Install PySpark on Windows

The first step of working with big data is to set up your environment. For learning and testing purposes, you can set the environment in a single machine. Let’s install PySpark on Windows 10 or Windows 11.

Install Java

You can install the official Java SDK from Oracle. But you might consider using OpenJDK.

Download the JDK zip file for Windows and unzip the file in the folder of your choice, such as “c:\Apps\jdk-11.0.17”. You can find executables in the “C:\Apps\jdk-11.0.17\bin” folder.

Add the environment variables.

JAVA_HOME = c:\Apps\jdk-11.0.17
PATH = %PATH%;c:\Apps\jdk-11.0.17\bin

And then, you can verify the installation. Open the Command Prompt.


Install Python

Installing Python is easy.

Go to the Python download page and download the latest or the version of your choice.

  • If the latest version does not work (such as “IndexError”), try the Python version 3.10.9.

You can simply download the Windows installer (exe) and run it. Make sure the “Add python.exe to Path” option is checked.

And then verify the installation.

> python --version
Python 3.10.9

In Windows, the python is installed in the C:\Users{username}\AppData\Local\Programs\Python\Python{version}.


Install Apache Spark

  • First, go to the Spark download page.
  • Select the latest stable release.
  • Choose the package type: pre-built for the latest version.
  • Download the tgz file.
  • You do not need to install Spark. You can just unarchive the tar file in the folder of your choice (“C:\Apps”) using 7 Zip.
  • Set the environment variables.
SPARK_HOME  = C:\Apps\spark-3.3.1-bin-hadoop3

HADOOP_HOME = C:\Apps\spark-3.3.1-bin-hadoop3

PYSPARK_PYTHON = C:\{python_path}\python.exe

PATH=%PATH%;C:\Apps\spark-3.3.1-bin-hadoop3\bin

Let’s verify the installation by running the PySpark shell. You can find the PySpark in the “C:\Apps\spark-3.3.1-bin-hadoop3\bin” folder.

Windows Defender firewall might block to run the java.exe. Click the “Allow access” button.

But you see the following error. The installation is not quite done yet.


Alternatively, you can install pyspark using pip and set up the environment variables accordingly.

pip install pyspark

pip install --upgrade pandas

in the “C:\Users\{username}\AppData\Local\Programs\Python\Python{version}\Lib\site-packages\pyspark“.


Install winutils

Download the winutils.exe and hadoop.dll files for the version of hadoop against which your Spark installation was built for.

https://github.com/kontext-tech/winutils

Copy the files to the %SPARK_HOME%\bin folder “C:\Apps\spark-3.3.1-bin-hadoop3\bin”


PySpark Shell

Let’s run the PySpark Shell again.

You are good to go if you see the Welcome screen without an error.


Spark Web UI

Apache Spark provides a suite of web user interfaces (UIs) that you can use to monitor the status and resource consumption of your Spark cluster.

https://spark.apache.org/docs/latest/web-ui.html

Once you start the PySpark Shell, you can access the Spark Web UI through the port 4040.

http://localhost:4040/


Stopping the PySpark Shell

You can exit the shell by using the “quit()” function.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s