Install PySpark on Windows

The first step of working with big data is to set up your environment. For learning and testing purposes, you can set the environment in a single machine. Let’s install PySpark on Windows 10 or Windows 11.

Install Java

You can install the official Java SDK from Oracle. But you might consider using OpenJDK.

Spark is picky with JDK. Please install JDK 8 or JDK 11. Do not use the latest version.
Java Archive Downloads – Java SE 11 (oracle.com)

Download the JDK zip file for Windows and unzip the file in the folder of your choice, such as “c:\Apps\jdk-11.0.17”. You can find executables in the “C:\Apps\jdk-11.0.17\bin” folder.

Add the environment variables.

JAVA_HOME = c:\Apps\jdk-11.0.17
PATH = %PATH%;c:\Apps\jdk-11.0.17\bin

And then, you can verify the installation. Open the Command Prompt.

Install Python

Installing Python is easy.

Go to the Python download page and download the latest or the version of your choice.

If the latest version does not work (such as “IndexError”), try the Python version 3.10.9.

You can simply download the Windows installer (exe) and run it. Make sure the “Add python.exe to Path” option is checked.

And then verify the installation.

> python --version
Python 3.10.9

In Windows, the python is installed in the C:\Users{username}\AppData\Local\Programs\Python\Python{version}.

Install Apache Spark

First, go to the Spark download page.
Select the latest stable release.
Choose the package type: pre-built for the latest version.
Download the tgz file.

You do not need to install Spark. You can just unarchive the tar file in the folder of your choice (“C:\Apps”) using 7 Zip.
Set the environment variables.

SPARK_HOME  = C:\Apps\spark-3.3.1-bin-hadoop3

HADOOP_HOME = C:\Apps\spark-3.3.1-bin-hadoop3

PYSPARK_PYTHON = C:\{python_path}\python.exe

PATH=%PATH%;C:\Apps\spark-3.3.1-bin-hadoop3\bin

Let’s verify the installation by running the PySpark shell. You can find the PySpark in the “C:\Apps\spark-3.3.1-bin-hadoop3\bin” folder.

Windows Defender firewall might block to run the java.exe. Click the “Allow access” button.

But you see the following error. The installation is not quite done yet.

Alternatively, you can install pyspark using pip and set up the environment variables accordingly.

pip install pyspark

pip install --upgrade pandas

in the “C:\Users\{username}\AppData\Local\Programs\Python\Python{version}\Lib\site-packages\pyspark“.

Install winutils

Download the winutils.exe and hadoop.dll files for the version of hadoop against which your Spark installation was built for.

https://github.com/kontext-tech/winutils

Copy the files to the %SPARK_HOME%\bin folder “C:\Apps\spark-3.3.1-bin-hadoop3\bin”

PySpark Shell

Let’s run the PySpark Shell again.

You are good to go if you see the Welcome screen without an error.

Spark Web UI

Apache Spark provides a suite of web user interfaces (UIs) that you can use to monitor the status and resource consumption of your Spark cluster.

https://spark.apache.org/docs/latest/web-ui.html

Once you start the PySpark Shell, you can access the Spark Web UI through the port 4040.

http://localhost:4040/

Stopping the PySpark Shell

You can exit the shell by using the “quit()” function.

Install PySpark on Windows

Install Java

Install Python

Install Apache Spark

Install winutils

PySpark Shell

Spark Web UI

Stopping the PySpark Shell

Published by P. L.

Leave a Comment Cancel reply

Install Java

Install Python

Install Apache Spark

Install winutils

PySpark Shell

Spark Web UI

Stopping the PySpark Shell

Share this:

Related

Published by P. L.

Leave a Comment Cancel reply