The first step of working with big data is to set up your environment. For learning and testing purposes, you can set the environment in a single machine. Let’s install PySpark on Windows 10 or Windows 11.
Install Java
You can install the official Java SDK from Oracle. But you might consider using OpenJDK.
- Spark is picky with JDK. Please install JDK 8 or JDK 11. Do not use the latest version.
- Java Archive Downloads – Java SE 11 (oracle.com)
Download the JDK zip file for Windows and unzip the file in the folder of your choice, such as “c:\Apps\jdk-11.0.17”. You can find executables in the “C:\Apps\jdk-11.0.17\bin” folder.
Add the environment variables.
JAVA_HOME = c:\Apps\jdk-11.0.17
PATH = %PATH%;c:\Apps\jdk-11.0.17\bin
And then, you can verify the installation. Open the Command Prompt.
Install Python
Installing Python is easy.
Go to the Python download page and download the latest or the version of your choice.
- If the latest version does not work (such as “IndexError”), try the Python version 3.10.9.
You can simply download the Windows installer (exe) and run it. Make sure the “Add python.exe to Path” option is checked.
And then verify the installation.
> python --version
Python 3.10.9
In Windows, the python is installed in the C:\Users{username}\AppData\Local\Programs\Python\Python{version}.
Install Apache Spark
- First, go to the Spark download page.
- Select the latest stable release.
- Choose the package type: pre-built for the latest version.
- Download the tgz file.
- You do not need to install Spark. You can just unarchive the tar file in the folder of your choice (“C:\Apps”) using 7 Zip.
- Set the environment variables.
SPARK_HOME = C:\Apps\spark-3.3.1-bin-hadoop3
HADOOP_HOME = C:\Apps\spark-3.3.1-bin-hadoop3
PYSPARK_PYTHON = C:\{python_path}\python.exe
PATH=%PATH%;C:\Apps\spark-3.3.1-bin-hadoop3\bin
Let’s verify the installation by running the PySpark shell. You can find the PySpark in the “C:\Apps\spark-3.3.1-bin-hadoop3\bin” folder.
Windows Defender firewall might block to run the java.exe. Click the “Allow access” button.
But you see the following error. The installation is not quite done yet.
Alternatively, you can install pyspark using pip and set up the environment variables accordingly.
pip install pyspark
pip install --upgrade pandas
in the “C:\Users\{username}\AppData\Local\Programs\Python\Python{version}\Lib\site-packages\pyspark“.
Install winutils
Download the winutils.exe and hadoop.dll files for the version of hadoop against which your Spark installation was built for.
https://github.com/kontext-tech/winutils
Copy the files to the %SPARK_HOME%\bin folder “C:\Apps\spark-3.3.1-bin-hadoop3\bin”
PySpark Shell
Let’s run the PySpark Shell again.
You are good to go if you see the Welcome screen without an error.
Spark Web UI
Apache Spark provides a suite of web user interfaces (UIs) that you can use to monitor the status and resource consumption of your Spark cluster.
https://spark.apache.org/docs/latest/web-ui.html
Once you start the PySpark Shell, you can access the Spark Web UI through the port 4040.
Stopping the PySpark Shell
You can exit the shell by using the “quit()” function.