One error that you might encounter when working with Python is:
ModuleNotFoundError: No module named 'pyspark'
This error occurs when Python can’t find the pyspark
module in your current Python environment.
This tutorial shows examples that cause this error and how to fix it.
How to reproduce the error
Suppose you want to use the pyspark
module for large-scale data processing.
You import the SparkSession
from the pyspark.sql
module in your code as follows:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
But you get the following error when running the code:
Traceback (most recent call last):
File "main.py", line 1, in <module>
from pyspark.sql import SparkSession
ModuleNotFoundError: No module named 'pyspark'
This error occurs because the pyspark
module is not a built-in Python module, so you need to install it before using it.
To see if you have the pyspark
module installed, you can run the pip show pyspark
command from the terminal as follows:
$ pip3 show pyspark
WARNING: Package(s) not found: pyspark
If you get the warning shown above, then you need to install the pyspark
module.
How to fix this error
To resolve this error, you need to install the pyspark
library using pip
as shown below:
pip install pyspark
# For pip3:
pip3 install pyspark
Once the module is installed, you should be able to run the code that imports pyspark
without receiving the error.
Install commands for other environments
The install command might differ depending on what environment you used to run the Python code.
Here’s a list of common install commands in popular Python environments to install the pyspark
module:
# if you don't have pip in your PATH:
python -m pip install pyspark
python3 -m pip install pyspark
# Windows
py -m pip install pyspark
# Anaconda
conda install -c conda-forge pyspark
# Jupyter Notebook
!pip install pyspark
Once the module is installed, you should be able to run the code without receiving this error.
Use findspark if you still see the error
If the error still occurs even after installing PySpark, then you might not have the path to PySpark added to your environment.
The easiest way to resolve this error is to install the findspark
package and use it before importing PySpark.
The findspark
package is used to find the path to PySpark and make it importable. Install the package using pip
as shown below:
pip install findspark
# For pip3:
pip3 install findspark
Then, add the code to initialize findspark
like this:
# Initialize findspark
import findspark
findspark.init()
# Now import PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Create Spark DataFrame
sdf = spark.createDataFrame([
Row(name="Nathan", age=29),
Row(name="Jane", age=26)
])
# Show the DataFrame
sdf.show()
Output:
+------+---+
| name|age|
+------+---+
|Nathan| 29|
| Jane| 26|
+------+---+
Now you can create a DataFrame using PySpark. The error has been resolved.
Add the path to PySpark manually
As an alternative, you can also set the path to PySpark manually. This is especially useful if you downloaded PySpark manually from the Apache Spark page.
First, download the tgz file as shown below:
After you download the package, extract it using the tar
command as follows:
tar -xzf spark-3.3.2-bin-hadoop3.tgz spark
Once you extracted the package, you need to add the following exports in your .zshrc
or .bashrc
file:
export SPARK_HOME="/Downloads/spark-3.3.2-bin-hadoop3"
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_PYTHON=python3
The SPARK_HOME
path must be the absolute path to the spark-3.3.2-bin-hadoop3
that you extracted from the tgz file.
The addition of $SPARK_HOME/python
in PYTHONPATH
and PYSPARK_PYTHON
is optional. They allow you to use the Python interpreter bundled with the Spark package you just downloaded.
Test if PySpark is available by running the pyspark
command from the terminal. The output should be similar as follows:
You may also need the Java Development Kit (JDK) available on your computer to run Spark properly.
Conclusion
The ModuleNotFoundError: No module named 'pyspark'
occurs when you attempt to import the pyspark
module in Python. You need to install the module using pip
as it’s not a built-in Python module.
If you still get the error after installing the module, then you can use the findspark
module to help you import PySpark.
You can also set the path to SPARK_HOME
manually with the help of pip show pyspark
command.
I hope this tutorial is useful. I see you again in other tutorials! 👋