How to fix ModuleNotFoundError: No module named 'pyspark' in Python

One error that you might encounter when working with Python is:

ModuleNotFoundError: No module named 'pyspark'

This error occurs when Python can’t find the pyspark module in your current Python environment.

This tutorial shows examples that cause this error and how to fix it.

How to reproduce the error

Suppose you want to use the pyspark module for large-scale data processing.

You import the SparkSession from the pyspark.sql module in your code as follows:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

But you get the following error when running the code:

Traceback (most recent call last):
  File "main.py", line 1, in <module>
    from pyspark.sql import SparkSession
ModuleNotFoundError: No module named 'pyspark'

This error occurs because the pyspark module is not a built-in Python module, so you need to install it before using it.

To see if you have the pyspark module installed, you can run the pip show pyspark command from the terminal as follows:

$ pip3 show pyspark      
WARNING: Package(s) not found: pyspark

If you get the warning shown above, then you need to install the pyspark module.

How to fix this error

To resolve this error, you need to install the pyspark library using pip as shown below:

pip install pyspark

# For pip3:
pip3 install pyspark

Once the module is installed, you should be able to run the code that imports pyspark without receiving the error.

Install commands for other environments

The install command might differ depending on what environment you used to run the Python code.

Here’s a list of common install commands in popular Python environments to install the pyspark module:

# if you don't have pip in your PATH:
python -m pip install pyspark

python3 -m pip install pyspark

# Windows
py -m pip install pyspark

# Anaconda
conda install -c conda-forge pyspark

# Jupyter Notebook
!pip install pyspark

Once the module is installed, you should be able to run the code without receiving this error.

Use findspark if you still see the error

If the error still occurs even after installing PySpark, then you might not have the path to PySpark added to your environment.

The easiest way to resolve this error is to install the findspark package and use it before importing PySpark.

The findspark package is used to find the path to PySpark and make it importable. Install the package using pip as shown below:

pip install findspark

# For pip3:
pip3 install findspark

Then, add the code to initialize findspark like this:

# Initialize findspark
import findspark
findspark.init()

# Now import PySpark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Create Spark DataFrame
sdf = spark.createDataFrame([
    Row(name="Nathan", age=29),
    Row(name="Jane", age=26)
])

# Show the DataFrame
sdf.show()

Output:

+------+---+
|  name|age|
+------+---+
|Nathan| 29|
|  Jane| 26|
+------+---+

Now you can create a DataFrame using PySpark. The error has been resolved.

Add the path to PySpark manually

As an alternative, you can also set the path to PySpark manually. This is especially useful if you downloaded PySpark manually from the Apache Spark page.

First, download the tgz file as shown below:

After you download the package, extract it using the tar command as follows:

tar -xzf spark-3.3.2-bin-hadoop3.tgz spark

Once you extracted the package, you need to add the following exports in your .zshrc or .bashrc file:

export SPARK_HOME="/Downloads/spark-3.3.2-bin-hadoop3"
export PATH=$SPARK_HOME/bin:$PATH

export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_PYTHON=python3

The SPARK_HOME path must be the absolute path to the spark-3.3.2-bin-hadoop3 that you extracted from the tgz file.

The addition of $SPARK_HOME/python in PYTHONPATH and PYSPARK_PYTHON is optional. They allow you to use the Python interpreter bundled with the Spark package you just downloaded.

Test if PySpark is available by running the pyspark command from the terminal. The output should be similar as follows:

You may also need the Java Development Kit (JDK) available on your computer to run Spark properly.

Conclusion

The ModuleNotFoundError: No module named 'pyspark' occurs when you attempt to import the pyspark module in Python. You need to install the module using pip as it’s not a built-in Python module.

If you still get the error after installing the module, then you can use the findspark module to help you import PySpark.

You can also set the path to SPARK_HOME manually with the help of pip show pyspark command.

I hope this tutorial is useful. I see you again in other tutorials! 👋

Take your skills to the next level ⚡️

I'm sending out an occasional email with the latest tutorials on programming, web development, and statistics. Drop your email in the box below and I'll send new stuff straight into your inbox!

No spam. Unsubscribe anytime.