Fix TypeError: 'Column' object is not callable

When you’re working with PySpark and pandas library in Python, you might encounter the following error:

TypeError: 'Column' object is not callable

This error usually occurs when you attempt to call a function on the Column object of PySpark’s DataFrame.

The PySpark DataFrame object is different from pandas DataFrame object. This article shows examples that could cause this error and how to fix them.

1. You’re calling a method directly on the Column object

Suppose you have a PySpark DataFrame object defined as follows:

from pyspark.sql import SparkSession, Row

spark = SparkSession.builder.getOrCreate()

# Create Spark DataFrame
sdf = spark.createDataFrame([
    Row(name="Nathan", age=29),
    Row(name="Jane", age=26),
    Row(name="John", age=28),
    Row(name="Lisa", age=22)
])

# Show the DataFrame
sdf.show()

Output:

+------+---+
|  name|age|
+------+---+
|Nathan| 29|
|  Jane| 26|
|  John| 28|
|  Lisa| 22|
+------+---+

Next, you try to show only the name column from the sdf DataFrame above like this:

sdf['name'].show()

But since you can’t call a function directly from a DataFrame column, you get the following error:

Traceback (most recent call last):
  File "main.py", line 20, in <module>
    sdf['name'].show()
TypeError: 'Column' object is not callable

To solve this error, you need to use the select() method from the DataFrame object to get a new DataFrame that contains only the column you want to show.

After that, you can call the show() method from the new DataFrame as follows:

sdf.select('name').show()

This time, the error won’t appear and you’ll get the following output:

+------+      
|  name|
+------+
|Nathan|
|  Jane|
|  John|
|  Lisa|
+------+

The select() method returns a new DataFrame object, from which you can call the show() method.

2. You’re calling functions that are not available

This error also happens when you call functions that are not available. One possible cause is that you’re using an outdated PySpark version that doesn’t have that function yet.

For example, the contains() function is only available in PySpark version 2.2 and above, so the following code causes the error if you have PySpark version 2.1 or below:

new_df = sdf.filter(sdf.name.contains('a'))

new_df.show()

To use the contains() function on DataFrame columns, you need to upgrade your PySpark version to v2.2 and above.

3. Calling a Python function on Column objects

Suppose you need to transform a string column in your DataFrame object to all uppercase.

You might create a custom function named to_upper() and use it as follows:

sdf = spark.createDataFrame([
    Row(name="Nathan", age=29),
    Row(name="Jane", age=26),
    Row(name="John", age=28),
    Row(name="Lisa", age=22)
])

def to_upper(text):
    return text.upper()

new_df = sdf.withColumn("name", to_upper(col('name')))
new_df.show()

Using the withColumn() function, you tried to call the to_upper() function on each row of the name column.

But the code returns an error as follows:

Traceback (most recent call last):
  File "main.py", line 28, in <module>
    new_df = sdf.withColumn("name", to_upper(col('name')))
  File "main.py", line 26, in to_upper
    return text.upper()
TypeError: 'Column' object is not callable

This is because the Column object is called as-is. The to_upper() function must be called on each row value in the name column.

To let PySpark know that you want to operate on the column value, you need to add the @udf annotation to the function.

from pyspark.sql.functions import udf, col

#...

@udf
def to_upper(text):
    return text.upper()

new_df = sdf.withColumn("name", to_upper(col('name')))
new_df.show()

By adding the @udf annotation above the function definition, PySpark will call the to_upper() function on each value inside the name column.

The output is as follows:

+------+---+
|  name|age|
+------+---+
|NATHAN| 29|
|  JANE| 26|
|  JOHN| 28|
|  LISA| 22|
+------+---+

With the @udf notation, you can call Python functions on PySpark Column.

As you can see from these examples, the PySpark DataFrame is different from the pandas DataFrame.

To avoid this kind of error, you need to learn how PySpark works.

I hope this tutorial is useful. See you around! 👋

Take your skills to the next level ⚡️

I'm sending out an occasional email with the latest tutorials on programming, web development, and statistics. Drop your email in the box below and I'll send new stuff straight into your inbox!

No spam. Unsubscribe anytime.