When you’re working with PySpark and pandas library in Python, you might encounter the following error:
TypeError: 'Column' object is not callable
This error usually occurs when you attempt to call a function on the Column
object of PySpark’s DataFrame.
The PySpark DataFrame
object is different from pandas DataFrame
object. This article shows examples that could cause this error and how to fix them.
1. You’re calling a method directly on the Column object
Suppose you have a PySpark DataFrame
object defined as follows:
from pyspark.sql import SparkSession, Row
spark = SparkSession.builder.getOrCreate()
# Create Spark DataFrame
sdf = spark.createDataFrame([
Row(name="Nathan", age=29),
Row(name="Jane", age=26),
Row(name="John", age=28),
Row(name="Lisa", age=22)
])
# Show the DataFrame
sdf.show()
Output:
+------+---+
| name|age|
+------+---+
|Nathan| 29|
| Jane| 26|
| John| 28|
| Lisa| 22|
+------+---+
Next, you try to show only the name
column from the sdf
DataFrame above like this:
sdf['name'].show()
But since you can’t call a function directly from a DataFrame column, you get the following error:
Traceback (most recent call last):
File "main.py", line 20, in <module>
sdf['name'].show()
TypeError: 'Column' object is not callable
To solve this error, you need to use the select()
method from the DataFrame object to get a new DataFrame
that contains only the column you want to show.
After that, you can call the show()
method from the new DataFrame
as follows:
sdf.select('name').show()
This time, the error won’t appear and you’ll get the following output:
+------+
| name|
+------+
|Nathan|
| Jane|
| John|
| Lisa|
+------+
The select()
method returns a new DataFrame
object, from which you can call the show()
method.
2. You’re calling functions that are not available
This error also happens when you call functions that are not available. One possible cause is that you’re using an outdated PySpark version that doesn’t have that function yet.
For example, the contains()
function is only available in PySpark version 2.2 and above, so the following code causes the error if you have PySpark version 2.1 or below:
new_df = sdf.filter(sdf.name.contains('a'))
new_df.show()
To use the contains()
function on DataFrame columns, you need to upgrade your PySpark version to v2.2 and above.
3. Calling a Python function on Column objects
Suppose you need to transform a string column in your DataFrame object to all uppercase.
You might create a custom function named to_upper()
and use it as follows:
sdf = spark.createDataFrame([
Row(name="Nathan", age=29),
Row(name="Jane", age=26),
Row(name="John", age=28),
Row(name="Lisa", age=22)
])
def to_upper(text):
return text.upper()
new_df = sdf.withColumn("name", to_upper(col('name')))
new_df.show()
Using the withColumn()
function, you tried to call the to_upper()
function on each row of the name column.
But the code returns an error as follows:
Traceback (most recent call last):
File "main.py", line 28, in <module>
new_df = sdf.withColumn("name", to_upper(col('name')))
File "main.py", line 26, in to_upper
return text.upper()
TypeError: 'Column' object is not callable
This is because the Column object is called as-is. The to_upper()
function must be called on each row value in the name column.
To let PySpark know that you want to operate on the column value, you need to add the @udf
annotation to the function.
from pyspark.sql.functions import udf, col
#...
@udf
def to_upper(text):
return text.upper()
new_df = sdf.withColumn("name", to_upper(col('name')))
new_df.show()
By adding the @udf
annotation above the function definition, PySpark will call the to_upper()
function on each value inside the name column.
The output is as follows:
+------+---+
| name|age|
+------+---+
|NATHAN| 29|
| JANE| 26|
| JOHN| 28|
| LISA| 22|
+------+---+
With the @udf
notation, you can call Python functions on PySpark Column.
As you can see from these examples, the PySpark DataFrame is different from the pandas DataFrame.
To avoid this kind of error, you need to learn how PySpark works.
I hope this tutorial is useful. See you around! 👋