0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I want to filter a DataFrame using a condition related to the length of a column, this question might be very easy but I didn't find any related question in the SO.

More specific, I have a DataFrame with only one Column which of ArrayType(StringType()), I want to filter the DataFrame using the length as filterer, I shot a snippet below.

df = sqlContext.read.parquet("letters.parquet")

# The output will be

# +------------+
# |      tokens|
# +------------+
# |[L, S, Y, S]|
# |[L, V, I, S]|
# |[I, A, N, A]|
# |[I, L, S, A]|
# |[E, N, N, Y]|
# |[E, I, M, A]|
# |[O, A, N, A]|
# |   [S, U, S]|
# +------------+

# But I want only the entries with length 3 or less

fdf = df.filter(len(df.tokens) <= 3)

 # But it says that the TypeError: object of type 'Column' has no len(), so the previous statement is obviously incorrect.

Please help me out! 

1 Answer

0 votes
by (32.5k points)

In order to show only the entries with length 3 or less, I would suggest you to use size function, that is available for Spark >=1.5

Using pyspark:


Welcome to Intellipaat Community. Get your technical queries answered by top developers !