Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

Could someone help me solve this problem I have with Spark DataFrame?

When I do myFloatRDD.toDF() I get an error:

TypeError: Can not infer schema for type: type 'float'

I don't understand why...

Example:

myFloatRdd = sc.parallelize([1.0,2.0,3.0])
df = myFloatRdd.toDF()

2 Answers

0 votes
by (32.3k points)
edited by

SparkSession.createDataFrame, requires an RDD of Row/tuple/list, unless schema with DataType is provided. I would suggest you convert float to tuple like this:

from pyspark.sql import Row

row = Row("val") # Or some other column name

myFloatRdd.map(row).toDF()

To create a DataFrame from a list of scalars, you'll have to use SparkSession.createDataFrame directly and provide a schema:

from pyspark.sql.types import FloatType

df = spark.createDataFrame([1.0, 2.0, 3.0], FloatType())

df.show()

## +-----+

## |value|

## +-----+

## |  1.0|

## |  2.0|

## |  3.0|

## +-----+

If you want to know more about Spark, then do check out this awesome video tutorial:

0 votes
by (1.9k points)

It doesn't work because Spark's toDF() method is expecting an explicit schema when data in your RDD is a simple type, like float. Spark cannot infer column names and types on simple types, such as floats or integers, in an RDD without a schema. That's solved by providing a schema by specifying a column name.

Here's how to do it.

With a Column Name Convert the RDD to a DataFrame:

Use toDF() and assign column name directly:

from pyspark.sql import SparkSession

# Start SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

# Create an RDD of floats

myFloatRdd = sc.parallelize([1.0, 2.0, 3.0])

#Convert to DataFrame with a specified column name 

df = myFloatRdd.toDF(c("value"))

df.show()

1.4k questions

32.9k answers

507 comments

693 users

...