Create Spark DataFrame. Can not infer schema for type: <type 'float'>

Question

2 Answers

Amit Rawat · Answer 1 · 2019-07-09T05:47:49+0000

SparkSession.createDataFrame, requires an RDD of Row/tuple/list, unless schema with DataType is provided. I would suggest you convert float to tuple like this:

from pyspark.sql import Row
row = Row("val") # Or some other column name
myFloatRdd.map(row).toDF()

To create a DataFrame from a list of scalars, you'll have to use SparkSession.createDataFrame directly and provide a schema:

from pyspark.sql.types import FloatType
df = spark.createDataFrame([1.0, 2.0, 3.0], FloatType())
df.show()
## +-----+
## |value|
## +-----+
## | 1.0|
## | 2.0|
## | 3.0|
## +-----+

If you want to know more about Spark, then do check out this awesome video tutorial:

Anandita · Answer 2 · 2024-11-08T15:47:58+0000

It doesn't work because Spark's toDF() method is expecting an explicit schema when data in your RDD is a simple type, like float. Spark cannot infer column names and types on simple types, such as floats or integers, in an RDD without a schema. That's solved by providing a schema by specifying a column name.

Here's how to do it.

With a Column Name Convert the RDD to a DataFrame:

Use toDF() and assign column name directly:

from pyspark.sql import SparkSession

# Start SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

# Create an RDD of floats

myFloatRdd = sc.parallelize([1.0, 2.0, 3.0])

#Convert to DataFrame with a specified column name

df = myFloatRdd.toDF(c("value"))

df.show()

Create Spark DataFrame. Can not infer schema for type: <type 'float'>

2 Answers

Related questions

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources