Back

Explore Courses Blog Tutorials Interview Questions
0 votes
1 view
in Big Data Hadoop & Spark by (11.4k points)

I'm trying to convert Pandas DF into Spark one. DF head:

10000001,1,0,1,12:35,OK,10002,1,0,9,f,NA,24,24,0,3,9,0,0,1,1,0,0,4,543
10000001,2,0,1,12:36,OK,10002,1,0,9,f,NA,24,24,0,3,9,2,1,1,3,1,3,2,611
10000002,1,0,4,12:19,PA,10003,1,1,7,f,NA,74,74,0,2,15,2,0,2,3,1,2,2,691


Code:

dataset = pd.read_csv("data/AS/test_v2.csv")
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(dataset)


And I got an error:

TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>
 

1 Answer

0 votes
by (32.3k points)

You can avoid type related errors by imposing a schema as follows:

Suppose a text file was created (samp.csv) with the original data (as above) and hypothetical column names were inserted ("col1","col2",...,"col25").

import pyspark

from pyspark.sql import SparkSession

import pandas as pd

spark = SparkSession.builder.appName('pandasToSparkDF').getOrCreate()

pdDF = pd.read_csv("samp.csv")

contents of the pandas data frame:

pdDF

col1    col2 col3    col4 col5 col6    col7 col8 col9 col10   ... col16 col17 col18 col19   col20 col21 col22 col23 col24   col25

0   10000001    1 0 1 12:35   OK 10002 1 0 9   ... 3 9 0 0 1 1   0 0 4 543

1   10000001    2 0 1 12:36   OK 10002 1 0 9   ... 3 9 2 1 1 3   1 3 2 611

2   10000002    1 0 4 12:19   PA 10003 1 1 7   ... 2 15 2 0 2 3   1 2 2 691

Next, create the schema:

from pyspark.sql.types import *

mySchema = StructType([ StructField("Col1", LongType(), True)\

                       ,StructField("Col2", IntegerType(), True)\

                       ,StructField("Col3", IntegerType(), True)\

                       ,StructField("Col4", IntegerType(), True)\

                       ,StructField("Col5", StringType(), True)\

                       ,StructField("Col6", StringType(), True)\

                       ,StructField("Col7", IntegerType(), True)\

                       ,StructField("Col8", IntegerType(), True)\

                       ,StructField("Col9", IntegerType(), True)\

                       ,StructField("Col10", IntegerType(), True)\

                       ,StructField("Col11", StringType(), True)\

                       ,StructField("Col12", StringType(), True)\

                       ,StructField("Col13", IntegerType(), True)\

                       ,StructField("Col14", IntegerType(), True)\

                       ,StructField("Col15", IntegerType(), True)\

                       ,StructField("Col16", IntegerType(), True)\

                       ,StructField("Col17", IntegerType(), True)\

                       ,StructField("Col18", IntegerType(), True)\

                       ,StructField("Col19", IntegerType(), True)\

                       ,StructField("Col20", IntegerType(), True)\

                       ,StructField("Col21", IntegerType(), True)\

                       ,StructField("Col22", IntegerType(), True)\

                       ,StructField("Col23", IntegerType(), True)\

                       ,StructField("Col24", IntegerType(), True)\

                       ,StructField("Col25", IntegerType(), True)])

Note: True (implies nullable allowed)

Create the pyspark dataframe:

df = spark.createDataFrame(pdDF,schema=mySchema)

Confirm the pandas data frame is now a pyspark data frame:

type(df)

Now you have :

pyspark.sql.dataframe.DataFrame

If you wish to learn Spark visit this Spark Tutorial.

Related questions

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

28.4k questions

29.7k answers

500 comments

94.1k users

Browse Categories

...