Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:

partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet")


I then go to my Python notebook to read in the data:

df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet")
 

and I get the following error:

AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'
 

1 Answer

0 votes
by (32.3k points)

To read a parquet file simply use parquet format of Spark session. Do it like this:

yourdf = spark.read.parquet("your_path_tofile/abc.parquet")

More specifically, follow the approach given:

from pyspark.sql import SparkSession

# initialise sparkContext

spark = SparkSession.builder \

    .master('local') \

    .appName('myAppName') \

    .config('spark.executor.memory', '5gb') \

    .config("spark.cores.max", "6") \

    .getOrCreate()

sc = spark.sparkContext

# using SQLContext to read parquet file

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

# to read parquet file

df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')

Browse Categories

...