Explore Courses Blog Tutorials Interview Questions
0 votes
in Big Data Hadoop & Spark by (11.4k points)

I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:"noStopWords","lowerText","prediction")"swift2d://xxxx.keystone/commentClusters.parquet")

I then go to my Python notebook to read in the data:

df ="swift2d://xxxx.keystone/commentClusters.parquet")

and I get the following error:

AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'

1 Answer

0 votes
by (32.3k points)

To read a parquet file simply use parquet format of Spark session. Do it like this:

yourdf ="your_path_tofile/abc.parquet")

More specifically, follow the approach given:

from pyspark.sql import SparkSession

# initialise sparkContext

spark = SparkSession.builder \

    .master('local') \

    .appName('myAppName') \

    .config('spark.executor.memory', '5gb') \

    .config("spark.cores.max", "6") \


sc = spark.sparkContext

# using SQLContext to read parquet file

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

# to read parquet file

df ='path-to-file/commentClusters.parquet')

Browse Categories