Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (120 points)

Hi,

I want to automate the schema creation and addition to Dataframe during reading from cloud storage system  for different file like parquet or csv etc. For example.

I have file a.csv or a.parquet while creating data frame reading we can explictly define schema with struct type. instead of write the schema in the notebook want to create schema lets say for all my csv i have one schema like csv_schema and stored in cloud storage. if any addition or deletion i will do that in csv_schema file separately.

In notebook when creating data frame  during reading file want to pass this schema which stored in separate file .Please suggest if we can write any function in python or other idea to automate schema creation and addition in data frame for different file system

1 Answer

0 votes
by (33.1k points)

Hi Sunil

In PySpark, We can use PySpark SQL to load csv file or parquet files into a dataframe. If you want to combine multiple files from cloud storage into a data frame, then you need Spark SQL commands for that. 

First, We can load a single file into a dataframe. Then you can use the Spark SQL join function to join other files into this data frame. After loading all the data into a single dataframe, you can perform data wrangling functions on that.

You can check this documentation for more information.

Hope this answer will help you!

schema addition automation for pyspark while reading from cloud storage
Intellipaat-community
by (120 points)
Thanks Anurag. what i am looking like i have one file a.parquet . For this file while reading need to pass the schema explictly. insted of creating schema in notebook of struct type and filed , want to have generic schema say a.schema stored in one file in storage location. While creating or reading the a.parquet file wan to pass the schema which created in the above file path i.e a.path.

So its like i have schema file a.schema where i have the schema definition for all the parquet file/any file its generic.

df=spark.read.schema(generic schema).parquet ..

how to get or create that generic schema like with some pyspark function kino ff which am looking instead of adding schema definition with in notebook

Hope it helps.

Browse Categories

...