schema addition automation for pyspark while reading from cloud storage

Question

asked Sep 23, 2019 in Big Data Hadoop & Spark by sunil.nanda.east (120 points)

Hi,

I want to automate the schema creation and addition to Dataframe during reading from cloud storage system for different file like parquet or csv etc. For example.

I have file a.csv or a.parquet while creating data frame reading we can explictly define schema with struct type. instead of write the schema in the notebook want to create schema lets say for all my csv i have one schema like csv_schema and stored in cloud storage. if any addition or deletion i will do that in csv_schema file separately.

In notebook when creating data frame during reading file want to pass this schema which stored in separate file .Please suggest if we can write any function in python or other idea to automate schema creation and addition in data frame for different file system

1 Answer

Anurag · Answer 1 · 2019-09-23T06:39:10+0000

Hi Sunil

In PySpark, We can use PySpark SQL to load csv file or parquet files into a dataframe. If you want to combine multiple files from cloud storage into a data frame, then you need Spark SQL commands for that.

First, We can load a single file into a dataframe. Then you can use the Spark SQL join function to join other files into this data frame. After loading all the data into a single dataframe, you can perform data wrangling functions on that.

You can check this documentation for more information.

Hope this answer will help you!

Thanks Anurag. what i am looking like i have one file a.parquet . For this file while reading need to pass the schema explictly. insted of creating schema in notebook of struct type and filed , want to have generic schema say a.schema stored in one file in storage location. While creating or reading the a.parquet file wan to pass the schema which created in the above file path i.e a.path.

So its like i have schema file a.schema where i have the schema definition for all the parquet file/any file its generic.

df=spark.read.schema(generic schema).parquet ..

how to get or create that generic schema like with some pyspark function kino ff which am looking instead of adding schema definition with in notebook

Hope it helps. — sunil.nanda.east, Sep 23, 2019

schema addition automation for pyspark while reading from cloud storage

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources