overwriting a spark output using pyspark

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-23T14:01:19+0000

For Spark 1.4+, you have been provided with a built in csv function for the dataframewriter

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

e.g.

spark_df.write.csv(path=self.output_file_path, header="true", mode="overwrite", sep="\t")

Which is syntactic sugar for:

spark_df.write.format("csv").mode("overwrite").options(header="true",sep="\t").save(path=self.output_file_path)

I think what’s confusing here is finding where exactly the options are available for each format in the docs.

Actually, the write related methods belong to the DataFrameWriter class:https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter

The csv method has these options available, also available when using format("csv"):https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.csv

The way you need to supply parameters also depends on if the method takes a single (key, value)tuple or keyword args. It's fairly standard to the way python works generally though, using (*args, **kwargs), it just differs from the Scala syntax.

For example The option(key, value) method takes one option as a tuple like option(header,"true") and the .options(**options) method takes a bunch of keyword assignments e.g. .options(header="true",sep="\t")

overwriting a spark output using pyspark

overwriting a spark output using pyspark

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions