Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing :

sc.textFile('file.csv')
    .map(lambda line: (line.split(',')[0], line.split(',')[1]))
    .collect()


I would expect this call to give me a list of the two first columns of my file but I'm getting this error :

File "<ipython-input-60-73ea98550983>", line 1, in <lambda>
IndexError: list index out of range


although my CSV file as more than one column.

1 Answer

0 votes
by (32.3k points)
edited by

Just try this instead of your code:

sc.textFile("file.csv") \

    .map(lambda line: line.split(",")) \

    .filter(lambda line: len(line)<=1) \

    .collect()

This will work you you.

You can also use built-in csv data source directly:

spark.read.csv(

    "some_input_file.csv", header=True, mode="DROPMALFORMED", schema=schema

)

or

(spark.read

    .schema(schema)

    .option("header", "true")

    .option("mode", "DROPMALFORMED")

    .csv("some_input_file.csv"))

without including any external dependencies.

If you want to know more about Spark, then do check out this awesome video tutorial:

Related questions

Browse Categories

...