Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I have been trying to get the databricks library for reading CSVs to work. I am trying to read a TSV created by hive into a spark data frame using the scala api.

Here is an example that you can run in the spark shell (I made the sample data public so it can work for you)

import org.apache.spark.sql.SQLContext

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

val sqlContext = new SQLContext(sc)

val segments = sqlContext.read.format("com.databricks.spark.csv").load("s3n://michaeldiscenza/data/test_segments")


The documentation says you can specify the delimiter but I am unclear about how to specify that option.

1 Answer

0 votes
by (32.3k points)

For Spark 2.0+: I would suggest you to use the built-in CSV connector to avoid third party dependency and better performance:

val spark = SparkSession.builder.getOrCreate()

val segments = spark.read.option("sep", "\t").csv("/path/to/file")

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...