0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I have an application, which sends data to AWS Kinesis Firehose and this writes the data into my S3 bucket. Firehose uses "yyyy/MM/dd/HH" format to write the files.

Like in this sample S3 path:


Now I have a Spark application written in Scala, where I need to read data from a specific time period. I have start and end dates. The data is in JSON format and that's why I use sqlContext.read.json() not sc.textFile().

How can I read the data quickly and efficiently?

1 Answer

0 votes
by (32.5k points)

Have a look at the DataFrameReader API  and you'll notice that there is a  .json(paths: String*) method. This will easily solve your problem. Just build a collection of the paths you want, with globs of not, as you prefer, and then call the method, e.g.,

val paths: Seq[String] = ...

val df = sqlContext.read.json(paths: _*)

Related questions

Welcome to Intellipaat Community. Get your technical queries answered by top developers !