Spark provides a simple manner to load and save data files in a very large number of file formats. These formats may range from being unstructured, like text, to semi-structured, like JSON, to structured, like sequence files. The input file formats that Spark wraps are transparently handled in a compressed format based on the file extension specified.
Interested in learning Apache Spark? Click here to learn more from this Cloudera Spark Training!
Watch this video on ‘Apache Spark Tutorial’:
Text files are very simple and convenient to load from and save to Spark applications. When we load a single text file as an RDD, then each input line becomes an element in the RDD. It has the capacity to load multiple whole text files at the same time into a pair of RDD elements, with the key being the name given and the value the contents of each file format specified.
- Loading the text files: Loading a single text file is as simple as calling the textFile() function on our SparkContext with the pathname placed next to the file, as shown below:
input = sc.textFile("file:///home/holden/repos/spark/README.md")
- Saving the text files: Spark consists of a function called saveAsTextFile(), which saves the path of a file and writes the content of the RDD to that file. The path is considered as directory, and multiple outputs will be produced in that directory. This is how Spark becomes able to write output from multiple codes.
- Loading the JSON Files: For all supported languages, the approach of loading data in the text form and parsing the JSON data can be adopted. Here, if the file contains multiple JSON records, the developer will have to download the entire file and parse each one by one.
- Saving the JSON Files: In comparison to loading the JSON files, writing to it is much easier as, here, the developer does not have to worry about the wrong format of data values. The same libraries can be used that were used to convert the RDDs into parsed JSON files; however, RDDs of the structured data will be taken and converted into RDDs of strings.
Want to grab more detailed knowledge on Hadoop? Read this extensive Spark Tutorial!
CSV and TSV Files
Comma-separated values (CSV) files are a very common format used to store tables. These files have a definite number of fields in each line the values of which are separated by a comma. Similarly, in tab-separated values (TSV) files, the field values are separated by tabs.
- Loading the CSV Files: The loading procedure of CSV and TSV files is quite similar to that of the JSON files. In order to load a CSV/TSV file, its content in the text format is loaded at first and then it is processed. Like the JSON files, CSV and TSV files also have different library files, but it is suggested to use only those corresponding to each language.
- Saving the CSV Files: Writing to CSV/TSV files are also quite easy. However, as the output cannot have the file name, mapping is required for a better results. One easy way to perform this is to write a function that can convert the fields into positions in an array.
A sequence file is a flat file that consists of binary key/value pairs. Sequence files are widely used in Hadoop. The sync markers in these files allow Spark to find a particular point in a file and re-synchronize it with record limits.
- Loading the Sequence Files: Spark comes with a specialized API that reads the sequence files. All we have to do is call a sequence file (pat, keyClass, valueClass, minPartitions), and access can be obtained from SparkContext.
- Saving the Sequence Files: In order to save the sequence files, a paired RDD, along with its types to write, is required. For several native types, implicit conversions between Scala and Hadoop Writables are possible. Hence, to write a native type, we have to save the paired RDD by calling the saveAsSequenceFile(path) function. Then, we have to map over the data and convert it prior to saving if the conversion is not automatic.
Object files are the packaging around sequence files that enables saving RDDs containing value records only. Saving an object file is quite simple as it just requires calling saveAsObjectFile() on an RDD.
Be familiar with these Top Spark Interview Questions and Answers and get a head start in your career!
Hadoop Input and Output Formats
The input split is referred to as the data present in HDFS. Spark provides APIs to implement the InputFormat of Hadoop in Scala, Python, and Java. The old APIs were hadoopRDD and hadoopFile, but now the APIs have been improved and the new APIs are known as newAPIHadoopRDD and newAPIHadoopFile.
For HadoopOutputFormat, Hadoop takes TextOutputFormat in which the key and value pair is separated through comma and saved in part file. Spark has the APIs of Hadoop for both MapRed and MapReduce.
- File Compression: For most of the Hadoop outputs, a compression code can be specified which is easily accessible. It is used to compress the data.
A wide array of file systems are supported by Apache Spark. Some of them are discussed below:
- Local/Regular FS: Spark is able to load files from the local file system, which requires files to remain on the same path on all nodes.
- Amazon S3: This file system is suitable for storing large amounts of files. It works faster when the computed nodes are inside Amazon EC2. However, at times, its performance goes down if we opt for the public network.
- HDFS: It is a distributed file system that works well on commodity hardware. It provides high throughput.
If you want to know about Steps for the installation of Kafka, refer to this insightful Blog!
Structured Data with Spark SQL
It works effectively on semi-structured and structured data. Structured data can be defined as schemas, and it has a consistent set of fields.
One of the common structured data sources on Hadoop is Apache Hive. Hive can store tables in a variety and different range of formats, from plain text to column-oriented formats, inside HDFS, and it also contains other storage systems. Spark SQL can load any amount of tables supported by Hive.
Spark supports a wide range of databases with the help of Hadoop Connectors or Custom Spark Connectors. Some of them are JDBC, Cassandra, HBase, and Elasticsearch.
Intellipaat provides the most comprehensive Cloudera Spark Course to fast-track your career!