Loading and Saving your Data
File Formats : Spark provides a very simple manner to load and save data files in a very large number of file formats. Formats may range the formats from being the unstructured, like text, to semi structured way, like JSON, to structured, like Sequence Files. The input file formats that Spark wraps all are transparently handle in a compressed format based on the file extension specified.
Text Files : Text files are very simple and convenient to load from and save in Spark Applications. When we load a single type text file as an RDD, then each input line becomes an element in the RDD. It has the capacity to load multiple whole text files and at the same time into a pair of RDD elements, with the key being the name given and the value being the contents of each file formats specified.
Get Spark Certification in just 15 Hours
Loading text files : Loading a single text file is as simple as calling the text File() function on our Spark Context with the path name to the file, as you can see the complete description provided below in an example given below:
input = sc.textFile(“file:///home/holden/repos/spark/README.md”)
Saving text files :
Spark consists of a function called as saveAsTextFile() which saves the path of a file and write content of RDD to that file. The path is considered as directory and multiple outputs will be produced in that directory and this is how Spark becomes able to write output from multiple codes.
Loading JSON : For all the supported languages, the approach of loading the data in text form and parsing the JSON data can be adopted. Here if the file contains multiple JSON records, the developer will have to download the entire file and parse each one by one.
Saving JSON : In comparison to loading the JSON files, writing to it is much easier as the developer needs not to worry about the wrong format of data values. Same libraries can be used here that were used to convert the RDDs into parsed JSON files, however RDDs of structured data will be taken and converted into RDD of strings.
Free Ebook: Step by Step Guide to Master Spark
Comma-Separated Values and Tab-Separated Values : CSV is a very common format used to store tables. These files have a definite number of fields in each line whose values are separated through comma. Similarly in TSV files the field values are separated through tabs.
Loading CSV : The loading procedure of CSV and TSV files is quite similar to that of JSON files. In order to load CSV or TSV files, its content in text format are loaded at first and then processed. Like JSON files CSV and TSV files also have different library files, but it is suggested to use only one for each language.
Saving CSV : Write to CSV or TSV files are quite easy, however as the output cannot have the file name, mapping is required for better output. One easy to perform this is to write a function that can convert the fields to positions in an array.
Sequence Files : This is a flat file which consists of binary key/value pairs.Sequence files are widely used in Hadoop which consist of flat files along with a key/value pair. The sync markers in these files allow Spark to find particular point in the file and re-synchronizing it with record limits.
Loading Sequence Files : Spark comes with a specialized API which reads Sequence files. All you have to do is call Sequence file (pat, keyClass, valueClass, minPartitions), and access can be obtained from SparkContext.
Saving Sequence Files : In order to save the Sequence files a pair of RDD along with their types to write are required. For several native types implicit conversions between Scala and Hadoop Writables are possible. Hence to write a native type you will have to save the PairRDD by calling saveAsSequenceFile(path). Map over the data and convert it prior to saving if the conversion is not automatic.
Object Files : Object files are the packaging around Sequence files that enables saving RDDs containing value records only. Saving an Object file is quite simple as it just requires calling saveAsObjectFile() on RDD.
Hadoop Input and Output Format : The input split is referred to as the data present in HDFS. Spark provides APIs to implement the InputFormat of Hadoop in Scala, Python and Java. Old APIs were hadoopRDD and hadoopFile, but the APIs have been improved and the new APIs are known newAPIHadoopRDD, newAPIHadoopFile.
For HadoopOutputFormat, Hadoop takes TextOutputFormat in which key and value pair are separated through comma and saved in part file. Spark has the APIs of Hadoop for both Mapred and MapReduce
File Compression : Most of the Hadoop outputs, a compression code can be specified which is easily accessible. It is used to compress the data.
File systems : A wide array of file systems are supported by Spark. Some of them are discussed below-
- Local/“Regular” FS : Spark is able to load files from local file system which requires files to remain at the same path on all the nodes.
- Amazon S3 : This file system is suitable for storing large amount of files. It works faster when the computed nodes are inside the AmazonEC2. However at times its performance goes down if public network is opted.
- HDFS : It is a distributed file system which works well on commodity hardware. it provides high throughput.
Structured Data with Spark SQL : It works effectively on semi-structured and structured data. Structured data can be defined as schemas and consistent set of fields.
Apache Hive : One of the common structured data source on Hadoop is Apache Hive. Hive can store tables in a variety and different range of formats, from plain text to column-oriented formats, inside HDFS or also contains other storage systems. Spark SQL can load any amount of table supported by Hive.
Databases : Spark supports a wide range of databases with the help of Hadoop Connectors or Custom Spark Connectors. Some of them are JDBC, Cassandra, HBase and Elasticsearch.