Spark can save files from multiple input and output sources. It actively uses Hadoop map-reduce to access data in various formats and file systems, such as Text Files, JSON Files, CSV and TSV Files, Sequence Files, Object Files, and Hadoop Input and Output Formats. Let’s get started.
File Formats
Spark provides a simple way to load and save data files in a very large number of file formats. These formats may range from being unstructured, like text, to semi-structured, like JSON, to structured, like sequence files. The input file formats that Spark wraps are transparently handled in a compressed format based on the file extension specified.
Watch this video on ‘Apache Spark Tutorial’:
Text Files
Loading and saving text files in Spark applications is straightforward and convenient. When we load a single text file as an RDD, each input line transforms into an element within the RDD. Additionally, it is possible to load multiple whole text files simultaneously, creating a pair of RDD elements. In this case, the key corresponds to the provided name, while the value represents the contents of each file in the specified format.
- Loading the text files: Loading a single text file is as simple as calling the textFile() function on our SparkContext with the pathname placed next to the file, as shown below:
input = sc.textFile("file:///home/holden/repos/spark/README.md")
- Saving the text files: Spark consists of a function called saveAsTextFile(), which saves the path of a file and writes the content of the RDD to that file. The path is considered as a directory, and multiple outputs will be produced in that directory. This is how Spark becomes able to write output from multiple codes.
result.saveAsTextFile(outputFile)
JSON Files
JSON stands for JavaScript Object Notation, which is a light-weighted data interchange format. It supports text only which can be easily sent and received from a server. Python has an inbuilt package named ‘json’ to support JSON in Python.
- JSON files can be loaded by parsing the JSON data in text form. In supported languages, developers can adopt this approach to load data from the files. If the file contains multiple JSON records, the entire file needs to be downloaded and each record parsed individually.
- When it comes to saving JSON files, the process is easier compared to loading. Developers don’t have to worry about incorrect data value formats. The same libraries used to convert RDDs (Resilient Distributed Datasets) into parsed JSON files can be used.
Get 100% Hike!
Master Most in Demand Skills Now!
CSV and TSV Files
Comma-separated values (CSV) files are a very common format used to store tables. These files have a definite number of fields in each line, the values of which are separated by a comma. Similarly, in tab-separated values (TSV) files, the field values are separated by tabs.
- Loading the CSV Files: The loading procedure of CSV and TSV files is quite similar to that of the JSON files. To load a CSV or TSV file, its content in the text format is loaded first, and then it is processed. Like the JSON files, CSV and TSV files also have different library files, but it is suggested to use only those corresponding to each language.
- Saving the CSV Files: Writing to CSV or TSV files is also quite easy. However, as the output cannot have the file name, mapping is required for better results. One easy way to perform this is to write a function that can convert the fields into positions in an array.
Sequence Files
A sequence file is a flat file that consists of binary key/value pairs and is widely used in Hadoop. The sync markers in these files allow Spark to find a particular point in a file and re-synchronize it with record limits.
- Loading the Sequence Files: Spark comes with a specialized API that reads the sequence files. All we have to do is call a sequence file (pat, keyClass, valueClass, minPartitions), and access can be obtained from SparkContext.
- Saving the Sequence Files: To save the sequence files, a paired RDD, along with its types to write, is required. For several native types, implicit conversions between Scala and Hadoop Writables are possible. Hence, to write a native type, we have to save the paired RDD by calling the saveAsSequenceFile(path) function. Then, we have to map over the data and convert it before saving it if the conversion is not automatic.
Object Files
Object files are the packaging around sequence files that enables saving RDDs containing only value records. Saving an object file is quite simple, as it just requires calling saveAsObjectFile() on an RDD.
Hadoop Input and Output Formats
The input split is referred to as the data present in HDFS. Spark provides APIs to implement the InputFormat of Hadoop in Scala, Python, and Java. The old APIs were Hadoop RDD and Hadoop files, but now the APIs have been improved, and the new APIs are known as newAPIHadoopRDD and newAPIHadoopFile.
For HadoopOutputFormat, Hadoop takes TextOutputFormat in which the key and value pair are separated through a comma and saved in a part file. Spark has the APIs of Hadoop for both MapRed and MapReduce.
- File Compression: For most of the Hadoop outputs, a compression code can be specified, which is easily accessible. It is used to compress the data.
File Systems
A wide array of file systems are supported by Apache Spark. Some of them are discussed below:
- Local/Regular FS: Spark can load files from the local file system, which requires files to remain on the same path on all nodes.
- Amazon S3: The Amazon S3 file system is suitable for storing large amounts of files. It works faster when the computed nodes are inside Amazon EC2. However, at times, its performance goes down if we opt for the public network.
- HDFS: It is a distributed file system that works well on commodity hardware. It provides high throughput.
Structured Data with Spark SQL
It works effectively on semi-structured and structured data. Structured data can be defined as schemas, and it has a consistent set of fields.
Apache Hive
One of the common structured data sources on Hadoop is Apache Hive. Hive can store tables in a variety of formats, from plain text to column-oriented formats, inside HDFS, and it also contains other storage systems. Spark SQL can load any number of tables supported by Hive.
Databases
Spark supports a wide range of databases with the help of Hadoop Connectors or Custom Spark Connectors. Some of them are JDBC, Cassandra, HBase, and Elasticsearch.