Loading and Saving your Data
File Formats : Spark provides a very simple manner to load and save data files in a very large number of file formats. Formats may range the formats from being the unstructured, like text, to semi structured way, like JSON, to structured, like Sequence Files. The input file formats that Spark wraps all are transparently handle in a compressed format based on the file extension specified.
Text Files : Text files are very simple and convenient to load from and save in Spark Applications. When we load a single type text file as an RDD, then each input line becomes an element in the RDD. It has the capacity to load multiple whole text files and at the same time into a pair of RDD elements, with the key being the name given and the value being the contents of each file formats specified.
Loading text files : Loading a single text file is as simple as calling the text File() function on our Spark Context with the path name to the file, as you can see the complete description provided below in an example given below:
input = sc.textFile("file:///home/holden/repos/spark/README.md")
Saving text files : The method saveAsTextFile(),takes a path name of the file saved and have the output contents written based on RDD to that typical file. Here the path is treated as a directory and Spark will output (end result) multiple files underneath that directory. This allows Spark to write the output from multiple nodes. Example:
JSON : JSON is a popular semi structured data oriented platform. The simplest way to load JSON(Java Script Object Notation) data is by loading the data as a text format and then mapping over the values with a JSON parser.
Loading JSON : Loading the data as a text format and then parsing the JSON data is an approach that we can use in all of the supported typed languages. This works well assuming that you have one JSON record per row; if you have multiline JSON files, then you have to load the whole file and then parse each file individually.
Saving JSON : Writing out JSON files is much simpler compared to loading the files, because we are not worried about incorrectly formatted data values and we know that what kind of data that we are writing out for queries been raised. We can use the same type of libraries that we used to convert our RDD of strings into parsed JSON data and instead take our RDD of structured data and convert it into an RDD of strings, which we can then write out using Spark’s text file API (Application User Interface).
Comma-Separated Values and Tab-Separated Values : Comma-separated value (CSV) files are supposed to contain a fixed number of set of fields per line, and those fields are separated by using comma or a tab where in the case of tab is been used to separate value, or TSV, files.
Loading CSV : Loading CSV/TSV data is similar to loading JSON data sets in that we can first load it contains the text format and then process it as required. The lack of standardization of format leads to different versions of the same library functions sometimes handling input in different ways is preferred. As with JSON, there are many different CSV libraries files, but we prefer to use only one for each language specified.
Saving CSV : As with the terms of JSON data, writing out CSV/TSV data files is quite simple and we can benefit from reusing the output encoding object. Since in CSV we don’t output the field name with each set of record, to have a consistent output we need to create a mapping here. One of the easiest ways to do this is to just write a function that implied to converts the fields to given positions in an array set.
Sequence Files : Sequence Files are a popular used in Hadoop format composed of flat files with key/value set of pairs. Sequence Files have sync markers that allow the Spark to seek to a particular point in the file and then re synchronize the same with the record boundaries.
Loading Sequence Files : Spark has a specialized API for reading in set of SequenceFiles. On the Spark Context we have the access to call sequence File(path, keyClass, valueClass, minPartitions).
Saving Sequence Files : Sequence Files are key/value pairs, we required a Pair of RDD along with their types that our defined Sequence File can write it out. Implicit type of conversions between Scala types and Hadoop Writables exist for many native types, so if you are writing out a native type you can just save your PairRDD by calling saveAsSequenceFile(path), and it will write out the data exactly as you required.
If there isn’t an automatic conversion from our key and value to this Writable format, or we want to use a specific variable-length types (e.g., VIntWritable), we can just simply map over the data and convert it before saving its contents.
Object Files : Object files are a deceptively simple wrapper around Sequence Files that allows us to save our RDDs containing just values records. Saving an object oriented file is as simple as calling save As Object File on an RDD. Reading an object file back is also quite simple method; the function used are objectFile() on the Spark Context takes in a path and returns an RDD.
Hadoop Input and Output Format : Loading with other Hadoop input formats: To read in a file using the new Hadoop API we need to tell Spark a few certain things. The new API Hadoop File version takes a path, and requires three classes. The first class is the “format” class, which is the class representing our input format. A similar function, is identified in hadoopFile(),used for working with Hadoop input formats implemented with the older versions of API. The next class is the class for our key values, and the final class is the class of our value that is been specified. If we need to specify additional information regarding the Hadoop configuration properties, we can also pass in a conf object.
File Compression : Frequently when working with Big Data technology, we find ourselves needing to use compressed data files to save storage space and network overhead. With most provided Hadoop output results, we can specify a compression code that will compress the data and it is very easily extracted and used.
File systems : Spark supports a large number of file systems for reading and writing to, which we can use with any of the file formats as we wish.
- Local/“Regular” FS : While Spark supports loading files from the local file system, it requires the files which are available at the same path on all nodes in your cluster.
- Amazon S3 : Amazon S3 is an increasingly popular option for storing very large amounts of data. S3 is especially fast when you’re computed nodes are located inside of Amazon EC2, but still it can easily have much worse performance if you prefer for the public Internet.
- HDFS : The Hadoop Distributed File System (HDFS) is a very popular distributed file system in which Spark works very well. HDFS is designed to work on commodity hardware and be resilient to node failure while providing high dimension of data throughput.
Structured Data with Spark SQL : Spark SQL is a component added in Spark 1.0 that is quickly and furiously becoming Spark’s preferred way to work with structured and semi structured data format. By structured data, we mean data that contains the schema that is a consistent set of fields across data records.
Apache Hive : One of the common structured data source on Hadoop is Apache Hive. Hive can store tables in a variety and different range of formats, from plain text to column-oriented formats, inside HDFS or also contains other storage systems. Spark SQL can load any amount of table supported by Hive.
JSON : If you have JSON data with a consistent schema across records, Spark SQL can prefer their schema and load this data in terms of rows, making it very simple to pull out the fields the way you need. To load JSON data, first you need to create a Hive Context as when using Hive. Then use the Hive Context.jsonFile method to get an RDD of Row objects for the whole set of file.
Databases : Spark can access several popular databases using either their Hadoop connectors or even uses the custom Spark connectors.
- Java Database Connectivity : Spark can load data from any relational set of database that supports Java Database Connectivity (JDBC), including the MySQL, Postgres, and even other systems. To access this data, we construct an org.apache.spark.rdd.JdbcRDD and provide it with our SparkContext and also containing the other parameters.
- Cassandra : Spark’s Cassandra support has improvised and greatly implemented with the introduction of the open source Spark Cassandra connector from DataStax. Since the connector is not currently being the part of Spark, you will need to add some further dependencies to your build file elements. Cassandra doesn’t still uses the Spark SQL, but it instead returns RDDs of Cassandra Row objects, which have same methods as Spark SQL’s Row object contains.
- HBase : Spark can access HBase through its Hadoop format of input value, implemented in the org.apache.hadoop.hbase.mapreduce. Table Input Format class. This input format returns key/value pairs where the key is of type org.apache.hadoop.hbase.io.Immu tableBytesWritable and the value is described of the type org.apache.hadoop.hbase.client.Result. The Result class includes various kinds of methods for getting values based on their column family, as described in its API documentation.
- Elasticsearch : Spark has the capabilities for both read and writes data from Elasticsearch using Elasticsearch-Hadoop. Elasticsearch is a new open source platform, Lucene-based search system. The Elasticsearch connector is a bit different than the other connectors we have examined, since it ignores the path information that we have provided and instead depends on setting up configuration on our Spark Context.