You can parse a CSV file with Spark built-in CSV reader. This returns a DataFrame/DataSet on the successful read of the file. Also, you can apply SQL-like operations easily on the top of DATAFRAME/DATASET.
Using Spark 2.x(and above) with Java
Create SparkSession object aka spark
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL Example")
.getOrCreate();
Create Schema for Row with StructType
import org.apache.spark.sql.types.StructType;
StructType schema = new StructType()
.add("department", "string")
.add("designation", "string")
.add("ctc", "long")
.add("state", "string");
Create dataframe from CSV file and apply schema to it
Dataset<Row> df = spark.read()
.option("mode", "DROPMALFORMED")
.schema(schema)
.csv("hdfs://path/input.csv");
To know about more options on reading data from CSV file click here.
Now we can perform aggregation on data in 2 ways:
1. SQL way
In order to perform SQL operation just register a table in spark sql metastore
df.createOrReplaceTempView("employee");
Run SQL query on registered dataframe
Dataset<Row> sqlResult = spark.sql(
"SELECT department, designation, state, SUM(ctc), COUNT(department)"
+ " FROM employee GROUP BY department, designation, state");
sqlResult.show(); //for testing
We can even execute SQL directly on CSV file with out creating table with Spark SQL
2. Object chaining or Programming or Java-like way
Do the necessary import for sql functions
import static org.apache.spark.sql.functions.count;
import static org.apache.spark.sql.functions.sum;
Use groupBy and agg on dataframe/dataset to perform count and sum on data
Dataset<Row> dfResult = df.groupBy("department", "designation", "state")
.agg(sum("ctc"), count("department"));
// In Spark 1.6+ columns mentioned in group are added to result by default
dfResult.show();//for testing
dependent libraries
"org.apache.spark" % "spark-core_2.11" % "2.0.0"
"org.apache.spark" % "spark-sql_2.11" % "2.0.0"