Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am new to spark, and I want to use group-by & reduce to find the following from CSV (one line by employed):

  Department, Designation, costToCompany, State
  Sales, Trainee, 12000, UP
  Sales, Lead, 32000, AP
  Sales, Lead, 32000, LA
  Sales, Lead, 32000, TN
  Sales, Lead, 32000, AP
  Sales, Lead, 32000, TN
  Sales, Lead, 32000, LA
  Sales, Lead, 32000, LA
  Marketing, Associate, 18000, TN
  Marketing, Associate, 18000, TN

  HR, Manager, 58000, TN

I would like to simplify the about CSV with group by Department, Designation, State with additional columns with sum(costToCompany) and TotalEmployeeCount

Should get a result like:

  Dept, Desg, state, empCount, totalCost
  Sales,Lead,AP,2,64000
  Sales,Lead,LA,3,96000 
  Sales,Lead,TN,2,64000


Is there any way to achieve this using transformations and actions.

1 Answer

0 votes
by (32.3k points)

You can parse a CSV file with Spark built-in CSV reader. This returns a DataFrame/DataSet on the successful read of the file. Also, you can apply SQL-like operations easily on the top of DATAFRAME/DATASET.

Using Spark 2.x(and above) with Java

Create SparkSession object aka spark

import org.apache.spark.sql.SparkSession;

SparkSession spark = SparkSession

    .builder()

    .appName("Java Spark SQL Example")

    .getOrCreate();

Create Schema for Row with StructType

import org.apache.spark.sql.types.StructType;

StructType schema = new StructType()

    .add("department", "string")

    .add("designation", "string")

    .add("ctc", "long")

    .add("state", "string");

Create dataframe from CSV file and apply schema to it

Dataset<Row> df = spark.read()

    .option("mode", "DROPMALFORMED")

    .schema(schema)

    .csv("hdfs://path/input.csv");

To know about more options on reading data from CSV file click here.

Now we can perform aggregation on data in 2 ways:

1. SQL way

In order to perform SQL operation just register a table in spark sql metastore

df.createOrReplaceTempView("employee");

Run SQL query on registered dataframe

Dataset<Row> sqlResult = spark.sql(

    "SELECT department, designation, state, SUM(ctc), COUNT(department)" 

        + " FROM employee GROUP BY department, designation, state");

sqlResult.show(); //for testing

We can even execute SQL directly on CSV file with out creating table with Spark SQL

2. Object chaining or Programming or Java-like way

Do the necessary import for sql functions

import static org.apache.spark.sql.functions.count;

import static org.apache.spark.sql.functions.sum;

Use groupBy and agg on dataframe/dataset to perform count and  sum on data

Dataset<Row> dfResult = df.groupBy("department", "designation", "state")

    .agg(sum("ctc"), count("department"));

// In Spark 1.6+ columns mentioned in group are added to result by default

dfResult.show();//for testing

dependent libraries

"org.apache.spark" % "spark-core_2.11" % "2.0.0" 

"org.apache.spark" % "spark-sql_2.11" % "2.0.0"

Browse Categories

...