Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I have a huge file in HDFS having Time Series data points (Yahoo Stock prices).

I want to find the moving average of the Time Series how do I go about writing the Apache Spark job to do that .

1 Answer

0 votes
by (32.3k points)
edited by

Moving average is a very tricky task for Spark, and any distributed system.

In our approach we have to duplicate the data at the start of the partitions, so that calculating the moving average per partition gives complete coverage.

Here is a way to do this in Spark. The example data:

image

A simple partitioner that puts each row in the partition we specify by the key:

image

Create the data where the first window - 1 rows is copied to the previous partition:

image

Just calculate the moving average on each partition:

image

Because of the duplicate segments this will have no gaps in coverage.

image

If you want to know more about Spark, then do check out this awesome video tutorial:

Browse Categories

...