0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I have a huge file in HDFS having Time Series data points (Yahoo Stock prices).

I want to find the moving average of the Time Series how do I go about writing the Apache Spark job to do that .

1 Answer

0 votes
by (25.6k points)
edited ago by

Moving average is a very tricky task for Spark, and any distributed system.

In our approach we have to duplicate the data at the start of the partitions, so that calculating the moving average per partition gives complete coverage.

Here is a way to do this in Spark. The example data:

image

A simple partitioner that puts each row in the partition we specify by the key:

image

Create the data where the first window - 1 rows is copied to the previous partition:

image

Just calculate the moving average on each partition:

image

Because of the duplicate segments this will have no gaps in coverage.

image

If you want to know more about Spark, then do check out this awesome video tutorial:

Welcome to Intellipaat Community. Get your technical queries answered by top developers !

Categories

...