+1 vote
1 view
in Big Data Hadoop & Spark by (900 points)

We know that in spark there is a method rdd.collect which converts RDD to a list.

List<String> f= rdd.collect();
String[] array = f.toArray(new String[f.size()]);

I am trying to do exactly opposite in my project. I have an ArrayList of String which I want to convert to JavaRDD. I am looking for this solution for quite some time but have not found the answer. Can anybody please help me out here?

2 Answers

0 votes
by (13.2k points)

To convert a list to Java RDD you can use 

JavaSparkContext.parallelize(List)

Below is an example ,

import java.util.Arrays;

import java.util.List; 

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

 

public class RDDfromList {

     public static void main(String[] args) {

        // configure spark

        SparkConf sparkConfiguration = new SparkConf().setAppName("Spark RDD").setMaster("local[2]").set("spark.executor.memory","2g");

        // start a spark context

        JavaSparkContext sparkContext = new JavaSparkContext(sparkConfiguration);

 

        // read list to RDD

        List<String> data = Arrays.asList("A","B","C","D","E"); 

        JavaRDD<String> items = sparkContext.parallelize(data,1);

 

        // apply a function for each element of RDD

        items.foreach(item -> {

            System.out.println("* "+item); 

        });

    }

}

0 votes
by (31.4k points)

You can simply just apply JavaSparkContext#parallelizePairs for List of tuples, refer the following code:

List<Tuple2<Integer, Integer>> pairs = new ArrayList<>();

pairs.add(new Tuple2<>(0, 5));

pairs.add(new Tuple2<>(1, 3));

JavaSparkContext sc = new JavaSparkContext();

JavaPairRDD<Integer, Integer> rdd = sc.parallelizePairs(pairs);

If you want to know more about Spark, then do check out this awesome video tutorial:

 

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...