How to load local file in sc.textFile, instead of HDFS

Question

asked Jul 5, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

I'm following the great spark tutorial

so i'm trying at 46m:00s to load the README.md but fail to what i'm doing is this:

$ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash
bash-4.1# cd /usr/local/spark-1.1.0-bin-hadoop2.4
bash-4.1# ls README.md
README.md
bash-4.1# ./bin/spark-shell
scala> val f = sc.textFile("README.md")
14/12/04 12:11:14 INFO storage.MemoryStore: ensureFreeSpace(164073) called with curMem=0, maxMem=278302556
14/12/04 12:11:14 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 160.2 KB, free 265.3 MB)
f: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[1] at textFile at <console>:12
scala> val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox:9000/user/root/README.md
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)

how can I load that README.md?

1 Answer

Related questions

0 votes

1 answer

What is sc in PySpark?

asked Mar 15, 2021 in Big Data Hadoop & Spark by dev_sk2311 (45k points)

0 votes

1 answer

Cannot Read a file from HDFS using Spark

asked Jul 9, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

0 votes

1 answer

How to load java properties file and use in Spark?

asked Jul 23, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

0 votes

1 answer

Load CSV file with Spark

asked Jul 9, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

0 votes

1 answer

Spark - load CSV file as DataFrame?

asked Jun 18, 2019 in Big Data Hadoop & Spark by Aarav (11.4k points)

Amit Rawat · Answer 1 · 2019-07-05T14:45:59+0000

While Spark supports loading files from the local filesystem, it requires that the files are available on the same path on all nodes in your cluster.

Some network filesystems, like NFS, AFS, and MapR’s NFS layer, are exposed to the user as a regular filesystem.

If your data is already in one of these systems, then you can use it as input by just specifying a file:// path; Spark will handle it as long as the filesystem is mounted at the same path on each node. Every node needs to have the same path

rdd = sc.textFile("file:///path/to/file")

If your file isn’t already on all nodes in the cluster, you can load it locally on the driver without going through Spark and then call parallelize to distribute the contents to workers.

You can refer to the following video tutorial of spark:

How to load local file in sc.textFile, instead of HDFS

How to load local file in sc.textFile, instead of HDFS

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions