0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I want to list all folders within a hdfs directory using Scala/Spark. In Hadoop I can do this by using the command: 

hadoop fs -ls hdfs://sandbox.hortonworks.com/demo/

I tried it with:

val conf = new Configuration()
val fs = FileSystem.get(new URI("hdfs://sandbox.hortonworks.com/"), conf)

val path = new Path("hdfs://sandbox.hortonworks.com/demo/")

val files = fs.listFiles(path, false)

But it does not seem that he looks in the Hadoop directory as i cannot find my folders/files.

I also tried with:

FileSystem.get(sc.hadoopConfiguration).listFiles(new Path("hdfs://sandbox.hortonworks.com/demo/"), true)

But this also does not help.

Do you have any other idea?

1 Answer

0 votes
by (32.5k points)

In Hadoop 1.4, we are not provided with listFiles method so we use listStatus to get directories. It doesn't have recursive option but it is easy to manage recursive lookup.

val fs = FileSystem.get(new Configuration())

val status = fs.listStatus(new Path(YOUR_HDFS_PATH))

status.foreach(x=> println(x.getPath))

For Spark 2.0+,

import org.apache.hadoop.fs.{FileSystem, Path}

val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)

fs.listStatus(new Path(s"${hdfs-path}")).filter(_.isDir).map(_.getPath).foreach(println)

Welcome to Intellipaat Community. Get your technical queries answered by top developers !