apache spark - check if file exists

Question

1 Answer

Amit Rawat · Answer 1 · 2019-07-26T04:59:15+0000

For a file in HDFS, I would suggest you to go with the hadoop way of doing this:

val conf = sc.hadoopConfiguration

val fs = org.apache.hadoop.fs.FileSystem.get(conf)
val exists = fs.exists(new org.apache.hadoop.fs.Path("/path/on/hdfs/to/SUCCESS.txt"))

For Pyspark:

You can also achieve this without invoking a subprocess. Try to do something like:

fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))