0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I am new to spark and I have a question. I have a two step process in which the first step write a SUCCESS.txt file to a location on HDFS. My second step which is a spark job has to verify if that SUCCESS.txt file exists before it starts processing the data.

I checked the spark API and didnt find any method which checks if a file exists. Any ideas how to handle this?

1 Answer

0 votes
by (25.6k points)

For a file in HDFS, I would suggest you to go with the hadoop way of doing this:

val conf = sc.hadoopConfiguration

val fs = org.apache.hadoop.fs.FileSystem.get(conf)

val exists = fs.exists(new org.apache.hadoop.fs.Path("/path/on/hdfs/to/SUCCESS.txt"))

For Pyspark:

You can also achieve this without invoking a subprocess. Try to do something like:

fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())

fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))

...