Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am new to spark and I have a question. I have a two step process in which the first step write a SUCCESS.txt file to a location on HDFS. My second step which is a spark job has to verify if that SUCCESS.txt file exists before it starts processing the data.

I checked the spark API and didnt find any method which checks if a file exists. Any ideas how to handle this?

1 Answer

0 votes
by (32.3k points)

For a file in HDFS, I would suggest you to go with the hadoop way of doing this:

val conf = sc.hadoopConfiguration

val fs = org.apache.hadoop.fs.FileSystem.get(conf)

val exists = fs.exists(new org.apache.hadoop.fs.Path("/path/on/hdfs/to/SUCCESS.txt"))

For Pyspark:

You can also achieve this without invoking a subprocess. Try to do something like:

fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())

fs.exists(sc._jvm.org.apache.hadoop.fs.Path("path/to/SUCCESS.txt"))

Related questions

Browse Categories

...