You can do this, at least for filtering, if you're willing to use a language-integrated query.
For a data file dates.txt containing:
You can pack as much Scala date magic in your UDF as you want but I'll keep it simple:
def myDateFilter(date: String) = date contains "-08-"
Set it all up as follows -- a lot of this is from the Programming guide.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// case class for your records
case class Entry(name: String, when: String)
// read and parse the data
val entries = sc.textFile("dates.txt").map(_.split(",")).map(e => Entry(e(0),e(1)))
You can use the UDF as part of your WHERE clause:
val augustEntries = entries.where('when)(myDateFilter).select('name, 'when)
and see the results:
augustEntries.map(r => r(0)).collect().foreach(println)
Notice the where method that I've used, declared as follows in the doc:
def where[T1](arg1: Symbol)(udf: (T1) ⇒ Boolean): SchemaRDD
So, only one argument can be taken by the UDF, but you can compose several .where() calls to filter on multiple columns.
For Spark 1.2.0 (and for 1.1.0): I don’t know if it is really documented or not, but Spark now supports registering a UDF so it can be queried from SQL.
The above UDF could be registered using:
and if the table was registered
it could be queried using
sqlContext.sql("SELECT * FROM entries WHERE myDateFilter(when)")
For more details check this.