To solve your problem, you have two options:
1. Try to create a singleton object with a lazy val representing the data, as given below:
object WekaModel {
lazy val data = {
// initialize data here. This will only happen once per JVM process
}
}
Then try to use the lazy val in your map function. The lazy val makes sure that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for data.
elementsRDD.map { element =>
// use WekaModel.data here
}
Advantages
This is more efficient, as it allows you to initialize your data once per JVM instance. This approach is actually a very good choice when needing to initialize a database connection pool for example.
Disadvantages
Less control over initialization, e.g. it gets trickier to initialize your object if you require runtime parameters.
You can not free up or release the object if you need to.Also, that is acceptable as the OS will free up the resources when the process exits.
2. Use mapPartition (or foreachPartition) method on the RDD instead of using just map.
This allows you to initialize whatever you need for the entire partition.
elementsRDD.mapPartition { elements =>
val model = new WekaModel()
elements.map { element =>
// use model and element. There is a single instance of model per partition.
}
}
Advantage
Provides more flexibility in the initialization and deinitialization of objects.
Disadvantage
Each partition creates and initializes a new instance of your object. Depending on how many partitions you have per JVM instance, it may or may not be an issue.