Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I've been searching for a while if there is any way to use a Scala class in Pyspark, and I haven't found any documentation nor guide about this subject.

Let's say I create a simple class in Scala that uses some libraries of apache-spark, something like:

class SimpleClass(sqlContext: SQLContext, df: DataFrame, column: String) {
  def exe(): DataFrame = {
    import sqlContext.implicits._

    df.select(col(column))
  }
}

 

  • Is there any possible way to use this class in Pyspark?
  • Is it too tough?
  • Do I have to create a .py file?
  • Is there any guide that shows how to do that?

By the way, I also looked at the spark code and I felt a bit lost, and I was incapable of replicating their functionality for my own purpose.

1 Answer

0 votes
by (32.3k points)

What you are asking is possible to do but it can be far away from a trivial approach. What you need here is a Java (friendly) wrapper so you don't have to deal with Scala features which cannot be easily expressed using plain Java and as a result don't play well with Py4J gateway.

Assuming your class is int the package com.example and have Python DataFrame called df

df = ... # Python DataFrame

you'll have to:

  • Include it in the driver classpath, for example, using --driver-class-path argument for PySpark shell / spark-submit. Depending on the exact code you may have to pass it using --jars as well

  • Extract JVM instance from a Python SparkContext instance:

            jvm = sc._jvm

  • Extract Scala SQLContext from a SQLContext instance:

            ssqlContext = sqlContext._ssql_ctx

  • Extract Java DataFrame from the df:

            jdf = df._jdf

  • Create new instance of SimpleClass:

simpleObject = jvm.com.example.SimpleClass(ssqlContext, jdf, "v")

  • Call exe method and wrap the result using Python DataFrame:

            from pyspark.sql import DataFrame

            DataFrame(simpleObject.exe(), ssqlContext)

The result should be a valid PySpark DataFrame. Also, you can combine all the steps into a single call.

Important Note: Just keep in mind that this approach is possible only if Python code is executed solely on the driver. It won’t work if the code is  used inside Python action or transformation

Browse Categories

...