0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

Is there an equivalent of Pandas Melt Function in Apache Spark in PySpark or at least in Scala?

I was running a sample dataset till now in python and now I want to use Spark for the entire dataset.

1 Answer

0 votes
by (32.2k points)

There is no such build-in function for Spark. However, I researched and came across this solution:

from pyspark.sql.functions import array, col, explode, lit, struct

from pyspark.sql import DataFrame

from typing import Iterable 

def melt(

        df: DataFrame, 

        id_vars: Iterable[str], value_vars: Iterable[str], 

        var_name: str="variable", value_name: str="value") -> DataFrame:

    """

    Convert :class:`DataFrame` from wide to long format.

    # -------------------------------------------------------------------------------

    # Create array<struct<variable: str, value: ...>>

    # -------------------------------------------------------------------------------

    _vars_and_vals = array(*(

        struct(lit(c).alias(var_name), col(c).alias(value_name)) 

        for c in value_vars))

    # -------------------------------------------------------------------------------

    # Add to the DataFrame and explode

    # -------------------------------------------------------------------------------

    _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))

    cols = id_vars + [

            col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]

    return _tmp.select(*cols)

    

# -------------------------------------------------------------------------------

# Let's Implement Wide to Long in Pyspark!

# -------------------------------------------------------------------------------

melt(df_web_browsing_full_test, 

     id_vars=['ID_variable'], 

     value_vars=['VALUE_variable_1', 'VALUE_variable_2']).show()

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...