0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-

Name|Place
a   |a1
a   |a2
a   |a2
    |d1
b   |a2
c   |a2
c   |
    |
d   |c1


In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.

1 Answer

0 votes
by (25.6k points)

For java:

I think you need to use the fill(String value, String[] columns) method of your dataframe, which automatically replaces Null values in a given list of columns with the value you specified.

So if you are very clear about the value that you want to replace the Null with...:

String[] colNames = {"Name"}

dataframe = dataframe.na.fill("a", colNames)

You can do the same for the rest of your columns.

And if you want to solve this kind of problem in scala:

You can use .na.fill function (check this for reference:org.apache.spark.sql.DataFrameNaFunctions).

The function that you need here is:

 def fill(value: String, cols: Seq[String]): DataFrame

Now, you can freely choose the columns, and also you can choose the value you want to replace the null or NaN.

For your case, do something like this:

val df2 = df.na.fill("a", Seq("Name"))

            .na.fill("a2", Seq("Place"))

...