Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-

Name|Place
a   |a1
a   |a2
a   |a2
    |d1
b   |a2
c   |a2
c   |
    |
d   |c1


In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.

1 Answer

0 votes
by (32.3k points)

For java:

I think you need to use the fill(String value, String[] columns) method of your dataframe, which automatically replaces Null values in a given list of columns with the value you specified.

So if you are very clear about the value that you want to replace the Null with...:

String[] colNames = {"Name"}

dataframe = dataframe.na.fill("a", colNames)

You can do the same for the rest of your columns.

And if you want to solve this kind of problem in scala:

You can use .na.fill function (check this for reference:org.apache.spark.sql.DataFrameNaFunctions).

The function that you need here is:

 def fill(value: String, cols: Seq[String]): DataFrame

Now, you can freely choose the columns, and also you can choose the value you want to replace the null or NaN.

For your case, do something like this:

val df2 = df.na.fill("a", Seq("Name"))

            .na.fill("a2", Seq("Place"))

Learn Spark with this Spark Certification Course by Intellipaat.

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

29.3k questions

30.6k answers

501 comments

104k users

Browse Categories

...