Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

So I have a 2D scatter filled with points (x,y). I want to draw k vertical lines (x_1 = a, x_2 = b, ..., x_k = k), so as to partition the points into k groups.

The optimal solution would minimize the average variance of each group's y_value.

What is the appropriate algorithm? It sounded like k-means but I have the constraint that the lines must be vertical.

1 Answer

0 votes
by (41.4k points)

Here, the solution is based on dynamic programming.

Now, with the following notations: (x_1, y_1), ..., (x_n, y_n) the points, with x_1 <= x_2 <= ... <= x_n to cut in K groups.

Var(i, j) the variance of the y's: y_i, ..., y_j.

F_K((x_1,y_1), ..., (x_n, y_n)) = F_k(1,n) the value of the best solution for the problem.

After that, we have the following:

F_k(i,j) = min for l in i...j-k+1 of (Var(i,l) + F_(k-1)(l+1, j) and

F_1(i,j) = Var(i,j).

Now, the best way to split the points in 'k' groups is to select the leftmost cut and for the remaining points, the best choice of k-1 cuts.

Now, we need a 3D array A of dimensions n*n*K to store the value of F_k(i,j) for all i,j,k. 

So, the program will be like this:

function get_value(P: points, A: 3D array, i, j, k){

  if A[i][j][k] is defined{

    result = A[i][j][k]

  } else if k == 1 {

    A[i][j][k] = get_var(P, i, j)

    result = A[i][j][k] 

  } else {

    result = +INF

    for l in i ... j-k+1 {

      tmp = get_value(P, A, i, l, 1) + get_value(P, A, l+1, j, k-1)

      if tmp < result {

        result = tmp

      }

    }

  }

  return result

}

If you wish to learn more about how to use python for data science, then go through this data science python course by Intellipaat for more insights.

Browse Categories

...