0 votes
1 view
in Data Science by (17.6k points)

So I have a 2D scatter filled with points (x,y). I want to draw k vertical lines (x_1 = a, x_2 = b, ..., x_k = k), so as to partition the points into k groups.

The optimal solution would minimize the average variance of each group's y_value.

What is the appropriate algorithm? It sounded like k-means but I have the constraint that the lines must be vertical.

1 Answer

0 votes
by (38.2k points)

Here, the solution is based on dynamic programming.

Now, with the following notations: (x_1, y_1), ..., (x_n, y_n) the points, with x_1 <= x_2 <= ... <= x_n to cut in K groups.

Var(i, j) the variance of the y's: y_i, ..., y_j.

F_K((x_1,y_1), ..., (x_n, y_n)) = F_k(1,n) the value of the best solution for the problem.

After that, we have the following:

F_k(i,j) = min for l in i...j-k+1 of (Var(i,l) + F_(k-1)(l+1, j) and

F_1(i,j) = Var(i,j).

Now, the best way to split the points in 'k' groups is to select the leftmost cut and for the remaining points, the best choice of k-1 cuts.

Now, we need a 3D array A of dimensions n*n*K to store the value of F_k(i,j) for all i,j,k. 

So, the program will be like this:

function get_value(P: points, A: 3D array, i, j, k){

  if A[i][j][k] is defined{

    result = A[i][j][k]

  } else if k == 1 {

    A[i][j][k] = get_var(P, i, j)

    result = A[i][j][k] 

  } else {

    result = +INF

    for l in i ... j-k+1 {

      tmp = get_value(P, A, i, l, 1) + get_value(P, A, l+1, j, k-1)

      if tmp < result {

        result = tmp

      }

    }

  }

  return result

}

If you wish to learn more about how to use python for data science, then go through this data science python course by Intellipaat for more insights.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...