2 views

So I have a 2D scatter filled with points (x,y). I want to draw k vertical lines (x_1 = a, x_2 = b, ..., x_k = k), so as to partition the points into k groups.

The optimal solution would minimize the average variance of each group's y_value.

What is the appropriate algorithm? It sounded like k-means but I have the constraint that the lines must be vertical.

by (41.4k points)

Here, the solution is based on dynamic programming.

Now, with the following notations: (x_1, y_1), ..., (x_n, y_n) the points, with x_1 <= x_2 <= ... <= x_n to cut in K groups.

Var(i, j) the variance of the y's: y_i, ..., y_j.

F_K((x_1,y_1), ..., (x_n, y_n)) = F_k(1,n) the value of the best solution for the problem.

After that, we have the following:

F_k(i,j) = min for l in i...j-k+1 of (Var(i,l) + F_(k-1)(l+1, j) and

F_1(i,j) = Var(i,j).

Now, the best way to split the points in 'k' groups is to select the leftmost cut and for the remaining points, the best choice of k-1 cuts.

Now, we need a 3D array A of dimensions n*n*K to store the value of F_k(i,j) for all i,j,k.

So, the program will be like this:

function get_value(P: points, A: 3D array, i, j, k){

if A[i][j][k] is defined{

result = A[i][j][k]

} else if k == 1 {

A[i][j][k] = get_var(P, i, j)

result = A[i][j][k]

} else {

result = +INF

for l in i ... j-k+1 {

tmp = get_value(P, A, i, l, 1) + get_value(P, A, l+1, j, k-1)

if tmp < result {

result = tmp

}

}

}

return result

}

If you wish to learn more about how to use python for data science, then go through this data science python course by Intellipaat for more insights.