2 views

I have implemented a simple Linear Regression (single variate for now) example in C++ to help me get my head around the concepts. I'm pretty sure that the key algorithm is right but my performance is terrible.

This is the method which actually performs the gradient descent:

void LinearRegression::BatchGradientDescent(std::vector<std::pair<int,int>> & data,float& theta1,float& theta2) { float weight = (1.0f/static_cast<float>(data.size())); float theta1Res = 0.0f; float theta2Res = 0.0f; for(auto p: data) { float cost = Hypothesis(p.first,theta1,theta2) - p.second; theta1Res += cost; theta2Res += cost*p.first; } theta1 = theta1 - (m_LearningRate*weight* theta1Res); theta2 = theta2 - (m_LearningRate*weight* theta2Res); }

With the other key functions given as:

float LinearRegression::Hypothesis(float x,float theta1,float theta2) const { return theta1 + x*theta2; } float LinearRegression::CostFunction(std::vector<std::pair<int,int>> & data, float theta1, float theta2) const { float error = 0.0f; for(auto p: data) { float prediction = (Hypothesis(p.first,theta1,theta2) - p.second) ; error += prediction*prediction; } error *= 1.0f/(data.size()*2.0f); return error; } void LinearRegression::Regress(std::vector<std::pair<int,int>> & data) { for(unsigned int itr = 0; itr < MAX_ITERATIONS; ++itr) { BatchGradientDescent(data,m_Theta1,m_Theta2); //Some visualisation code } }

Now the issue is that if the learning rate is greater than around 0.000001 the value of the cost function after gradient descent is higher than it is before. That is to say, the algorithm is working in reverse. The line forms into a straight line through the origin pretty quickly but then takes millions of iterations to actually reach a reasonably well-fit line.

With a learning rate of 0.01, after six iterations the output is: (where the difference is costAfter-costBefore)

Cost before 102901.945312, cost after 517539430400.000000, difference 517539332096.000000 Cost before 517539430400.000000, cost after 3131945127824588800.000000, difference 3131944578068774912.000000 Cost before 3131945127824588800.000000, cost after 18953312418560698826620928.000000, difference 18953308959796185006080000.000000 Cost before 18953312418560698826620928.000000, cost after 114697949347691988409089177681920.000000, difference 114697930004878874575022382383104.000000 Cost before 114697949347691988409089177681920.000000, cost after inf, difference inf Cost before inf, cost after inf, difference nan

In this example, the thetas are set to zero, the learning rate to 0.000001, and there are 8,000,000 iterations! The visualization code only updates the graph after every 100,000 iterations. The function which creates the data points:

static void SetupRegressionData(std::vector<std::pair<int,int>> & data) { srand (time(NULL)); for(int x = 50; x < 750; x += 3) { data.push_back(std::pair<int,int>(x+(rand() % 100), 400 + (rand() % 100) )); } }

In short, if my learning rate is too high the gradient descent algorithm effectively runs backward and tends to infinity and if it is lowered to the point where it actually converges towards minima the number of iterations required to actually do so is unacceptably high.

Have I missed anything/made a mistake in the core algorithm?

by (108k points)

Normalizing the variables will work well, you can experiment around with different learning rates and iterations and it works exactly how you would expect. For your vanilla implementation with a fixed learning rate, you should make your life easier by normalizing the data to zero mean and unit standard deviation before you feed it into the gradient descent algorithm. That way you will be able to reason about the learning rate more easily. Then you can just rescale your prediction accordingly.

If you wish to learn about Linear Regression then visit this Linear Regression Tutorial.