I have implemented a simple Linear Regression (single variate for now) example in C++ to help me get my head around the concepts. I'm pretty sure that the key algorithm is right but my performance is terrible.
This is the method which actually performs the gradient descent:
void LinearRegression::BatchGradientDescent(std::vector<std::pair<int,int>> & data,float& theta1,float& theta2) { float weight = (1.0f/static_cast<float>(data.size())); float theta1Res = 0.0f; float theta2Res = 0.0f; for(auto p: data) { float cost = Hypothesis(p.first,theta1,theta2) - p.second; theta1Res += cost; theta2Res += cost*p.first; } theta1 = theta1 - (m_LearningRate*weight* theta1Res); theta2 = theta2 - (m_LearningRate*weight* theta2Res); }
With the other key functions given as:
float LinearRegression::Hypothesis(float x,float theta1,float theta2) const { return theta1 + x*theta2; } float LinearRegression::CostFunction(std::vector<std::pair<int,int>> & data, float theta1, float theta2) const { float error = 0.0f; for(auto p: data) { float prediction = (Hypothesis(p.first,theta1,theta2) - p.second) ; error += prediction*prediction; } error *= 1.0f/(data.size()*2.0f); return error; } void LinearRegression::Regress(std::vector<std::pair<int,int>> & data) { for(unsigned int itr = 0; itr < MAX_ITERATIONS; ++itr) { BatchGradientDescent(data,m_Theta1,m_Theta2); //Some visualisation code } }
Now the issue is that if the learning rate is greater than around 0.000001 the value of the cost function after gradient descent is higher than it is before. That is to say, the algorithm is working in reverse. The line forms into a straight line through the origin pretty quickly but then takes millions of iterations to actually reach a reasonably well-fit line.
With a learning rate of 0.01, after six iterations the output is: (where the difference is costAfter-costBefore)
Cost before 102901.945312, cost after 517539430400.000000, difference 517539332096.000000 Cost before 517539430400.000000, cost after 3131945127824588800.000000, difference 3131944578068774912.000000 Cost before 3131945127824588800.000000, cost after 18953312418560698826620928.000000, difference 18953308959796185006080000.000000 Cost before 18953312418560698826620928.000000, cost after 114697949347691988409089177681920.000000, difference 114697930004878874575022382383104.000000 Cost before 114697949347691988409089177681920.000000, cost after inf, difference inf Cost before inf, cost after inf, difference nan
In this example, the thetas are set to zero, the learning rate to 0.000001, and there are 8,000,000 iterations! The visualization code only updates the graph after every 100,000 iterations.
The function which creates the data points:
static void SetupRegressionData(std::vector<std::pair<int,int>> & data) { srand (time(NULL)); for(int x = 50; x < 750; x += 3) { data.push_back(std::pair<int,int>(x+(rand() % 100), 400 + (rand() % 100) )); } }
In short, if my learning rate is too high the gradient descent algorithm effectively runs backward and tends to infinity and if it is lowered to the point where it actually converges towards minima the number of iterations required to actually do so is unacceptably high.
Have I missed anything/made a mistake in the core algorithm?