The reason is that we are trying to minimize the loss. Precisely, we perform this by a gradient descent method. It fundamentally means that from our current point in the parameter space (determined by the complete set of current weights), we want to go in a direction that will decrease the loss function. Imagine standing on a hillside and walking down the direction where the slope is steepest.

Mathematically, the direction that provides you the steepest descent from your current point in parameter space is the negative gradient. And the gradient is nothing but the vector made up of all the derivatives of the loss function concerning every single parameter.