In this article, I will present the linear models in terms of questions and answers that can be asked during an interview process. I will try to start from very basic interview questions and cover advance ones later. So let’s get started.


Table of Contents

  • What is a linear model?
  • How we train a Linear Regression model?
  • What is a learning rate?
  • What are some Gradient Descent pitfalls?
  • What is a convex function? Give an example of a convex function.
  • Why do we use feature scaling in gradient descent?
  • Conclusion
  • References
Photo by Johannes Plenio / Unsplash

What is a linear model?

A linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term.


How we train a Linear Regression model?

To train a Linear Regression model, we first need to select a performance metric which is usually the Root Mean Square Error (RMSE) to evaluate how well the model fits the training set.

In practice, it is simpler to minimize the Mean Square Error (MSE) than the RMSE as it leads to the same result (because the value that minimizes a function also minimizes its square root).

In that way we found the model parameters that minimize the cost function over the training set.

There are two techniques to train a Linear Regression model:

  • Normal Equation: Using a mathematical equation that gives the result directly. However, this technique is hardly used as it is computational complex; required to compute the inverse of XT X which is an (n + 1) × (n + 1) matrix (where n is the number of features). The computational complexity of inverting such a matrix is typically about O(n^2.4) to O(n^3) (depending on the implementation). In other words, if you double the number of features, you multiply the computation time by roughly 2^2.4 = 5.3 to 2^3 = 8 times.
  • Gradient Descent: is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems and is better suited for cases where there are a large number of features, or too many training instances to fit in memory.The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

Explain Gradient Descent

What Gradient Descent does is: it measures the local gradient of the error function with regards to the parameter vector θ, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum!

The steps are:

  • Start by filling θ with random values - random initialization
  • Improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum.

What is a learning rate?

An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time.

Too small learning rate

On the other hand, if the learning rate is too high the algorithm might diverge, failing to find a good solution.

Too large learning rate

What are some Gradient Descent pitfalls?

Note that not all cost functions are convex (have global minimum). There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult.

Observing the above figure, we can notice:

  • if the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum.
  • if it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.

What is a convex function? Give an example of a convex function.

A function is called convex if the line segment between any two points never crosses the curve. This implies that there are no local minima, just one global minimum. It is also a continuous function with a slope that never changes abruptly.

These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum (if you wait long enough and if the learning rate is not too high)

Finally,an example of a convex function is the MSE cost function of a Linear Regression model.


Why do we use feature scaling in gradient descent?

In fact, the cost function has the shape of a bowl, but it can be an elongated bowl if the features have very different scales.

Gradient Descent with and without feature scaling

As you can see, on the left the Gradient Descent algorithm goes straight toward the minimum, thereby reaching it quickly, whereas on the right it reaches the minimum, but it is taking a long time.

As a rule of Thumb when using Gradient Descent, you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge.

Conclusion

This brings us to the end of this article. Hope you got a basic understanding of Linear models.

If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.‌
‌Thanks for reading and I am looking forward to hear your questions :)‌
Stay tuned and Happy Machine Learning.


‌References

  • Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems