The linear regression model is a supervised machine learning algorithm which predicts the result from the new items which weren't used during the learning phase. Although it is considered outdated and outperformed by the neural networks model, it is a great place to start learning about machine learning. It teaches some useful concepts like cost functions, gradient descent, and using derivatives to calculate improvement. In this example, I will go through generating a random set of samples with just two coordinates as a single variable set and a labeling set and using a simple linear regression model to predict the original function.
Generating the data set
To generate a simple linear dataset we will first think of an arbitrary linear equation. We will then generate some random points around the resulting line. After this step, we will pretend to have forgotten about the original equation and use the points to predict the linear equation parameters. In the end, we will compare the original equation parameters with the predicted parameters and observe how well the predicted linear equation fits the original one.
import matplotlib.pyplot as plt import numpy as np def func(x): return 6 * x + 2 xrng = np.arange(-2, 2, 0.01) yrng = func(xrng) xxrng = np.random.uniform(-2, 2, 100) yyrng = func(xxrng) + np.random.uniform(-3, 3, 100) plt.plot(xrng, yrng) plt.scatter(xxrng, yyrng, c='r') plt.show()
This code generates a random dataset similar to the one displayed in figure 1 (but with many more data points), which will be used in the next step. The randomly generated data can be exchanged for real data captured from one of many datasets freely available on the Internet, as long as the dataset is properly reshaped to have a single dependent attribute and the data used in the calculation is linear by nature (can be split by a single line).
Figure 1: The goal is to minimize all the errors e1 - e3, which is usually done by minimizing their average. Different cost functions use different types of averages.The cost function is used to determine the amount of precision between the learned values and the actual values. In this case, it is the distance between the actual point and the point calculated using the predicted function. This distance is actually used as the prediction error. During the learning process, we want to minimize this distance, which effectively means reducing the predicted linear function error. The goal is not to overfit the result while doing so, which could result in really bad predictions. The goal is to keep both the training error and validation error low.
Here are some well-known cost functions which can be used in the linear regression model:
Gradient descent is an algorithm that iteratively calculates and improves the desired result. In the case of linear regression, the weight and bias of the linear function being learned are initially randomly selected (or zero) and then improved in each learning epoch. In each step the result is improved only by a small fraction (the learning rate), which needs to be properly selected. If the learning rate is too small, the learning process might take too long. If it is too big, the local minimum could be overshot, which results in the learned hyperparameters exploding to totally wrong values.
The gradient descent solves the following equations:
Each calculation step takes an old value of a hyperparameter and improves it for a fraction of the differential function result. Partially differentiated function results are actually the amounts of slope in a particular function point with respect to weight and bias separately. In the next subsection, we will see how to calculate partial differentials.
Calculating partial differentials
Figure 2: Slope determines how much the value y changes when increasing the x value. In this case, y gets smaller if x is less than zero and then climbs up when x is greater than zero.The differential of a function tells us about the slope of that function with regards to the differentiated hyperparameter. More precisely, it tells us the direction of change (denoted by a positive vs. negative value). This is why the equations in the previous section are using the subtraction - we want to find the minimum. The partial derivatives tell us how much the slope changes in the direction of the respective parameter we derived the function by. So if we, for example, calculate a partial derivative with respect to x, the result of that function will tell us how the slope changes with respect to the x axis.
To differentiate a function of our choice, we often use the so-called chain rule method. It tells us how to differentiate complex functions, or rather the functions of other functions.
A more complex chain rule example is the following:
Although the linear regression might be considered an older, less applicable, and less precise machine learning method, I believe it teaches some of the important basics. Understanding it can help with understanding more complex machine learning models like neural networks.
The full code to this topic can be found on the link below. I would encourage the readers to play with the code to get a better understanding of the matter:
- Change the amount of randomness when generating the points used in learning. Observe how the increase and decrease of random distances influences the precision
- Use other cost functions. You will have to calculate and use its partial derivations
- Change the alpha and see how it influences the learning rate. Try to find an alpha value that explodes instead converging to minimum
- Reimplement the epochs so that instead of being hardcoded the code checks for the difference between the two epochs and if it is small enough, stop learning further
Feel free to ask any questions here or on Twitter, I'll do my best to help.