Linear and Logistic Regression

Intuition and implementation behind the base algorithms for supervised machine learning
Apr 28, 2021

Introduction

The most common form of machine learning is the supervised machine learning technique, which requires a labeled dataset in order to generate the predictions. Under supervised machine learning also, there are two subfields, which are regression and classification. Regression is when the model is to predict continuous values( a number) and classification is when the model is to classify the data. The most basic algorithm used for regression is linear regression and the most basic for classification is logistic regression. These two algorithms come under the microscope in the following sections.

Linear Regression

The aim of the linear regression algorithm is to formulate a linear equation that captures the relationship between the independent and dependent features, in other words, is to find the line of best fit, that can correctly predict the output of the dependent variables. Linear regression can be used for both simple and multiple linear regression.

When we have a single independent variable to predict the output, then it's called a simple linear regression, when we have two or more independent variables, then it is multiple linear regression. The word linear comes from the fact that, in finding the line of best fit, the algorithm establishes a straight-line relationship between the independent and dependent variables.

Linear Regression Steps

As indicated above, the output of linear regression should be a continuous value, as can be seen in the figure above. The dependent variable(y) is marks and the independent(x) is the hours_study. The equation of the best-fitted line can then be written in the form :

Where m is the slope of the line and c is the intercept. In order to generate the best-fitted line, random numbers are assigned to both m and c, and y can be calculated given a value of x. As indicated in the figure above, suppose the hour's study is 5, and assigning random numbers to m and c, a value can be derived for the marks achieved when a student studies for 5 hours. As indicated earlier, linear regression is a supervised learning algorithm, so the values of y are already known.

Looking at the figure again, you can see the values of y(marks) are already known. Now we can verify whether our determined output(y hat) is correct or not.  The way to do this is to calculate the loss between what we predicted and what the value actually is, assuming we predicted that studying for 10 hours would generate a mark of 48 but the actual value is 50, then we can use a function known as the loss function to calculate the error in the prediction.
A very common loss function that is ideal for this particular task is the  mean squared error(MSE), which is represented mathematically as:

where n is the number of data points, y is the actual value and yhat is the predicted value

In order to achieve the line of best fit, the error has to be reduced using a technique called gradient descent. After minimizing the error, the final equation can then be obtained to predict the y-value given x, so the marks can then be predicted given any number of hours studied.

Linear Regression Implementation

In this section, we are going to implement the linear regression algorithm in python. The libraries in python make it easy for us to implement the steps "i" mentioned beforehand to generate predictions and calculate the error between the actual and the predicted value. Using the same data, where we try to predict the marks scored by students given the number of hours studied, we begin by importing the necessary libraries.

``````from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
``````

The scikit learn library is a machine learning and preprocessing library that does a lot of the heavy lifting for us and provides incredible functions to perform the linear regression relatively easier and faster.

``````X = data['hours_study'].values.reshape(-1,1)
y = data['marks'].values.reshape(-1,1)``````

We first split our data into X and y, with the independent variable being represented by X and the dependent(the value we are trying to predict) represented by y,  so the linear regression algorithm can utilize it to generate the line of best fit and generate the predictions for us.

``````Lr = LinearRegression()
Lr.fit(X,y)``````

We first instantiate the Linear regression object and store it in an alias 'Lr', so that we do need to write the whole Linear regression but just Lr, making it easy for us. We call the 'fit' function, which is used to fit the X to the y, in other words, we train the algorithm.

``````y_hat = Lr.predict(X)
mean_squared_error(y,y_hat)``````

We then create our predictions by calling the 'predict' function and storing it in a variable, y_hat. Using the mean squared error as the error metrics, we compute the error between our predictions and the actual value. In this case, the value was 10.2, which means in trying to predict the marks scored by students based on the number of hours studied, our algorithm is off by 10.2 marks.

``````plt.scatter(y, y_hat)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', c='green', lw=3)
plt.xlabel('Actual Marks')
plt.ylabel('Predicted Marks')
plt.title('Actuals Vs Predicted Marks')``````

We use a python library called matplotlib which is used to generate graph and figures with a few lines of code, we were able to generate the line of best fit for our data

It can be seen that the model did a reasonable job in predicting the marks of students given the number of hours studied, but it could have done a better job as some data points are clearly outside the line of best fit and as can be seen by the error of 10.2 marks obtained.

Logistic Regression

Logistic regression is used for classification problems where we want to classify elements into groups, also known as binary classification. Logistic regression is used to predict the categorical dependent variable(y) given the independent variables(x). Logistic regression's output can only be between 0 and 1, in other words, it is used where the probability of the two classes is required, such as it is expensive or it is not just two classes. We would use the same data we used for the linear regression, which is predicting the marks a student obtained given the number of hours studied.

Logistic Regression Steps

For logistic regression, there should be a threshold. So if the probability of a specific element is higher than this threshold, then that element is classified as belonging to one group and vice versa. To kick things off, we determine this threshold, which is done by determining the line of best fit, by following the steps in the linear regression.  The resulting linear regression graph as seen before, cannot be used to predict classes because of the outliers. Therefore in order to take care of these outliers, we convert them into a probability by using the sigmoid equation,

This equation converts any real number to a probability between 0 and 1 based on the threshold set

As can be seen in the figure, a sigmoid curve is obtained which converts any value fed to it to between 0 and 1. Back to our example, as earlier indicated, our independent value is hours_study and dependent is marks. For the classification, we need to partition our data into two classes, passed or not. We set a threshold value to 70, so a student scoring above 70 indicates they passed the exam, and below means, they did not pass.

To build this classification model, we feed the marks to the sigmoid function, which based on the threshold we set, would output a 1 if a student passed(got more than 70) and a 0 if a student did not pass(got less than 70). Therefore, we can easily classify our outputs into two classes, passed or not.

As can be seen in the figure, the upper part of the sigmoid represents passed and the lower portion represents did not pass, so any mark above the threshold is set 1, and any mark below the threshold is set to 0. So for example, a student obtaining a mark of 82% would output a probability of 0.82 which is above the 0.7 thresholds, so such a student would pass the exam and would be outputted a 1 by the sigmoid function, likewise, a student scoring 35% would output a probability of 0.35, hence the sigmoid function assigning a 0 to it, and the student fails.

Here also, we need a way to determine if our model does well in performing this classification. A common metric used for this is the accuracy score, which is simply, the proportion of correctly classified observations to the actual observations, which is a percentage.

Logistic Regression Implementation

In this section also, we are going to implement the logistic regression algorithm for our dataset. Remember we are using the same dataset as the one used for the linear regression, so the values of X and y would be the same. As mentioned before, the libraries make it easy for us to implement this with a few lines of code.

``````from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
``````

We import the logistic regression algorithm from the scikit learn website, and also the accuracy score in order to evaluate our algorithm. Note that we are using some of the libraries used for the linear regression, so they would not be shown in this section.

``````Lr = LogisticRegression()
Lr.fit(X,y)``````

We instantiate the logistic regression algorithm, just as we did for the linear regression algorithm and storing it as Lr, and fitting it, which again means we are fitting our X to the y, which ultimately means that we are training our algorithm.

``pred = Lr.predict(X)``

After training our algorithm we need to make the predictions, so we call the predict function and store it as pred, which is our predictions

``accuracy_score(pred,y)``

We call the accuracy function to compute how accurate our model was in predicting if an individual passed or not, it achieved a score of 86.3%, meaning that in using the number of hours to determine if an individual will pass his exam or not, the logistic regression algorithm can predict that with an 86.3% accuracy, which is quite good.

Conclusion

We looked at the linear and logistic regression algorithms intuitively and also how to implement them in python. One should note that there are more advanced ways of doing regression and classification tasks which gives lower errors and higher accuracies respectively. Hopefully, this article serves as a guide for individuals just starting out and hoping to get a good knowledge of the base algorithms for supervised learning and how to use them for their use cases.