Bias-Variance Tradeoff in Machine Learning
Before speaking about bias and variance, let's understand what hypothesis set is and how we are going to define it. First of all, when you train a model, you are seeking a hypothesis function over the entire space. It does not matter you train a linear regression, logistic regression or a deep network, you always have to understand what a hypothesis set is and how you're going to find a function you are looking for.
If we create a model to approximate the given target function, it means we define a hypothesis set. Our trained model is a point from it, which can be far or close to the target function.
Now let's define our hypothesis set. Let's assume we have chosen a model, which defines hypothesis set H.
Choosing this function as a hypothesis function, we define a space, which contains all possible linear functions. Now we have 2 main questions that need to be answered.
- How good is my hypothesis set to approximate the target function?
- Is it hard to find that particular function from the set, which is the best approximation of the target function?
To understand these questions, let's talk about the bias and variance of the models.
Bias and Variance Understanding
To understand how good our model performs and how we can measure it, we will use squared error and the math expectation of it. There is a great course by Professor Yaser Abu-Mostafa from Caltech University, who explains these approaches deeply.
In other words, we want to know the math expectation of our model performance, which is the mathematical expectation of the squared difference between our hypothesis and target functions. D corresponds to the fixed dataset we had for training. As it is a subset of all possible examples, it means to get overall measurement, we need to integrate over the training example subsets.
This is the general analysis of our model's behavior over the entire space of examples. We want to know how it will behave, on the whole set of examples. Now let's break this equation into the different parts.
Before moving on, let's take look at the quantity we add on the right side. If we have chosen a hypothesis function, finding an exact function to approximate target function, depends on the subset of the dataset we've got. For example, We have a dataset of N examples, which needs to be used for finding the best approximation. If we change this dataset and take another one, our function will be different, but from the same hypothesis space. By doing it over and over again, we will get different functions. We want to define the best hypothesis function over those, which are calculating by change the subset D.
The best function will be the math expectation of the hypothesis function, in other words, the mean of them. Now let's see what we will get as a result of the math expectation.
The first part is the Variance and the part is the Bias. The bias shows how the best hypothesis approximates the target function and what is the distance between them, and the variance shows what will happen when we change the subset of examples and how they are going to vary. If we look back, we will see the math expectation of our squared loss function depends on the bias and variance, which are functions from examples. In this point bias and variance are functions of examples x. To get the overall value of them, we can add math expectations over the examples space.
But what will happen when we start to change the hypothesis set? How will this variance and bias behave?
Let's assume the target function is a cubic function and we have chosen the linear hypothesis set to approximate it. Like
It means our hypothesis set is small enough and the variance will not be too big, because space is narrow and the math expectation of certain points from that space will not be too far from the best hypothesis. In other words, the variance will be small, but what about the bias? The bias says how our best hypothesis is far from the target function. Is it even close, or whatever dataset I will take even the best hypothesis function will be far away from the target. Remember, our target function is a cubic function. Whatever dataset you take, the linear function is not able to approximate the target function well. This is what the bias value says. Here is how it looks like
The left example is a small hypothesis set, which has small variance, but high bias, the other one has small bias and high variance, it turns out minimizing the MSE which consists of bias and variance is really hard. By decreasing one value, the other can increase.
There is an optimal point, where bias and variance are in a good position and their values are reasonable. To find that optimal point, we need to draw the curves for every value, which depends on the complexity of the model. By saying the complexity of the model, we mean the complexity of the hypothesis set, the size of it.
Our goal is to minimize the total loss, which consists of bias, variance, and small noise. These curves show that increasing the complexity of the model, we will decrease the bias, but the variance will increase and as a result, the total loss will be high.
We can't take a too simple model, which can't even approximate the target function and can't take too big one either, because it has high variance.
In the first case, our model will predict wrong results on the training set, cause the model is too simple, and the second case our model will predict perfectly right results on the training set but will suffer on the examples from validation set, on the examples, which the model hasn't seen in training process. Here is how the predictions will look like
This is a known problem in the machine learning sphere, specifically in deep learning. Every specialist knows about Underfitting or High Bias and Overfitting or High Variance. These are the main problems everybody faces and there are a lot of approaches to fix them. People tried to solve this in the following ways.
- Model Selection / Early Stopping
- Normalization Functions
- Augmentation Techniques
Every point from this list is a big topic to speak about and there is a lot of implementations, observations about them. They have their own use cases, places, frameworks, purposes for usage, etc.