Overfitting Vs Underfitting

Published on :

21 Aug, 2024

Blog Author :

N/A

Edited by :

Rashmi Kulkarni

Reviewed by :

Dheeraj Vaidya

Difference Between Overfitting And Underfitting

The key difference between overfitting and underfitting is seen in how well a model can identify patterns in data. When a machine learning model is so closely associated with the training data that it cannot digest, interpret, or analyze any new or unseen data, leading to inefficiency and accuracy of the model's outcome, it indicates overfitting. When a machine learning model cannot process the training data and cannot build a connection between input variables and output variables or target values, it is called underfitting. Therefore, a machine learning model is only considered reliable when it generalizes input data correctly and helps make predictions about new data.

  • Overfitting is when the model works well with trained data but not with new or unseen data. It performs poorly, offers inaccurate predictions, and cannot process new data.
  • Underfitting is when a model is incapable of linking input variables and output or target values. It is easier to identify than overfitting.
  • Both are problems with machine learning models. An ideal, balanced model can generalize any form of data to predict outcomes and assist in critical decision-making.
  • Variance and bias play an important role in defining a machine learning model's behavior to identify whether it is underfitting or overfitting.

Comparative Table

ParticularsOverfittingUnderfitting
MeaningThe model performs well with trained data but not with new data.The model offers poor performance without any link between input and output.
Data/DurationModels need to be regulated and adjusted to minimize overfitting.To reduce underfitting, training duration and data size must be increased.
VarianceIt has a high varianceIt has a low variance.
BiasIt has a low bias.It has a high bias.
ComplexityIt is very complex. Hence, complexity must be reduced.The model becomes simple, which is not ideal. Hence, complexities in data must be introduced for better prediction results.
SpottingOverfitting is difficult to spot in a machine learning model.Compared to overfitting, underfitting models are easy to identify as they perform poorly, even with training datasets.

What Is Overfitting?

Overfitting is an abnormal machine learning behavior and accuracy problem, as the model only responds correctly to its trained data. Data scientists train a machine learning model with a set of training data to make predictions; it is also called a known dataset. Based on this information, the model then attempts to make predictions with new data. However, when new data is introduced, it presents inaccurate results and does not support decision-making. In a nutshell, overfitting is a shift in algorithms, bias, and variance in the context of unseen data.

The reasons for overfitting in machine learning models are:

  • The model is familiar with the trained data. Hence, noise and inaccurate datasets can mislead it.
  • The model has high complexity with multiple details, levels, and features. Random or irrelevant information can create noise.
  • A model exhibiting overfitting has high variance and low bias.
  • The training data is small and does not have enough data samples to represent or cover all kinds of possible values.
  • The model is trained on data that is noisy and irrelevant for a long time.

Some techniques to reduce overfitting in machine learning models are listed below.

  • Use large datasets with multiple data samples covering all possible values.
  • Simplify the model and reduce its complexity.
  • Use dropouts for neural networks for layered learning.
  • Implement ridge and lasso regularization regression to ensure relevance.
  • Make an early stop in between the training phase and tracking loss.

Example

Suppose FirstAdvantage Bank launched a new loan approval system, LoanFirst, to streamline its lending process. Based on machine learning, LoanFirst aimed to identify creditworthy borrowers. After it was installed, loan applications were processed in no time, and approval rates increased with each passing day. However, the bank noticed a high number of defaulters in the coming months, with loan delinquency rates rising month-on-month. As the team began to investigate, a problematic pattern emerged.

LoanFirst approved applicants based on specific but minor details seen in their financial profiles. These features were prominent in the training data when the model was trained. For instance, the tech team saw that LoanFirst approved applicants who belonged to a specific area in the city (based on how it was trained).

It means LoanFirst considered certain parameters important and prioritized such applicants, which was nothing but overfitting. The model focused on specific patterns in the training data, ignoring all other financial health indicators.

Due to this, FirstAdvantage Bank had to discontinue LoanFirst in its existing form. The tech team was instructed to build a new model that did not have overfitting problems.

What Is Underfitting?

Underfitting is a scenario where machine learning models fail to establish associations between input variables and output values. It is a common pitfall in machine learning due to a lack of training, features, or other reasons, which results in a high degree of error, inefficiency, and inaccuracy in both training data and the unseen new data sample introduced.

It happens when a model is too simple and lacks detailed features. Hence, the data is not processed on and for certain aspects, leading to high bias and low variance. It is the opposite of overfitting. With overfitting, at least the model predicts correct outcomes with training data, but with underfitting, even that is not possible. With underfitting, the dominant trend (important patterns) cannot be established. Underfitting demands more training, features, and data samples.

The reasons for underfitting in a machine learning model are:

  • The training data is noisy and unclean, consisting of values irrelevant to the model.
  • The machine learning model has high bias and low variance.
  • The training dataset size is inadequate and does not cover all types of possible values.
  • The model is too simple, with a lack of multiple levels and features.

Some ways to reduce underfitting in machine learning models are:

  • Increase the features of the model and fill in multiple levels and aspects of it.
  • The model's complexity must be increased.
  • The training data should be noise-less and relevant, covering all types of possible values to enable the model to establish links between input and output variables.
  • Increase the training duration of the data in the model.

Example

Suppose Safe & Sound Insurance built a machine-learning model to streamline risk assessment for car insurance policies. The model, SafeCar, was required to assess driving records, car details, etc., and compute the right coverage and premium amounts.

When the income from car premiums began to decline a few months later, Safe & Sound Insurance decided to investigate the matter. The team realized that the machine learning model suggested low coverage options to nearly all applicants, irrespective of their driving history, car type, etc. The company realized that many customers began moving to other insurance service providers where the coverage was high, and premiums were competitive.

The team understood that SafeCar was a victim of underfitting. The model, built to have a conservative approach, had become overly simple. It had failed to understand and acknowledge complexities in the data. Once this was clear, the team expanded the training data and trained the model to accommodate varied patterns. In this manner, Safe & Sound Insurance built a new model to regain its customers and improve its profitability.

Similarities

In this section, let us understand the similarities between overfitting and underfitting.

  • Both overfitting and underfitting machine learning are the two most common problems, resulting in poor performance and inaccurate outcomes.
  • Analysts and data scientists do not prefer either situation. They prefer a reliable machine learning model that generalizes any form of data and offers accuracy in outcome.
  • Both bring problems related to data inadequacy and sample data containing noisy, irrelevant, and garbage values.

This article has been a guide to Overfitting Vs. Underfitting. We explain the differences and similarities between them, with examples and a comparative table. You may also find some useful articles here –