Code&Data Insights

[Machine Learning] Regularization - Early Stopping | Weight Decay | Dropout 본문

Data Science/Machine Learning

[Machine Learning] Regularization - Early Stopping | Weight Decay | Dropout

paka_corn 2023. 11. 6. 00:42

Regularization 

: prevent overfitting and improve the generalization of a model  

- It introduces additional constraints or penalties into the model training process to discourage the model from becoming too complex.

- It aims to strike a balance between fitting the training data well and maintaining simplicity in the model

 

 

Early Stopping 

: monitor the performance after each epoch on the validation set and stop training if the validation performance starts to get worse.

 

When Overfitting Happens? 

- too many parameters, 

- less training dataset

 

 

How to Avoid Overfitting? 

- Improve Generalization! 

1. Weight Decay

: Limits the size of weights (model parameters) and prefers smaller weights, thereby reducing the complexity of the model

 

· Why we use weight decay?

: to avoid overfitting

 

 

·  When it is useful

-  Limited training data

=> Small dataset has higher risk of the model memorization

 

- Complex model (multidimensional)

=> higher capacity to fit the training data perfectly can lead to overfitting

 

-  Many parameters

=> More parameters can make the model more prone to overfitting

=> It reduces the influence of individual parameters.

 

L2 Ridge

L2 regularization(Ridge regularization)

-       The regularization term penalizes large weights

-       it encourages the model to have small weights.

-       The strength of this penalty is controlled by a hyperparameter commonly denoted as "λ" (lambda).

-       A higher λ value leads to a stronger regularization effect, pushing the weights closer to zero.

 

Weight decay for vanilla SGD corresponds to an L2 regularization

 

2. Dropout

: only mask training set, NOT test set 

- useful in small dataset

Dropout : a regularization technique used in neural networks and deep learning to prevent overfitting.

 

·  During training,

- Dropout operates by randomly deactivating connections of selected neurons (nodes).  => if the dropout probability is 0.5, each neuron has a 50% chance of being deactivated.  

- This random deactivation encourages the model to learn various weight combinations, preventing overfitting and enhancing generalization.

 

· During testing,

-  all neurons are activated. In other words, Dropout is deactivated, and all weights are utilized

Comments