Code&Data Insights

[Deep Learning] Optimization - Gradient Descent | Stochastic Gradient Descent | Mini-batch SGD 본문

Data Science/Deep Learning

[Deep Learning] Optimization - Gradient Descent | Stochastic Gradient Descent | Mini-batch SGD

paka_corn 2023. 11. 1. 11:08

Optimization

- Training a machine learning model often requires solving Optimization problem 

=> have to find the parameters of the function f that minimizes the loss function using the training data.

 

 

 

 

Problem in Optimization in Multi dimensional Spaces

- TOO MANY CRITICAL POINT! (critical points where  f'(x) = 0) 

=> local minima, maxima, and saddle points 

 

 

 

How to Solve Optimization Problem?

Solution 1 : Find the Analytical Solution 

1) compute the gradient f'(x)

2) find a closed-form expression where all the critical points

   ( hard to find the closed-form expression except for linear(simple) function)

3) test all critical points we found and choose ONE MINIMIZE THE LOSS! 

 

 

Solution 2 : Find the Numerical Solution 

: finding an approximate solution to the problem 

1) Start from a set of candidate solutions(LOCAL OPTIMA)

2) Progressively improve the candidate solutions until convergence.

(The algorithm is sensitiveto the initialization point)

 

WHY LOCAL OPTIMA?

* find a global optimum - computationally demanding 

* Some of them can find local optima only 

(hill-climbing, coordinate descent, gradient descent)

* Local optimization is much faster than global optimization. 

 

 

 

Gradient Descent

 

Gradient (Derivative)

: When we have more than one parameter we have to compute the partial derivatives over all the parameters

-> It is a vector that points in the direction of the greatest increase of the objective(loss)

 

Learning Rate 

: a crucial factor, influences how quickly or slowly converges to an optimal solution.

* Changing the learning rate over the epochs can improve performance and reduce training time. 

- The learning rate is a hyperparameter that determines how much the weights are adjusted with respect to the loss gradient.

 

- If lr is too small, it requires many updates and progresses very slow! 

- If lr is too large, it causes drastic updates(divergent behavior) 

=> too large - oscillation(진동) 

=> much too large - instability(불안정)

 

 

 

 

Learning Rate Annealing

: reduce the learning rate while we train. 

- new-bob annealing : reduce the learning rate when some conditions are met 

 

Gradient Descent

- we update the parameter based on the gradient.

- proper initialization is IMPORTANT!

- each iteration uses the entire dataset to update the weights

 

Update Rules

wt+1 = wt − α∇wℒ(X,Y,wt)

 

Advantages and Disavantages

Disadvantages

- GD can get stuck in local optima

- difficult to apply to non-differentiable loss function

- require tuning the learning rate (+ batch size for SGD) 

- can be costly (time & memory) since we need to evaluate the whole training dataset before we take one step towards the minimum.

(significantly more computation resources)

 

 

Batch Size 

: number of samples in the minibatch

=> increase computational cost, but also accuracy 

- Minibatch

: set of training samples used for computing the gradient.

 

 

 

Stochastic Gradient Descent (SGD) 

: use noisy versions of the gradient, update the model's parameters using the gradient of the loss function with respect to a single training example(traditinally 1 batch) at a time. 

- a single point(or a batch of points) are used in each iteration to update weights.

- It introduces some randomness in the learning process that helps the algorithm to escape from saddle points and local minima => Regularization Effect! 

 

Update Rules 

wt+1 = wt − α∇wl(x, y,wt)

 

Advantages and Disavantages

Advantages

- It converges faster compared to GD

- It can be shown to reach the global minimum in convex settings

- SGD much more scalable in large datasets compared to GD

- Local minumum in non-convex functions obtained by SGD are “close” to global minimum.

=> SGD naturally introduces some regularization that yields better generalization properties.

 

 

Disadvantages

- Inefficient computation and lead to noisy updates

- Due to the randomness, convergence can be unstable, and noise can be present.

- the error function is not as well minimized as in the case of GD

 

 

 

 

 

Mini-batch SGD

: updating the model's parameters using a small random subset (mini-batch) of the training data at each iteration.

 

Update Rules 

wt+1 = wt − α∇wℒ(Xn,Yn,wt)

 

 

Advantages and Disavantages

Advantages

- can be seen as obtaining a lower variance gradient estimate but without need for full batch

(compromise between GD and SGD)

- Forward and Backward processing is often more efficient on single GPU

* SGD with small mini-batch can find flatter minimum and that flat minimum generalize better

"Flat regions" 

- Refer to areas in the loss function where the gradient is close to zero.

- In such regions, the gradient doesn't change abruptly, allowing gradient descent to converge quickly.

- In flat regions, gradient descent optimizes rapidly, leading to stable convergence when updating the model's weights.

 

 

Disadvantages

- Initialization Sensitivity

- Plateaus(regions of very small gradient) and Saddle Points(points where the gradient is zero but not necessarily a minimum).

Comments