# Notes on DeepLearning.ai Specialization - Improving Deep Neural Nets

published on September 3rd, 2023

I just completed the second part of the DeepLearning.ai specialization on Coursera. I'm writing these notes as a summary to remember better and be able to review things quickly.

## Resources

My notes (PDF of notes I took on the Remarkable 2)

Slides from the lecture

Jupyter notebooks

Here is a summary of the topics and notes I took for them

### Bias-Variance tradeoff or Underfit vs Overfit

• To get an understanding of bias and variance of our model, we need to split the training data into training, dev and (maybe) test set
• We need a test set if we want to get an unbiased estimate of our models real-world performance
• 70%/30% or 60%/20%/20% split rules may or may not apply - you just need enough data in dev and test sets to estimate the fit
• e.g. on a training set of 10M examples, maybe 100K enough for dev / test set
• High bias seems to be a synoym for underfitting the training set
• High variance seems to be a synonym for overfitting

What to do against high bias (underfitting)?

• More powerful model (e.g. more layers / hidden units / different architecture)
• Better optimization algorithm
• Better initialization
• Tune learning rate

What to do against high variance (overfitting)?

• Regularization (L2, dropout, early stopping)
• Larger / more diverse training set

### Regularization

#### L2 Regularization

• Adds a cost term $\frac{\lambda}{2m}\sum_{i=0}^n w_i^2$ to the cost function
• This creates a linear term for the parameters in the gradient
• Sometimes called "weight decay"

#### Dropout

• Drops random neurons in the hidden layers of the network
• Hyperparameter $\text{keepprob}$ which keeps neurons alive
• Scale activations by $1 / \text{keepprob}$ after dropping
• During backprop
• Drop the same neurons as during forward prop from the gradient of the activations $dA$
• Scale the gradient of the activations $dA$ by $1/\text{keepprob}$.

### Normalizing Inputs

• Having inputs on the same scale speeds up learning
• Shape of cost function should be more uniform, making it easier for gradient descent to point to the minimum
• Need to normalize data in the same way during dev/test/production!

### Vanishing & Exploding Gradients, Initializations

• Gradients in earlier layers depend multiplicatively on gradients of later layers
• If gradients of later layers are all (or most) $> 1$ the gradient will explode
• If gradients of later layers are all (or most) $< 1$ the gradient will vanish
• Parameter initialization can help with this
• Don't initialize parameters to 0 or large values. This will lead to vanishing or exploding gradients, respectively
• Recommended: He initialization: $\text{Var}(w^l_{ij})=\frac{2}{k^{l-1}}$ (where $k^l$ is the number of nodes in layer $l$).

### Numerical Approximation and Gradient Checking

• It's easy to get gradient implementation wrong, we can use the numerical approximation to check it:

$$\frac{dJ}{d\theta_i} \approx d\tilde{\theta}_i = \frac{J(\theta + \epsilon_i) - J(\theta - \epsilon_i)}{2\epsilon}$$

Where $\epsilon_i$ is the $i$-th unit vector multplied by $\epsilon$.

To validate if the actual implementation close enough, check

$$\frac{\Vert d\tilde{\theta} - d\theta \Vert_2}{\Vert d\tilde{\theta} \Vert_2 + \Vert d\theta \Vert_2} \approx \leq 10^{-7}$$

• Instead of doing gradient descent on (a) the full batch of training examples or (b) a single example (stochastic gradient descent) do it on small batches (e.g. 64 - 2048 items)
• Uses vectorization, forwardprop and backprop can process a whole mini-batch at a time, so tends to be faster
• Cost function and gradients will be less smooth than with full-batch => use momentum or ADAM
• Tune mini-batch size (another hyperparameter), ideally so problem fits into GPU memory

### Exponentially weighted averages

Keep an exponentially weighted average of $\theta$'s by computing

$$v_t = \beta v_{t-1} + (1-\beta)\theta_t$$

Intuition is that $v_t$ is the velocity and $\theta_t$. $\beta$ controls how quickly velocity changes. Approximately we're averaging over $\frac{1}{1-\beta}$ values. (E.g. 10 values for $\beta=0.9$)

When $v_0=0$ we can use a bias correction term $\tilde{v}_t = \frac{v_t}{1-\beta^t}$.

This is a very memory-efficient way of maintaining a running average (only need to remember one term).

Use exponentially weighted average on the gradients.

\begin{aligned} v_{dW} &= \beta v_{dW} + (1 - \beta)dW \\ v_{db} &= \beta v_{db} + (1 - \beta)db \\ W &:= W - \alpha v_{dW} \\ b &:= b - \alpha v_{db} \\ \end{aligned}

Smoothes over fluctuations, helps with stochastic or mini-batch gradient descent.

### RMSProp

\begin{aligned} S_{dW} &= \beta_2 S_{dW} + (1-\beta_2)dW^2 \\ S_{db} &= \beta_2 S_{db} + (1-\beta_2)db^2 \\ W &:= W - \alpha\frac{dW}{\sqrt{S_{dW} + \epsilon}} \\ b &:= b - \alpha\frac{db}{\sqrt{S_{db} + \epsilon}} \end{aligned}

• Uses both momentum and RMSprop together

$$\gdef\corr#1{#1^{\text{corrected}}}$$

\begin{aligned} v_{dW} & = \beta_1 v_{dW} + (1-\beta_1) dW \\ v_{db} & = \beta_1 v_{db} + (1-\beta_1) db \\ S_{dW} & = \beta_2 S_{dW} + (1-\beta_2) dW^2 \\ S_{db} & = \beta_2 S_{db} + (1-\beta_2) db^2 \\ \corr{v_{dW}} &= \frac{v_{dW}}{1-\beta_1^t}\\ \corr{v_{db}} &= \frac{v_{db}}{1-\beta_1^t}\\ \corr{S_{dW}} &= \frac{S_{dW}}{1-\beta_2^t}\\ \corr{S_{db}} &= \frac{S_{db}}{1-\beta_2^t}\\ W &:= W - \alpha \frac{\corr{v_{dW}}}{\sqrt{\corr{S_{dW}} + \epsilon}} \\ b &:= b - \alpha \frac{\corr{v_{db}}}{\sqrt{\corr{S_{db}} + \epsilon}} \end{aligned}

• Hyperparameters: typically tune $\alpha$, leave $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\epsilon = 10^{-8}$.

### Learning Rate Decay

• Can be useful to decrease learning rate as learning progresses, to take smaller steps near the minimum
• There are different schemes
• e.g. $\alpha = \frac{1}{1 + \text{decayrate}\times\text{epochnum}}\alpha_0$
• stepwise manual decay
• exponential decay

### Local optima & Saddle points

• Because there are many parameters, optimizers of NNs unlikely to get stuck in a local optima
• E.g. in 20000 dimensions, it is unlikely that a point is a minimum along all dimensions if its not a global minimum
• However, saddle points or plateaus are a problem, they can make learning very slow

#### Hyperparameters and their Priorities

• Not all hyperparameters are created equal / have the same priority
• P0: $\alpha$
• P1: #hidden_units, $\beta$ (momentum), mini-batch size
• P2: #layers, learning rate decay
• P3: Adam parameters ($\beta_1, \beta_2,\epsilon$)

#### How to search for Hyperparameters

• Don't search on a grid, search with random values
• Allows to try out many more values for each hyperparameter
• Consider a coarse-to-fine approach: First search a larger space, then zoom in on the space that works best

#### Picking an appropriate scale

• Some parameters are OK to search for in linear scale, eg. #hidden_units
• Others need a log scale, e.g. $\alpha, \beta$.
• When wanting to search $\alpha \in [0.0001, 0.1]$ it doesn't make sense to search linearly, search in log-scale instead
• Sample exponent in $a\in[-4, -1]$, then set $\alpha$ to $10^a$.
• Similar for $\beta$, when wanting to search in $[0.9, 0.999]$ search for $1-\beta$ with the above method instead

### Panda vs Caviar Strategy

• Panda strategy: babysit one model, try different optimization strategies one after another

• Caviar strategy: try many approaches at once, then pick the winners

Use caviar strategy when you can (=enough computational power for the data set), panda when you must.

### Normalized Batch / Batch Norm

• Instead of learning the mean and scale of each layer implicitly, learn them as explicit parameters
• Normalize each $z_j^{(l)}$ to have 0 mean and variance 1. Normalization is done over the different examples in each mini-batch
• For test / prod, compute normalization parameters as exponentially weighted averages over the mini-batches
• Then scale by $\gamma_j^{(l)}$, shift by $\beta_j^{(l)}$, these are learned parameters
• Because we explicitly shift, we don't need the bias units $b$

Intuition:

• Batch normalization works because later layers see less shifts in their inputs (because these are now explicitly controlled / learned)
• Changes in the input layer is sometimes referred to as "covariance shift"
• This can speed up learning in later layers, because they need to adopt less to rapid changes in the previous layers.

### Softmax Regression

• Softmax layer can be used for multi-class classification. Outputs a vector of $n \times 1$
• Somewhat unusual layer since it acts on the whole vector

$$\text{softmax}_i = \frac{e^{a_i}}{\sum_j e^{a_j}}$$

where the sum over $j$ runs over the activations of the last layer.

### ML Frameworks and TensorFlow

• Many different frameworks to choose from, many viable alternatives
• Choose a framework that
• Is nice to code in (reading and writing)
• Has good performance
• Is truely open (open source + good governance)