Types and Tailcalls

Notes on DeepLearning.ai Specialization - Improving Deep Neural Nets

published on September 3rd, 2023

I just completed the second part of the DeepLearning.ai specialization on Coursera. I'm writing these notes as a summary to remember better and be able to review things quickly.


My notes (PDF of notes I took on the Remarkable 2)

Slides from the lecture

Jupyter notebooks


Here is a summary of the topics and notes I took for them

Bias-Variance tradeoff or Underfit vs Overfit

What to do against high bias (underfitting)?

What to do against high variance (overfitting)?


L2 Regularization


Normalizing Inputs

Vanishing & Exploding Gradients, Initializations

Numerical Approximation and Gradient Checking

$$ \frac{dJ}{d\theta_i} \approx d\tilde{\theta}_i = \frac{J(\theta + \epsilon_i) - J(\theta - \epsilon_i)}{2\epsilon} $$

Where $\epsilon_i$ is the $i$-th unit vector multplied by $\epsilon$.

To validate if the actual implementation close enough, check

$$ \frac{\Vert d\tilde{\theta} - d\theta \Vert_2}{\Vert d\tilde{\theta} \Vert_2 + \Vert d\theta \Vert_2} \approx \leq 10^{-7} $$

Mini-Batch Gradient Descent

Exponentially weighted averages

Keep an exponentially weighted average of $\theta$'s by computing

$$ v_t = \beta v_{t-1} + (1-\beta)\theta_t $$

Intuition is that $v_t$ is the velocity and $\theta_t$. $\beta$ controls how quickly velocity changes. Approximately we're averaging over $\frac{1}{1-\beta}$ values. (E.g. 10 values for $\beta=0.9$)

When $v_0=0$ we can use a bias correction term $\tilde{v}_t = \frac{v_t}{1-\beta^t}$.

This is a very memory-efficient way of maintaining a running average (only need to remember one term).

Gradient Descent with Momentum

Use exponentially weighted average on the gradients.

\[\begin{aligned} v_{dW} &= \beta v_{dW} + (1 - \beta)dW \\ v_{db} &= \beta v_{db} + (1 - \beta)db \\ W &:= W - \alpha v_{dW} \\ b &:= b - \alpha v_{db} \\ \end{aligned}\]

Smoothes over fluctuations, helps with stochastic or mini-batch gradient descent.


An alternative way of smoothing out gradient updates.

\[\begin{aligned} S_{dW} &= \beta_2 S_{dW} + (1-\beta_2)dW^2 \\ S_{db} &= \beta_2 S_{db} + (1-\beta_2)db^2 \\ W &:= W - \alpha\frac{dW}{\sqrt{S_{dW} + \epsilon}} \\ b &:= b - \alpha\frac{db}{\sqrt{S_{db} + \epsilon}} \end{aligned}\]

Adam Optimization

$$ \gdef\corr#1{#1^{\text{corrected}}} $$

\[\begin{aligned} v_{dW} & = \beta_1 v_{dW} + (1-\beta_1) dW \\ v_{db} & = \beta_1 v_{db} + (1-\beta_1) db \\ S_{dW} & = \beta_2 S_{dW} + (1-\beta_2) dW^2 \\ S_{db} & = \beta_2 S_{db} + (1-\beta_2) db^2 \\ \corr{v_{dW}} &= \frac{v_{dW}}{1-\beta_1^t}\\ \corr{v_{db}} &= \frac{v_{db}}{1-\beta_1^t}\\ \corr{S_{dW}} &= \frac{S_{dW}}{1-\beta_2^t}\\ \corr{S_{db}} &= \frac{S_{db}}{1-\beta_2^t}\\ W &:= W - \alpha \frac{\corr{v_{dW}}}{\sqrt{\corr{S_{dW}} + \epsilon}} \\ b &:= b - \alpha \frac{\corr{v_{db}}}{\sqrt{\corr{S_{db}} + \epsilon}} \end{aligned}\]

Adam paper

Learning Rate Decay

Local optima & Saddle points

Hyperparameters and their Priorities

How to search for Hyperparameters

Picking an appropriate scale

Panda vs Caviar Strategy

Use caviar strategy when you can (=enough computational power for the data set), panda when you must.

Normalized Batch / Batch Norm

Softmax Regression

$$\text{softmax}_i = \frac{e^{a_i}}{\sum_j e^{a_j}}$$

where the sum over $j$ runs over the activations of the last layer.

ML Frameworks and TensorFlow

comments powered by Disqus