跳转至

Review

本篇是 https://arxiv.org/pdf/1803.01164 的阅读笔记

这也许并不是一篇优秀的综述,排版比较混乱,而且似乎并没有经过仔细的校对,有大量诸如 'than' 拼写成 'then' 的错误,部分方面涵盖的也不够详细。

原文标题:"The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches"

Warning

本文有大量对原文的直接引用,因此禁用了鼠标选取、复制功能。

Abstract

Topics:

  • Deep Neural Network (DNN)
  • Convolutional Neural Network (CNN)
  • Recurrent Neural Network (RNN)
    • Long Short Term Memory (LTSM)
    • Gated Recurrent Units (GRU)
  • Auto-Encoder (AE)
  • Deep Beleief Network (DBN)
  • Generative Adversarial Network (GAN)
  • Deep Reinforcement Learning (DRL)
  • Transfer Learning

Intro

Machine Learning > Neural Network (NN) \(\Rightarrow\) Deep Learning (DL)

Categorization

  • Supervised: DNN, CNN, RNN (LSTM GRU)
    • With labeled data: given input and corresponding outputs.
    • A loss function \(l(y_t, \hat{y}_t)\) is given and calculated at each time step (a gradient is calculated for each time step).
  • Semi-supervised (partially supervised): RNN (LSTM, GRU), GAN
    • Reinforcement Learning
    • Semi-labeled datasets
  • Unspervised: RNN (LTSM) RL
    • Clustering, generative techniques. No label is given.
    • Auto Encoders (AE), Restricted Boltzmann Machines (RBM), GAN
  • Deep Reinforcement Learning (DRL)
    • Do not have straight foward loss function (do not have full access to the function you are trying to optimize, query them through interaction; state-based enviroment, input depends on previous actions)

Feature Learning

  • Traditional ML Approach:

    Handmade features by apply feature extraction algorithms

    • Scale Invariant Feature Transform (SIFT)
    • Speeded Up Robust Features (SURF)
    • GIST RANSAC
    • Histogram of Oriented Gradients (HOG)
    • Local Binary Patterns (LBP)
    • Empirical Mode Decomposition (EMD) for speech analysis

    Learning approaches:

    • Support Vector Machine (SVM)
    • Random Forest (RF)
    • K-Nearest Neighbors (KNN)
    • Principal Component Analysis (PCA)
    • Kernel PCA
    • Linear Discriminant Analysis (LDA)
    • Independent Component Analysis (ICA)
    • Fisher Discriminant Analysis (FDA)

Applications: when and where

skipped

State of the art

  • Image classification

    Dataset: ImageNet

  • Automatic Speech Recognition (ASR)

    Dataset: TIMIT

Why Deep Learning?

  • Universal
  • Robust
  • Generalizable
  • Scalable

DNN

History

Year Event Description
1943 McCulloch and Pitts Turing complete
1958 Rosenblatt Preceptron's will converge if thwhat they are trying to learn can be represented
1969 Minsky and Papert Limitation of preceptron
1985 Geoff Hinton Backpropagation
1998 Yann LeCun Convolutional Neural Network (CNN)
2006 Hinton Training deep networks
2012 AlexNet ImageNet

Gradient Descent

\[ \theta \leftarrow \theta - \eta \nabla_{\theta} J(\theta) \]

Stochastic Gradient Descent

Compared to Gradient Descent, Stochastic Gradient Descent is faster and more efficient. Most times, the dataset is too large to fit into memory, so we use mini-batches.

Steps:

  • The dataset is shuffled and divided into mini-batches, and the gradient is computed for each mini-batch.
    • The mini-batch size is a hyperparameter.
  • For each mini-batch, the weights are updated using the gradient of the loss function with respect to the weights.
    • One complete foward and backward pass through the network is called an epoch.
    • The process is repeated for a number of epochs to ensure the model converges to the optimal weights.

Backpropagation

skipped

Learning rate

  • For a smaller learning rate, the model would take longer to converge;
  • For a larger learning rate, the model might overshoot the optimal weights.

The typical solution for this problem is to reduce the learning rate during training. Three common approaches are used for reducing the learning rate during training: constant, factored and exponential decay.

\[ \begin{aligned} \text{Constant decay:} & \quad \eta_t = \frac{1}{1 + \eta \cdot t} \\ \text{Factored decay:} & \quad \eta_t = \frac{1}{1 + \eta \cdot \sqrt{t}} \\ \text{Exponential decay:} & \quad \eta_t = \alpha_0 \cdot e^{-\eta\: t} \end{aligned} \]

Common Practice

Use a learning rate decay of \(e^{-\eta} = 0.1\).

Weight decay

To prevent overfitting, a regularization term is added to the loss function.

\[ L(\theta, x) = J(\theta, x) + \lambda \Omega(\theta) \]

where \(\Omega(\theta)\) is the regularization term.

  • L1 regularization: \(\Omega(\theta) = \sum_{i} |\theta_i|\)
  • L2 regularization: \(\Omega(\theta) = \sum_{i} \theta_i^2\)

Common Practice

\(L2\) regularization is more commonly used with \(\lambda = 0.0004\), smaller \(\lambda\) will accelerate the convergence.

Notes

Other necessary components for efficient traning:

  • data prepcrocessing & augmentation
  • network initialization approaches
  • batch normalization
  • activation functions
  • regularization with dropout
  • different optimization algorithms

Before 2006, attempts taken at training deep architectures failed: training a deep supervised feed-foward neural network tended to yield worse result (both in training and in test error) than shallow ones (wuth 1 or 2 hidden layers). Hinton's revolutionary work on DBNs spearheaded a change in this in 2006.

Due to their composition, many layers of DNNs are more capable at representing highly varying nonlinear functions compared to shallow learning approaches. Moreover, DNNs are mroe efficient for learning because of the combination of feature extraction and classficiation layers.

CNN

History

Year Event Description
1988 Fukushima Neocognitron
1998 LeCun LeNet-5

CNN s have several advantages over DNNs, including being more similar to human visual processing system, being highly optimized in structure for processing 2D and 3D images, and being effective at leraning and extracting abstractions of 2D features. The max pooling layer of CNNs is effective in absorbing shape variations. Moreover, composed of sparse connections with tied weights, CNNs have significantly fewer parameters than a fully conenected network of simiar size. Most of all, CNNs are trained with gradient-based optimization algorithms, and suffer less from the diminishing gradient problem.