Review¶
本篇是 https://arxiv.org/pdf/1803.01164 的阅读笔记
这也许并不是一篇优秀的综述,排版比较混乱,而且似乎并没有经过仔细的校对,有大量诸如 'than' 拼写成 'then' 的错误,部分方面涵盖的也不够详细。
原文标题:"The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches"
Warning
本文有大量对原文的直接引用,因此禁用了鼠标选取、复制功能。
Abstract¶
Topics:
- Deep Neural Network (DNN)
- Convolutional Neural Network (CNN)
- Recurrent Neural Network (RNN)
- Long Short Term Memory (LTSM)
- Gated Recurrent Units (GRU)
- Auto-Encoder (AE)
- Deep Beleief Network (DBN)
- Generative Adversarial Network (GAN)
- Deep Reinforcement Learning (DRL)
- Transfer Learning
Intro¶
Machine Learning > Neural Network (NN) \(\Rightarrow\) Deep Learning (DL)
Categorization¶
- Supervised: DNN, CNN, RNN (LSTM GRU)
- With labeled data: given input and corresponding outputs.
- A loss function \(l(y_t, \hat{y}_t)\) is given and calculated at each time step (a gradient is calculated for each time step).
- Semi-supervised (partially supervised): RNN (LSTM, GRU), GAN
- Reinforcement Learning
- Semi-labeled datasets
- Unspervised: RNN (LTSM) RL
- Clustering, generative techniques. No label is given.
- Auto Encoders (AE), Restricted Boltzmann Machines (RBM), GAN
- Deep Reinforcement Learning (DRL)
- Do not have straight foward loss function (do not have full access to the function you are trying to optimize, query them through interaction; state-based enviroment, input depends on previous actions)
Feature Learning¶
-
Traditional ML Approach:
Handmade features by apply feature extraction algorithms
- Scale Invariant Feature Transform (SIFT)
- Speeded Up Robust Features (SURF)
- GIST RANSAC
- Histogram of Oriented Gradients (HOG)
- Local Binary Patterns (LBP)
- Empirical Mode Decomposition (EMD) for speech analysis
Learning approaches:
- Support Vector Machine (SVM)
- Random Forest (RF)
- K-Nearest Neighbors (KNN)
- Principal Component Analysis (PCA)
- Kernel PCA
- Linear Discriminant Analysis (LDA)
- Independent Component Analysis (ICA)
- Fisher Discriminant Analysis (FDA)
Applications: when and where¶
skipped
State of the art¶
-
Image classification
Dataset: ImageNet
-
Automatic Speech Recognition (ASR)
Dataset: TIMIT
Why Deep Learning?¶
- Universal
- Robust
- Generalizable
- Scalable
DNN¶
History¶
Year | Event | Description |
---|---|---|
1943 | McCulloch and Pitts | Turing complete |
1958 | Rosenblatt | Preceptron's will converge if thwhat they are trying to learn can be represented |
1969 | Minsky and Papert | Limitation of preceptron |
1985 | Geoff Hinton | Backpropagation |
1998 | Yann LeCun | Convolutional Neural Network (CNN) |
2006 | Hinton | Training deep networks |
2012 | AlexNet | ImageNet |
Gradient Descent¶
Stochastic Gradient Descent¶
Compared to Gradient Descent, Stochastic Gradient Descent is faster and more efficient. Most times, the dataset is too large to fit into memory, so we use mini-batches.
Steps:
- The dataset is shuffled and divided into mini-batches, and the gradient is computed for each mini-batch.
- The mini-batch size is a hyperparameter.
- For each mini-batch, the weights are updated using the gradient of the loss function with respect to the weights.
- One complete foward and backward pass through the network is called an epoch.
- The process is repeated for a number of epochs to ensure the model converges to the optimal weights.
Backpropagation¶
skipped
Learning rate¶
- For a smaller learning rate, the model would take longer to converge;
- For a larger learning rate, the model might overshoot the optimal weights.
The typical solution for this problem is to reduce the learning rate during training. Three common approaches are used for reducing the learning rate during training: constant, factored and exponential decay.
Common Practice
Use a learning rate decay of \(e^{-\eta} = 0.1\).
Weight decay¶
To prevent overfitting, a regularization term is added to the loss function.
where \(\Omega(\theta)\) is the regularization term.
- L1 regularization: \(\Omega(\theta) = \sum_{i} |\theta_i|\)
- L2 regularization: \(\Omega(\theta) = \sum_{i} \theta_i^2\)
Common Practice
\(L2\) regularization is more commonly used with \(\lambda = 0.0004\), smaller \(\lambda\) will accelerate the convergence.
Notes¶
Other necessary components for efficient traning:
- data prepcrocessing & augmentation
- network initialization approaches
- batch normalization
- activation functions
- regularization with dropout
- different optimization algorithms
Before 2006, attempts taken at training deep architectures failed: training a deep supervised feed-foward neural network tended to yield worse result (both in training and in test error) than shallow ones (wuth 1 or 2 hidden layers). Hinton's revolutionary work on DBNs spearheaded a change in this in 2006.
Due to their composition, many layers of DNNs are more capable at representing highly varying nonlinear functions compared to shallow learning approaches. Moreover, DNNs are mroe efficient for learning because of the combination of feature extraction and classficiation layers.
CNN¶
History¶
Year | Event | Description |
---|---|---|
1988 | Fukushima | Neocognitron |
1998 | LeCun | LeNet-5 |
CNN s have several advantages over DNNs, including being more similar to human visual processing system, being highly optimized in structure for processing 2D and 3D images, and being effective at leraning and extracting abstractions of 2D features. The max pooling layer of CNNs is effective in absorbing shape variations. Moreover, composed of sparse connections with tied weights, CNNs have significantly fewer parameters than a fully conenected network of simiar size. Most of all, CNNs are trained with gradient-based optimization algorithms, and suffer less from the diminishing gradient problem.