4. Outline
What is Machine Learning?
What is Deep Learning?
Hands-on Tutorial of Deep Learning
Tips for Training DL Models
Variants - Convolutional Neural Network
4
6. 一句話說明 Machine Learning
6
Field of study that gives computers the ability
to learn without being explicitly programmed.
- Arthur Lee Samuel, 1959
『
』
7. Machine Learning vs Artificial Intelligence
AI is the simulation of human intelligence processes
Outcome-based: 從結果來看,是否有 human intelligence
一個擁有非常詳盡的 rule-based 系統也可以是 AI
Machine learning 是達成 AI 的一種方法
從資料當中學習出 rules
找到一個夠好的 function
能解決特定的問題
7
Artificial
Intelligence
Machine
Learning
8. Goal of Machine Learning
For a specific task, find a best function to complete
Task: 每集寶可夢結束的“猜猜我是誰”
8
f*( ) =
f*( ) =
Ans: 妙蛙種子
Ans: 超夢
10. 1. Define a Set of Functions
10
Define a set
of functions
Evaluate
and
Search
Pick the best
function
A set of functions, f(‧)
{f(ө1),…,f(ө*),…,f(өn)}
在北投公園的寶可夢訓練師
11. 2. Evaluate and Search
11
Define a set
of functions
Evaluate
and
Search
Pick the best
function
f1( ) =
f1( ) =
根據結果,修正 ө :
避免找身上有皮卡丘的人 (遠離 ө1 )
f1=f(ө1)
12. 3. Pick The Best Function
12
Define a set
of functions
Evaluate
and
Search
Pick the best
function
找到寶可夢訓練大師
23. A Neuron
23
Input
x1
xn
……
z = w1x1+w2x2+……+wnxn+b
̂y
wn
w1
Output
Weight
+
z
σ: Activation
function
b
Bias
σ(z)
̂y = σ(z)
…
Neuron
24. Neuron 的運作
Example
24
σ(z)=z linear function
z = w1x1+w2x2+……+wnxn+b
A neuron
5
2
z = w1x1+w2x2+……+wnxn+b
3
-1
Output
+
z
σ(z)
3
̂y
̂y = z = (-1)*5+3*2+3 = 4 Ө: weights, bias
25. Fully Connected Neural Network
很多個 neurons 連接成 network
Universality theorem: a network with enough number
of neurons can present any function
26
X1
Xn
w1,n
w1,1
w2,n
w2,1
26. Fully Connected Neural Network
A simple network with linear activation functions
5
2
-0.5
+0.2
-0.1
+0.5
-1.5
+0.8
27. Fully Connected Neural Network
5
2
+0.4
+0.1
+0.5
+0.9
0.12
0.55
-0.5
+0.2
-0.1
+0.5
-1.5
+0.8
A simple network with linear activation functions
28. 給定 Network Weights
29
f( ) =
5
2
-0.5
+0.5
-0.1
+0.2
+0.4
+0.1
+0.5
+0.9 0.12
0.55
Given & ,
f(x,Ө)
Ө: weights, bias
A Neural Network = A Function
29. Recall: Deep Learning Framework
32
Define a set
of functions
Evaluate
and
Search
Pick the best
function
特定的網絡架構
A set of functions, f(‧)
{f(ө1),…,f(ө*),…,f(өn)}
f(ө94)
f(ө87)
f(ө945)…
不斷修正 f 的參數
找到最適合的參數
f(ө*)
30. output values 跟 actual values 越一致越好
A loss function is to quantify the gap between
network outputs and actual values
Loss function is a function of Ө
如何評估模型好不好?
33
X1
X2
̂y1
̂y2
y1
y2
L
f(x,Ө)
(Ө)
31. 目標:最佳化 Total Loss
Find the best function that minimize total loss
Find the best network weights, ө*
𝜃∗ = argm𝑖𝑛
𝜃
𝐿(𝜃)
最重要的問題: 該如何找到 ө* 呢?
踏破鐵鞋無覓處 (enumerate all possible values)
假設 weights 限制只能 0.0, 0.1, …, 0.9,有 500 個 weights
全部組合就有 10500 組
評估 1 秒可以做 106 組,要約 10486 年
宇宙大爆炸到現在才 1010 年
Impossible to enumerate
34
37. Summary – Gradient Descent
用來最佳化一個連續的目標函數
朝著進步的方向前進
Gradient descent
Gradient 受 loss function 影響
Gradient 受 activation function 影響
受 learning rate 影響
40
38. Gradient Descent 的缺點
一個 epoch 更新一次,收斂速度很慢
一個 epoch 等於看過所有 training data 一次
Problem 1
有辦法加速嗎?
A solution: stochastic gradient descent (SGD)
Problem 2
Gradient based method 不能保證找到全域最佳解
可以利用 momentum 降低困在 local minimum 的機率
41
39. Gradient Descent 的缺點
一個 epoch 更新一次,收斂速度很慢
一個 epoch 等於看過所有 training data 一次
Problem 1
有辦法加速嗎?
A solution: stochastic gradient descent (SGD)
Problem 2
Gradient based method 不能保證找到全域最佳解
可以利用 momentum 降低困在 local minimum 的機率
42
40. Stochastic Gradient Descent
隨機抽一筆 training sample,依照其 loss 更新一次
另一個問題,一筆一筆更新也很慢
Mini-batch: 每一個 mini-batch 更新一次
Benefits of mini-batch
相較於 SGD: faster to complete one epoch
相較於 GD: faster to converge (to optimum)
43
Update once Update once Update once
Loss Loss Loss
56. 處理順序資料
Ordinal variables (順序資料)
For example: {Low, Medium, High}
Encoding in order
{Low, Medium, High} {1,2,3}
Create a new feature using mean or median
59
UID Age
P1 0-17
P2 0-17
P3 55+
P4 26-35
UID Age
P1 15
P2 15
P3 70
P4 30
82. Alternative: Functional API
The way to go for defining a complex model
For example: multiple outputs, multiple input source
Why “Functional API” ?
All layers and models are callable (like function call)
Example
85
from keras.layers import Input, Dense
input = Input(shape=(200,))
output = Dense(10)(input)
83. 86
# Sequential (依序的)深度學習模型
model = Sequential()
model.add(Dense(128, input_dim=200))
model.add(Activation('sigmoid'))
model.add(Dense(256))
model.add(Activation('sigmoid'))
model.add(Dense(5))
model.add(Activation('softmax'))
model.summary()
# Functional API
from keras.layers import Input, Dense
from keras.models import Model
input = Input(shape=(200,))
x = Dense(128,activation='sigmoid')(input)
x = Dense(256,activation='sigmoid')(x)
output = Dense(5,activation='softmax')(x)
# 定義 Model (function-like)
model = Model(inputs=[input], outputs=[output])
84. Good Use Case for Functional API (1)
Model is callable as well, so it is easy to re-use the
trained model
Re-use the architecture and weights as well
87
# If model and input is defined already
# re-use the same architecture of the above model
y1 = model(input)
85. Good Use Case for Functional API (2)
Easy to manipulate various input sources
88
x2
Dense(100) Dense(200)y1x1 outputnew_x2
x1 = input(shape=(10,))
y1 = Dense(100)(x1)
x2 = input(shape=(20,))
new_x2 = keras.layers.concatenate([y1,x2])
output = Dense(200)(new_x2)
Model = Model(inputs=[x1,x2],outputs=[output])
86. Today
Our exercise uses “Sequential” model
Because it is more straight-forward to understand the
details of stacking layers
89
90. Tips for Deep Learning
93
No
Activation Function
YesGood result on
training dataset?
Loss Function
Good result on
testing dataset?
Optimizer
Learning Rate
91. Tips for Deep Learning
94
No
Activation Function
YesGood result on
training dataset?
Loss Function
Good result on
testing dataset?
Optimizer
Learning Rate
𝜕𝐿
𝜕𝜃
=
𝜕𝐿
𝜕 𝑦
𝜕 𝑦
𝜕𝑧
𝜕𝑧
𝜕𝜃
受 loss function 影響
92. Using MSE
在指定 loss function 時
95
# 指定 loss function 和 optimizier
model.compile(loss='categorical_crossentropy',
optimizer=sgd)
# 指定 loss function 和 optimizier
model.compile(loss='mean_squared_error',
optimizer=sgd)
96. How to Select Loss function
Classification 常用 cross-entropy
搭配 softmax 當作 output layer 的 activation function
Regression 常用 mean absolute/squared error
對特定問題定義 loss function
Unbalanced dataset, class 0 : class 1 = 99 : 1
Self-defined loss function
99
Loss Class 0 Class 1
Class 0 0 99
Class 1 1 0
97. Current Best Model Configuration
100
Component Selection
Loss function categorical_crossentropy
Activation function sigmoid + softmax
Optimizer SGD
98. Tips for Deep Learning
101
No
Activation Function
YesGood result on
training data?
Loss Function
Good result on
testing data?
Optimizer
Learning Rate
99. 練習 02_learningRateSelection.py
(5-8 minutes)
102
# 指定 optimizier
from keras.optimizers import SGD, Adam, RMSprop, Adagrad
sgd = SGD(lr=0.01,momentum=0.0,decay=0.0,nesterov=False)
試試看改變 learning rate,挑選出最好的 learning rate。
建議一次降一個數量級,如: 0.1 vs 0.01 vs 0.001
101. How to Set Learning Rate
大多要試試看才知道,通常不會大於 0.1
一次調一個數量級
0.1 0.01 0.001
0.01 0.012 0.015 0.018 …
幸運數字!
104
102. Tips for Deep Learning
105
No
Activation Function
YesGood result on
training data?
Loss Function
Good result on
testing data?
Optimizer
Learning Rate 𝜕𝐿
𝜕𝜃
=
𝜕𝐿
𝜕 𝑦
𝜕 𝑦
𝜕𝑧
𝜕𝑧
𝜕𝜃
受 activation function 影響
108. Leaky ReLU
Allow a small gradient while the input to activation
function smaller than 0
112
α=0.1
f(x)= x if x > 0,
αx otherwise.
df/dx=
1 if x > 0,
α otherwise.
109. Leaky ReLU in Keras
更多其他的 activation functions
https://keras.io/layers/advanced-activations/
113
# For example
From keras.layer.advanced_activation import LeakyReLU
lrelu = LeakyReLU(alpha = 0.02)
model.add(Dense(128, input_dim = 200))
# 指定 activation function
model.add(lrelu)
113. How to Select Activation Functions
Hidden layers
通常會用 ReLU
Sigmoid 有 vanishing gradient 的問題較不推薦
Output layer
Regression: linear
Classification: softmax
117
114. Current Best Model Configuration
118
Component Selection
Loss function categorical_crossentropy
Activation function softplus + softmax
Optimizer SGD
115. Tips for Deep Learning
119
No
YesGood result on
training dataset?
Good result on
testing dataset?
Activation Function
Loss Function
Optimizer
Learning Rate
116. Optimizers in Keras
SGD – Stochastic Gradient Descent
Adagrad – Adaptive Learning Rate
RMSprop – Similar with Adagrad
Adam – Similar with RMSprop + Momentum
Nadam – Adam + Nesterov Momentum
120
117. Optimizer – SGD
Stochastic gradient descent
支援 momentum, learning rate decay, Nesterov momentum
Momentum 的影響
無 momentum: update = -lr*gradient
有 momentum: update = -lr*gradient + m*last_update
Learning rate decay after update once
屬於 1/t decay lr = lr / (1 + decay*t)
t: number of done updates
121
130. 一般的起手式: Adam
Adaptive learning rate for every weights
Momentum included
Keras 推薦 RNN 使用 RMSProp
在訓練 RNN 需要注意 explosive gradient 的問題
clip gradient 的暴力美學
RMSProp 與 Adam 的戰爭仍在延燒
134
How to Select Optimizers
131. Tips for Deep Learning
135
No
YesGood result on
training data?
Good result on
testing data?
Activation Function
Loss Function
Optimizer
Learning Rate
132. Current Best Model Configuration
136
Component Selection
Loss function categorical_crossentropy
Activation function softplus + softmax
Optimizer Adam
50 epochs 後
90% 準確率!
135. Tips for Deep Learning
139
YesGood result on
training dataset?
YesGood result on
testing dataset?
No
什麼是 overfitting?
training result 進步,但 testing result 反而變差
Early Stopping
Regularization
Dropout
136. Tips for Deep Learning
140
YesGood result on
training dataset?
YesGood result on
testing dataset?
No
Early Stopping
Regularization
Dropout
145. Early Stopping in Keras
Early Stopping
monitor: 要監控的 performance index
patience: 可以容忍連續幾次的不思長進
149
''' EarlyStopping '''
from keras.callbacks import EarlyStopping
earlyStopping=EarlyStopping(monitor = 'val_loss',
patience = 3)
156. How to Set Dropout
不要一開始就加入 Dropout
不要一開始就加入 Dropout
不要一開始就加入 Dropout
a) Dropout 會讓 training performance 變差
b) Dropout 是在避免 overfitting,不是萬靈丹
c) 參數少時,regularization
160
163. Tips for Training Your Own DL Model
167
Yes Good result on
testing dataset?
No
Early Stopping
Regularization
Dropout
No
Activation Function
Loss Function
Optimizer
Learning Rate
Good result on
training dataset?
173. Common applications
模糊化, 銳利化, 浮雕
http://setosa.io/ev/image-kernels/
Convolution in Computer Vision (CV)
177
174. Adding each pixel and its local neighbors which are
weighted by a filter (kernel)
Perform this convolution process to every pixels
Convolution in Computer Vision (CV)
178
Image
Filter
175. Adding each pixel and its local neighbors which are
weighted by a filter (kernel)
Perform this convolution process to every pixels
Convolution in Computer Vision (CV)
179
Image Filter
A filter could be
seen as a pattern
176. Real Example: Sobel Edge Detection
edge = 亮度變化大的地方
180
𝑝𝑖,𝑗 𝑝𝑖+1,𝑗
𝐺ℎ 𝑖, 𝑗 = 𝑝𝑖+1,𝑗 − 𝑝𝑖,𝑗
Multiply by a constant c
𝑐 ∗ 𝐺ℎ 𝑖, 𝑗
凸顯兩像素之間的差異
X-axis
Pixel value
Figure reference
https://en.wikipedia.org/wiki/Sobel_operator#/media/File:Bikesgraygh.jpg
180. Feedforward DNN
CNN Structure
184
Convolutional Layer
&
Activation Layer
Pooling Layer
Can be performed
may times
Flatten
PICACHU!!!
Pooling Layer
Convolutional Layer
&
Activation Layer
New image
New image
New image
193. Pooling Layer
Why do we need pooling layers?
Reduce the number of weights
Prevent overfitting
Max pooling
Consider the existence of patterns in each region
197
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
2 3
3 3
Max pooling
*How about average pooling?
200. CIFAR-10 Dataset
60,000 (50,000 training + 10,000 testing) samples,
32x32 color images in 10 classes
10 classes
airplane, automobile, ship, truck,
bird, cat, deer, dog, frog, horse
Official website
https://www.cs.toronto.edu/~kriz/
cifar.html
204
201. Overview of CIFAR-10 Dataset
Files of CIFAR-10 dataset
data_batch_1, …, data_batch_5
test_batch
4 elements in the input dataset
data
labels
batch_label
filenames
205
202. How to Load Samples form a File
This reading function is provided from the official site
206
# this function is provided from the official site
def unpickle(file):
import cPickle
fo = open(file, 'rb')
dict = cPickle.load(fo)
fo.close()
return dict
# reading a batch file
raw_data = unpickle(dataset_path + fn)
203. How to Load Samples form a File
Fixed function for Python3
207
# this function is provided from the official site
def unpickle(file):
import pickle
fo = open(file, 'rb')
dict = pickle.load(fo, encoding='latin1')
fo.close()
return dict
# reading a batch file
raw_data = unpickle(dataset_path + fn)
204. Checking the Data Structure
Useful functions and attributes
208
# [1] the type of input dataset
type(raw_data)
# <type 'dict'>
# [2] check keys in the dictionary
raw_data_keys = raw_data.keys()
# ['data', 'labels', 'batch_label', 'filenames']
# [3] check dimensions of pixel values
print "dim(data)", numpy.array(raw_data['data']).shape
# dim(data) (10000, 3072)
210. Reshape the Training/Testing Inputs
利用影像的長寬資訊先將 RGB 影像分開,再利用
reshape 函式將一維向量轉換為二維矩陣,最後用
dstack 將 RGB image 連接成三維陣列
214
X_train = numpy.asarray(
[numpy.dstack(
(
r[0:(width*height)].reshape(height,width),
r[(width*height):(2*width*height)].reshape(height,width),
r[(2*width*height):(3*width*height)].reshape(height,width))
) for r in img_px_values]
)
Y_train = np_utils.to_categorical(numpy.array(img_lab), classes)
211. Saving Each Data as Image
SciPy library
Dimension of "arr_data" should be (height, width, 3)
Supported image format
.bmp, .png
215
''' saving ndarray to image'''
from scipy.misc import imsave
def ndarray2image(arr_data, image_fn):
imsave(image_fn, arr_data)
212. Saving Each Data as Image
PIL library (Linux OS)
Dimension of "arr_data" should be (height, width, 3)
Supported image format
.bmp, .jpeg, .png, etc.
216
''' saving ndarray to image'''
from PIL import Image
def ndarray2image(arr_data, image_fn):
img = Image.fromarray(arr_data, 'RGB')
img.save(image_fn)
216. Only one function
Total parameters
Number of Parameters of Each Layers
220
# check parameters of every layers
model.summary()
32*3*3*3 + 32=896
32*3*3*32 + 32=9,248
7200*512 + 512=3,686,912
217. Let’s Start Training
Two validation methods
Validate with splitting training samples
Validate with testing samples
221
''' training'''
# define batch size and # of epoch
batch_size = 128
epoch = 32
# [1] validation data comes from training data
fit_log = model.fit(X_train, Y_train, batch_size=batch_size,
nb_epoch=epoch, validation_split=0.1,
shuffle=True)
# [2] validation data comes from testing data
fit_log = model.fit(X_train, Y_train, batch_size=batch_size,
nb_epoch=epoch, shuffle=True,
validation_data=(X_test, Y_test))
218. Saving Training History
Save training history to .csv file
222
'''saving training history'''
import csv
# define the output file name
history_fn = 'ccmd.csv'
# create the output file
with open(history_fn, 'wb') as csv_file:
w = csv.writer(csv_file)
# convert the data structure from dictionary to ndarray
temp = numpy.array(fit_log.history.values())
# write headers
w.writerow(fit_log.history.keys())
# write values
for i in range(temp.shape[1]):
w.writerow(temp[:,i])
219. Model Saving and Prediction
Saving/loading the whole CNN model
Predicting the classes with new image samples
223
'''saving model'''
from keras.models import load_model
model.save('cifar10.h5')
del model
'''loading model'''
model = load_model('cifar10.h5')
'''prediction'''
pred = model.predict_classes(X_test, batch_size, verbose=0)
221. Practice 1 – Dimensions of Inputs
Find the dimensions of image and class of labels in
read_dataset2vec.py and read_dataset2img.py
Following the dimension transformation from raw
inputs to training inputs (Line 50-110)
225
# define the information of images which can be obtained from
official website
height, width, dim = 32, 32, 3
classes = 10
222. Practice 2 – Design a CNN Model
設計一個 CNN model,並讓他可以順利執行 (Line
16-25)
226
# set dataset path
dataset_path = './cifar_10/'
exec(open("read_dataset2img.py").read())
'''CNN model'''
model = Sequential()
# 請建立一個 CNN model
# CNN
model.add(Flatten())
# DNN
223. Let’s Try CNN
Hints
Check the format of training dataset / validation dataset
Design your own CNN model
Don’t forget saving the model
227
Figure reference
https://unsplash.com/collections/186797/coding
232. Introduction
“transfer”: use the knowledge learned from task A to
tackle another task B
Example: 綿羊/羊駝 classifier
236
綿羊
羊駝
其他動物的圖
233. Use as Fixed Feature Extractor
A known model, like VGG, trained on ImageNet
ImageNet: 10 millions images with labels
237
OutputInput
取某一個 layer output
當作 feature vectors
Train a classifier based on the features
extracted by a known model
234. Use as Initialization
Initialize your net by the
weights of a known model
Use your dataset to further
train your model
Fine-tuning the known
model
238
OutputInput
OutputInput
VGG model
Your model
235. Short Summary
Unlabeled data (lack of y)
Semi-supervised learning
Insufficient data (lack of both x and y)
Transfer learning (focus on layer transfer)
Use as fixed feature extractor
Use as initialization
Resources: https://keras.io/applications/
Jason Yosinski, Jeff Clune, Yoshua Bengio, Hod Lipson, “How transferable are
features in deep neural networks?”, https://arxiv.org/abs/1411.1792, 2014
239
237. Recap – Fundamentals
Fundamentals of deep learning
A neural network = a function
Gradient descent
Stochastic gradient descent
Mini-batch
Guidelines to determine a network structure
241
238. Recap – Improvement on Training Set
How to improve performance on training dataset
242
Activation Function
Loss Function
Optimizer
Learning Rate
239. Recap – Improvement on Testing Set
How to improve performance on testing dataset
243
Early Stopping
Regularization
Dropout
240. Recap – CNN
Fundamentals of CNN
Concept of filters
Hyper-parameters
Filter size
Zero-padding
Stride
Depth (total number of filters)
Pooling layers
How to train a CNN in Keras
CIFAR-10 dataset
244
242. How to Get Trained Weights
weights = model.get_weights()
model.layers[1].set_weights(weights[0:2])
246
# get weights
myweight = model.get_weights()
# set weights
model.layers[1].set_weights(myweights[0:2])
# BTW, use model.summary() to check your layers
model.summary()
243. Fit_Generator
當資料太大無法一次讀進時 (memory limitation)
247
# get weights
def train_generator(batch_size):
while 1:
data = np.genfromtext('pkgo_city66_class5_v1.csv',
delimiter=',',
skip_header=1)
for i in range(0,np.floor(batch_size/len(data)))
x = data[i*batch_size:(i+1)*batch_size,:200]
y = data[i*batch_size:(i+1)*batch_size,200]
yield x,y
Model.fit_generator(train_generator(28),
epochs=30,
steps_per_epoch=100,
validation_steps=100) # or validation_data
244. Residual Network
248
X
y = H(x)
F(x)+x = H(x)
F(x) = H(x) - x
The authors hypothesize that it is easier to optimize the
residual mapping than to optimize the original mapping
H(x)
https://arxiv.org/pdf/1512.03385.pdf
250. Go Deeper in Deep Learning
“Neural Networks and Deep Learning”
written by Michael Nielsen
http://neuralnetworksanddeeplearning.com/
“Deep Learning”
Written by Yoshua Bengio, Ian J. Goodfellow and Aaron
Courville
http://www.iro.umontreal.ca/~bengioy/dlbook/
Course: Machine learning and having it deep and
structured
http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_
2.html
(Slide Credit: Hung-Yi Lee)
251. References
Keras documentation Keras 官方網站,非常詳細
Keras Github 可以從 example/ 中找到適合自己應用的範例
Youtube 頻道 – 台大電機李宏毅教授
Convolutional Neural Networks for Visual Recognition cs231n
若有課程上的建議,歡迎來信
cmchang@iis.sinica.edu.tw and chihfan@iis.sinica.edu.tw
255