DaSE-Computer-Vision-2021
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 

918 lines
36 KiB

{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from google.colab import drive\n",
"\n",
"drive.mount('/content/drive', force_remount=True)\n",
"\n",
"# 输入daseCV所在的路径\n",
"# 'daseCV' 文件夹包括 '.py', 'classifiers' 和'datasets'文件夹\n",
"# 例如 'CV/assignments/assignment1/daseCV/'\n",
"FOLDERNAME = None\n",
"\n",
"assert FOLDERNAME is not None, \"[!] Enter the foldername.\"\n",
"\n",
"%cd drive/My\\ Drive\n",
"%cp -r $FOLDERNAME ../../\n",
"%cd ../../\n",
"%cd daseCV/datasets/\n",
"!bash get_datasets.sh\n",
"%cd ../../"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"pdf-title"
]
},
"source": [
"# Batch Normalization\n",
"One way to make deep networks easier to train is to use more sophisticated optimization procedures such as SGD+momentum, RMSProp, or Adam. Another strategy is to change the architecture of the network to make it easier to train. \n",
"One idea along these lines is batch normalization which was proposed by [1] in 2015.\n",
"\n",
"The idea is relatively straightforward. Machine learning methods tend to work better when their input data consists of uncorrelated features with zero mean and unit variance. When training a neural network, we can preprocess the data before feeding it to the network to explicitly decorrelate its features; this will ensure that the first layer of the network sees data that follows a nice distribution. However, even if we preprocess the input data, the activations at deeper layers of the network will likely no longer be decorrelated and will no longer have zero mean or unit variance since they are output from earlier layers in the network. Even worse, during the training process the distribution of features at each layer of the network will shift as the weights of each layer are updated.\n",
"\n",
"The authors of [1] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, [1] proposes to insert batch normalization layers into the network. At training time, a batch normalization layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.\n",
"\n",
"It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.\n",
"\n",
"[1] [Sergey Ioffe and Christian Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing\n",
"Internal Covariate Shift\", ICML 2015.](https://arxiv.org/abs/1502.03167)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"pdf-ignore"
]
},
"outputs": [],
"source": [
"# As usual, a bit of setup\n",
"import time\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from daseCV.classifiers.fc_net import *\n",
"from daseCV.data_utils import get_CIFAR10_data\n",
"from daseCV.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n",
"from daseCV.solver import Solver\n",
"\n",
"%matplotlib inline\n",
"plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n",
"plt.rcParams['image.interpolation'] = 'nearest'\n",
"plt.rcParams['image.cmap'] = 'gray'\n",
"\n",
"# for auto-reloading external modules\n",
"# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"\n",
"\n",
"def rel_error(x, y):\n",
" \"\"\" returns relative error \"\"\"\n",
" return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))\n",
"\n",
"\n",
"def print_mean_std(x, axis=0):\n",
" print(' means: ', x.mean(axis=axis))\n",
" print(' stds: ', x.std(axis=axis))\n",
" print()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"pdf-ignore"
]
},
"outputs": [],
"source": [
"# Load the (preprocessed) CIFAR10 data.\n",
"data = get_CIFAR10_data()\n",
"for k, v in data.items():\n",
" print('%s: ' % k, v.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Batch normalization: forward\n",
"\n",
"在文件 `daseCV/layers` 中实现 `batchnorm_forward` 函数完成batch normalization的前向传播。然后运行以下代码测试你的实现是否准确。\n",
"\n",
"上面参考论文[1]可能会对你有帮助"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check the training-time forward pass by checking means and variances\n",
"# of features both before and after batch normalization \n",
"\n",
"# Simulate the forward pass for a two-layer network\n",
"np.random.seed(231)\n",
"N, D1, D2, D3 = 200, 50, 60, 3\n",
"X = np.random.randn(N, D1)\n",
"W1 = np.random.randn(D1, D2)\n",
"W2 = np.random.randn(D2, D3)\n",
"a = np.maximum(0, X.dot(W1)).dot(W2)\n",
"\n",
"print('Before batch normalization:')\n",
"print_mean_std(a,axis=0)\n",
"\n",
"gamma = np.ones((D3,))\n",
"beta = np.zeros((D3,))\n",
"# Means should be close to zero and stds close to one\n",
"print('After batch normalization (gamma=1, beta=0)')\n",
"a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})\n",
"print_mean_std(a_norm,axis=0)\n",
"\n",
"gamma = np.asarray([1.0, 2.0, 3.0])\n",
"beta = np.asarray([11.0, 12.0, 13.0])\n",
"# Now means should be close to beta and stds close to gamma\n",
"print('After batch normalization (gamma=', gamma, ', beta=', beta, ')')\n",
"a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})\n",
"print_mean_std(a_norm,axis=0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check the test-time forward pass by running the training-time\n",
"# forward pass many times to warm up the running averages, and then\n",
"# checking the means and variances of activations after a test-time\n",
"# forward pass.\n",
"\n",
"np.random.seed(231)\n",
"N, D1, D2, D3 = 200, 50, 60, 3\n",
"W1 = np.random.randn(D1, D2)\n",
"W2 = np.random.randn(D2, D3)\n",
"\n",
"bn_param = {'mode': 'train'}\n",
"gamma = np.ones(D3)\n",
"beta = np.zeros(D3)\n",
"\n",
"for t in range(50):\n",
" X = np.random.randn(N, D1)\n",
" a = np.maximum(0, X.dot(W1)).dot(W2)\n",
" batchnorm_forward(a, gamma, beta, bn_param)\n",
"\n",
"bn_param['mode'] = 'test'\n",
"X = np.random.randn(N, D1)\n",
"a = np.maximum(0, X.dot(W1)).dot(W2)\n",
"a_norm, _ = batchnorm_forward(a, gamma, beta, bn_param)\n",
"\n",
"# Means should be close to zero and stds close to one, but will be\n",
"# noisier than training-time forward passes.\n",
"print('After batch normalization (test-time):')\n",
"print_mean_std(a_norm,axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Batch normalization: backward\n",
"在 `batchnorm_backward` 中实现batch normalization的反向传播\n",
"\n",
"要想得到反向传播的公式,你应该写出batch normalization的计算图,并且对每个中间节点求反向传播公式。一些中间节点可能有多个传出分支;注意要在反向传播中对这些分支的梯度求和。\n",
"\n",
"一旦你实现了该功能,请运行下面的代码进行梯度数值检测。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Gradient check batchnorm backward pass\n",
"np.random.seed(231)\n",
"N, D = 4, 5\n",
"x = 5 * np.random.randn(N, D) + 12\n",
"gamma = np.random.randn(D)\n",
"beta = np.random.randn(D)\n",
"dout = np.random.randn(N, D)\n",
"\n",
"bn_param = {'mode': 'train'}\n",
"fx = lambda x: batchnorm_forward(x, gamma, beta, bn_param)[0]\n",
"fg = lambda a: batchnorm_forward(x, a, beta, bn_param)[0]\n",
"fb = lambda b: batchnorm_forward(x, gamma, b, bn_param)[0]\n",
"\n",
"dx_num = eval_numerical_gradient_array(fx, x, dout)\n",
"da_num = eval_numerical_gradient_array(fg, gamma.copy(), dout)\n",
"db_num = eval_numerical_gradient_array(fb, beta.copy(), dout)\n",
"\n",
"_, cache = batchnorm_forward(x, gamma, beta, bn_param)\n",
"dx, dgamma, dbeta = batchnorm_backward(dout, cache)\n",
"#You should expect to see relative errors between 1e-13 and 1e-8\n",
"print('dx error: ', rel_error(dx_num, dx))\n",
"print('dgamma error: ', rel_error(da_num, dgamma))\n",
"print('dbeta error: ', rel_error(db_num, dbeta))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Batch normalization: alternative backward\n",
"\n",
"课堂上我们讨论过两种求sigmoid反向传播公式的方法,第一种是写出计算图,然后对计算图中的每一个中间变量求导;另一种方法是在纸上计算好最终的梯度,得到一个很简单的公式。打个比方,你可以先在纸上算出sigmoid的反向传播公式,然后直接实现就可以了,不需要算中间变量的梯度。\n",
"\n",
"BN也有这个性质,你可以自己推一波公式!(接下来不翻译了,自己看)\n",
"\n",
"In the forward pass, given a set of inputs $X=\\begin{bmatrix}x_1\\\\x_2\\\\...\\\\x_N\\end{bmatrix}$, \n",
"\n",
"we first calculate the mean $\\mu$ and variance $v$.\n",
"With $\\mu$ and $v$ calculated, we can calculate the standard deviation $\\sigma$ and normalized data $Y$.\n",
"The equations and graph illustration below describe the computation ($y_i$ is the i-th element of the vector $Y$).\n",
"\n",
"\\begin{align}\n",
"& \\mu=\\frac{1}{N}\\sum_{k=1}^N x_k & v=\\frac{1}{N}\\sum_{k=1}^N (x_k-\\mu)^2 \\\\\n",
"& \\sigma=\\sqrt{v+\\epsilon} & y_i=\\frac{x_i-\\mu}{\\sigma}\n",
"\\end{align}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"notebook_images/batchnorm_graph.png\" width=691 height=202>"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"pdf-ignore"
]
},
"source": [
"The meat of our problem during backpropagation is to compute $\\frac{\\partial L}{\\partial X}$, given the upstream gradient we receive, $\\frac{\\partial L}{\\partial Y}.$ To do this, recall the chain rule in calculus gives us $\\frac{\\partial L}{\\partial X} = \\frac{\\partial L}{\\partial Y} \\cdot \\frac{\\partial Y}{\\partial X}$.\n",
"\n",
"The unknown/hart part is $\\frac{\\partial Y}{\\partial X}$. We can find this by first deriving step-by-step our local gradients at \n",
"$\\frac{\\partial v}{\\partial X}$, $\\frac{\\partial \\mu}{\\partial X}$,\n",
"$\\frac{\\partial \\sigma}{\\partial v}$, \n",
"$\\frac{\\partial Y}{\\partial \\sigma}$, and $\\frac{\\partial Y}{\\partial \\mu}$,\n",
"and then use the chain rule to compose these gradients (which appear in the form of vectors!) appropriately to compute $\\frac{\\partial Y}{\\partial X}$.\n",
"\n",
"If it's challenging to directly reason about the gradients over $X$ and $Y$ which require matrix multiplication, try reasoning about the gradients in terms of individual elements $x_i$ and $y_i$ first: in that case, you will need to come up with the derivations for $\\frac{\\partial L}{\\partial x_i}$, by relying on the Chain Rule to first calculate the intermediate $\\frac{\\partial \\mu}{\\partial x_i}, \\frac{\\partial v}{\\partial x_i}, \\frac{\\partial \\sigma}{\\partial x_i},$ then assemble these pieces to calculate $\\frac{\\partial y_i}{\\partial x_i}$. \n",
"\n",
"You should make sure each of the intermediary gradient derivations are all as simplified as possible, for ease of implementation. \n",
"\n",
"\n",
"算好之后,在 `batchnorm_backward_alt` 函数中实现简化版的batch normalization的反向传播公式,然后分别运行两种反向传播实现并比较结果,你的结果应该是一致的,但是简化版的实现应该会更快一点。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(231)\n",
"N, D = 100, 500\n",
"x = 5 * np.random.randn(N, D) + 12\n",
"gamma = np.random.randn(D)\n",
"beta = np.random.randn(D)\n",
"dout = np.random.randn(N, D)\n",
"\n",
"bn_param = {'mode': 'train'}\n",
"out, cache = batchnorm_forward(x, gamma, beta, bn_param)\n",
"\n",
"t1 = time.time()\n",
"dx1, dgamma1, dbeta1 = batchnorm_backward(dout, cache)\n",
"t2 = time.time()\n",
"dx2, dgamma2, dbeta2 = batchnorm_backward_alt(dout, cache)\n",
"t3 = time.time()\n",
"\n",
"print('dx difference: ', rel_error(dx1, dx2))\n",
"print('dgamma difference: ', rel_error(dgamma1, dgamma2))\n",
"print('dbeta difference: ', rel_error(dbeta1, dbeta2))\n",
"print('speedup: %.2fx' % ((t2 - t1) / (t3 - t2)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fully Connected Nets with Batch Normalization\n",
"\n",
"现在你已经实现了Batch Normalization,请在`daseCV/classifiers/fc_net.py`中的`FullyConnectedNet`上添加Batch Norm。\n",
"\n",
"具体来说,当在构造函数中`normalization`标记设置为`batchnorm`时,应该在每个ReLU激活层之前插入一个Batch Norm层。网络最后一层的输出不应该加Batch Norm。\n",
"\n",
"当你完成该功能,运行以下代码进行梯度检查。\n",
"\n",
"HINT: You might find it useful to define an additional helper layer similar to those in the file `daseCV/layer_utils.py`. If you decide to do so, do it in the file `daseCV/classifiers/fc_net.py`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(231)\n",
"N, D, H1, H2, C = 2, 15, 20, 30, 10\n",
"X = np.random.randn(N, D)\n",
"y = np.random.randint(C, size=(N,))\n",
"\n",
"# You should expect losses between 1e-4~1e-10 for W, \n",
"# losses between 1e-08~1e-10 for b,\n",
"# and losses between 1e-08~1e-09 for beta and gammas.\n",
"for reg in [0, 3.14]:\n",
" print('Running check with reg = ', reg)\n",
" model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n",
" reg=reg, weight_scale=5e-2, dtype=np.float64,\n",
" normalization='batchnorm')\n",
"\n",
" loss, grads = model.loss(X, y)\n",
" print('Initial loss: ', loss)\n",
"\n",
" for name in sorted(grads):\n",
" f = lambda _: model.loss(X, y)[0]\n",
" grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n",
" print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))\n",
" if reg == 0: print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Batchnorm for deep networks\n",
"\n",
"运行以下代码,在1000个样本的子集上训练一个六层网络,包括有和没有Batch Norm的版本。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"np.random.seed(231)\n",
"# Try training a very deep net with batchnorm\n",
"hidden_dims = [100, 100, 100, 100, 100]\n",
"\n",
"num_train = 1000\n",
"small_data = {\n",
" 'X_train': data['X_train'][:num_train],\n",
" 'y_train': data['y_train'][:num_train],\n",
" 'X_val': data['X_val'],\n",
" 'y_val': data['y_val'],\n",
"}\n",
"\n",
"weight_scale = 2e-2\n",
"bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')\n",
"model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)\n",
"\n",
"print('Solver with batch norm:')\n",
"bn_solver = Solver(bn_model, small_data,\n",
" num_epochs=10, batch_size=50,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': 1e-3,\n",
" },\n",
" verbose=True,print_every=20)\n",
"bn_solver.train()\n",
"\n",
"print('\\nSolver without batch norm:')\n",
"solver = Solver(model, small_data,\n",
" num_epochs=10, batch_size=50,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': 1e-3,\n",
" },\n",
" verbose=True, print_every=20)\n",
"solver.train()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"运行以下命令来可视化上面训练的两个网络的结果。你会发现,使用Batch Norm有助于网络更快地收敛。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"pdf-ignore-input"
]
},
"outputs": [],
"source": [
"def plot_training_history(title, label, baseline, bn_solvers, plot_fn, bl_marker='.', bn_marker='.', labels=None):\n",
" \"\"\"utility function for plotting training history\"\"\"\n",
" plt.title(title)\n",
" plt.xlabel(label)\n",
" bn_plots = [plot_fn(bn_solver) for bn_solver in bn_solvers]\n",
" bl_plot = plot_fn(baseline)\n",
" num_bn = len(bn_plots)\n",
" for i in range(num_bn):\n",
" label='with_norm'\n",
" if labels is not None:\n",
" label += str(labels[i])\n",
" plt.plot(bn_plots[i], bn_marker, label=label)\n",
" label='baseline'\n",
" if labels is not None:\n",
" label += str(labels[0])\n",
" plt.plot(bl_plot, bl_marker, label=label)\n",
" plt.legend(loc='lower center', ncol=num_bn+1) \n",
"\n",
" \n",
"plt.subplot(3, 1, 1)\n",
"plot_training_history('Training loss','Iteration', solver, [bn_solver], \\\n",
" lambda x: x.loss_history, bl_marker='o', bn_marker='o')\n",
"plt.subplot(3, 1, 2)\n",
"plot_training_history('Training accuracy','Epoch', solver, [bn_solver], \\\n",
" lambda x: x.train_acc_history, bl_marker='-o', bn_marker='-o')\n",
"plt.subplot(3, 1, 3)\n",
"plot_training_history('Validation accuracy','Epoch', solver, [bn_solver], \\\n",
" lambda x: x.val_acc_history, bl_marker='-o', bn_marker='-o')\n",
"\n",
"plt.gcf().set_size_inches(15, 15)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Batch normalization and initialization\n",
"\n",
"我们将进行一个小实验来研究Batch Norm和权值初始化之间的相互关系。\n",
"\n",
"下面代码将训练8层网络,分别使用不同规模的权重初始化进行Batch Norm和不进行Batch Norm。\n",
"然后绘制训练精度、验证集精度、训练损失。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"pdf-ignore-input"
]
},
"outputs": [],
"source": [
"np.random.seed(231)\n",
"# Try training a very deep net with batchnorm\n",
"hidden_dims = [50, 50, 50, 50, 50, 50, 50]\n",
"num_train = 1000\n",
"small_data = {\n",
" 'X_train': data['X_train'][:num_train],\n",
" 'y_train': data['y_train'][:num_train],\n",
" 'X_val': data['X_val'],\n",
" 'y_val': data['y_val'],\n",
"}\n",
"\n",
"bn_solvers_ws = {}\n",
"solvers_ws = {}\n",
"weight_scales = np.logspace(-4, 0, num=20)\n",
"for i, weight_scale in enumerate(weight_scales):\n",
" print('Running weight scale %d / %d' % (i + 1, len(weight_scales)))\n",
" bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')\n",
" model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)\n",
"\n",
" bn_solver = Solver(bn_model, small_data,\n",
" num_epochs=10, batch_size=50,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': 1e-3,\n",
" },\n",
" verbose=False, print_every=200)\n",
" bn_solver.train()\n",
" bn_solvers_ws[weight_scale] = bn_solver\n",
"\n",
" solver = Solver(model, small_data,\n",
" num_epochs=10, batch_size=50,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': 1e-3,\n",
" },\n",
" verbose=False, print_every=200)\n",
" solver.train()\n",
" solvers_ws[weight_scale] = solver"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"pdf-ignore-input"
]
},
"outputs": [],
"source": [
"# Plot results of weight scale experiment\n",
"best_train_accs, bn_best_train_accs = [], []\n",
"best_val_accs, bn_best_val_accs = [], []\n",
"final_train_loss, bn_final_train_loss = [], []\n",
"\n",
"for ws in weight_scales:\n",
" best_train_accs.append(max(solvers_ws[ws].train_acc_history))\n",
" bn_best_train_accs.append(max(bn_solvers_ws[ws].train_acc_history))\n",
" \n",
" best_val_accs.append(max(solvers_ws[ws].val_acc_history))\n",
" bn_best_val_accs.append(max(bn_solvers_ws[ws].val_acc_history))\n",
" \n",
" final_train_loss.append(np.mean(solvers_ws[ws].loss_history[-100:]))\n",
" bn_final_train_loss.append(np.mean(bn_solvers_ws[ws].loss_history[-100:]))\n",
" \n",
"plt.subplot(3, 1, 1)\n",
"plt.title('Best val accuracy vs weight initialization scale')\n",
"plt.xlabel('Weight initialization scale')\n",
"plt.ylabel('Best val accuracy')\n",
"plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')\n",
"plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')\n",
"plt.legend(ncol=2, loc='lower right')\n",
"\n",
"plt.subplot(3, 1, 2)\n",
"plt.title('Best train accuracy vs weight initialization scale')\n",
"plt.xlabel('Weight initialization scale')\n",
"plt.ylabel('Best training accuracy')\n",
"plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')\n",
"plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')\n",
"plt.legend()\n",
"\n",
"plt.subplot(3, 1, 3)\n",
"plt.title('Final training loss vs weight initialization scale')\n",
"plt.xlabel('Weight initialization scale')\n",
"plt.ylabel('Final training loss')\n",
"plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')\n",
"plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')\n",
"plt.legend()\n",
"plt.gca().set_ylim(1.0, 3.5)\n",
"\n",
"plt.gcf().set_size_inches(15, 15)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"pdf-inline"
]
},
"source": [
"## Inline Question 1:\n",
"描述一下这个实验的结果。权重初始化的规模如何影响 带有/没有Batch Norm的模型,为什么?\n",
"\n",
"## Answer:\n",
"[FILL THIS IN]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Batch normalization and batch size\n",
"\n",
"我们将进行一个小实验来研究Batch Norm和batch size之间的相互关系。\n",
"\n",
"下面的代码将使用不同的batch size来训练带有/没有Batch Norm的6层网络。\n",
"然后将绘制随时间变化的训练准确率和验证集的准确率。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"pdf-ignore-input"
]
},
"outputs": [],
"source": [
"def run_batchsize_experiments(normalization_mode):\n",
" np.random.seed(231)\n",
" # Try training a very deep net with batchnorm\n",
" hidden_dims = [100, 100, 100, 100, 100]\n",
" num_train = 1000\n",
" small_data = {\n",
" 'X_train': data['X_train'][:num_train],\n",
" 'y_train': data['y_train'][:num_train],\n",
" 'X_val': data['X_val'],\n",
" 'y_val': data['y_val'],\n",
" }\n",
" n_epochs=10\n",
" weight_scale = 2e-2\n",
" batch_sizes = [5,10,50]\n",
" lr = 10**(-3.5)\n",
" solver_bsize = batch_sizes[0]\n",
"\n",
" print('No normalization: batch size = ',solver_bsize)\n",
" model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)\n",
" solver = Solver(model, small_data,\n",
" num_epochs=n_epochs, batch_size=solver_bsize,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': lr,\n",
" },\n",
" verbose=False)\n",
" solver.train()\n",
" \n",
" bn_solvers = []\n",
" for i in range(len(batch_sizes)):\n",
" b_size=batch_sizes[i]\n",
" print('Normalization: batch size = ',b_size)\n",
" bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=normalization_mode)\n",
" bn_solver = Solver(bn_model, small_data,\n",
" num_epochs=n_epochs, batch_size=b_size,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': lr,\n",
" },\n",
" verbose=False)\n",
" bn_solver.train()\n",
" bn_solvers.append(bn_solver)\n",
" \n",
" return bn_solvers, solver, batch_sizes\n",
"\n",
"batch_sizes = [5,10,50]\n",
"bn_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('batchnorm')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.subplot(2, 1, 1)\n",
"plot_training_history('Training accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \\\n",
" lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n",
"plt.subplot(2, 1, 2)\n",
"plot_training_history('Validation accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \\\n",
" lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n",
"\n",
"plt.gcf().set_size_inches(15, 10)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"pdf-inline"
]
},
"source": [
"## Inline Question 2:\n",
"描述一下这个实验的结果。请问Batch Norm和batch size之间的又什么关系?为什么会出现这种关系?\n",
"\n",
"## Answer:\n",
"[FILL THIS IN]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Layer Normalization\n",
"\n",
"(这里大概讲的是batch norm受限于batch size的取值,但是受限于硬件资源,batch size不能取太大,所以提出了layer norm,对一个样本的特征向量进行归一化,均值和方差由该样本的特征向量的所有元素算出来,具体的自己看英文和论文。)\n",
"\n",
"Batch normalization has proved to be effective in making networks easier to train, but the dependency on batch size makes it less useful in complex networks which have a cap on the input batch size due to hardware limitations. \n",
"\n",
"Several alternatives to batch normalization have been proposed to mitigate this problem; one such technique is Layer Normalization [2]. Instead of normalizing over the batch, we normalize over the features. In other words, when using Layer Normalization, each feature vector corresponding to a single datapoint is normalized based on the sum of all terms within that feature vector.\n",
"\n",
"[2] [Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. \"Layer Normalization.\" stat 1050 (2016): 21.](https://arxiv.org/pdf/1607.06450.pdf)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"pdf-inline"
]
},
"source": [
"## Inline Question 3:\n",
"\n",
"下面的数据预处理步骤中,哪些类似于Batch Norm,哪些类似于Layer Norm?\n",
"\n",
"1. Scaling each image in the dataset, so that the RGB channels for each row of pixels within an image sums up to 1.\n",
"2. Scaling each image in the dataset, so that the RGB channels for all pixels within an image sums up to 1. \n",
"3. Subtracting the mean image of the dataset from each image in the dataset.\n",
"4. Setting all RGB values to either 0 or 1 depending on a given threshold.\n",
"\n",
"## Answer:\n",
"[FILL THIS IN]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Layer Normalization: Implementation\n",
"\n",
"现在你要实现layer normalization。这步应该相对简单,因为在概念上,layer norm的实现几乎与batch norm一样。不过一个重要的区别是,对于layer norm,我们使用moments,并且测试阶段与训练阶段是相同的,每个数据样本直接计算平均值和方差。\n",
"\n",
"你要完成下面的工作\n",
"\n",
"* 实现 `daseCV/layers.py` 中的`layernorm_forward`。 \n",
"\n",
"运行下面第一个cell检查你的结果\n",
"\n",
"* 实现 `daseCV/layers.py` 中的`layernorm_backward`。\n",
"运行下面第二个cell检查你的结果\n",
"\n",
"* 修改 `daseCV/classifiers/fc_net.py`,在`FullyConnectedNet`上增加layer normalization。当构造函数中的`normalization`标记为`\"layernorm\"`时,你应该在每个ReLU层前插入layer normalization层。\n",
"\n",
"运行下面第三个cell进行关于在layer normalization上的batch size的实验。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check the training-time forward pass by checking means and variances\n",
"# of features both before and after layer normalization \n",
"\n",
"# Simulate the forward pass for a two-layer network\n",
"np.random.seed(231)\n",
"N, D1, D2, D3 =4, 50, 60, 3\n",
"X = np.random.randn(N, D1)\n",
"W1 = np.random.randn(D1, D2)\n",
"W2 = np.random.randn(D2, D3)\n",
"a = np.maximum(0, X.dot(W1)).dot(W2)\n",
"\n",
"print('Before layer normalization:')\n",
"print_mean_std(a,axis=1)\n",
"\n",
"gamma = np.ones(D3)\n",
"beta = np.zeros(D3)\n",
"# Means should be close to zero and stds close to one\n",
"print('After layer normalization (gamma=1, beta=0)')\n",
"a_norm, _ = layernorm_forward(a, gamma, beta, {'mode': 'train'})\n",
"print_mean_std(a_norm,axis=1)\n",
"\n",
"gamma = np.asarray([3.0,3.0,3.0])\n",
"beta = np.asarray([5.0,5.0,5.0])\n",
"# Now means should be close to beta and stds close to gamma\n",
"print('After layer normalization (gamma=', gamma, ', beta=', beta, ')')\n",
"a_norm, _ = layernorm_forward(a, gamma, beta, {'mode': 'train'})\n",
"print_mean_std(a_norm,axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Gradient check batchnorm backward pass\n",
"np.random.seed(231)\n",
"N, D = 4, 5\n",
"x = 5 * np.random.randn(N, D) + 12\n",
"gamma = np.random.randn(D)\n",
"beta = np.random.randn(D)\n",
"dout = np.random.randn(N, D)\n",
"\n",
"ln_param = {}\n",
"fx = lambda x: layernorm_forward(x, gamma, beta, ln_param)[0]\n",
"fg = lambda a: layernorm_forward(x, a, beta, ln_param)[0]\n",
"fb = lambda b: layernorm_forward(x, gamma, b, ln_param)[0]\n",
"\n",
"dx_num = eval_numerical_gradient_array(fx, x, dout)\n",
"da_num = eval_numerical_gradient_array(fg, gamma.copy(), dout)\n",
"db_num = eval_numerical_gradient_array(fb, beta.copy(), dout)\n",
"\n",
"_, cache = layernorm_forward(x, gamma, beta, ln_param)\n",
"dx, dgamma, dbeta = layernorm_backward(dout, cache)\n",
"\n",
"#You should expect to see relative errors between 1e-12 and 1e-8\n",
"print('dx error: ', rel_error(dx_num, dx))\n",
"print('dgamma error: ', rel_error(da_num, dgamma))\n",
"print('dbeta error: ', rel_error(db_num, dbeta))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Layer Normalization and batch size\n",
"\n",
"我们将使用layer norm来进行前面的batch size实验。与之前的实验相比,batch size对训练精度的影响要小得多!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ln_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('layernorm')\n",
"\n",
"plt.subplot(2, 1, 1)\n",
"plot_training_history('Training accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \\\n",
" lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n",
"plt.subplot(2, 1, 2)\n",
"plot_training_history('Validation accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \\\n",
" lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n",
"\n",
"plt.gcf().set_size_inches(15, 10)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": [
"pdf-inline"
]
},
"source": [
"## Inline Question 4:\n",
"什么时候layer normalization可能不工作(不起作用),为什么?\n",
"\n",
"1. 在非常深的网络上使用\n",
"2. 特征的维度非常的小\n",
"3. 有非常高的正则化项\n",
"\n",
"\n",
"## Answer:\n",
"[FILL THIS IN]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# 重要\n",
"\n",
"这里是作业的结尾处,请执行以下步骤:\n",
"\n",
"1. 点击`File -> Save`或者用`control+s`组合键,确保你最新的的notebook的作业已经保存到谷歌云。\n",
"2. 执行以下代码确保 `.py` 文件保存回你的谷歌云。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"FOLDER_TO_SAVE = os.path.join('drive/My Drive/', FOLDERNAME)\n",
"FILES_TO_SAVE = ['daseCV/classifiers/cnn.py', 'daseCV/classifiers/fc_net.py']\n",
"\n",
"for files in FILES_TO_SAVE:\n",
" with open(os.path.join(FOLDER_TO_SAVE, '/'.join(files.split('/')[1:])), 'w') as f:\n",
" f.write(''.join(open(files).readlines()))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}