assignment2/BatchNormalization.ipynb

"from google.colab import drive\n",
"drive.mount('/content/drive', force_remount=True)\n",
"# 输入daseCV所在的路径\n",
"# 'daseCV' 文件夹包括 '.py', 'classifiers' 和'datasets'文件夹\n",
"# 例如 'CV/assignments/assignment1/daseCV/'\n",
"FOLDERNAME = None\n",
"assert FOLDERNAME is not None, \"[!] Enter the foldername.\"\n",
"%cd drive/My\\ Drive\n",
"%cp -r $FOLDERNAME ../../\n",
"%cd ../../\n",
"%cd daseCV/datasets/\n",
"%cd ../../"
"source": [
"# Batch Normalization\n",
"One way to make deep networks easier to train is to use more sophisticated optimization procedures such as SGD+momentum, RMSProp, or Adam. Another strategy is to change the architecture of the network to make it easier to train. \n",
"One idea along these lines is batch normalization which was proposed by [1] in 2015.\n",
"The idea is relatively straightforward. Machine learning methods tend to work better when their input data consists of uncorrelated features with zero mean and unit variance. When training a neural network, we can preprocess the data before feeding it to the network to explicitly decorrelate its features; this will ensure that the first layer of the network sees data that follows a nice distribution. However, even if we preprocess the input data, the activations at deeper layers of the network will likely no longer be decorrelated and will no longer have zero mean or unit variance since they are output from earlier layers in the network. Even worse, during the training process the distribution of features at each layer of the network will shift as the weights of each layer are updated.\n",
"The authors of [1] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, [1] proposes to insert batch normalization layers into the network. At training time, a batch normalization layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.\n",
"It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.\n",
"[1] [Sergey Ioffe and Christian Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing\n",
"Internal Covariate Shift\", ICML 2015.]("
"# As usual, a bit of setup\n",
"import time\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from daseCV.classifiers.fc_net import *\n",
"from daseCV.data_utils import get_CIFAR10_data\n",
"from daseCV.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n",
"from daseCV.solver import Solver\n",
"%matplotlib inline\n",
"plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n",
"plt.rcParams['image.interpolation'] = 'nearest'\n",
"plt.rcParams['image.cmap'] = 'gray'\n",
"# for auto-reloading external modules\n",
"# see\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"def rel_error(x, y):\n",
" \"\"\" returns relative error \"\"\"\n",
" return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))\n",
"def print_mean_std(x, axis=0):\n",
" print(' means: ', x.mean(axis=axis))\n",
" print(' stds: ', x.std(axis=axis))\n",
" print()"
"source": [
"# Load the (preprocessed) CIFAR10 data.\n",
"data = get_CIFAR10_data()\n",
"for k, v in data.items():\n",
" print('%s: ' % k, v.shape)"
"cell_type": "markdown",
"metadata": {},
"source": [
"## Batch normalization: forward\n",
"在文件 `daseCV/layers` 中实现 `batchnorm_forward` 函数完成batch normalization的前向传播。然后运行以下代码测试你的实现是否准确。\n",
"source": [
"# Check the training-time forward pass by checking means and variances\n",
"# of features both before and after batch normalization \n",
"# Simulate the forward pass for a two-layer network\n",
"N, D1, D2, D3 = 200, 50, 60, 3\n",
"X = np.random.randn(N, D1)\n",
"W1 = np.random.randn(D1, D2)\n",
"W2 = np.random.randn(D2, D3)\n",
"a = np.maximum(0,\n",
"print('Before batch normalization:')\n",
"gamma = np.ones((D3,))\n",
"beta = np.zeros((D3,))\n",
"# Means should be close to zero and stds close to one\n",
"print('After batch normalization (gamma=1, beta=0)')\n",
"a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})\n",
"gamma = np.asarray([1.0, 2.0, 3.0])\n",
"beta = np.asarray([11.0, 12.0, 13.0])\n",
"# Now means should be close to beta and stds close to gamma\n",
"print('After batch normalization (gamma=', gamma, ', beta=', beta, ')')\n",
"a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})\n",
"source": [
"# Check the test-time forward pass by running the training-time\n",
"# forward pass many times to warm up the running averages, and then\n",
"# checking the means and variances of activations after a test-time\n",
"# forward pass.\n",
"N, D1, D2, D3 = 200, 50, 60, 3\n",
"W1 = np.random.randn(D1, D2)\n",
"W2 = np.random.randn(D2, D3)\n",
"bn_param = {'mode': 'train'}\n",
"gamma = np.ones(D3)\n",
"beta = np.zeros(D3)\n",
"for t in range(50):\n",
" X = np.random.randn(N, D1)\n",
" a = np.maximum(0,\n",
" batchnorm_forward(a, gamma, beta, bn_param)\n",
"bn_param['mode'] = 'test'\n",
"X = np.random.randn(N, D1)\n",
"a = np.maximum(0,\n",
"a_norm, _ = batchnorm_forward(a, gamma, beta, bn_param)\n",
"# Means should be close to zero and stds close to one, but will be\n",
"# noisier than training-time forward passes.\n",
"print('After batch normalization (test-time):')\n",
"## Batch normalization: backward\n",
"在 `batchnorm_backward` 中实现batch normalization的反向传播\n",
"要想得到反向传播的公式,你应该写出batch normalization的计算图,并且对每个中间节点求反向传播公式。一些中间节点可能有多个传出分支;注意要在反向传播中对这些分支的梯度求和。\n",
"source": [
"# Gradient check batchnorm backward pass\n",
"N, D = 4, 5\n",
"x = 5 * np.random.randn(N, D) + 12\n",
"gamma = np.random.randn(D)\n",
"beta = np.random.randn(D)\n",
"dout = np.random.randn(N, D)\n",
"bn_param = {'mode': 'train'}\n",
"fx = lambda x: batchnorm_forward(x, gamma, beta, bn_param)[0]\n",
"fg = lambda a: batchnorm_forward(x, a, beta, bn_param)[0]\n",
"fb = lambda b: batchnorm_forward(x, gamma, b, bn_param)[0]\n",
"dx_num = eval_numerical_gradient_array(fx, x, dout)\n",
"da_num = eval_numerical_gradient_array(fg, gamma.copy(), dout)\n",
"db_num = eval_numerical_gradient_array(fb, beta.copy(), dout)\n",
"_, cache = batchnorm_forward(x, gamma, beta, bn_param)\n",
"dx, dgamma, dbeta = batchnorm_backward(dout, cache)\n",
"#You should expect to see relative errors between 1e-13 and 1e-8\n",
"print('dx error: ', rel_error(dx_num, dx))\n",
"print('dgamma error: ', rel_error(da_num, dgamma))\n",
"print('dbeta error: ', rel_error(db_num, dbeta))"
"## Batch normalization: alternative backward\n",
"In the forward pass, given a set of inputs $X=\\begin{bmatrix}x_1\\\\x_2\\\\...\\\\x_N\\end{bmatrix}$, \n",
"we first calculate the mean $\\mu$ and variance $v$.\n",
"With $\\mu$ and $v$ calculated, we can calculate the standard deviation $\\sigma$ and normalized data $Y$.\n",
"The equations and graph illustration below describe the computation ($y_i$ is the i-th element of the vector $Y$).\n",
"& \\mu=\\frac{1}{N}\\sum_{k=1}^N x_k & v=\\frac{1}{N}\\sum_{k=1}^N (x_k-\\mu)^2 \\\\\n",
"& \\sigma=\\sqrt{v+\\epsilon} & y_i=\\frac{x_i-\\mu}{\\sigma}\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"notebook_images/batchnorm_graph.png\" width=691 height=202>"
"cell_type": "markdown",
"metadata": {
"tags": [
"source": [
"The meat of our problem during backpropagation is to compute $\\frac{\\partial L}{\\partial X}$, given the upstream gradient we receive, $\\frac{\\partial L}{\\partial Y}.$ To do this, recall the chain rule in calculus gives us $\\frac{\\partial L}{\\partial X} = \\frac{\\partial L}{\\partial Y} \\cdot \\frac{\\partial Y}{\\partial X}$.\n",
"The unknown/hart part is $\\frac{\\partial Y}{\\partial X}$. We can find this by first deriving step-by-step our local gradients at \n",
"$\\frac{\\partial v}{\\partial X}$, $\\frac{\\partial \\mu}{\\partial X}$,\n",
"$\\frac{\\partial \\sigma}{\\partial v}$, \n",
"$\\frac{\\partial Y}{\\partial \\sigma}$, and $\\frac{\\partial Y}{\\partial \\mu}$,\n",
"and then use the chain rule to compose these gradients (which appear in the form of vectors!) appropriately to compute $\\frac{\\partial Y}{\\partial X}$.\n",
"If it's challenging to directly reason about the gradients over $X$ and $Y$ which require matrix multiplication, try reasoning about the gradients in terms of individual elements $x_i$ and $y_i$ first: in that case, you will need to come up with the derivations for $\\frac{\\partial L}{\\partial x_i}$, by relying on the Chain Rule to first calculate the intermediate $\\frac{\\partial \\mu}{\\partial x_i}, \\frac{\\partial v}{\\partial x_i}, \\frac{\\partial \\sigma}{\\partial x_i},$ then assemble these pieces to calculate $\\frac{\\partial y_i}{\\partial x_i}$. \n",
"You should make sure each of the intermediary gradient derivations are all as simplified as possible, for ease of implementation. \n",
"算好之后,在 `batchnorm_backward_alt` 函数中实现简化版的batch normalization的反向传播公式,然后分别运行两种反向传播实现并比较结果,你的结果应该是一致的,但是简化版的实现应该会更快一点。"
"N, D = 100, 500\n",
"x = 5 * np.random.randn(N, D) + 12\n",
"gamma = np.random.randn(D)\n",
"beta = np.random.randn(D)\n",
"dout = np.random.randn(N, D)\n",
"bn_param = {'mode': 'train'}\n",
"out, cache = batchnorm_forward(x, gamma, beta, bn_param)\n",
"t1 = time.time()\n",
"dx1, dgamma1, dbeta1 = batchnorm_backward(dout, cache)\n",
"t2 = time.time()\n",
"dx2, dgamma2, dbeta2 = batchnorm_backward_alt(dout, cache)\n",
"t3 = time.time()\n",
"print('dx difference: ', rel_error(dx1, dx2))\n",
"print('dgamma difference: ', rel_error(dgamma1, dgamma2))\n",
"print('dbeta difference: ', rel_error(dbeta1, dbeta2))\n",
"print('speedup: %.2fx' % ((t2 - t1) / (t3 - t2)))"
"## Fully Connected Nets with Batch Normalization\n",
"现在你已经实现了Batch Normalization,请在`daseCV/classifiers/`中的`FullyConnectedNet`上添加Batch Norm。\n",
"具体来说,当在构造函数中`normalization`标记设置为`batchnorm`时,应该在每个ReLU激活层之前插入一个Batch Norm层。网络最后一层的输出不应该加Batch Norm。\n",
"HINT: You might find it useful to define an additional helper layer similar to those in the file `daseCV/`. If you decide to do so, do it in the file `daseCV/classifiers/`."
"source": [
"N, D, H1, H2, C = 2, 15, 20, 30, 10\n",
"X = np.random.randn(N, D)\n",
"y = np.random.randint(C, size=(N,))\n",
"# You should expect losses between 1e-4~1e-10 for W, \n",
"# losses between 1e-08~1e-10 for b,\n",
"# and losses between 1e-08~1e-09 for beta and gammas.\n",
"for reg in [0, 3.14]:\n",
" print('Running check with reg = ', reg)\n",
" model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n",
" reg=reg, weight_scale=5e-2, dtype=np.float64,\n",
" normalization='batchnorm')\n",
" loss, grads = model.loss(X, y)\n",
" print('Initial loss: ', loss)\n",
" for name in sorted(grads):\n",
" f = lambda _: model.loss(X, y)[0]\n",
" grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n",
" print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))\n",
" if reg == 0: print()"
"# Batchnorm for deep networks\n",
"运行以下代码,在1000个样本的子集上训练一个六层网络,包括有和没有Batch Norm的版本。"
"source": [
"# Try training a very deep net with batchnorm\n",
"hidden_dims = [100, 100, 100, 100, 100]\n",
"num_train = 1000\n",
"small_data = {\n",
" 'X_train': data['X_train'][:num_train],\n",
" 'y_train': data['y_train'][:num_train],\n",
" 'X_val': data['X_val'],\n",
" 'y_val': data['y_val'],\n",
"weight_scale = 2e-2\n",
"bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')\n",
"model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)\n",
"print('Solver with batch norm:')\n",
"bn_solver = Solver(bn_model, small_data,\n",
" num_epochs=10, batch_size=50,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': 1e-3,\n",
" },\n",
" verbose=True,print_every=20)\n",
"print('\\nSolver without batch norm:')\n",
"solver = Solver(model, small_data,\n",
" num_epochs=10, batch_size=50,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': 1e-3,\n",
" },\n",
" verbose=True, print_every=20)\n",
"运行以下命令来可视化上面训练的两个网络的结果。你会发现,使用Batch Norm有助于网络更快地收敛。"
"def plot_training_history(title, label, baseline, bn_solvers, plot_fn, bl_marker='.', bn_marker='.', labels=None):\n",
" \"\"\"utility function for plotting training history\"\"\"\n",
" plt.title(title)\n",
" plt.xlabel(label)\n",
" bn_plots = [plot_fn(bn_solver) for bn_solver in bn_solvers]\n",
" bl_plot = plot_fn(baseline)\n",
" num_bn = len(bn_plots)\n",
" for i in range(num_bn):\n",
" label='with_norm'\n",
" if labels is not None:\n",
" label += str(labels[i])\n",
" plt.plot(bn_plots[i], bn_marker, label=label)\n",
" label='baseline'\n",
" if labels is not None:\n",
" label += str(labels[0])\n",
" plt.plot(bl_plot, bl_marker, label=label)\n",
" plt.legend(loc='lower center', ncol=num_bn+1) \n",
" \n",
"plt.subplot(3, 1, 1)\n",
"plot_training_history('Training loss','Iteration', solver, [bn_solver], \\\n",
" lambda x: x.loss_history, bl_marker='o', bn_marker='o')\n",
"plt.subplot(3, 1, 2)\n",
"plot_training_history('Training accuracy','Epoch', solver, [bn_solver], \\\n",
" lambda x: x.train_acc_history, bl_marker='-o', bn_marker='-o')\n",
"plt.subplot(3, 1, 3)\n",
"plot_training_history('Validation accuracy','Epoch', solver, [bn_solver], \\\n",
" lambda x: x.val_acc_history, bl_marker='-o', bn_marker='-o')\n",
"plt.gcf().set_size_inches(15, 15)\n",
"# Batch normalization and initialization\n",
"我们将进行一个小实验来研究Batch Norm和权值初始化之间的相互关系。\n",
"下面代码将训练8层网络,分别使用不同规模的权重初始化进行Batch Norm和不进行Batch Norm。\n",
"# Try training a very deep net with batchnorm\n",
"hidden_dims = [50, 50, 50, 50, 50, 50, 50]\n",
"num_train = 1000\n",
"small_data = {\n",
" 'X_train': data['X_train'][:num_train],\n",
" 'y_train': data['y_train'][:num_train],\n",
" 'X_val': data['X_val'],\n",
" 'y_val': data['y_val'],\n",
"bn_solvers_ws = {}\n",
"solvers_ws = {}\n",
"weight_scales = np.logspace(-4, 0, num=20)\n",
"for i, weight_scale in enumerate(weight_scales):\n",
" print('Running weight scale %d / %d' % (i + 1, len(weight_scales)))\n",
" bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')\n",
" model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)\n",
" bn_solver = Solver(bn_model, small_data,\n",
" num_epochs=10, batch_size=50,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': 1e-3,\n",
" },\n",
" verbose=False, print_every=200)\n",
" bn_solver.train()\n",
" bn_solvers_ws[weight_scale] = bn_solver\n",
" solver = Solver(model, small_data,\n",
" num_epochs=10, batch_size=50,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': 1e-3,\n",
" },\n",
" verbose=False, print_every=200)\n",
" solver.train()\n",
" solvers_ws[weight_scale] = solver"
"# Plot results of weight scale experiment\n",
"best_train_accs, bn_best_train_accs = [], []\n",
"best_val_accs, bn_best_val_accs = [], []\n",
"final_train_loss, bn_final_train_loss = [], []\n",
"for ws in weight_scales:\n",
" best_train_accs.append(max(solvers_ws[ws].train_acc_history))\n",
" bn_best_train_accs.append(max(bn_solvers_ws[ws].train_acc_history))\n",
" \n",
" best_val_accs.append(max(solvers_ws[ws].val_acc_history))\n",
" bn_best_val_accs.append(max(bn_solvers_ws[ws].val_acc_history))\n",
" \n",
" final_train_loss.append(np.mean(solvers_ws[ws].loss_history[-100:]))\n",
" bn_final_train_loss.append(np.mean(bn_solvers_ws[ws].loss_history[-100:]))\n",
" \n",
"plt.subplot(3, 1, 1)\n",
"plt.title('Best val accuracy vs weight initialization scale')\n",
"plt.xlabel('Weight initialization scale')\n",
"plt.ylabel('Best val accuracy')\n",
"plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')\n",
"plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')\n",
"plt.legend(ncol=2, loc='lower right')\n",
"plt.subplot(3, 1, 2)\n",
"plt.title('Best train accuracy vs weight initialization scale')\n",
"plt.xlabel('Weight initialization scale')\n",
"plt.ylabel('Best training accuracy')\n",
"plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')\n",
"plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')\n",
"plt.subplot(3, 1, 3)\n",
"plt.title('Final training loss vs weight initialization scale')\n",
"plt.xlabel('Weight initialization scale')\n",
"plt.ylabel('Final training loss')\n",
"plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')\n",
"plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')\n",
"plt.gca().set_ylim(1.0, 3.5)\n",
"plt.gcf().set_size_inches(15, 15)\n",
"## Inline Question 1:\n",
"描述一下这个实验的结果。权重初始化的规模如何影响 带有/没有Batch Norm的模型,为什么?\n",
"## Answer:\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"# Batch normalization and batch size\n",
"我们将进行一个小实验来研究Batch Norm和batch size之间的相互关系。\n",
"下面的代码将使用不同的batch size来训练带有/没有Batch Norm的6层网络。\n",
"def run_batchsize_experiments(normalization_mode):\n",
" np.random.seed(231)\n",
" # Try training a very deep net with batchnorm\n",
" hidden_dims = [100, 100, 100, 100, 100]\n",
" num_train = 1000\n",
" small_data = {\n",
" 'X_train': data['X_train'][:num_train],\n",
" 'y_train': data['y_train'][:num_train],\n",
" 'X_val': data['X_val'],\n",
" 'y_val': data['y_val'],\n",
" }\n",
" n_epochs=10\n",
" weight_scale = 2e-2\n",
" batch_sizes = [5,10,50]\n",
" lr = 10**(-3.5)\n",
" solver_bsize = batch_sizes[0]\n",
" print('No normalization: batch size = ',solver_bsize)\n",
" model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)\n",
" solver = Solver(model, small_data,\n",
" num_epochs=n_epochs, batch_size=solver_bsize,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': lr,\n",
" },\n",
" verbose=False)\n",
" solver.train()\n",
" \n",
" bn_solvers = []\n",
" for i in range(len(batch_sizes)):\n",
" b_size=batch_sizes[i]\n",
" print('Normalization: batch size = ',b_size)\n",
" bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=normalization_mode)\n",
" bn_solver = Solver(bn_model, small_data,\n",
" num_epochs=n_epochs, batch_size=b_size,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': lr,\n",
" },\n",
" verbose=False)\n",
" bn_solver.train()\n",
" bn_solvers.append(bn_solver)\n",
" \n",
" return bn_solvers, solver, batch_sizes\n",
"batch_sizes = [5,10,50]\n",
"bn_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('batchnorm')"
"plt.subplot(2, 1, 1)\n",
"plot_training_history('Training accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \\\n",
" lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n",
"plt.subplot(2, 1, 2)\n",
"plot_training_history('Validation accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \\\n",
" lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n",
"plt.gcf().set_size_inches(15, 10)\n",
"## Inline Question 2:\n",
"描述一下这个实验的结果。请问Batch Norm和batch size之间的又什么关系?为什么会出现这种关系?\n",
"## Answer:\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"# Layer Normalization\n",
"(这里大概讲的是batch norm受限于batch size的取值,但是受限于硬件资源,batch size不能取太大,所以提出了layer norm,对一个样本的特征向量进行归一化,均值和方差由该样本的特征向量的所有元素算出来,具体的自己看英文和论文。)\n",
"Batch normalization has proved to be effective in making networks easier to train, but the dependency on batch size makes it less useful in complex networks which have a cap on the input batch size due to hardware limitations. \n",
"Several alternatives to batch normalization have been proposed to mitigate this problem; one such technique is Layer Normalization [2]. Instead of normalizing over the batch, we normalize over the features. In other words, when using Layer Normalization, each feature vector corresponding to a single datapoint is normalized based on the sum of all terms within that feature vector.\n",
"[2] [Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. \"Layer Normalization.\" stat 1050 (2016): 21.]("
"## Inline Question 3:\n",
"下面的数据预处理步骤中,哪些类似于Batch Norm,哪些类似于Layer Norm?\n",
"1. Scaling each image in the dataset, so that the RGB channels for each row of pixels within an image sums up to 1.\n",
"2. Scaling each image in the dataset, so that the RGB channels for all pixels within an image sums up to 1. \n",
"3. Subtracting the mean image of the dataset from each image in the dataset.\n",
"4. Setting all RGB values to either 0 or 1 depending on a given threshold.\n",
"## Answer:\n",
"# Layer Normalization: Implementation\n",
"现在你要实现layer normalization。这步应该相对简单,因为在概念上,layer norm的实现几乎与batch norm一样。不过一个重要的区别是,对于layer norm,我们使用moments,并且测试阶段与训练阶段是相同的,每个数据样本直接计算平均值和方差。\n",
"* 实现 `daseCV/` 中的`layernorm_forward`。 \n",
"* 实现 `daseCV/` 中的`layernorm_backward`。\n",
"* 修改 `daseCV/classifiers/`,在`FullyConnectedNet`上增加layer normalization。当构造函数中的`normalization`标记为`\"layernorm\"`时,你应该在每个ReLU层前插入layer normalization层。\n",
"运行下面第三个cell进行关于在layer normalization上的batch size的实验。"
"# Check the training-time forward pass by checking means and variances\n",
"# of features both before and after layer normalization \n",
"# Simulate the forward pass for a two-layer network\n",
"N, D1, D2, D3 =4, 50, 60, 3\n",
"X = np.random.randn(N, D1)\n",
"W1 = np.random.randn(D1, D2)\n",
"W2 = np.random.randn(D2, D3)\n",
"a = np.maximum(0,\n",
"print('Before layer normalization:')\n",
"gamma = np.ones(D3)\n",
"beta = np.zeros(D3)\n",
"# Means should be close to zero and stds close to one\n",
"print('After layer normalization (gamma=1, beta=0)')\n",
"a_norm, _ = layernorm_forward(a, gamma, beta, {'mode': 'train'})\n",
"gamma = np.asarray([3.0,3.0,3.0])\n",
"beta = np.asarray([5.0,5.0,5.0])\n",
"# Now means should be close to beta and stds close to gamma\n",
"print('After layer normalization (gamma=', gamma, ', beta=', beta, ')')\n",
"a_norm, _ = layernorm_forward(a, gamma, beta, {'mode': 'train'})\n",
"# Gradient check batchnorm backward pass\n",
"N, D = 4, 5\n",
"x = 5 * np.random.randn(N, D) + 12\n",
"gamma = np.random.randn(D)\n",
"beta = np.random.randn(D)\n",
"dout = np.random.randn(N, D)\n",
"ln_param = {}\n",
"fx = lambda x: layernorm_forward(x, gamma, beta, ln_param)[0]\n",
"fg = lambda a: layernorm_forward(x, a, beta, ln_param)[0]\n",
"fb = lambda b: layernorm_forward(x, gamma, b, ln_param)[0]\n",
"dx_num = eval_numerical_gradient_array(fx, x, dout)\n",
"da_num = eval_numerical_gradient_array(fg, gamma.copy(), dout)\n",
"db_num = eval_numerical_gradient_array(fb, beta.copy(), dout)\n",
"_, cache = layernorm_forward(x, gamma, beta, ln_param)\n",
"dx, dgamma, dbeta = layernorm_backward(dout, cache)\n",
"#You should expect to see relative errors between 1e-12 and 1e-8\n",
"print('dx error: ', rel_error(dx_num, dx))\n",
"print('dgamma error: ', rel_error(da_num, dgamma))\n",
"print('dbeta error: ', rel_error(db_num, dbeta))"
"# Layer Normalization and batch size\n",
"我们将使用layer norm来进行前面的batch size实验。与之前的实验相比,batch size对训练精度的影响要小得多!"
"ln_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('layernorm')\n",
"plt.subplot(2, 1, 1)\n",
"plot_training_history('Training accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \\\n",
" lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n",
"plt.subplot(2, 1, 2)\n",
"plot_training_history('Validation accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \\\n",
" lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n",
"plt.gcf().set_size_inches(15, 10)\n",
"## Inline Question 4:\n",
"什么时候layer normalization可能不工作(不起作用),为什么?\n",
"1. 在非常深的网络上使用\n",
"2. 特征的维度非常的小\n",
"3. 有非常高的正则化项\n",
"## Answer:\n",
"# 重要\n",
"1. 点击`File -> Save`或者用`control+s`组合键,确保你最新的的notebook的作业已经保存到谷歌云。\n",
"2. 执行以下代码确保 `.py` 文件保存回你的谷歌云。"
"import os\n",
"FOLDER_TO_SAVE = os.path.join('drive/My Drive/', FOLDERNAME)\n",
"FILES_TO_SAVE = ['daseCV/classifiers/', 'daseCV/classifiers/']\n",
"for files in FILES_TO_SAVE:\n",
" with open(os.path.join(FOLDER_TO_SAVE, '/'.join(files.split('/')[1:])), 'w') as f:\n",
" f.write(''.join(open(files).readlines()))"
assignment2/ConvolutionalNetworks.ipynb

"from google.colab import drive\n",
"drive.mount('/content/drive', force_remount=True)\n",
"# 输入daseCV所在的路径\n",
"# 'daseCV' 文件夹包括 '.py', 'classifiers' 和'datasets'文件夹\n",
"# 例如 'CV/assignments/assignment1/daseCV/'\n",
"FOLDERNAME = None\n",
"assert FOLDERNAME is not None, \"[!] Enter the foldername.\"\n",
"%cd drive/My\\ Drive\n",
"%cp -r $FOLDERNAME ../../\n",
"%cd ../../\n",
"%cd daseCV/datasets/\n",
"%cd ../../"
"# 卷积网络\n",
"# As usual, a bit of setup\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from daseCV.classifiers.cnn import *\n",
"from daseCV.data_utils import get_CIFAR10_data\n",
"from daseCV.gradient_check import eval_numerical_gradient_array, eval_numerical_gradient\n",
"from daseCV.layers import *\n",
"from daseCV.fast_layers import *\n",
"from daseCV.solver import Solver\n",
"%matplotlib inline\n",
"plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n",
"plt.rcParams['image.interpolation'] = 'nearest'\n",
"plt.rcParams['image.cmap'] = 'gray'\n",
"# for auto-reloading external modules\n",
"# see\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"def rel_error(x, y):\n",
" \"\"\" returns relative error \"\"\"\n",
" return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))"
"# Load the (preprocessed) CIFAR10 data.\n",
"data = get_CIFAR10_data()\n",
"for k, v in data.items():\n",
" print('%s: ' % k, v.shape)"
"# 卷积:简单的正向传播\n",
"卷积网络的核心是卷积运算。在文件 `daseCV/` 中的函数`conv_forward_naive`里实现卷积层的正向传播。\n",
"x_shape = (2, 3, 4, 4)\n",
"w_shape = (3, 3, 4, 4)\n",
"x = np.linspace(-0.1, 0.5,\n",
"w = np.linspace(-0.2, 0.3,\n",
"b = np.linspace(-0.1, 0.2, num=3)\n",
"conv_param = {'stride': 2, 'pad': 1}\n",
"out, _ = conv_forward_naive(x, w, b, conv_param)\n",
"correct_out = np.array([[[[-0.08759809, -0.10987781],\n",
" [-0.18387192, -0.2109216 ]],\n",
" [[ 0.21027089, 0.21661097],\n",
" [ 0.22847626, 0.23004637]],\n",
" [[ 0.50813986, 0.54309974],\n",
" [ 0.64082444, 0.67101435]]],\n",
" [[[-0.98053589, -1.03143541],\n",
" [-1.19128892, -1.24695841]],\n",
" [[ 0.69108355, 0.66880383],\n",
" [ 0.59480972, 0.56776003]],\n",
" [[ 2.36270298, 2.36904306],\n",
" [ 2.38090835, 2.38247847]]]])\n",
"# Compare your output to ours; difference should be around e-8\n",
"print('Testing conv_forward_naive')\n",
"print('difference: ', rel_error(out, correct_out))"
"# 补充:通过卷积对进行图像处理\n",
"from imageio import imread\n",
"from PIL import Image\n",
"kitten = imread('notebook_images/kitten.jpg')\n",
"puppy = imread('notebook_images/puppy.jpg')\n",
"# kitten is wide, and puppy is already square\n",
"d = kitten.shape[1] - kitten.shape[0]\n",
"kitten_cropped = kitten[:, d//2:-d//2, :]\n",
"img_size = 200 # Make this smaller if it runs too slow\n",
"resized_puppy = np.array(Image.fromarray(puppy).resize((img_size, img_size)))\n",
"resized_kitten = np.array(Image.fromarray(kitten_cropped).resize((img_size, img_size)))\n",
"x = np.zeros((2, 3, img_size, img_size))\n",
"x[0, :, :, :] = resized_puppy.transpose((2, 0, 1))\n",
"x[1, :, :, :] = resized_kitten.transpose((2, 0, 1))\n",
"# Set up a convolutional weights holding 2 filters, each 3x3\n",
"w = np.zeros((2, 3, 3, 3))\n",
"# The first filter converts the image to grayscale.\n",
"# Set up the red, green, and blue channels of the filter.\n",
"w[0, 0, :, :] = [[0, 0, 0], [0, 0.3, 0], [0, 0, 0]]\n",
"w[0, 1, :, :] = [[0, 0, 0], [0, 0.6, 0], [0, 0, 0]]\n",
"w[0, 2, :, :] = [[0, 0, 0], [0, 0.1, 0], [0, 0, 0]]\n",
"# Second filter detects horizontal edges in the blue channel.\n",
"w[1, 2, :, :] = [[1, 2, 1], [0, 0, 0], [-1, -2, -1]]\n",
"# Vector of biases. We don't need any bias for the grayscale\n",
"# filter, but for the edge detection filter we want to add 128\n",
"# to each output so that nothing is negative.\n",
"b = np.array([0, 128])\n",
"# Compute the result of convolving each input in x with each filter in w,\n",
"# offsetting by b, and storing the results in out.\n",
"out, _ = conv_forward_naive(x, w, b, {'stride': 1, 'pad': 1})\n",
"def imshow_no_ax(img, normalize=True):\n",
" \"\"\" Tiny helper to show images as uint8 and remove axis labels \"\"\"\n",
" if normalize:\n",
" img_max, img_min = np.max(img), np.min(img)\n",
" img = 255.0 * (img - img_min) / (img_max - img_min)\n",
" plt.imshow(img.astype('uint8'))\n",
" plt.gca().axis('off')\n",
"# Show the original images and the results of the conv operation\n",
"plt.subplot(2, 3, 1)\n",
"imshow_no_ax(puppy, normalize=False)\n",
"plt.title('Original image')\n",
"plt.subplot(2, 3, 2)\n",
"imshow_no_ax(out[0, 0])\n",
"plt.subplot(2, 3, 3)\n",
"imshow_no_ax(out[0, 1])\n",
"plt.subplot(2, 3, 4)\n",
"imshow_no_ax(kitten_cropped, normalize=False)\n",
"plt.subplot(2, 3, 5)\n",
"imshow_no_ax(out[1, 0])\n",
"plt.subplot(2, 3, 6)\n",
"imshow_no_ax(out[1, 1])\n",
"# 卷积:简单的反向传播\n",
"x = np.random.randn(4, 3, 5, 5)\n",
"w = np.random.randn(2, 3, 3, 3)\n",
"b = np.random.randn(2,)\n",
"dout = np.random.randn(4, 2, 5, 5)\n",
"conv_param = {'stride': 1, 'pad': 1}\n",
"dx_num = eval_numerical_gradient_array(lambda x: conv_forward_naive(x, w, b, conv_param)[0], x, dout)\n",
"dw_num = eval_numerical_gradient_array(lambda w: conv_forward_naive(x, w, b, conv_param)[0], w, dout)\n",
"db_num = eval_numerical_gradient_array(lambda b: conv_forward_naive(x, w, b, conv_param)[0], b, dout)\n",
"out, cache = conv_forward_naive(x, w, b, conv_param)\n",
"dx, dw, db = conv_backward_naive(dout, cache)\n",
"# Your errors should be around e-8 or less.\n",
"print('Testing conv_backward_naive function')\n",
"print('dx error: ', rel_error(dx, dx_num))\n",
"print('dw error: ', rel_error(dw, dw_num))\n",
"print('db error: ', rel_error(db, db_num))"
"# 最大池化: 简单的正向传播\n",
"x_shape = (2, 3, 4, 4)\n",
"x = np.linspace(-0.3, 0.4,\n",
"pool_param = {'pool_width': 2, 'pool_height': 2, 'stride': 2}\n",
"out, _ = max_pool_forward_naive(x, pool_param)\n",
"correct_out = np.array([[[[-0.26315789, -0.24842105],\n",
" [-0.20421053, -0.18947368]],\n",
" [[-0.14526316, -0.13052632],\n",
" [-0.08631579, -0.07157895]],\n",
" [[-0.02736842, -0.01263158],\n",
" [ 0.03157895, 0.04631579]]],\n",
" [[[ 0.09052632, 0.10526316],\n",
" [ 0.14947368, 0.16421053]],\n",
" [[ 0.20842105, 0.22315789],\n",
" [ 0.26736842, 0.28210526]],\n",
" [[ 0.32631579, 0.34105263],\n",
" [ 0.38526316, 0.4 ]]]])\n",
"# Compare your output with ours. Difference should be on the order of e-8.\n",
"print('Testing max_pool_forward_naive function:')\n",
"print('difference: ', rel_error(out, correct_out))"
"# 最大池化: 简单的反向传播\n",
"x = np.random.randn(3, 2, 8, 8)\n",
"dout = np.random.randn(3, 2, 4, 4)\n",
"pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n",
"dx_num = eval_numerical_gradient_array(lambda x: max_pool_forward_naive(x, pool_param)[0], x, dout)\n",
"out, cache = max_pool_forward_naive(x, pool_param)\n",
"dx = max_pool_backward_naive(dout, cache)\n",
"# Your error should be on the order of e-12\n",
"print('Testing max_pool_backward_naive function:')\n",
"print('dx error: ', rel_error(dx, dx_num))"
"# Fast layers\n",
"python build_ext --inplace\n",
"**提示:** 只有当池化区域不重叠并对输入进行平铺时,池化的快速实现才能表现出最好的性能。如果不满足这些条件,那么快速池化将不会比原来的的实现快很多。\n",
"# Rel errors should be around e-9 or less\n",
"from daseCV.fast_layers import conv_forward_fast, conv_backward_fast\n",
"from time import time\n",
"x = np.random.randn(100, 3, 31, 31)\n",
"w = np.random.randn(25, 3, 3, 3)\n",
"b = np.random.randn(25,)\n",
"dout = np.random.randn(100, 25, 16, 16)\n",
"conv_param = {'stride': 2, 'pad': 1}\n",
"t0 = time()\n",
"out_naive, cache_naive = conv_forward_naive(x, w, b, conv_param)\n",
"t1 = time()\n",
"out_fast, cache_fast = conv_forward_fast(x, w, b, conv_param)\n",
"t2 = time()\n",
"print('Testing conv_forward_fast:')\n",
"print('Naive: %fs' % (t1 - t0))\n",
"print('Fast: %fs' % (t2 - t1))\n",
"print('Speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n",
"print('Difference: ', rel_error(out_naive, out_fast))\n",
"t0 = time()\n",
"dx_naive, dw_naive, db_naive = conv_backward_naive(dout, cache_naive)\n",
"t1 = time()\n",
"dx_fast, dw_fast, db_fast = conv_backward_fast(dout, cache_fast)\n",
"t2 = time()\n",
"print('\\nTesting conv_backward_fast:')\n",
"print('Naive: %fs' % (t1 - t0))\n",
"print('Fast: %fs' % (t2 - t1))\n",
"print('Speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n",
"print('dx difference: ', rel_error(dx_naive, dx_fast))\n",
"print('dw difference: ', rel_error(dw_naive, dw_fast))\n",
"print('db difference: ', rel_error(db_naive, db_fast))"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Relative errors should be close to 0.0\n",
"from daseCV.fast_layers import max_pool_forward_fast, max_pool_backward_fast\n",
"x = np.random.randn(100, 3, 32, 32)\n",
"dout = np.random.randn(100, 3, 16, 16)\n",
"pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n",
"t0 = time()\n",
"out_naive, cache_naive = max_pool_forward_naive(x, pool_param)\n",
"t1 = time()\n",
"out_fast, cache_fast = max_pool_forward_fast(x, pool_param)\n",
"t2 = time()\n",
"print('Testing pool_forward_fast:')\n",
"print('Naive: %fs' % (t1 - t0))\n",
"print('fast: %fs' % (t2 - t1))\n",
"print('speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n",
"print('difference: ', rel_error(out_naive, out_fast))\n",
"t0 = time()\n",
"dx_naive = max_pool_backward_naive(dout, cache_naive)\n",
"t1 = time()\n",
"dx_fast = max_pool_backward_fast(dout, cache_fast)\n",
"t2 = time()\n",
"print('\\nTesting pool_backward_fast:')\n",
"print('Naive: %fs' % (t1 - t0))\n",
"print('fast: %fs' % (t2 - t1))\n",
"print('speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n",
"print('dx difference: ', rel_error(dx_naive, dx_fast))"
"# 卷积 \"sandwich\" 层\n",
"from daseCV.layer_utils import conv_relu_pool_forward, conv_relu_pool_backward\n",
"x = np.random.randn(2, 3, 16, 16)\n",
"w = np.random.randn(3, 3, 3, 3)\n",
"b = np.random.randn(3,)\n",
"dout = np.random.randn(2, 3, 8, 8)\n",
"conv_param = {'stride': 1, 'pad': 1}\n",
"pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n",
"out, cache = conv_relu_pool_forward(x, w, b, conv_param, pool_param)\n",
"dx, dw, db = conv_relu_pool_backward(dout, cache)\n",
"dx_num = eval_numerical_gradient_array(lambda x: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], x, dout)\n",
"dw_num = eval_numerical_gradient_array(lambda w: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], w, dout)\n",
"db_num = eval_numerical_gradient_array(lambda b: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], b, dout)\n",
"# Relative errors should be around e-8 or less\n",
"print('Testing conv_relu_pool')\n",
"print('dx error: ', rel_error(dx_num, dx))\n",
"print('dw error: ', rel_error(dw_num, dw))\n",
"print('db error: ', rel_error(db_num, db))"
"from daseCV.layer_utils import conv_relu_forward, conv_relu_backward\n",
"x = np.random.randn(2, 3, 8, 8)\n",
"w = np.random.randn(3, 3, 3, 3)\n",
"b = np.random.randn(3,)\n",
"dout = np.random.randn(2, 3, 8, 8)\n",
"conv_param = {'stride': 1, 'pad': 1}\n",
"out, cache = conv_relu_forward(x, w, b, conv_param)\n",
"dx, dw, db = conv_relu_backward(dout, cache)\n",
"dx_num = eval_numerical_gradient_array(lambda x: conv_relu_forward(x, w, b, conv_param)[0], x, dout)\n",
"dw_num = eval_numerical_gradient_array(lambda w: conv_relu_forward(x, w, b, conv_param)[0], w, dout)\n",
"db_num = eval_numerical_gradient_array(lambda b: conv_relu_forward(x, w, b, conv_param)[0], b, dout)\n",
"# Relative errors should be around e-8 or less\n",
"print('Testing conv_relu:')\n",
"print('dx error: ', rel_error(dx_num, dx))\n",
"print('dw error: ', rel_error(dw_num, dw))\n",
"print('db error: ', rel_error(db_num, db))"
"# 三层卷积网络\n",
"## 检查loss\n",
"model = ThreeLayerConvNet()\n",
"N = 50\n",
"X = np.random.randn(N, 3, 32, 32)\n",
"y = np.random.randint(10, size=N)\n",
"loss, grads = model.loss(X, y)\n",
"print('Initial loss (no regularization): ', loss)\n",
"model.reg = 0.5\n",
"loss, grads = model.loss(X, y)\n",
"print('Initial loss (with regularization): ', loss)"
"## 梯度检查\n",
"num_inputs = 2\n",
"input_dim = (3, 16, 16)\n",
"reg = 0.0\n",
"num_classes = 10\n",
"X = np.random.randn(num_inputs, *input_dim)\n",
"y = np.random.randint(num_classes, size=num_inputs)\n",
"model = ThreeLayerConvNet(num_filters=3, filter_size=3,\n",
" input_dim=input_dim, hidden_dim=7,\n",
" dtype=np.float64)\n",
"loss, grads = model.loss(X, y)\n",
"# Errors should be small, but correct implementations may have\n",
"# relative errors up to the order of e-2\n",
"for param_name in sorted(grads):\n",
" f = lambda _: model.loss(X, y)[0]\n",
" param_grad_num = eval_numerical_gradient(f, model.params[param_name], verbose=False, h=1e-6)\n",
" e = rel_error(param_grad_num, grads[param_name])\n",
" print('%s max relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name])))"
"## 小样本的过拟合\n",
"num_train = 100\n",
"small_data = {\n",
" 'X_train': data['X_train'][:num_train],\n",
" 'y_train': data['y_train'][:num_train],\n",
" 'X_val': data['X_val'],\n",
" 'y_val': data['y_val'],\n",
"model = ThreeLayerConvNet(weight_scale=1e-2)\n",
"solver = Solver(model, small_data,\n",
" num_epochs=15, batch_size=50,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': 1e-3,\n",
" },\n",
" verbose=True, print_every=1)\n",
"cell_type": "markdown",
"plt.subplot(2, 1, 1)\n",
"plt.plot(solver.loss_history, 'o')\n",
"plt.subplot(2, 1, 2)\n",
"plt.plot(solver.train_acc_history, '-o')\n",
"plt.plot(solver.val_acc_history, '-o')\n",
"plt.legend(['train', 'val'], loc='upper left')\n",
"## 训练网络\n",
"model = ThreeLayerConvNet(weight_scale=0.001, hidden_dim=500, reg=0.001)\n",
"solver = Solver(model, data,\n",
" num_epochs=1, batch_size=50,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': 1e-3,\n",
" },\n",
" verbose=True, print_every=20)\n",
"cell_type": "markdown",
"You can visualize the first-layer convolutional filters from the trained network by running the following:\n",
"from daseCV.vis_utils import visualize_grid\n",
"grid = visualize_grid(model.params['W1'].transpose(0, 2, 3, 1))\n",
"plt.gcf().set_size_inches(5, 5)\n",
"# 空间批量归一化\n",
"通常,当我们对维数为`N`的最小批进行批归一化时接受的形状为 `(N, D)`的输入,之后生成形状为`(N, D)`的输出。对于来自卷积层的数据,批归一化需要接受形状为`(N, C, H, W)`的输入,并产生形状为`(N, C, H, W)`的输出,其中`N`维度为最小批大小而 `(H, W)` 维度是特征图的大小。\n",
"如果特征图是使用卷积生成的,那么我们期望每个特征通道的两个不同图像以及同一图像内不同位置之间的统计信息例如均值、方差相对一致。毕竟每个特征通道都是由相同的卷积滤波器产生的!因此,空间批量归一化通过计算最小批维度`N`以及空间维度 `H` 和`W`的统计信息,为每个 `C`特征通道计算均值和方差。\n",
"[1] [Sergey Ioffe and Christian Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing\n",
"Internal Covariate Shift\", ICML 2015.]("
"## 空间批量归一化:正向传播\n",
"在文件 `daseCV/`中的`spatial_batchnorm_forward`函数里实现空间批归一化的正向传播。通过运行以下命令检查您的代码:"
"# Check the training-time forward pass by checking means and variances\n",
"# of features both before and after spatial batch normalization\n",
"N, C, H, W = 2, 3, 4, 5\n",
"x = 4 * np.random.randn(N, C, H, W) + 10\n",
"print('Before spatial batch normalization:')\n",
"print(' Shape: ', x.shape)\n",
"print(' Means: ', x.mean(axis=(0, 2, 3)))\n",
"print(' Stds: ', x.std(axis=(0, 2, 3)))\n",
"# Means should be close to zero and stds close to one\n",
"gamma, beta = np.ones(C), np.zeros(C)\n",
"bn_param = {'mode': 'train'}\n",
"out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n",
"print('After spatial batch normalization:')\n",
"print(' Shape: ', out.shape)\n",
"print(' Means: ', out.mean(axis=(0, 2, 3)))\n",
"print(' Stds: ', out.std(axis=(0, 2, 3)))\n",
"# Means should be close to beta and stds close to gamma\n",
"gamma, beta = np.asarray([3, 4, 5]), np.asarray([6, 7, 8])\n",
"out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n",
"print('After spatial batch normalization (nontrivial gamma, beta):')\n",
"print(' Shape: ', out.shape)\n",
"print(' Means: ', out.mean(axis=(0, 2, 3)))\n",
"print(' Stds: ', out.std(axis=(0, 2, 3)))"
"# Check the test-time forward pass by running the training-time\n",
"# forward pass many times to warm up the running averages, and then\n",
"# checking the means and variances of activations after a test-time\n",
"# forward pass.\n",
"N, C, H, W = 10, 4, 11, 12\n",
"bn_param = {'mode': 'train'}\n",
"gamma = np.ones(C)\n",
"beta = np.zeros(C)\n",
"for t in range(50):\n",
" x = 2.3 * np.random.randn(N, C, H, W) + 13\n",
" spatial_batchnorm_forward(x, gamma, beta, bn_param)\n",
"bn_param['mode'] = 'test'\n",
"x = 2.3 * np.random.randn(N, C, H, W) + 13\n",
"a_norm, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n",
"# Means should be close to zero and stds close to one, but will be\n",
"# noisier than training-time forward passes.\n",
"print('After spatial batch normalization (test-time):')\n",
"print(' means: ', a_norm.mean(axis=(0, 2, 3)))\n",
"print(' stds: ', a_norm.std(axis=(0, 2, 3)))"
"## 空间批量归一化:反向传播\n",
"N, C, H, W = 2, 3, 4, 5\n",
"x = 5 * np.random.randn(N, C, H, W) + 12\n",
"gamma = np.random.randn(C)\n",
"beta = np.random.randn(C)\n",
"dout = np.random.randn(N, C, H, W)\n",
"bn_param = {'mode': 'train'}\n",
"fx = lambda x: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n",
"fg = lambda a: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n",
"fb = lambda b: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n",
"dx_num = eval_numerical_gradient_array(fx, x, dout)\n",
"da_num = eval_numerical_gradient_array(fg, gamma, dout)\n",
"db_num = eval_numerical_gradient_array(fb, beta, dout)\n",
"#You should expect errors of magnitudes between 1e-12~1e-06\n",
"_, cache = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n",
"dx, dgamma, dbeta = spatial_batchnorm_backward(dout, cache)\n",
"print('dx error: ', rel_error(dx_num, dx))\n",
"print('dgamma error: ', rel_error(da_num, dgamma))\n",
"print('dbeta error: ', rel_error(db_num, dbeta))"
"# 组归一化\n",
"在之前的notebook中,我们提到了“层归一化”是一种替代的归一化技术,它减轻了“批归一化”的批大小限制。但是,正如 [2] 的作者所观察到的,当与卷积层一起使用时,层归一化的性能不如批归一化:\n",
">With fully connected layers, all the hidden units in a layer tend to make similar contributions to the final prediction, and re-centering and rescaling the summed inputs to a layer works well. However, the assumption of similar contributions is no longer true for convolutional neural networks. The large number of the hidden units whose\n",
"receptive fields lie near the boundary of the image are rarely turned on and thus have very different\n",
"statistics from the rest of the hidden units within the same layer.\n",
"[3] 的作者提出了一种中间技术。与“层归一化”相反,在“层归一化”中您对每个数据点的整个特征进行归一化,他们建议将每个数据点一致的特征划分为G组,然后对每个组的每个数据点进行归一化。\n",
"![Comparison of normalization techniques discussed so far](notebook_images/normalization.png)\n",
"<center>**Visual comparison of the normalization techniques discussed so far (image edited from [3])**</center>\n",
"尽管在每一组中仍然存在贡献相等的假设,但作者假设这不是问题,因为在视觉识别的特征中出现了天生的分组。他们用来说明这一点的一个例子是,在传统的计算机视觉中,许多高性能的传统的特征都有明确分组在一起的术语。以Histogram of Oriented Gradients[4]为例——在计算每个空间局部块的直方图后,对每个块的直方图进行归一化处理,然后拼接在一起形成最终的特征向量。\n",
"[2] [Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. \"Layer Normalization.\" stat 1050 (2016): 21.](\n",
"[3] [Wu, Yuxin, and Kaiming He. \"Group Normalization.\" arXiv preprint arXiv:1803.08494 (2018).](\n",
"[4] [N. Dalal and B. Triggs. Histograms of oriented gradients for\n",
"human detection. In Computer Vision and Pattern Recognition\n",
"(CVPR), 2005.]("
"## 组归一化:正向传播\n",
"# Check the training-time forward pass by checking means and variances\n",
"# of features both before and after spatial batch normalization\n",
"N, C, H, W = 2, 6, 4, 5\n",
"G = 2\n",
"x = 4 * np.random.randn(N, C, H, W) + 10\n",
"x_g = x.reshape((N*G,-1))\n",
"print('Before spatial group normalization:')\n",
"print(' Shape: ', x.shape)\n",
"print(' Means: ', x_g.mean(axis=1))\n",
"print(' Stds: ', x_g.std(axis=1))\n",
"# Means should be close to zero and stds close to one\n",
"gamma, beta = np.ones((1,C,1,1)), np.zeros((1,C,1,1))\n",
"bn_param = {'mode': 'train'}\n",
"out, _ = spatial_groupnorm_forward(x, gamma, beta, G, bn_param)\n",
"out_g = out.reshape((N*G,-1))\n",
"print('After spatial group normalization:')\n",
"print(' Shape: ', out.shape)\n",
"print(' Means: ', out_g.mean(axis=1))\n",
"print(' Stds: ', out_g.std(axis=1))"
"## 空间组归一化:反向传播\n",
"在文件 `daseCV/`中的`spatial_groupnorm_backward`函数里实现空间批量归一化的反向传播。运行以下命令以检查您的代码:"
"N, C, H, W = 2, 6, 4, 5\n",
"G = 2\n",
"x = 5 * np.random.randn(N, C, H, W) + 12\n",
"gamma = np.random.randn(1,C,1,1)\n",
"beta = np.random.randn(1,C,1,1)\n",
"dout = np.random.randn(N, C, H, W)\n",
"gn_param = {}\n",
"fx = lambda x: spatial_groupnorm_forward(x, gamma, beta, G, gn_param)[0]\n",
"fg = lambda a: spatial_groupnorm_forward(x, gamma, beta, G, gn_param)[0]\n",
"fb = lambda b: spatial_groupnorm_forward(x, gamma, beta, G, gn_param)[0]\n",
"dx_num = eval_numerical_gradient_array(fx, x, dout)\n",
"da_num = eval_numerical_gradient_array(fg, gamma, dout)\n",
"db_num = eval_numerical_gradient_array(fb, beta, dout)\n",
"_, cache = spatial_groupnorm_forward(x, gamma, beta, G, gn_param)\n",
"dx, dgamma, dbeta = spatial_groupnorm_backward(dout, cache)\n",
"#You should expect errors of magnitudes between 1e-12~1e-07\n",
"print('dx error: ', rel_error(dx_num, dx))\n",
"print('dgamma error: ', rel_error(da_num, dgamma))\n",
"print('dbeta error: ', rel_error(db_num, dbeta))"
"# 重要\n",
"1. 点击`File -> Save`或者用`control+s`组合键,确保你最新的的notebook的作业已经保存到谷歌云。\n",
"2. 执行以下代码确保 `.py` 文件保存回你的谷歌云。"
"import os\n",
"FOLDER_TO_SAVE = os.path.join('drive/My Drive/', FOLDERNAME)\n",
"FILES_TO_SAVE = ['daseCV/classifiers/', 'daseCV/classifiers/']\n",
"for files in FILES_TO_SAVE:\n",
" with open(os.path.join(FOLDER_TO_SAVE, '/'.join(files.split('/')[1:])), 'w') as f:\n",
" f.write(''.join(open(files).readlines()))"
assignment2/Dropout.ipynb

"from google.colab import drive\n",
"drive.mount('/content/drive', force_remount=True)\n",
"# 输入daseCV所在的路径\n",
"# 'daseCV' 文件夹包括 '.py', 'classifiers' 和'datasets'文件夹\n",
"# 例如 'CV/assignments/assignment1/daseCV/'\n",
"FOLDERNAME = None\n",
"assert FOLDERNAME is not None, \"[!] Enter the foldername.\"\n",
"%cd drive/My\\ Drive\n",
"%cp -r $FOLDERNAME ../../\n",
"%cd ../../\n",
"%cd daseCV/datasets/\n",
"%cd ../../"
"# Dropout\n",
"Dropout [1] 是一种通过在正向传播中将一些输出随机设置为零,神经网络正则化的方法。在这个练习中,你将实现一个dropout层,并修改你的全连接网络使其可选择的使用dropout\n",
"[1] [Geoffrey E. Hinton et al, \"Improving neural networks by preventing co-adaptation of feature detectors\", arXiv 2012]("
"# As usual, a bit of setup\n",
"from __future__ import print_function\n",
"import time\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from daseCV.classifiers.fc_net import *\n",
"from daseCV.data_utils import get_CIFAR10_data\n",
"from daseCV.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n",
"from daseCV.solver import Solver\n",
"%matplotlib inline\n",
"plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n",
"plt.rcParams['image.interpolation'] = 'nearest'\n",
"plt.rcParams['image.cmap'] = 'gray'\n",
"# for auto-reloading external modules\n",
"# see\n",
"%load_ext autoreload\n",
"%autoreload 2\n",
"def rel_error(x, y):\n",
" \"\"\" returns relative error \"\"\"\n",
" return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))\n"
"# Load the (preprocessed) CIFAR10 data.\n",
"data = get_CIFAR10_data()\n",
"for k, v in data.items():\n",
" print('%s: ' % k, v.shape)"
"# Dropout 正向传播\n",
"在文件 `daseCV/` 中完成dropout的正向传播过程。由于dropout在训练和测试期间的行为是不同的,因此请确保两种模式下都实现完成。\n",
"x = np.random.randn(500, 500) + 10\n",
"for p in [0.25, 0.4, 0.7]:\n",
" out, _ = dropout_forward(x, {'mode': 'train', 'p': p})\n",
" out_test, _ = dropout_forward(x, {'mode': 'test', 'p': p})\n",
" print('Running tests with p = ', p)\n",
" print('Mean of input: ', x.mean())\n",
" print('Mean of train-time output: ', out.mean())\n",
" print('Mean of test-time output: ', out_test.mean())\n",
" print('Fraction of train-time output set to zero: ', (out == 0).mean())\n",
" print('Fraction of test-time output set to zero: ', (out_test == 0).mean())\n",
" print()"
"# Dropout 反向传播\n",
"在文件 `daseCV/` 中完成dropout的反向传播。完成之后运行以下cell以对你的实现代码进行梯度检查。"
"x = np.random.randn(10, 10) + 10\n",
"dout = np.random.randn(*x.shape)\n",
"dropout_param = {'mode': 'train', 'p': 0.2, 'seed': 123}\n",
"out, cache = dropout_forward(x, dropout_param)\n",
"dx = dropout_backward(dout, cache)\n",
"dx_num = eval_numerical_gradient_array(lambda xx: dropout_forward(xx, dropout_param)[0], x, dout)\n",
"# Error should be around e-10 or less\n",
"print('dx relative error: ', rel_error(dx, dx_num))"
"## 问题 1:\n",
"如果我们不利用inverted dropout,在训练的时候直接将dropout后的值除以 `p`,会发生什么?为什么会这样呢?\n",
"## 回答:\n",
"# 全连接网络的Dropout\n",
"N, D, H1, H2, C = 2, 15, 20, 30, 10\n",
"X = np.random.randn(N, D)\n",
"y = np.random.randint(C, size=(N,))\n",
"for dropout in [1, 0.75, 0.5]:\n",
" print('Running check with dropout = ', dropout)\n",
" model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n",
" weight_scale=5e-2, dtype=np.float64,\n",
" dropout=dropout, seed=123)\n",
" loss, grads = model.loss(X, y)\n",
" print('Initial loss: ', loss)\n",
" \n",
" # Relative errors should be around e-6 or less; Note that it's fine\n",
" # if for dropout=1 you have W2 error be on the order of e-5.\n",
" for name in sorted(grads):\n",
" f = lambda _: model.loss(X, y)[0]\n",
" grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n",
" print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))\n",
" print()"
"# 正则化实验\n",
"# Train two identical nets, one with dropout and one without\n",
"num_train = 500\n",
"small_data = {\n",
" 'X_train': data['X_train'][:num_train],\n",
" 'y_train': data['y_train'][:num_train],\n",
" 'X_val': data['X_val'],\n",
" 'y_val': data['y_val'],\n",
"solvers = {}\n",
"dropout_choices = [1, 0.25]\n",
"for dropout in dropout_choices:\n",
" model = FullyConnectedNet([500], dropout=dropout)\n",
" print(dropout)\n",
" solver = Solver(model, small_data,\n",
" num_epochs=25, batch_size=100,\n",
" update_rule='adam',\n",
" optim_config={\n",
" 'learning_rate': 5e-4,\n",
" },\n",
" verbose=True, print_every=100)\n",
" solver.train()\n",
" solvers[dropout] = solver\n",
" print()"
"# Plot train and validation accuracies of the two models\n",
"train_accs = []\n",
"val_accs = []\n",
"for dropout in dropout_choices:\n",
" solver = solvers[dropout]\n",
" train_accs.append(solver.train_acc_history[-1])\n",
" val_accs.append(solver.val_acc_history[-1])\n",
"plt.subplot(3, 1, 1)\n",
"for dropout in dropout_choices:\n",
" plt.plot(solvers[dropout].train_acc_history, 'o', label='%.2f dropout' % dropout)\n",
"plt.title('Train accuracy')\n",
"plt.legend(ncol=2, loc='lower right')\n",
" \n",
"plt.subplot(3, 1, 2)\n",
"for dropout in dropout_choices:\n",
" plt.plot(solvers[dropout].val_acc_history, 'o', label='%.2f dropout' % dropout)\n",
"plt.title('Val accuracy')\n",
"plt.legend(ncol=2, loc='lower right')\n",
"plt.gcf().set_size_inches(15, 15)\n",
"## 问题 2:\n",
"## 回答:\n",
"## 问题三 3:\n",
"## 回答:\n",
"# 重要\n",
"1. 点击`File -> Save`或者用`control+s`组合键,确保你最新的的notebook的作业已经保存到谷歌云。\n",
"2. 执行以下代码确保 `.py` 文件保存回你的谷歌云。"
"import os\n",
"FOLDER_TO_SAVE = os.path.join('drive/My Drive/', FOLDERNAME)\n",
"FILES_TO_SAVE = ['daseCV/classifiers/', 'daseCV/classifiers/']\n",
"for files in FILES_TO_SAVE:\n",
" with open(os.path.join(FOLDER_TO_SAVE, '/'.join(files.split('/')[1:])), 'w') as f:\n",
" f.write(''.join(open(files).readlines()))"
