神经网络前向传播和反向传播算法推导

一、目标

推导具有单隐层的神经网络的前向传播和反向传播算法，并进行编程（可以使用sklearn中的神经网络）。
- 探讨10，30，100，300，1000，不同隐藏节点数对网络性能的影响。
- 探讨不同学习率和迭代次数对网络性能的影响。
- 改变数据的标准化方法，探讨对训练的影响。
查阅资料说明什么是Hebb学习规则

二、推导单隐层神经网络的前向传播和反向传播算法

参考资料：https://blog.csdn.net/Lucky_Go/article/details/89738286

三、算法实现

参考资料：https://blog.csdn.net/zsx17/article/details/89342506

因为网上神经网络的代码基本都是用tensorflow实现的，这里是直接调库。在完成了作业的基本要求之后我也尝试了自己实现单隐层神经网络的代码（在实验报告的后部分）。

1. 载入数据

# 1、载入数据
import numpy as np
import tensorflow as tf
import tensorflow.examples.tutorials.mnist.input_data as input_data

# 读取mnist数据
mnist = input_data.read_data_sets('MNIST_data/', one_hot=True)

2. 建立模型

# 2.建立模型

# 2.1 构建输入层
x = tf.placeholder(tf.float32, [None, 784], name='X')
y = tf.placeholder(tf.float32, [None, 10], name='Y')

# 2.2 构建隐藏层
# 隐藏层神经元数量(随意设置）
H1_NN = 256
# 权重
W1 = tf.Variable(tf.random_normal([784, H1_NN]))
# 偏置项
b1 = tf.Variable(tf.zeros([H1_NN]))

Y1 = tf.nn.relu(tf.matmul(x, W1) + b1)

# 2.3 构建输出层
W2 = tf.Variable(tf.random_normal([H1_NN, 10]))
b2 = tf.Variable(tf.zeros([10]))

forward = tf.matmul(Y1, W2) + b2
pred = tf.nn.softmax(forward)

3. 训练模型

# 3.训练模型

# 3.1 定义损失函数
# tensorflow提供了下面的函数，用于避免log(0)值为Nan造成数据不稳定
loss_function = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=forward, labels=y))
# # 交叉熵损失函数
# loss_function = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))

# 3.2 设置训练参数
train_epochs = 40  # 训练轮数
batch_size = 50  # 单次训练样本数(批次大小)
# 一轮训练的批次数
total_batch = int(mnist.train.num_examples / batch_size)
display_step = 1  # 显示粒数
learning_rate = 0.01  # 学习率

# 3.2 选择优化器
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss_function)

# 3.3定义准确率
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(pred, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# 3.4 模型的训练
# 记录训练开始的时间
from time import time

startTime = time()

sess = tf.Session()
sess.run(tf.global_variables_initializer())

for epoch in range(train_epochs):
    for batch in range(total_batch):
        # 读取批次训练数据
        xs, ys = mnist.train.next_batch(batch_size)
        # 执行批次训练
        sess.run(optimizer, feed_dict={x: xs, y: ys})
    # 在total_batch批次数据训练完成后，使用验证数据计算误差和准确率，验证集不分批
    loss, acc = sess.run([loss_function, accuracy], feed_dict={x: mnist.validation.images, y: mnist.validation.labels})
    # 打印训练过程中的详细信息
    if (epoch + 1) % display_step == 0:
        print('训练轮次：', '%02d' % (epoch + 1),
              '损失：', '{:.9f}'.format(loss),
              '准确率：', '{:.4f}'.format(acc))
print('训练结束')
# 显示总运行时间
duration = time() - startTime
print("总运行时间为：", "{:.2f}".format(duration))

4. 模型评估

# 4.评估模型
accu_test = sess.run(accuracy,
                     feed_dict={x: mnist.test.images, y: mnist.test.labels})
print('测试集准确率：', accu_test)

5. 应用模型

# 5.应用模型
prediction_result = sess.run(tf.argmax(pred, 1), feed_dict={x: mnist.test.images})
# 查看预测结果的前10项
print("前10项的结果：", prediction_result[0:10])

# 5.1找出预测错误的样本
compare_lists = prediction_result == np.argmax(mnist.test.labels, 1)
print(compare_lists)
err_lists = [i for i in range(len(compare_lists)) if compare_lists[i] == False]
print('预测错误的图片：', err_lists)
print('预测错误图片的总数：', len(err_lists))

# 定义一个输出错误分类的函数
import numpy as np


def print_predict_errs(labels,  # 标签列表
                       prediction):  # 预测值列表
    count = 0
    compare_lists = (prediction == np.argmax(labels, 1))
    err_lists = [i for i in range(len(compare_lists)) if compare_lists[i] == False]
    for x in err_lists:
        print('index=' + str(x) + '标签值=', np.argmax(labels[x]), '预测值=', prediction[x])
        count = count + 1
    print("总计：" + str(count))


print_predict_errs(labels=mnist.test.labels, prediction=prediction_result)

# 可视化
import matplotlib.pyplot as plt


def plot_images_labels_prediction(images,  # 图像列表
                                  labels,  # 标签列表
                                  predication,  # 预测值列表
                                  index,  # 从第index个开始显示
                                  num=10):  # 缺省一次显示10幅
    fig = plt.gcf()  # 获取当前图表，get current figure
    fig.set_size_inches(10, 12)  # 设为英寸，1英寸=2.53厘米
    if num > 25:
        num = 25  # 最多显示25个子图
    for i in range(0, num):
        ax = plt.subplot(5, 5, i + 1)  # 获取当前要处理的子图
        # 显示第index图像
        ax.imshow(np.reshape(images[index], (28, 28)), cmap='binary')

        # 构建该图上显示的title
        title = 'label=' + str(np.argmax(labels[index]))
        if len(predication) > 0:
            title += ",predict=" + str(predication[index])

        # 显示图上的title信息
        ax.set_title(title, fontsize=10)
        ax.set_xticks([])  # 不显示坐标轴
        ax.set_yticks([])
        index += 1

    plt.show()


plot_images_labels_prediction(mnist.test.images,
                              mnist.test.labels,
                              prediction_result, 10, 25)
plot_images_labels_prediction(mnist.test.images,
                              mnist.test.labels,
                              prediction_result, 610, 20)

6. 结果展示

上面的代码中隐层节点个数为256个，学习率为0.01，迭代次数为40次。训练结果如下：

部分分类图像如下所示：

四、算法调优

在上面的模型中隐层结点数为256，学习率为0.01，迭代次数为40次。

下面分别从隐层节点数、学习率和迭代次数三个角度进行调优。

1. 隐层节点数

将隐层节点数设为10，得到的结果如下图所示：

将隐层节点设为30，100，300，1000的效果不再具体展示，效果如下所示：

隐层节点个数	总运行时间/s	预测错误的图片数	准确率
10	46.29	736	0.9264
30	43.46	528	0.9472
100	59.06	343	0.9657
256	84.48	249	0.9751
300	76.64	269	0.9731
1000	302.27	240	0.976

由表可知，准确率随着隐层节点个数的增加而增加，增加速率逐步减少。

2. 学习率

学习率分别为0.005，0.01， 0.02， 0.1，隐层节点数选择256，迭代次数选择40。分类结果如下：

学习率	总运行时间/s	预测错误的图片数	准确率
0.005	78.81	231	0.9769
0.01	84.48	249	0.9751
0.02	69.72	446	0.9554
0.1	73.87	2561	0.7439

由表可知，准确率随着学习率的增加而降低。在学习率低于0.01时，图片分类的准确率提升的速率较小。

3. 迭代次数

迭代次数分别为20，40，100，隐层节点数选择256，学习率选择0.01。分类结果如下：

迭代次数	总运行时间/s	预测错误的图片数	准确率
20	37.12	307	0.9693
40	84.48	249	0.9751
100	184.39	239	0.9761

由表可知，迭代次数对总运行时间的影响率很大，准确率随着迭代次数的增加而增加，但对准确率起决定因素的还是隐层的节点个数以及学习率。

4. 改变数据标准化方法

最大-最小规范化

`Z-score`规范化

五、`Hebb`学习规则

参考资料：https://baike.baidu.com/item/Hebb%E5%AD%A6%E4%B9%A0%E8%A7%84%E5%88%99/3061563?fr=aladdin

Hebb学习规则是一个无监督学习规则，这种学习的结果是使网络能够提取训练集的统计特性，从而把输入信息按照它们的相似性程度划分为若干类。这一点与人类观察和认识世界的过程非常吻合，人类观察和认识世界在相当程度上就是在根据事物的统计特征进行分类。Hebb学习规则只根据神经元连接间的激活水平改变权值，因此这种方法又称为相关学习或并联学习。

无监督学习规则
唐纳德·赫布（1904-1985）是加拿大著名生理心理学家。Hebb学习规则与“条件反射”机理一致，并且已经得到了神经细胞学说的证实。
巴甫洛夫的条件反射实验：每次给狗喂食前都先响铃，时间一长，狗就会将铃声和食物联系起来。以后如果响铃但是不给食物，狗也会流口水。
受该实验的启发，Hebb的理论认为在同一时间被激发的神经元间的联系会被强化。比如，铃声响时一个神经元被激发，在同一时间食物的出现会激发附近的另一个神经元，那么这两个神经元间的联系就会强化，从而记住这两个事物之间存在着联系。相反，如果两个神经元总是不能同步激发，那么它们间的联系将会越来越弱。
Hebb学习律可表示为：
$W{ij}(t+1)=W{ij}(t)+a⋅yi⋅y_j$
$W{ij}(t+1)=W_{ij}(t)+a⋅y_i⋅y_j$

其中$W{ij}$表示神经元$j$到神经元$i$的连接权，$y_i$与$y_j$表示两个神经元的输出，$a$是表示学习速率的常数，如果$y_i$与$y_j$同时被激活，即$y_i$与$y_j$同时为正，那么$W{ij}$将增大。如果$yi$被激活，而$y_j$处于抑制状态，即$y_i$为正$y_j$为负，那么$W{ij}$将变小。

六、自己实现单隐层神经网络

参考资料：https://blog.csdn.net/hellozhxy/article/details/81055391

网络结构的函数定义：

def layer_sizes(X, Y):
    n_x = X.shape[0] # size of input layer
    n_h = 4 # size of hidden layer
    n_y = Y.shape[0] # size of output layer
    return (n_x, n_h, n_y)

参数初始化函数：

def initialize_parameters(n_x, n_h, n_y):
    W1 = np.random.randn(n_h, n_x)*0.01
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h)*0.01
    b2 = np.zeros((n_y, 1)) 
   
    assert (W1.shape == (n_h, n_x))    
    assert (b1.shape == (n_h, 1))    
    assert (W2.shape == (n_y, n_h))    
    assert (b2.shape == (n_y, 1))
    parameters = {"W1": W1, 
                  "b1": b1,                 
                  "W2": W2,                  
                  "b2": b2}   
                   
    return parameters

前向传播计算函数：

def forward_propagation(X, parameters):
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']    
    # Implement Forward Propagation to calculate A2 (probabilities)
    Z1 = np.dot(W1, X) + b1
    A1 = np.tanh(Z1)
    Z2 = np.dot(W2, Z1) + b2
    A2 = sigmoid(Z2)    
    assert(A2.shape == (1, X.shape[1]))
    cache = {"Z1": Z1,                   
             "A1": A1,                   
             "Z2": Z2,                  
             "A2": A2}    
    return A2, cache

计算损失函数：

def compute_cost(A2, Y, parameters):
    m = Y.shape[1] # number of example
    # Compute the cross-entropy cost
    logprobs = np.multiply(np.log(A2),Y) + np.multiply(np.log(1-A2), 1-Y)
    cost = -1/m * np.sum(logprobs)
    cost = np.squeeze(cost)     # makes sure cost is the dimension we expect.
    assert(isinstance(cost, float))    
    return cost

反向传播函数：

def backward_propagation(parameters, cache, X, Y):
    m = X.shape[1]    
    # First, retrieve W1 and W2 from the dictionary "parameters".
    W1 = parameters['W1']
    W2 = parameters['W2']    
    # Retrieve also A1 and A2 from dictionary "cache".
    A1 = cache['A1']
    A2 = cache['A2']    
    # Backward propagation: calculate dW1, db1, dW2, db2. 
    dZ2 = A2-Y
    dW2 = 1/m * np.dot(dZ2, A1.T)
    db2 = 1/m * np.sum(dZ2, axis=1, keepdims=True)
    dZ1 = np.dot(W2.T, dZ2)*(1-np.power(A1, 2))
    dW1 = 1/m * np.dot(dZ1, X.T)
    db1 = 1/m * np.sum(dZ1, axis=1, keepdims=True)
    grads = {"dW1": dW1,
             "db1": db1,                      
             "dW2": dW2,             
             "db2": db2}   
    return grads

权值更新函数：

def update_parameters(parameters, grads, learning_rate = 1.2):
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']    
    # Retrieve each gradient from the dictionary "grads"
    dW1 = grads['dW1']
    db1 = grads['db1']
    dW2 = grads['dW2']
    db2 = grads['db2']    
    # Update rule for each parameter
    W1 -= dW1 * learning_rate
    b1 -= db1 * learning_rate
    W2 -= dW2 * learning_rate
    b2 -= db2 * learning_rate
    parameters = {"W1": W1, 
                  "b1": b1,            
                  "W2": W2,   
                  "b2": b2}    
    return parameters

最终的神经网络模型：

def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False):
    np.random.seed(3)
    n_x = layer_sizes(X, Y)[0]
    n_y = layer_sizes(X, Y)[2]    
    # Initialize parameters, then retrieve W1, b1, W2, b2. Inputs: "n_x, n_h, n_y". Outputs = "W1, b1, W2, b2, parameters".
    parameters = initialize_parameters(n_x, n_h, n_y)
    W1 = parameters['W1']
    b1 = parameters['b1']
    W2 = parameters['W2']
    b2 = parameters['b2']    
    # Loop (gradient descent)
    for i in range(0, num_iterations):        
    # Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache".
        A2, cache = forward_propagation(X, parameters)        
        # Cost function. Inputs: "A2, Y, parameters". Outputs: "cost".
        cost = compute_cost(A2, Y, parameters)        
        # Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
        grads = backward_propagation(parameters, cache, X, Y)        
        # Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
        parameters = update_parameters(parameters, grads, learning_rate=1.2)        
        # Print the cost every 1000 iterations
        if print_cost and i % 1000 == 0:            
            print ("Cost after iteration %i: %f" %(i, cost))    
            
    return parameters

本文采用署名-非商业性使用-相同方式共享 4.0 国际许可协议，转载请注明出处。