使用batch-normalization层使回传梯度消失

我在试用batch normalization时遇到了问题:

我原本的net是收敛的,但一加入BN层(BatchNorm+Scale)就不收敛。

为了定位问题哪里,我把net的层数减少到3层,而中间层加入了BN,用debug模式,log是这样的:
I0803 17:30 net.cpp:666] [Backward] Layer loss, bottom blob conv16 diff: 0.00189681
I0803 17:30 net.cpp:666] [Backward] Layer conv16, bottom blob conv1 diff: 0.00020805
I0803 17:30 net.cpp:677] [Backward] Layer conv16, param blob 0 diff: 0
I0803 17:30 net.cpp:677] [Backward] Layer conv16, param blob 1 diff: 124.31
I0803 17:30 net.cpp:666] [Backward] Layer relu1, bottom blob conv1 diff: 0
I0803 17:30 net.cpp:666] [Backward] Layer scale1, bottom blob conv1 diff: 0
I0803 17:30 net.cpp:677] [Backward] Layer scale1, param blob 0 diff: 0
I0803 17:30 net.cpp:677] [Backward] Layer scale1, param blob 1 diff: 0
I0803 17:30 net.cpp:666] [Backward] Layer bnorm1, bottom blob conv1 diff: 0
I0803 17:30 net.cpp:677] [Backward] Layer bnorm1, param blob 0 diff: 0
I0803 17:30 net.cpp:677] [Backward] Layer bnorm1, param blob 1 diff: 0
I0803 17:30 net.cpp:677] [Backward] Layer bnorm1, param blob 2 diff: 0
I0803 17:30 net.cpp:677] [Backward] Layer bnorm1, param blob 3 diff: 0
I0803 17:30 net.cpp:677] [Backward] Layer bnorm1, param blob 4 diff: 0
I0803 17:30 net.cpp:666] [Backward] Layer conv1, bottom blob conv0 diff: 0
I0803 17:30 net.cpp:677] [Backward] Layer conv1, param blob 0 diff: 0
I0803 17:30 net.cpp:666] [Backward] Layer relu0, bottom blob conv0 diff: 0
I0803 17:30 net.cpp:677] [Backward] Layer conv0, param blob 0 diff: 0
I0803 17:30 net.cpp:677] [Backward] Layer conv0, param blob 1 diff: 0
上面是回传的diff,可以看到conv16时,diff不是0,但diff一到relu1就成了0
layer {
name: "conv1"
type: "Convolution"
bottom: "conv0"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 0
decay_mult: 0
}
convolution_param {
num_output: 32
pad: 1
kernel_size: 3
weight_filler {
type: "gaussian"
std: 0.0589
}
bias_filler {
type: "constant"
value: 0
}
engine: CUDNN
}
}
layer {
name: "bnorm1"
type: "BatchNorm"
bottom: "conv1"
top: "conv1"
batch_norm_param {
use_global_stats: false
}
}
layer {
name: "scale1"
type: "Scale"
bottom: "conv1"
top: "conv1"
scale_param {
bias_term: true
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "conv1"
top: "conv1"
}

layer {
name: "conv16"
type: "Convolution"
bottom: "conv1"
top: "conv16"
param {
lr_mult: 1
decay_mult: 1
}
这是我的prototxt中间的一段。

有趣的是,把其中的BatchNorm层和Scale层注释掉的话,就没问题了,梯度就正常回传,网络也收敛。
所以我在想我的写法是不是哪里有问题?

首先,BatchNorm层和Scale层的bottom和top是同一层,这种写法是residual net里我看到的,
在帖子http://www.caffecn.cn/?/question/51 中我好像也看到了这种写法
但我也看到有些网络里BatchNorm层进去出来的不是同一个层,到底在caffe里BN层应该怎么用?
是应该进去出来是同一层吗?

其次,training的时候用use_global_stats: false,到test的时候用use_global_stats: true
我这个理解正确吗?

第三,我这里有什么地方写错了吗?我是说除了conv16那层,因为那层我没copy全。
我指的是Conv→BN→Scale→Relu这个loop里面,是否有严重错误?

谢谢




 
已邀请:

alex68 - 一般不扯淡~

赞同来自:

BatchNorm层和Scale层的位置没有问题,use_global_stats也是正确的,BN层支持inplace,所以可以进去出来是同一个层
如果你不用BN层的时候,conv16的梯度也是多少?
I0803 17:30 net.cpp:666] [Backward] Layer loss, bottom blob conv16 diff: 0.00189681
I0803 17:30 net.cpp:666] [Backward] Layer conv16, bottom blob conv1 diff: 0.00020805
也是这么小吗?

pyramide - 医学图像处理工程师

赞同来自:

谢谢帮忙。

我看过了,不用BN层的时候,conv16的梯度要大得多:

以下是不用BN层的时候
I0803 17:01:38.139041 30274 net.cpp:666]     [Backward] Layer loss, bottom blob conv16 diff: 0.979801
I0803 17:01:38.139472 30274 net.cpp:666] [Backward] Layer conv16, bottom blob conv1 diff: 0.124307
I0803 17:01:38.139539 30274 net.cpp:677] [Backward] Layer conv16, param blob 0 diff: 9.15838
I0803 17:01:38.139598 30274 net.cpp:677] [Backward] Layer conv16, param blob 1 diff: 250.829
I0803 17:01:38.139714 30274 net.cpp:666] [Backward] Layer relu1, bottom blob conv1 diff: 0.0558361
I0803 17:01:38.140074 30274 net.cpp:666] [Backward] Layer conv1, bottom blob conv0 diff: 0.0765413
I0803 17:01:38.140161 30274 net.cpp:677] [Backward] Layer conv1, param blob 0 diff: 0.873169
I0803 17:01:38.140240 30274 net.cpp:666] [Backward] Layer relu0, bottom blob conv0 diff: 0.0393307
I0803 17:01:38.140476 30274 net.cpp:677] [Backward] Layer conv0, param blob 0 diff: 2.86087
I0803 17:01:38.140554 30274 net.cpp:677] [Backward] Layer conv0, param blob 1 diff: 8.34717
这是使用了BN层的时候
I0804 10:23:44.707945  8318 net.cpp:666]     [Backward] Layer loss, bottom blob conv16 diff: 0.00189681
I0804 10:23:44.711360 8318 net.cpp:666] [Backward] Layer conv16, bottom blob conv1 diff: 0.000280446
I0804 10:23:44.711444 8318 net.cpp:677] [Backward] Layer conv16, param blob 0 diff: 0
I0804 10:23:44.711511 8318 net.cpp:677] [Backward] Layer conv16, param blob 1 diff: 124.31
I0804 10:23:44.711822 8318 net.cpp:666] [Backward] Layer relu1, bottom blob conv1 diff: 0
I0804 10:23:44.715348 8318 net.cpp:666] [Backward] Layer scale1, bottom blob conv1 diff: 0
I0804 10:23:44.715420 8318 net.cpp:677] [Backward] Layer scale1, param blob 0 diff: 0
I0804 10:23:44.715484 8318 net.cpp:677] [Backward] Layer scale1, param blob 1 diff: 0
I0804 10:23:44.715926 8318 net.cpp:666] [Backward] Layer bnorm1, bottom blob conv1 diff: 0
I0804 10:23:44.716004 8318 net.cpp:677] [Backward] Layer bnorm1, param blob 0 diff: 0
I0804 10:23:44.716068 8318 net.cpp:677] [Backward] Layer bnorm1, param blob 1 diff: 0
I0804 10:23:44.716132 8318 net.cpp:677] [Backward] Layer bnorm1, param blob 2 diff: 0
I0804 10:23:44.716223 8318 net.cpp:677] [Backward] Layer bnorm1, param blob 3 diff: 0
I0804 10:23:44.716290 8318 net.cpp:677] [Backward] Layer bnorm1, param blob 4 diff: 0
I0804 10:23:44.720715 8318 net.cpp:666] [Backward] Layer conv1, bottom blob conv0 diff: 0
I0804 10:23:44.720784 8318 net.cpp:677] [Backward] Layer conv1, param blob 0 diff: 0
I0804 10:23:44.721057 8318 net.cpp:666] [Backward] Layer relu0, bottom blob conv0 diff: 0
I0804 10:23:44.723947 8318 net.cpp:677] [Backward] Layer conv0, param blob 0 diff: 0
I0804 10:23:44.724014 8318 net.cpp:677] [Backward] Layer conv0, param blob 1 diff: 0
我觉得梯度变小还是正常的,因为用了BN,等于做了一个zero-one normalization,数值应该是被压缩的。
但问题是,为啥bnorm有5个param blob??好像其他层都没有啊!这是否能说明错误所在?

另外,我也看了forward层的情况,同样的,到了BN层就是0
I0804 10:22:42.074671  8318 net.cpp:638]     [Forward] Layer loadtestdata, top blob data data: 0.368457
I0804 10:22:42.074757 8318 net.cpp:638] [Forward] Layer loadtestdata, top blob label data: 0.514496
I0804 10:22:42.076117 8318 net.cpp:638] [Forward] Layer conv0, top blob conv0 data: 0.115678
I0804 10:22:42.076200 8318 net.cpp:650] [Forward] Layer conv0, param blob 0 data: 0.0455077
I0804 10:22:42.076273 8318 net.cpp:650] [Forward] Layer conv0, param blob 1 data: 0
I0804 10:22:42.076539 8318 net.cpp:638] [Forward] Layer relu0, top blob conv0 data: 0.0446758
I0804 10:22:42.078435 8318 net.cpp:638] [Forward] Layer conv1, top blob conv1 data: 0.0675479
I0804 10:22:42.078516 8318 net.cpp:650] [Forward] Layer conv1, param blob 0 data: 0.0470226
I0804 10:22:42.078589 8318 net.cpp:650] [Forward] Layer conv1, param blob 1 data: 0
I0804 10:22:42.079108 8318 net.cpp:638] [Forward] Layer bnorm1, top blob conv1 data: 0
I0804 10:22:42.079197 8318 net.cpp:650] [Forward] Layer bnorm1, param blob 0 data: 0
I0804 10:22:42.079270 8318 net.cpp:650] [Forward] Layer bnorm1, param blob 1 data: 0
I0804 10:22:42.079350 8318 net.cpp:650] [Forward] Layer bnorm1, param blob 2 data: 0
I0804 10:22:42.079421 8318 net.cpp:650] [Forward] Layer bnorm1, param blob 3 data: 0
I0804 10:22:42.079505 8318 net.cpp:650] [Forward] Layer bnorm1, param blob 4 data: 0
I0804 10:22:42.080267 8318 net.cpp:638] [Forward] Layer scale1, top blob conv1 data: 0
I0804 10:22:42.080345 8318 net.cpp:650] [Forward] Layer scale1, param blob 0 data: 1
I0804 10:22:42.080418 8318 net.cpp:650] [Forward] Layer scale1, param blob 1 data: 0
I0804 10:22:42.080651 8318 net.cpp:638] [Forward] Layer relu1, top blob conv1 data: 0
I0804 10:22:42.082074 8318 net.cpp:638] [Forward] Layer conv16, top blob conv16 data: 0
I0804 10:22:42.082154 8318 net.cpp:650] [Forward] Layer conv16, param blob 0 data: 0.0485365
I0804 10:22:42.082226 8318 net.cpp:650] [Forward] Layer conv16, param blob 1 data: 0
I0804 10:22:42.082675 8318 net.cpp:638] [Forward] Layer loss, top blob loss data: 42.0327
而如果是没有BN层的话,forward就正常。
I0803 17:01:29.700850 30274 net.cpp:638]     [Forward] Layer loadtestdata, top blob data data: 0.320584
I0803 17:01:29.700920 30274 net.cpp:638] [Forward] Layer loadtestdata, top blob label data: 0.236383
I0803 17:01:29.701556 30274 net.cpp:638] [Forward] Layer conv0, top blob conv0 data: 0.106141
I0803 17:01:29.701633 30274 net.cpp:650] [Forward] Layer conv0, param blob 0 data: 0.0467062
I0803 17:01:29.701692 30274 net.cpp:650] [Forward] Layer conv0, param blob 1 data: 0
I0803 17:01:29.701835 30274 net.cpp:638] [Forward] Layer relu0, top blob conv0 data: 0.0547961
I0803 17:01:29.702193 30274 net.cpp:638] [Forward] Layer conv1, top blob conv1 data: 0.0716117
I0803 17:01:29.702267 30274 net.cpp:650] [Forward] Layer conv1, param blob 0 data: 0.0473551
I0803 17:01:29.702327 30274 net.cpp:650] [Forward] Layer conv1, param blob 1 data: 0
I0803 17:01:29.702425 30274 net.cpp:638] [Forward] Layer relu1, top blob conv1 data: 0.0318472
I0803 17:01:29.702781 30274 net.cpp:638] [Forward] Layer conv16, top blob conv16 data: 0.0403702
I0803 17:01:29.702847 30274 net.cpp:650] [Forward] Layer conv16, param blob 0 data: 0.0474007
I0803 17:01:29.702908 30274 net.cpp:650] [Forward] Layer conv16, param blob 1 data: 0
I0803 17:01:29.703228 30274 net.cpp:638] [Forward] Layer loss, top blob loss data: 11.2245

求破

============================================================
PS:
大家可能注意到了,Forward中的异常,就是出现了6次的bnorm1和3次的scale1。
在prototxt文件中,这两层都是conv1进,conv1出,也就是说,和ReLU1层一样,都应该是in-place的层。
但在实际中,ReLU层只有1行output,但scale层有3行,而bnorm层有6行!
我相信这就是问题所在,这至少说明这里的bnorm层和scale层没有被当作in-place层对待。
如何在prototxt中强制in-place=true?我搜了半天,没看到这个属性在prototxt文件中出现

要回复问题请先登录注册