使用GPU训练SSD强制关机重启的问题

问题描述:
    我从weiliu89/caffe下载的SSD 的caffe branch以及VGG网络模型,使用VOC2007训练模型。当我使用GPU训练时,在第一次迭代计算Loss后(如下图),机器突然强制关机重启
    
11.jpg


    而当我使用CPU训练时,不会出现关机重启的问题。这可以说明我的Caffe环境,VGG模型以及lmdb格式的数据都是准备好的,但是在GPU模式下训练模型,会突然在第一次loss计算之后强制关机并重启。
运行环境:
-Computer-

Processor        :  4x Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
Memory           :  16346MB (4223MB used)
Operating System        : Ubuntu 15.10
GPU                            Titan X

-System-
Kernel        : Linux 4.2.0-42-generic (x86_64)
Default C Compiler        : GNU C Compiler version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2) 
Distribution        : Ubuntu 15.10

-Driver-
 Driver Version: 352.39  
CUDA 7.5 
cudnn 5.1.13

-Display-
Resolution        : 2820x1440 pixels
OpenGL Renderer        : Gallium 0.4 on llvmpipe (LLVM 3.6, 256 bits)
X11 Vendor        :    The X.Org Foundation

以下是部分System Log
 
Oct 17 09:19:06 wjq-NUDT systemd[1]: Created slice user-1000.slice.
Oct 17 09:19:06 wjq-NUDT systemd[1]: Starting User Manager for UID 1000...
Oct 17 09:19:06 wjq-NUDT systemd[1]: Started Session c2 of user wjq.
Oct 17 09:19:06 wjq-NUDT systemd[1393]: Reached target Sockets.
Oct 17 09:19:06 wjq-NUDT systemd[1393]: Reached target Timers.
Oct 17 09:19:06 wjq-NUDT systemd[1393]: Reached target Paths.
Oct 17 09:19:06 wjq-NUDT systemd[1393]: Reached target Basic System.
Oct 17 09:19:06 wjq-NUDT systemd[1393]: Reached target Default.
Oct 17 09:19:06 wjq-NUDT systemd[1393]: Startup finished in 56ms.
Oct 17 09:19:06 wjq-NUDT systemd[1]: Started User Manager for UID 1000.
Oct 17 09:19:08 wjq-NUDT org.a11y.atspi.Registry[1610]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry
Oct 17 09:19:08 wjq-NUDT gnome-session[1603]: gnome-session-is-accelerated: llvmpipe detected.
Oct 17 09:19:09 wjq-NUDT org.gnome.ScreenSaver[1497]: ** (gnome-screensaver:1644): WARNING **: Couldn't get presence status: The name org.gnome.SessionManager was not provided by any .service files
Oct 17 09:19:09 wjq-NUDT gnome-session[1603]: gnome-keyring-daemon: insufficient process capabilities, unsecure memory might get used
Oct 17 09:19:09 wjq-NUDT gnome-session[1603]: SSH_AUTH_SOCK=/run/user/1000/keyring/ssh
Oct 17 09:19:09 wjq-NUDT gnome-session[1603]: gnome-keyring-daemon: insufficient process capabilities, unsecure memory might get used
Oct 17 09:19:09 wjq-NUDT gnome-session[1603]: SSH_AUTH_SOCK=/run/user/1000/keyring/ssh
Oct 17 09:19:09 wjq-NUDT gnome-session[1603]: gnome-keyring-daemon: insufficient process capabilities, unsecure memory might get used
Oct 17 09:19:09 wjq-NUDT gnome-session[1603]: SSH_AUTH_SOCK=/run/user/1000/keyring/ssh
Oct 17 09:19:09 wjq-NUDT rtkit-daemon[984]: Successfully made thread 1702 of process 1702 (n/a) owned by '1000' high priority at nice level -11.
Oct 17 09:19:09 wjq-NUDT rtkit-daemon[984]: Supervising 5 threads of 2 processes of 2 users.
Oct 17 09:19:11 wjq-NUDT pulseaudio[983]: [pulseaudio] bluez5-util.c: GetManagedObjects() failed: org.freedesktop.DBus.Error.TimedOut: Failed to activate service 'org.bluez': timed out
Oct 17 09:19:12 wjq-NUDT dbus[635]: [system] Activating via systemd: service name='org.freedesktop.UDisks2' unit='udisks2.service'
Oct 17 09:19:12 wjq-NUDT systemd[1]: Starting Disk Manager...
Oct 17 09:19:12 wjq-NUDT udisksd[1719]: udisks daemon version 2.1.6 starting
Oct 17 09:19:12 wjq-NUDT dbus[635]: [system] Successfully activated service 'org.freedesktop.UDisks2'
Oct 17 09:19:12 wjq-NUDT udisksd[1719]: Acquired the name org.freedesktop.UDisks2 on the system message bus
Oct 17 09:19:12 wjq-NUDT systemd[1]: Started Disk Manager.
Oct 17 09:19:12 wjq-NUDT org.gtk.Private.AfcVolumeMonitor[1497]: Volume monitor alive
Oct 17 09:19:12 wjq-NUDT dbus[635]: [system] Activating via systemd: service name='org.bluez' unit='dbus-org.bluez.service'
Oct 17 09:19:13 wjq-NUDT systemd[868]: Time has been changed
Oct 17 09:19:13 wjq-NUDT systemd[1393]: Time has been changed
Oct 17 09:19:13 wjq-NUDT systemd-timesyncd[522]: Synchronized to time server 91.189.91.157:123 (ntp.ubuntu.com).
Oct 17 09:19:13 wjq-NUDT systemd[1]: Time has been changed
Oct 17 09:19:13 wjq-NUDT rtkit-daemon[984]: Supervising 5 threads of 2 processes of 2 users.
Oct 17 09:19:13 wjq-NUDT rtkit-daemon[984]: Successfully made thread 1812 of process 1702 (n/a) owned by '1000' RT at priority 5.
Oct 17 09:19:13 wjq-NUDT rtkit-daemon[984]: Supervising 6 threads of 2 processes of 2 users.
Oct 17 09:19:14 wjq-NUDT rtkit-daemon[984]: Supervising 6 threads of 2 processes of 2 users.
Oct 17 09:19:14 wjq-NUDT rtkit-daemon[984]: Successfully made thread 1832 of process 1702 (n/a) owned by '1000' RT at priority 5.
Oct 17 09:19:14 wjq-NUDT rtkit-daemon[984]: Supervising 7 threads of 2 processes of 2 users.
Oct 17 09:19:14 wjq-NUDT rtkit-daemon[984]: Supervising 7 threads of 2 processes of 2 users.
Oct 17 09:19:14 wjq-NUDT rtkit-daemon[984]: Successfully made thread 1833 of process 1702 (n/a) owned by '1000' RT at priority 5.
Oct 17 09:19:14 wjq-NUDT rtkit-daemon[984]: Supervising 8 threads of 2 processes of 2 users.
Oct 17 09:19:14 wjq-NUDT rtkit-daemon[984]: Successfully made thread 1859 of process 1859 (n/a) owned by '1000' high priority at nice level -11.
Oct 17 09:19:14 wjq-NUDT rtkit-daemon[984]: Supervising 9 threads of 3 processes of 2 users.
Oct 17 09:19:14 wjq-NUDT pulseaudio[1859]: [pulseaudio] pid.c: Daemon already running.
Oct 17 09:19:14 wjq-NUDT gnome-session[1603]: (process:1868): indicator-application-service-WARNING **: Unable to get watcher name 'org.kde.StatusNotifierWatcher'
Oct 17 09:19:14 wjq-NUDT gnome-session[1603]: (process:1868): indicator-application-service-WARNING **: Name Lost
Oct 17 09:19:14 wjq-NUDT gnome-session[1603]: Entering running state
Oct 17 09:19:15 wjq-NUDT org.gnome.ScreenSaver[1497]: ** Message: Lost the name, shutting down.
Oct 17 09:19:15 wjq-NUDT org.freedesktop.FileManager1[1497]: (nautilus:1898): GLib-GIO-CRITICAL **: g_dbus_interface_skeleton_unexport: assertion 'interface_->priv->connections != NULL' failed
Oct 17 09:19:15 wjq-NUDT org.freedesktop.FileManager1[1497]: (nautilus:1898): GLib-GIO-CRITICAL **: g_dbus_interface_skeleton_unexport: assertion 'interface_->priv->connections != NULL' failed
Oct 17 09:19:15 wjq-NUDT org.freedesktop.FileManager1[1497]: Could not register the application: Unable to acquire bus name 'org.gnome.Nautilus'
Oct 17 09:19:15 wjq-NUDT org.freedesktop.FileManager1[1497]: (nautilus:1898): Gtk-CRITICAL **: gtk_icon_theme_get_for_screen: assertion 'GDK_IS_SCREEN (screen)' failed
Oct 17 09:19:15 wjq-NUDT org.freedesktop.FileManager1[1497]: (nautilus:1898): GLib-GObject-WARNING **: invalid (NULL) pointer instance
Oct 17 09:19:15 wjq-NUDT org.freedesktop.FileManager1[1497]: (nautilus:1898): GLib-GObject-CRITICAL **: g_signal_connect_object: assertion 'G_TYPE_CHECK_INSTANCE (instance)' failed
Oct 17 09:19:16 wjq-NUDT gnome-session[1603]: QDBusConnection: session D-Bus connection created before QCoreApplication. Application may misbehave.
Oct 17 09:19:16 wjq-NUDT gnome-session[1603]: No LSB modules are available.
Oct 17 09:19:16 wjq-NUDT gnome-session[1603]: No LSB modules are available.
Oct 17 09:19:20 wjq-NUDT gnome-session[1603]: (WARN-1931 /build/fcitx-cRl76F/fcitx-4.2.9/src/lib/fcitx-config/fcitx-config.c:922) Invalid Entry: line 150 missing '='
Oct 17 09:19:20 wjq-NUDT gnome-session[1603]: (WARN-1931 /build/fcitx-cRl76F/fcitx-4.2.9/src/lib/fcitx-config/fcitx-config.c:922) Invalid Entry: line 152 missing '='
Oct 17 09:19:20 wjq-NUDT gnome-session[1603]: (WARN-1931 /build/fcitx-cRl76F/fcitx-4.2.9/src/lib/fcitx-config/fcitx-config.c:922) Invalid Entry: line 150 missing '='
Oct 17 09:19:20 wjq-NUDT gnome-session[1603]: (WARN-1931 /build/fcitx-cRl76F/fcitx-4.2.9/src/lib/fcitx-config/fcitx-config.c:922) Invalid Entry: line 148 missing '='
Oct 17 09:19:20 wjq-NUDT gnome-session[1603]: (WARN-1931 /build/fcitx-cRl76F/fcitx-4.2.9/src/lib/fcitx-config/fcitx-config.c:922) Invalid Entry: line 152 missing '='
Oct 17 09:19:20 wjq-NUDT gnome-session[1603]: (WARN-1931 /build/fcitx-cRl76F/fcitx-4.2.9/src/lib/fcitx-config/fcitx-config.c:922) Invalid Entry: line 150 missing '='
Oct 17 09:19:20 wjq-NUDT gnome-session[1603]: message repeated 2 times: [ (WARN-1931 /build/fcitx-cRl76F/fcitx-4.2.9/src/lib/fcitx-config/fcitx-config.c:922) Invalid Entry: line 150 missing '=']
Oct 17 09:19:20 wjq-NUDT gnome-session[1603]: (WARN-1931 /build/fcitx-cRl76F/fcitx-4.2.9/src/lib/fcitx-config/fcitx-config.c:922) Invalid Entry: line 152 missing '='
Oct 17 09:19:20 wjq-NUDT gnome-session[1603]: (WARN-1931 /build/fcitx-cRl76F/fcitx-4.2.9/src/lib/fcitx-config/fcitx-config.c:922) Invalid Entry: line 154 missing '='
Oct 17 09:19:20 wjq-NUDT gnome-session[1603]: (WARN-1931 /build/fcitx-cRl76F/fcitx-4.2.9/src/lib/fcitx-config/fcitx-config.c:922) Invalid Entry: line 152 missing '='
Oct 17 09:19:21 wjq-NUDT gnome-session[1603]: Nautilus-Share-Message: Called "net usershare info" but it failed: 'net usershare' returned error 255: mkdir failed on directory /var/run/samba/msg.lock: Permission denied
Oct 17 09:19:21 wjq-NUDT gnome-session[1603]: net usershare: cannot open usershare directory /var/lib/samba/usershares. Error No such file or directory
Oct 17 09:19:21 wjq-NUDT gnome-session[1603]: Please ask your system administrator to enable user sharing.
Oct 17 09:19:25 wjq-NUDT gnome-session[1603]: (nm-applet:1869): nm-applet-WARNING **: Could not find ShellVersion property on org.gnome.Shell after 5 tries
Oct 17 09:19:34 wjq-NUDT org.gnome.zeitgeist.Engine[1497]: ** (zeitgeist-datahub:2151): WARNING **: zeitgeist-datahub.vala:229: Unable to get name "org.gnome.zeitgeist.datahub" on the bus!
Oct 17 09:19:35 wjq-NUDT systemd[1]: Starting Stop ureadahead data collection...
Oct 17 09:19:35 wjq-NUDT systemd[1]: Stopped Read required files in advance.
Oct 17 09:19:35 wjq-NUDT systemd[1]: Started Stop ureadahead data collection.
Oct 17 09:19:35 wjq-NUDT com.canonical.Unity.Scope.Home[1497]: (unity-scope-home:2178): unity-scope-home-WARNING **: platform-info.vala:112: Unable to read SIM properties: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.ofono was not provided by any .service files
Oct 17 09:19:35 wjq-NUDT com.canonical.Unity.Scope.LocalFiles[1497]: (process:2197): unity-files-daemon-WARNING **: folder.vala:65: Failed to read favorites: Failed to open file '/home/wjq/.gtk-bookmarks': No such file or directory
Oct 17 09:19:38 wjq-NUDT pulseaudio[1702]: [pulseaudio] bluez5-util.c: GetManagedObjects() failed: org.freedesktop.DBus.Error.TimedOut: Failed to activate service 'org.bluez': timed out
Oct 17 09:19:49 wjq-NUDT avahi-daemon[599]: Invalid response packet from host 192.168.1.104.
Oct 17 09:19:49 wjq-NUDT avahi-daemon[599]: Invalid response packet from host 10.0.10.207.
Oct 17 09:19:49 wjq-NUDT avahi-daemon[599]: Invalid response packet from host 192.168.1.104.
Oct 17 09:19:49 wjq-NUDT avahi-daemon[599]: Invalid response packet from host 10.0.10.207.
Oct 17 09:20:46 wjq-NUDT systemd[1]: Stopping User Manager for UID 119...
Oct 17 09:20:46 wjq-NUDT systemd[868]: Reached target Shutdown.
Oct 17 09:20:46 wjq-NUDT systemd[868]: Stopped target Default.
Oct 17 09:20:46 wjq-NUDT systemd[868]: Starting Exit the Session...
Oct 17 09:20:46 wjq-NUDT systemd[868]: Stopped target Basic System.
Oct 17 09:20:46 wjq-NUDT systemd[868]: Stopped target Paths.
Oct 17 09:20:46 wjq-NUDT systemd[868]: Stopped target Sockets.
Oct 17 09:20:46 wjq-NUDT systemd[868]: Stopped target Timers.
Oct 17 09:20:46 wjq-NUDT systemd[868]: Received SIGRTMIN+24 from PID 2333 (kill).
Oct 17 09:20:46 wjq-NUDT systemd[1]: Stopped User Manager for UID 119.
Oct 17 09:20:46 wjq-NUDT systemd[1]: Removed slice user-119.slice.
Oct 17 09:21:36 wjq-NUDT com.canonical.Unity.Scope.File.Gdrive[1497]: Search changed to: ''

请问各位大神,可以提供一些解决问题的思路,或者定位一下问题所在吗?多谢大家!
附录:
完整的System Log

 
 
已邀请:

要回复问题请先登录注册