自由Man

PyTorch训练时,Dataloader卡死、挂起,跑一个epoch停了,问题解决方案

问题描述:

    正常跑Epoch 0,跑下一个Epoch时,程序停在那里不动,使用CTRL+C,显示如下:

------Epoch 1------

^CTraceback (most recent call last):

  File "train.py", line 505, in <module>

Process Process-13:

  File "train.py", line 469, in main

    trainLoss, mIoU_train = train(epoch, seg_model, model, metrics, criterion, max_pooling_loss, train_loader, optimizer, n_classes, args.batch_size)

  File "train.py", line 315, in train

    for batch_idx, (segImages, Flows, target) in enumerate(train_data):

  File "/data/anaconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 330, in __next__

    idx, batch = self._get_batch()

  File "/data/anaconda3/envs/py36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 309, in _get_batch

    return self.data_queue.get()

  File "/data/anaconda3/envs/py36/lib/python3.6/multiprocessing/queues.py", line 335, in get

    res = self._reader.recv_bytes()

  File "/data/anaconda3/envs/py36/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes

    buf = self._recv_bytes(maxlength)

  File "/data/anaconda3/envs/py36/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes

    buf = self._recv(4)

  File "/data/anaconda3/envs/py36/lib/python3.6/multiprocessing/connection.py", line 379, in _recv

    chunk = read(handle, remaining)

KeyboardInterrupt


推荐解决方案:

  1. 前一进程还未处理完,又进入下一个导致互锁,在一个Epoch完了后,或者每次获取一个batch数据后停顿一下: time.sleep(0.003)

  2. 内存问题,使用开关:pin_memory=True/False

  3. 多进程导致互锁问题,减少进程数,或直接使用一个:num_workers=0/1

  4. 使用其他DataLoader的问题,改为: from torch.utils.data.dataloader import DataLoader

  5. 内存大小不够的问题:writing 8192 to /proc/sys/kernel/shmmni

  6. 如果脚本中同时使用了OpenCV,可能是OpenCV与Pytorch互锁的问题。

    关闭OpenCV的多线程:

    cv2.setNumThreads(0)

    cv2.ocl.setUseOpenCL(False)

  7. 是否存在打开文件未关闭的问题:[openfile].close()


发表评论:

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。

Powered By Z-BlogPHP 1.5.2 Zero Theme By 爱墙纸

Copyright ZiYouMan.cn. All Rights Reserved. 蜀ICP备15004526号