【小记】分布式训练中进程崩溃 SIGSEGV

type
Post
date
Jun 17, 2025
summary
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
category
实践技巧
tags
PD
分布式训练
password
URL
Property
Jun 20, 2025 01:48 AM

报错记录

Traceback (most recent call last): File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 2455, in test integrated(table) File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 2276, in integrated raise e # 重新抛出异常,让父进程能捕获到错误 File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 2270, in integrated synthesizer.fit() # 训练模型 File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/sdgx/synthesizer.py", line 327, in fit self.model.fit(metadata, processed_dataloader, **(model_fit_kwargs or {})) File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 284, in fit return self._fit_multi_gpu(metadata, dataloader, epochs, *args, **kwargs) File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 328, in _fit_multi_gpu mp.spawn( File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes while not context.join(): File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
这意味着在多进程(多卡)训练时,子进程 1 因为 SIGSEGV(段错误)崩溃,导致整个 mp.spawn 失败。
SIGSEGV 错误常见于 PyTorch 的 CUDA 相关操作

问题排查

1、传入子进程的对象中,是否包含不可序列化的对象
在子进程函数内部最开始的地方进行日志打印,查看进程启动、分配的 GPU 等情况;如果日志正常打印,说明子进程成功启动,并非是传入的对象有问题;检查通过
2025-06-17 07:53:11.333 | INFO | __mp_main__:_train_worker:339 - Worker 0 started, pid=62587 2025-06-17 07:53:13.600 | INFO | __mp_main__:_train_worker:339 - Worker 1 started, pid=62814
2、内存问题:检查显存和共享内存
  • 检查共享内存:df -h /dev/shm 查看其大小和使用情况。检查通过
    • 如果空间很小(例如只有几 GB),可以考虑扩容。编辑 /etc/fstab 文件或使用 mount 命令重新挂载。对于数据处理量大的任务,建议分配几十 GB。临时扩容到 64G: sudo mount -o remount,size=64G /dev/shm
    • Filesystem      Size  Used Avail Use% Mounted on tmpfs            32G  1.2M   32G   1% /dev/shm
  • 检查显存:nvidia-smi 查看显存使用情况,检查通过
    • 如果剩余可用显存太小则可能是资源竞争导致的;代码中指定使用[0,2]显卡
    • | GPU Name Persistence-M | ... | Memory-Usage | GPU-Util Compute M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | ... | 393MiB / 24564MiB | 0% Default | | 1 NVIDIA GeForce RTX 4090 Off | ... | 21238MiB / 24564MiB | 0% Default | | 2 NVIDIA GeForce RTX 4090 Off | ... | 393MiB / 24564MiB | 0% Default | | 3 NVIDIA GeForce RTX 4090 Off | ... | 20442MiB / 24564MiB | 0% Default | ... | Processes: | | GPU GI CI PID Type Process name GPU Memory | |=========================================================================================| | 0 N/A N/A 5619 C ...lama-box/llama-box-rpc-server 384MiB | | 1 N/A N/A 5621 C ...lama-box/llama-box-rpc-server 384MiB | | 1 N/A N/A 30802 C Model: ROOT-qwen1.5-0.5b-0 20828MiB | | 2 N/A N/A 5620 C ...lama-box/llama-box-rpc-server 384MiB | | 3 N/A N/A 5618 C ...lama-box/llama-box-rpc-server 384MiB | | 3 N/A N/A 26237 C ...ipx/venvs/gpustack/bin/python 20044MiB |
3、单卡测试:检查代码在单卡条件下能否正常运行;测试通过
4、NCCL 和多 GPU 通信问题
  • 检查 GPU 通信:运行 nvidia-smi topo -m 命令,查看多块 GPU 之间的连接性,如果看到 PXB 或 PIX,则说明连接性没问题;检查通过
    • GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB PHB PHB 0-31 0 N/A GPU1 PHB X PHB PHB 0-31 0 N/A GPU2 PHB PHB X PHB 0-31 0 N/A GPU3 PHB PHB PHB X 0-31 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
  • NCCL 日志检查:设置 NCCL 环境变量,输出详细日志;未输出有效信息
    • export NCCL_DEBUG=INFO export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1
  • NCCL 兼容性检查:
    • 当前搭配为:CUDA 12.8 + NCCL 2.21.5 + RTX 4090 + torch 2.5.1
      之前可以运行的搭配:CUDA 12.1 + NCCL 2.21.5 + T4 + torch 2.5.1
      怀疑是 CUDA 版本和 NCCL 版本不兼容的问题;
    • 重新安装支持最新 CUDA 驱动(12.4)的 torch 和 NCCL;
      • pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
    • 在初始化进程组时设置 backend='gloo'原来为“nccl”成功定位问题,使用 gloo 成功运行程序,确定为 NCCL 兼容性问题;
      • dist.init_process_group( backend='gloo', init_method='env://', world_size=len(self.gpu_ids), rank=rank )

解决问题

虽然使用 gloo 模式能正常运行代码,但是 gloo 的性能相比 nccl 来说低很多,所以注重效率还是需要使用 nccl 模式;
最终方式:
  • 调整 CUDA 和 NCCL 驱动的版本,目前 CUDA 版本为 12.8,版本太新,可以尝试降低 CUDA 版本到 12.4 或 12.1;
  • 因为 NCCL 通常是和 torch 一起安装的,所以如果要调整 NCCL 版本实际是调整 torch 的版本,PyTorch 官方网站有 CUDA 12.1 和 12.4 对应的安装命令(传送门);
  • 也可以尝试用 PyTorch 2.2.x + CUDA 11.8(最稳定的组合)。
If you have any questions, please contact me.