各位程式高手們好
我這幾天在我的電腦嘗試執行這個github的程式:improved-diffusion
簡單來說就是訓練一個denoising diffusion probabilistic model,這是一個最近幾年漸漸熱門起來的生成模型,可用於各種圖片修復任務(圖片缺失補全,高畫質修復等等)
我主要是想要用CelebA這個人臉圖片的dataset來訓練自己的模型
但是按照他(improved-diffusion)README中Training的部份執行python scripts/image_train.py --data_dir "D:\WeiTsung\improved_diffusion_main\DDPM_train\img_align_celeba" 這條指令時卻遇到以下錯誤訊息:
Traceback (most recent call last):
File "D:\WeiTsung\improved_diffusion_main\DDPM_train\scripts\image_train.py", line 87, in
main()
File "D:\WeiTsung\improved_diffusion_main\DDPM_train\scripts\image_train.py", line 26, in main
dist_util.setup_dist()
File "D:\WeiTsung\improved_diffusion_main\DDPM_train\improved_diffusion\dist_util.py", line 51, in setup_dist
dist.init_process_group(backend=backend, init_method="env://")
File "C:\Users\user\anaconda3\envs\DDPM_training_v2\Lib\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\anaconda3\envs\DDPM_training_v2\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1145, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\anaconda3\envs\DDPM_training_v2\Lib\site-packages\torch\distributed\rendezvous.py", line 247, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\user\anaconda3\envs\DDPM_training_v2\Lib\site-packages\torch\distributed\rendezvous.py", line 178, in _create_c10d_store
return TCPStore(
^^^^^^^^^
RuntimeError: unmatched '}' in format string
=============================================================================
TCPStore(hostname, port, world_size, start_daemon, timeout, multi_tenant=True)函數帶入的參數數值如下:
hostname=140.117.172.120
port=2021 (每次執行會不一樣)
world_size=1
start_daemon=True
timeout=0:30:00
以下是我在執行程式的anaconda環境已安裝的packages及其版本:
# Name Version Build Channel
blas 1.0 mkl
blobfile 2.1.1 pypi_0 pypi
brotli-python 1.0.9 py311hd77b12b_7
bzip2 1.0.8 he774522_0
ca-certificates 2023.08.22 haa95532_0
certifi 2023.11.17 py311haa95532_0
cffi 1.16.0 py311h2bbff1b_0
charset-normalizer 2.0.4 pyhd3eb1b0_0
cryptography 41.0.3 py311h89fc84f_0
cuda-cccl 12.3.101 0 nvidia
cuda-cudart 12.1.105 0 nvidia
cuda-cudart-dev 12.1.105 0 nvidia
cuda-cupti 12.1.105 0 nvidia
cuda-libraries 12.1.0 0 nvidia
cuda-libraries-dev 12.1.0 0 nvidia
cuda-nvrtc 12.1.105 0 nvidia
cuda-nvrtc-dev 12.1.105 0 nvidia
cuda-nvtx 12.1.105 0 nvidia
cuda-opencl 12.3.101 0 nvidia
cuda-opencl-dev 12.3.101 0 nvidia
cuda-profiler-api 12.3.101 0 nvidia
cuda-runtime 12.1.0 0 nvidia
filelock 3.13.1 py311haa95532_0
freetype 2.12.1 ha860e81_0
giflib 5.2.1 h8cc25b3_3
gmpy2 2.1.2 py311h7f96b67_0
idna 3.4 py311haa95532_0
intel-openmp 2023.1.0 h59b6b97_46320
jinja2 3.1.2 py311haa95532_0
jpeg 9e h2bbff1b_1
lerc 3.0 hd77b12b_0
libcublas 12.1.0.26 0 nvidia
libcublas-dev 12.1.0.26 0 nvidia
libcufft 11.0.2.4 0 nvidia
libcufft-dev 11.0.2.4 0 nvidia
libcurand 10.3.4.101 0 nvidia
libcurand-dev 10.3.4.101 0 nvidia
libcusolver 11.4.4.55 0 nvidia
libcusolver-dev 11.4.4.55 0 nvidia
libcusparse 12.0.2.55 0 nvidia
libcusparse-dev 12.0.2.55 0 nvidia
libdeflate 1.17 h2bbff1b_1
libffi 3.4.4 hd77b12b_0
libjpeg-turbo 2.0.0 h196d8e1_0
libnpp 12.0.2.50 0 nvidia
libnpp-dev 12.0.2.50 0 nvidia
libnvjitlink 12.1.105 0 nvidia
libnvjitlink-dev 12.1.105 0 nvidia
libnvjpeg 12.1.1.14 0 nvidia
libnvjpeg-dev 12.1.1.14 0 nvidia
libpng 1.6.39 h8cc25b3_0
libtiff 4.5.1 hd77b12b_0
libuv 1.44.2 h2bbff1b_0
libwebp 1.3.2 hbc33d0d_0
libwebp-base 1.3.2 h2bbff1b_0
lxml 4.9.3 pypi_0 pypi
lz4-c 1.9.4 h2bbff1b_0
markupsafe 2.1.1 py311h2bbff1b_0
mkl 2023.1.0 h6b88ed4_46358
mkl-service 2.4.0 py311h2bbff1b_1
mkl_fft 1.3.8 py311h2bbff1b_0
mkl_random 1.2.4 py311h59b6b97_0
mpc 1.1.0 h7edee0f_1
mpfr 4.0.2 h62dcd97_1
mpi4py 3.1.5 pypi_0 pypi
mpir 3.0.0 hec2e145_1
mpmath 1.3.0 py311haa95532_0
networkx 3.1 py311haa95532_0
numpy 1.26.2 py311hdab7c0b_0
numpy-base 1.26.2 py311hd01c5d8_0
openjpeg 2.4.0 h4fc8c34_0
openssl 3.0.12 h2bbff1b_0
pillow 10.0.1 py311h045eedc_0
pip 23.3.1 py311haa95532_0
pycparser 2.21 pyhd3eb1b0_0
pycryptodomex 3.19.0 pypi_0 pypi
pyopenssl 23.2.0 py311haa95532_0
pysocks 1.7.1 py311haa95532_0
python 3.11.5 he1021f5_0
pytorch 2.1.1 py3.11_cuda12.1_cudnn8_0 pytorch
pytorch-cuda 12.1 hde6ce7c_5 pytorch
pytorch-mutex 1.0 cuda pytorch
pyyaml 6.0.1 py311h2bbff1b_0
requests 2.31.0 py311haa95532_0
setuptools 68.0.0 py311haa95532_0
sqlite 3.41.2 h2bbff1b_0
sympy 1.12 py311haa95532_0
tbb 2021.8.0 h59b6b97_0
tk 8.6.12 h2bbff1b_0
torchaudio 2.1.1 pypi_0 pypi
torchvision 0.16.1 pypi_0 pypi
typing_extensions 4.7.1 py311haa95532_0
tzdata 2023c h04d1e81_0
urllib3 1.26.18 py311haa95532_0
vc 14.2 h21ff451_1
vs2015_runtime 14.27.29016 h5e58377_2
wheel 0.41.2 py311haa95532_0
win_inet_pton 1.1.0 py311haa95532_0
xz 5.4.2 h8cc25b3_0
yaml 0.2.5 he774522_0
zlib 1.2.13 h8cc25b3_0
zstd 1.5.5 hd43e919_0
我的電腦有2個GeForce RTX4070 12GB的GPU
然後執行:
print(th.cuda.device_count())得到2
print(th.cuda.is_available())得到True
所以應該是抓得到GPU
但是就不知道上述的錯誤訊息是為什麼,導致無法執行。
總之我希望github的code可以在我的GPUs上成功地訓練模型,否則只有cpu的話真的會訓練很久。但現在明明可以讀得到GPUs,程式卻因為上述error問題而不能執行。
希望有大大可以幫我解決此錯誤訊息,感激不盡!
RuntimeError: unmatched '}' in format string
查看看你傳進去的 dict 是不是多打或是不小心刪掉了
導致 dict 不完整
另一個可能是 torch 的版本太新,可能有些參數有變過
降版本試試看,看專案也是 2 年前了
Tools like Nsight Systems can help you profile your application and identify bottlenecks or issues in the code block blast