๋ฐ์ํ
horovodrun -np 1 python train.py
์์ ๋ฐฉ์์ผ๋ก Horovod๋ฅผ ์ฌ์ฉํ deep learning์ฝ๋ ์คํ์ ์๋์ ๊ฐ์ ์๋ฌ๊ฐ ๋ฐ์ํ๋ค.
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
getting local rank failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
[1,0]<stderr>:*** An error occurred in MPI_Init_thread
[1,0]<stderr>:*** on a NULL communicator
[1,0]<stderr>:*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[1,0]<stderr>:*** and potentially your MPI job)
[1,0]<stderr>:[K0878P2:12600] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
์๋ฌ์ ๊ทผ์์ ์๊ฐ์ง๋ ๋ชปํ ๊ฒ์์ ๋ฐ๊ฒฌ๋๋ค.
train.py ๋งจ ์์๋จ์์ horovod๋ก ๋ง๋ค์ด์ง ํจ์๋ฅผ ํธ์ถํ๋ ๋ถ๋ถ์ด ์๋๋ฐ, ์ด ๋ถ๋ถ์
if __name__ == "__main__":
์ผ๋ก ๋ฌถ์ด์ฃผ๊ธฐ๋ง ํ๋๊น ํด๋น ์๋ฌ๊ฐ ์ฌ๋ผ์ก๋ค... ์?
ํน์๋ผ๋ ๊ฐ์ ์๋ฌ๋ก ๊ณ ์ํ๋ ๋ถ๋ค์๊ฒ ๋์์ด ๋๊ธธ ๋ฐ๋.
๋ฐ์ํ
๋๊ธ