๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
IT-Engineering/TroubleShooting

Horovod ์‚ฌ์šฉ์‹œ ์—๋Ÿฌ : getting local rank failed, orte_ess_init failed, ompi_rte_init failed

by ๐Ÿงž‍โ™‚๏ธ 2020. 6. 8.
๋ฐ˜์‘ํ˜•
horovodrun -np 1 python train.py

์œ„์˜ ๋ฐฉ์‹์œผ๋กœ Horovod๋ฅผ ์‚ฌ์šฉํ•œ deep learning์ฝ”๋“œ ์‹คํ–‰์‹œ ์•„๋ž˜์™€ ๊ฐ™์€ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค.

--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  getting local rank failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
[1,0]<stderr>:*** An error occurred in MPI_Init_thread
[1,0]<stderr>:*** on a NULL communicator
[1,0]<stderr>:*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[1,0]<stderr>:***    and potentially your MPI job)
[1,0]<stderr>:[K0878P2:12600] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

์—๋Ÿฌ์˜ ๊ทผ์›์€ ์ƒ๊ฐ์ง€๋„ ๋ชปํ•œ ๊ฒƒ์—์„œ ๋ฐœ๊ฒฌ๋๋‹ค.

train.py ๋งจ ์ƒ์œ„๋‹จ์—์„œ horovod๋กœ ๋งŒ๋“ค์–ด์ง„ ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ๋ถ€๋ถ„์ด ์žˆ๋Š”๋ฐ, ์ด ๋ถ€๋ถ„์„

if __name__ == "__main__":

์œผ๋กœ ๋ฌถ์–ด์ฃผ๊ธฐ๋งŒ ํ•˜๋‹ˆ๊นŒ ํ•ด๋‹น ์—๋Ÿฌ๊ฐ€ ์‚ฌ๋ผ์กŒ๋‹ค... ์‘?

ํ˜น์‹œ๋ผ๋„ ๊ฐ™์€ ์—๋Ÿฌ๋กœ ๊ณ ์ƒํ•˜๋Š” ๋ถ„๋“ค์—๊ฒŒ ๋„์›€์ด ๋˜๊ธธ ๋ฐ”๋žŒ.

๋ฐ˜์‘ํ˜•

๋Œ“๊ธ€