Skip to content

Error executing sbatch scripts/imagenav-v1-hm3d-ovrl-rn50.sh #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Lazy-boy193 opened this issue Apr 8, 2025 · 0 comments
Open

Error executing sbatch scripts/imagenav-v1-hm3d-ovrl-rn50.sh #10

Lazy-boy193 opened this issue Apr 8, 2025 · 0 comments

Comments

@Lazy-boy193
Copy link

Error executing sbatch scripts/imagenav-v1-hm3d-ovrl-rn50.sh, my script file is as follows

#SBATCH --signal=USR1@600
#SBATCH --requeue

export GLOG_minloglevel=2
export MAGNUM_LOG=quiet

# 打印当前工作目录
pwd

# 切换到正确的工作目录
cd /share/home/HCI/liuyang/zson

# 打印切换后的工作目录
pwd

MASTER_ADDR=$(srun --ntasks=1 hostname 2>&1 | tail -n1)
export MASTER_ADDR

source activate zson

set -x
srun \
python -u run.py \
--exp-config configs/experiments/imagenav_hm3d_v1.yaml \
--run-type train \
NUM_CHECKPOINTS 200 \
TOTAL_NUM_STEPS 500e6 \
TASK_CONFIG.ENVIRONMENT.MAX_EPISODE_STEPS 500 \
TASK_CONFIG.TASK.SENSORS '["CACHED_GOAL_SENSOR"]' \
TASK_CONFIG.TASK.CACHED_GOAL_SENSOR.DATA_PATH 'data/goal_datasets/imagenav/hm3d/v1-rn50/{split}/content/{scene}.npy' \
TASK_CONFIG.TASK.CACHED_GOAL_SENSOR.SINGLE_VIEW True \
TASK_CONFIG.TASK.CACHED_GOAL_SENSOR.SAMPLE_GOAL_ANGLE True \
TASK_CONFIG.TASK.SIMPLE_REWARD.SUCCESS_REWARD 5.0 \
TASK_CONFIG.TASK.SIMPLE_REWARD.ANGLE_SUCCESS_REWARD 5.0 \
TASK_CONFIG.TASK.SIMPLE_REWARD.USE_DTG_REWARD True \
TASK_CONFIG.TASK.SIMPLE_REWARD.USE_ATG_REWARD True \
RL.POLICY.use_data_aug True \
RL.POLICY.CLIP_MODEL "RN50" \
RL.POLICY.pretrained_encoder 'data/models/omnidata_DINO_02.pth' \

The following error occurred, executingvim log.err

    backend, store=tcp_store, rank=world_rank, world_size=world_size
  File "/share/home/HCI/liuyang/anaconda3/envs/zson/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/share/home/HCI/liuyang/anaconda3/envs/zson/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 247, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)
Traceback (most recent call last):
  File "run.py", line 91, in <module>
    main()
  File "run.py", line 38, in main
    run_exp(**vars(args))
  File "run.py", line 86, in run_exp
    execute_exp(config, run_type)
  File "run.py", line 69, in execute_exp
    trainer.train()
  File "/share/home/HCI/liuyang/anaconda3/envs/zson/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/share/home/HCI/liuyang/habitat-lab-challenge-2022/habitat_baselines/rl/ppo/ppo_trainer.py", line 709, in train
    self._init_train()
  File "/share/home/HCI/liuyang/habitat-lab-challenge-2022/habitat_baselines/rl/ppo/ppo_trainer.py", line 216, in _init_train
    self.config.RL.DDPPO.distrib_backend
  File "/share/home/HCI/liuyang/habitat-lab-challenge-2022/habitat_baselines/rl/ddppo/ddp_utils.py", line 266, in init_distrib_slurm
    backend, store=tcp_store, rank=world_rank, world_size=world_size
  File "/share/home/HCI/liuyang/anaconda3/envs/zson/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 608, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/share/home/HCI/liuyang/anaconda3/envs/zson/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 247, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)
srun: error: gpu01: task 0: Exited with exit code 1
srun: error: gpu04: task 0: Exited with exit code 1
~

What shall I do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant