You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am currently training the InternVL with xtuner. However, I have encountered an issue with resuming training, and I would greatly appreciate your assistance.
Specifically, I am running distributed training on a SLURM cluster. Due to resource constraints, I can only allocate a few hours per job. Consequently, I need to resume training multiple times using checkpoint files from the .pth folder (e.g., mp_rank_00_model_states.pt). Unfortunately, each resume operation incurs a substantial delay during the “mmengine - WARNING - Advance dataloader 14000 steps to skip data that has already been trained” phase.
Could you please advise if there is any procedure or configuration setting to avoid this lengthy skipping process without compromising training performance?
The text was updated successfully, but these errors were encountered:
Hi, I am currently training the InternVL with xtuner. However, I have encountered an issue with resuming training, and I would greatly appreciate your assistance.
Specifically, I am running distributed training on a SLURM cluster. Due to resource constraints, I can only allocate a few hours per job. Consequently, I need to resume training multiple times using checkpoint files from the .pth folder (e.g., mp_rank_00_model_states.pt). Unfortunately, each resume operation incurs a substantial delay during the “mmengine - WARNING - Advance dataloader 14000 steps to skip data that has already been trained” phase.
Could you please advise if there is any procedure or configuration setting to avoid this lengthy skipping process without compromising training performance?
The text was updated successfully, but these errors were encountered: