A Question Regarding Resuming InternVL Training #991

sunye23 · 2025-02-11T14:48:46Z

Hi, I am currently training the InternVL with xtuner. However, I have encountered an issue with resuming training, and I would greatly appreciate your assistance.

Specifically, I am running distributed training on a SLURM cluster. Due to resource constraints, I can only allocate a few hours per job. Consequently, I need to resume training multiple times using checkpoint files from the .pth folder (e.g., mp_rank_00_model_states.pt). Unfortunately, each resume operation incurs a substantial delay during the “mmengine - WARNING - Advance dataloader 14000 steps to skip data that has already been trained” phase.

Could you please advise if there is any procedure or configuration setting to avoid this lengthy skipping process without compromising training performance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Question Regarding Resuming InternVL Training #991

A Question Regarding Resuming InternVL Training #991

sunye23 commented Feb 11, 2025

A Question Regarding Resuming InternVL Training #991

A Question Regarding Resuming InternVL Training #991

Comments

sunye23 commented Feb 11, 2025