Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Question Regarding Resuming InternVL Training #991

Open
sunye23 opened this issue Feb 11, 2025 · 0 comments
Open

A Question Regarding Resuming InternVL Training #991

sunye23 opened this issue Feb 11, 2025 · 0 comments

Comments

@sunye23
Copy link

sunye23 commented Feb 11, 2025

Hi, I am currently training the InternVL with xtuner. However, I have encountered an issue with resuming training, and I would greatly appreciate your assistance.

Specifically, I am running distributed training on a SLURM cluster. Due to resource constraints, I can only allocate a few hours per job. Consequently, I need to resume training multiple times using checkpoint files from the .pth folder (e.g., mp_rank_00_model_states.pt). Unfortunately, each resume operation incurs a substantial delay during the “mmengine - WARNING - Advance dataloader 14000 steps to skip data that has already been trained” phase.

Could you please advise if there is any procedure or configuration setting to avoid this lengthy skipping process without compromising training performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant