Skip to content

torchrun exec the train script will have error #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yiyepiaoling0715 opened this issue Nov 25, 2024 · 0 comments
Open

torchrun exec the train script will have error #30

yiyepiaoling0715 opened this issue Nov 25, 2024 · 0 comments

Comments

@yiyepiaoling0715
Copy link

training sript as follows:
torchrun --nproc_per_node=4 \ ../train.py --train_args_file ../train_args/dpo/full/deepseek-dpo-full.json
when exec to here
blender.loadranker("llm-blender/PairRM",device=Accelerator().device) # load PairRM
will have error
oad_fuser() [rank0]: Traceback (most recent call last): [rank0]: File "/lpai/code/firefly/shells/../train.py", line 923, in <module> [rank0]: main() [rank0]: File "/lpai/code/firefly/shells/../train.py", line 865, in main [rank0]: trainer = init_components(args, training_args) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/lpai/code/firefly/shells/../train.py", line 816, in init_components [rank0]: judge=PairRMJudge() [rank0]: ^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/trl/trainer/judges.py", line 167, in __init__ [rank0]: self.blender.loadranker("llm-blender/PairRM", device=Accelerator().device) [rank0]: ^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 292, in __init__ [rank0]: deepspeed_plugins = AcceleratorState().deepspeed_plugins [rank0]: ^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/accelerate/state.py", line 887, in __init__ [rank0]: raise ValueError( [rank0]: ValueError: Please make sure to properly initialize your accelerator via accelerator = Accelerator()before using any functionality from theaccelerate library.
[rank0]: size mismatch for pretrained_model.encoder.layer.23.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.layer.23.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.layer.23.output.LayerNorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.layer.23.output.LayerNorm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.rel_embeddings.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.LayerNorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.LayerNorm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant