torchrun exec the train script will have error #30

yiyepiaoling0715 · 2024-11-25T03:22:16Z

training sript as follows:
torchrun --nproc_per_node=4 \ ../train.py --train_args_file ../train_args/dpo/full/deepseek-dpo-full.json
when exec to here
blender.loadranker("llm-blender/PairRM",device=Accelerator().device) # load PairRM
will have error
oad_fuser() [rank0]: Traceback (most recent call last): [rank0]: File "/lpai/code/firefly/shells/../train.py", line 923, in <module> [rank0]: main() [rank0]: File "/lpai/code/firefly/shells/../train.py", line 865, in main [rank0]: trainer = init_components(args, training_args) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/lpai/code/firefly/shells/../train.py", line 816, in init_components [rank0]: judge=PairRMJudge() [rank0]: ^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/trl/trainer/judges.py", line 167, in __init__ [rank0]: self.blender.loadranker("llm-blender/PairRM", device=Accelerator().device) [rank0]: ^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 292, in __init__ [rank0]: deepspeed_plugins = AcceleratorState().deepspeed_plugins [rank0]: ^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/accelerate/state.py", line 887, in __init__ [rank0]: raise ValueError( [rank0]: ValueError: Please make sure to properly initialize your accelerator via accelerator = Accelerator()before using any functionality from theaccelerate library.
[rank0]: size mismatch for pretrained_model.encoder.layer.23.output.dense.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.layer.23.output.dense.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.layer.23.output.LayerNorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.layer.23.output.LayerNorm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.rel_embeddings.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.LayerNorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]). [rank0]: size mismatch for pretrained_model.encoder.LayerNorm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([0]).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torchrun exec the train script will have error #30

torchrun exec the train script will have error #30

yiyepiaoling0715 commented Nov 25, 2024

torchrun exec the train script will have error #30

torchrun exec the train script will have error #30

Comments

yiyepiaoling0715 commented Nov 25, 2024