This repo aims to provide a production-ready library for modeling and training Linear-MoE models, non-invasively built on the latest Megatron-Core. Contributions through pull requests are highly encouraged!
Linear Sequence Modeling | Instance | Qwen2 MoE (@Alibaba) | Deepseek v2 MoE (@Deepseek) | Mixtral MoE (@Mistral AI) | Llama3 (@Meta) |
---|---|---|---|---|---|
Linear Attention (LA) | Basic Linear Attention (@Idiap@EPFL) |
✅ | ✅ | ✅ | ✅ |
Lightning Attention (@Shanghai AI Lab) |
✅ | ✅ | ✅ | ✅ | |
Retention (@MSRA@THU) |
✅ | ✅ | ✅ | ✅ | |
GLA (@MIT@IBM) |
✅ | ✅ | ✅ | ✅ | |
Delta Net (@MIT) |
✅ | ✅ | ✅ | ✅ | |
GSA (@SUDA@MIT) |
✅ | ✅ | ✅ | ✅ | |
Based (@Stanford) |
✅ | ✅ | ✅ | ✅ | |
Rebased (@Tinkoff) |
✅ | ✅ | ✅ | ✅ | |
LASP-2 (@Shanghai AI Lab) |
✅ | ✅ | ✅ | ✅ | |
Gated DeltaNet (@MIT@NVIDIA) |
✅ | ✅ | ✅ | ✅ | |
🔥MoM (with GLA) (@Shanghai AI Lab) |
✅ | ✅ | ✅ | ✅ | |
🔥MoM (with Gated DeltaNet) (@Shanghai AI Lab) |
✅ | ✅ | ✅ | ✅ | |
State Space Modeling (SSM) | Mamba2 (@Princeton@CMU) |
✅ | ✅ | ✅ | ✅ |
Linear RNN | RWKV6 (@RWKV) |
✅ | ✅ | ✅ | ✅ |
HGRN2 (@TapTap@Shanghai AI Lab) |
✅ | ✅ | ✅ | ✅ | |
Softmax Attention | Softmax Attention (@Google) |
✅ | ✅ | ✅ | ✅ |
FlashAttention-2 (@Princeton@Stanford) |
✅ | ✅ | ✅ | ✅ |
Your environment should satify the following requirements:
# create a conda env, install PyTorch
conda create -n linear-moe python=3.11
conda activate linear-moe
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
# (if needed) Apex
git clone https://github.com/NVIDIA/apex.git
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# (if needed) FlashAttention
MAX_JOBS=8 pip install flash-attn --no-build-isolation
# (if needed) dropout_layer_norm in FlashAttention
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/csrc/layer_norm & pip install .
# Transformer Engine
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
# Linear-MoE
git clone --recurse-submodules https://github.com/OpenSparseLLMs/Linear-MoE.git
# requirements
pip install -r requirements.txt
We recommend using the latest release of NGC's PyTorch container with DGX nodes, which already have relatively new versions of CUDA, cuDNN, NCCL, PyTorch, Triton, Apex, TransformerEngine, etc., installed.
On the top of NGC's PyTorch container, you can setup Linear-MoE with:
# Linear-MoE
git clone --recurse-submodules https://github.com/OpenSparseLLMs/Linear-MoE.git
# requirements
pip install -r requirements.txt
To pretrain or finetune a Linear-MoE model, you can:
-
Open
examples
, choose the model you are going to pretrain or finetune, e.g.linear_moe_qwen2
. -
Edit
run_pretrain_qwen.sh
orrun_finetune_qwen.sh
to set your configurations including:
- Model size (e.g., 0.5B, 1.5B, 7B)
- Batch size
- Learning rate
- Model architecture (e.g., LSM modules, number of experts)
- Distributed training settings (TP, PP, CP, EP sizes)
- ...
- Start pretraining or finetuning by:
sh run_pretrain_qwen.sh
orsh run_finetune_qwen.sh
.
For example, to train a A0.3B (hybrid) linear-moe-qwen2
model with LA_MOUDLE=hgrn2
, you can config run_pretrain_qwen.sh
as:
ENV=dsw
MODEL_SIZE=A0.3B
BATCH_SIZE=2
GLOBAL_BATCH_SIZE=4
LR=1e-4
MIN_LR=1e-5
SEQ_LEN=2048
PAD_LEN=2048
PR=bf16
TP=1
PP=1
CP=1
EP=1
AC=sel
DO=true
FL=false
FU=false
SP=false
TE=false
MB=false
USE_GEMM=false
TOKEN_DROPPING=false
TRAIN_CAPACITY_FACTOR=1.25
EVAL_CAPACITY_FACTOR=2.0
SAVE_INTERVAL=100000
DATASET_PATH=xxx/qwen-datasets/wudao_qwenbpe_text_document
PRETRAIN_CHECKPOINT_PATH=xxx/qwen-ckpts/Qwen2-0.5B
TRAIN_TOKENS=15000000000
WARMUP_TOKENS=10000
OUTPUT_BASEPATH=./output
LA_MODULE="hgrn2"
BASE_MODEL="qwen2"
# for linear attention and linear RNN models
# pure linear
# LAYER_TYPE_LIST="LLLLLLLLLLLL"
# hybrid model
LAYER_TYPE_LIST="LLLNLLLNLLLN"
# for SSM models (Mamba2), MLP layers are fixed behind mamba or attention layers.
# M: mamba layer, *: attention layer
# pure mamba2
# HYBRID_OVERRIDE_PATTERN="MMMMMMMMMMMM"
# hybrid mamba2
# HYBRID_OVERRIDE_PATTERN="MMM*MMM*MMM*"
# Linear Attention & Linear RNN
linear_moe_options=" \
--use-la-module \
--la-module ${LA_MODULE} \
--la-mode fused_chunk \
--base-model ${BASE_MODEL} \
--la-feature-map swish \
--la-output-norm rmsnorm \
--la-gate-fn swish \
--layer-type-list ${LAYER_TYPE_LIST} \
"
# # SSM
# linear_moe_options=" \
# --use-la-module \
# --la-module ${LA_MODULE} \
# --base-model ${BASE_MODEL} \
# "
We use EleutherAI/lm-evaluation-harness for benchmark evaluation. See eval/README.md for detailed instruction.
We built this repo upon alibaba/PAI-Megatron-Patch, and take Megatron-Core as the training engine. We use the triton-implemented linear attention kernels from fla-org/flash-linear-attention, and CUDA implemented Mamba2 kernel from state-spaces/mamba to accelerate the execution.
If you find this repo useful, please consider citing our work:
@article{sun2025linear-moe,
title={Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts},
author={Sun, Weigao and Lan, Disen and Zhu, Tong and Qu, Xiaoye and Cheng, Yu},
journal={arXiv preprint arXiv:2503.05447},
year={2025}
}
@software{sun2024linear-moe,
title = {Linear-MoE: A Production-Ready Library for Modeling and Training Linear-MoE Models},
author = {Sun, Weigao and Lan, Disen and Zhu, Tong and Du, Jusen},
url = {https://github.com/OpenSparseLLMs/Linear-MoE},
year = {2024}
}
@article{du2025mom,
title={MoM: Linear Sequence Modeling with Mixture-of-Memories},
author={Du, Jusen and Sun, Weigao and Lan, Disen and Hu, Jiaxi and Cheng, Yu},
journal={arXiv preprint arXiv:2502.13685},
year={2025}
}
@article{sun2025lasp2,
title={LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid},
author={Sun, Weigao and Lan, Disen and Zhong, Yiran and Qu, Xiaoye and Cheng, Yu},
journal={arXiv preprint arXiv:2502.07563},
year={2025}
}