Mixture of Experts (MoEs) plays an important role in the development of more efficient and effective large language models (LLMs). Due to the enormous resource requirements, studying large scale MoE algorithms remain in-accessible to many researchers. This work develops LibMoE, a comprehensive and modular framework to streamline the research, training, and evaluation of MoE algorithms. Built upon three core principles: (i) modular design, (ii) efficient training; (iii) comprehensive evaluation, LibMoE brings MoE in LLMs more accessible to a wide range of researchers by standardizing the training and evaluation pipelines. Using LibMoE, we extensively benchmarked five state-of-the-art MoE algorithms over three different LLMs and 11 datasets under the zero-shot setting. The results show that despite the unique characteristics, all MoE algorithms perform roughly similar when averaged across a wide range of tasks. With the modular design and extensive evaluation, we believe LibMoE will be invaluable for researchers to make meaningful progress towards the next generation of MoE and LLMs.
Our work introduces LibMoE, a toolkit designed to simplify MoE research in LLMs by supporting distributed training and comprehensive evaluations across multiple MoE algorithms. With a modular design, LibMoE allows for extensive customization of MoE components (e.g., sparsity, router interactions, balancing losses). Incorporating the latest sparse upcycling techniques, it enables affordable MoE integration into existing dense LLM checkpoints. Our training pipeline, achievable within 55 hours using 4 x A100 GPUs, while the MoE upcycling step can be finished within 32 hours only, offers a cost-effective solution while preserving evaluation fidelity. Our main contributions is summarized below:
Training Pipeline: training, we adopt the vision-language pre-training task, which is one of the more challenging problems and only requires a rather small amount of data to start (around 1e9 tokens). To this end, we follow the CUMO framework to upcycle the LLaVA model, which consists of three modules: a pre-trained visual encoder, a pre-trained LLM, and a randomly initialized MLP connector. Training follows a two-stage process:
Evaluation Pipeline: To drive the MoE developments towards real-world scenarios, we implement LibMoE to evaluate the algorithms in the zero-shot setting. To this end, we modify the LMMS-Eval framework to evaluate the final checkpoints of various MoE algorithms. Particularly, we carefully select 11 popular benchmarks provided by LMMS-Eval and report the evaluation results. Additionally, we also provide a LibMoE's model loader so that future users can freely explore almost 100 benchmarks supported by LMMS-Eval.
Comparison of MoE algorithms on different models and training data Sizes for visual instruction tuning. The data set is constructed from LLaVA-665K. We highlight the highest (best) results in bold. Model: We consider five algorithms: SMoE-R (SMoE Router), Cosine-R, Sigmoid-R (Sigmoid Router), Hyper-R (Hyper Router), and Perturbed Cosine-R (Perturbed Cosine Router). We only show the performance of CLIP + Phi3 model, To see the full results, please visit our 👉 Paper 👈.
Data | Model | MoE Method | AI2D | Text VQA | GQA | Hallusion Benchmark | MathVista Validation | MMBenchEN dev | MMMU Validation | MMStar | POPE | SQA Full | MME | AVEGAGE: (w/o MME) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
332k | CLIP + Phi3 | SMoE-R | 63.67 | 47.47 | 59.46 | 43.32 | 31.60 | 66.67 | 40.11 | 37.94 | 86.87 | 77.23 | 1,608.21 | 55.42 |
Cosine-R | 63.31 | 48.83 | 59.25 | 41.54 | 31.80 | 67.96 | 39.56 | 39.09 | 86.81 | 76.96 | 1,637.99 | 55.51 | ||
Sigmoid-R | 63.80 | 47.74 | 59.24 | 41.43 | 31.40 | 68.30 | 40.78 | 38.70 | 87.49 | 77.61 | 1,611.36 | 55.65 | ||
Hyper-R | 64.05 | 47.76 | 59.61 | 41.11 | 32.50 | 69.24 | 41.33 | 39.27 | 86.68 | 77.31 | 1,602.59 | 55.89 | ||
Perturbed Cosine-R | 64.60 | 47.92 | 59.08 | 41.54 | 30.60 | 67.87 | 40.22 | 38.84 | 86.81 | 77.82 | 1,619.69 | 55.63 | ||
665k | CLIP + Phi3 | SMoE-R | 64.25 | 46.57 | 62.12 | 40.48 | 31.00 | 68.12 | 39.89 | 37.13 | 87.50 | 77.74 | 1,700.61 | 55.48 |
Cosine-R | 64.51 | 49.79 | 61.38 | 40.80 | 31.30 | 67.01 | 40.67 | 39.36 | 87.52 | 77.48 | 1,687.37 | 55.98 | ||
Sigmoid-R | 64.38 | 47.12 | 61.65 | 40.80 | 31.90 | 67.87 | 40.11 | 39.20 | 86.93 | 77.17 | 1,710.42 | 55.71 | ||
Hyper-R | 64.37 | 47.59 | 59.70 | 40.38 | 31.30 | 68.30 | 40.78 | 38.33 | 85.70 | 80.33 | 1,726.87 | 55.68 | ||
Perturbed Cosine-R | 64.70 | 47.16 | 61.90 | 39.43 | 32.80 | 69.50 | 39.89 | 40.33 | 87.42 | 77.64 | 1,672.70 | 56.08 |
Figure below offers a detailed view of the time-dependent performance of five MoE algorithms across 11 benchmarks. This figure illustrates the unique behavioral characteristics of each algorithm and supports our observation that, in most cases, the final checkpoints of the MoE algorithms do not necessarily yield the best performance. This finding underscores the potential benefits of applying early stopping to achieve optimal results.
We are making our entire experiment checkpoints publicly available to contribute to the community's research on the topic of Mixture of Experts (MoE). By reusing our checkpoints at the Pre-Training and Pre-FineTuning stages, we hope to help others save time and computational resources in their own experiments.
@misc{nguyen2024libmoelibrarycomprehensivebenchmarking,
title={LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models},
author={Nam V. Nguyen and Thong T. Doan and Luong Tran and Van Nguyen and Quang Pham},
year={2024},
eprint={2411.00918},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.00918},
}