LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

Abstract

Mixture of Experts (MoEs) plays an important role in the development of more efficient and effective large language models (LLMs). Due to the enormous resource requirements, studying large scale MoE algorithms remain in-accessible to many researchers. This work develops LibMoE, a comprehensive and modular framework to streamline the research, training, and evaluation of MoE algorithms. Built upon three core principles: (i) modular design, (ii) efficient training; (iii) comprehensive evaluation, LibMoE brings MoE in LLMs more accessible to a wide range of researchers by standardizing the training and evaluation pipelines. Using LibMoE, we extensively benchmarked five state-of-the-art MoE algorithms over three different LLMs and 11 datasets under the zero-shot setting. The results show that despite the unique characteristics, all MoE algorithms perform roughly similar when averaged across a wide range of tasks. With the modular design and extensive evaluation, we believe LibMoE will be invaluable for researchers to make meaningful progress towards the next generation of MoE and LLMs.

Overview

Our work introduces LibMoE, a toolkit designed to simplify MoE research in LLMs by supporting distributed training and comprehensive evaluations across multiple MoE algorithms. With a modular design, LibMoE allows for extensive customization of MoE components (e.g., sparsity, router interactions, balancing losses). Incorporating the latest sparse upcycling techniques, it enables affordable MoE integration into existing dense LLM checkpoints. Our training pipeline, achievable within 55 hours using 4 x A100 GPUs, while the MoE upcycling step can be finished within 32 hours only, offers a cost-effective solution while preserving evaluation fidelity. Our main contributions is summarized below:

First, we present LibMoE, a comprehensive toolkit to streamline the development of MoE in LLMs.
Second, with LibMoE, we implemented a standard benchmark to standardize the evaluation of five state-of-the-art MoE algorithms.
Lastly, LibMoE facilitates research beyond reporting the final performance by allowing researchers to easily explore various factors such as early-stopping, expert assignments, architecture choices, and many more.

**The detailed design of LiBMoE**, which comprises three major modules. First, the MoE module implements various MoE algorithms. Second, the training modules handles the training process and supports various configurations. Lastly, the evaluation module supports almost 100 zero-shot benchmarks and a wide-range of metrics.

Training and Evaluation Pipelines

Training Pipeline: training, we adopt the vision-language pre-training task, which is one of the more challenging problems and only requires a rather small amount of data to start (around 1e9 tokens). To this end, we follow the CUMO framework to upcycle the LLaVA model, which consists of three modules: a pre-trained visual encoder, a pre-trained LLM, and a randomly initialized MLP connector. Training follows a two-stage process:

Dense Training The dense training stage initializes the MLP connector and trains it to connect the pre-trained visual encoder to the pre-trained LLM.
MoE Training the MoE training stage upcycles the model to become MoE and also trains all components to obtain the visual instruction following capabilities.

Evaluation Pipeline: To drive the MoE developments towards real-world scenarios, we implement LibMoE to evaluate the algorithms in the zero-shot setting. To this end, we modify the LMMS-Eval framework to evaluate the final checkpoints of various MoE algorithms. Particularly, we carefully select 11 popular benchmarks provided by LMMS-Eval and report the evaluation results. Additionally, we also provide a LibMoE's model loader so that future users can freely explore almost 100 benchmarks supported by LMMS-Eval.

**Overview of the LibMoE architecture and training process.** In the first stage of Dense Training, only the MLP is trained to improve alignment. In the second stage, all parameters are trained. During MoE Training, the feed-forward networks (FFNs) of the Vision Encoder (VE) and MLP Connector are used to initialize the experts within the MoE framework, and all parameters continue to be trained.

Experimental results

Comparison of MoE algorithms on different models and training data Sizes for visual instruction tuning. The data set is constructed from LLaVA-665K. We highlight the highest (best) results in bold. Model: We consider five algorithms: SMoE-R (SMoE Router), Cosine-R, Sigmoid-R (Sigmoid Router), Hyper-R (Hyper Router), and Perturbed Cosine-R (Perturbed Cosine Router). We only show the performance of CLIP + Phi3 model, To see the full results, please visit our 👉 Paper 👈.

Data	Model	MoE Method	AI2D	Text VQA	GQA	Hallusion Benchmark	MathVista Validation	MMBenchEN dev	MMMU Validation	MMStar	POPE	SQA Full	MME	AVEGAGE: (w/o MME)
332k	CLIP + Phi3	SMoE-R	63.67	47.47	59.46	43.32	31.60	66.67	40.11	37.94	86.87	77.23	1,608.21	55.42
		Cosine-R	63.31	48.83	59.25	41.54	31.80	67.96	39.56	39.09	86.81	76.96	1,637.99	55.51
		Sigmoid-R	63.80	47.74	59.24	41.43	31.40	68.30	40.78	38.70	87.49	77.61	1,611.36	55.65
		Hyper-R	64.05	47.76	59.61	41.11	32.50	69.24	41.33	39.27	86.68	77.31	1,602.59	55.89
		Perturbed Cosine-R	64.60	47.92	59.08	41.54	30.60	67.87	40.22	38.84	86.81	77.82	1,619.69	55.63
665k	CLIP + Phi3	SMoE-R	64.25	46.57	62.12	40.48	31.00	68.12	39.89	37.13	87.50	77.74	1,700.61	55.48
		Cosine-R	64.51	49.79	61.38	40.80	31.30	67.01	40.67	39.36	87.52	77.48	1,687.37	55.98
		Sigmoid-R	64.38	47.12	61.65	40.80	31.90	67.87	40.11	39.20	86.93	77.17	1,710.42	55.71
		Hyper-R	64.37	47.59	59.70	40.38	31.30	68.30	40.78	38.33	85.70	80.33	1,726.87	55.68
		Perturbed Cosine-R	64.70	47.16	61.90	39.43	32.80	69.50	39.89	40.33	87.42	77.64	1,672.70	56.08

Analysis: The Performance of different MoE algorithms over time

Figure below offers a detailed view of the time-dependent performance of five MoE algorithms across 11 benchmarks. This figure illustrates the unique behavioral characteristics of each algorithm and supports our observation that, in most cases, the final checkpoints of the MoE algorithms do not necessarily yield the best performance. This finding underscores the potential benefits of applying early stopping to achieve optimal results.

*Comparison of the performance of different MoE algorithms across 11 benchmarks over time. The experiments were conducted using the LLaVa-332K dataset and the CLIP + Phi3 model.*

Model Checkpoints

We are making our entire experiment checkpoints publicly available to contribute to the community's research on the topic of Mixture of Experts (MoE). By reusing our checkpoints at the Pre-Training and Pre-FineTuning stages, we hope to help others save time and computational resources in their own experiments. To see the full checkpoints list, please visit our 👉 GitHub 👈.

Method	Stage	Siglip 224 + Phi3.5	Siglip 224 + Phi3	CLIP 336 + Phi3
Pre-Training	-	Link	Link	Link
Pre-FineTuning	-	Link	Link	Link
VIT 665K	SMoE-R	Link	Link	Link
	Cosine-R	Link	Link	Link
	Sigmoid-R	Link	Link	Link
	Hyper-R	Link	Link	Link
	Perturbed Cosine-R	Link	Link	Link

BibTeX

        
@misc{nguyen2024libmoelibrarycomprehensivebenchmarking,
  title={LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models}, 
  author={Nam V. Nguyen and Thong T. Doan and Luong Tran and Van Nguyen and Quang Pham},
  year={2024},
  eprint={2411.00918},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2411.00918}, 
}