Monkey Jump : MoE-Style PEFT for Efficient Multi-Task Learning

MoE-style specialization for PEFT without adding trainable routers or expert parameters. Gradient-free routing via k-means with EMA-updated centers.
๐Ÿ”€ Router-free MoE ๐Ÿง  Token-wise routing ๐Ÿ’พ Up to 48% memory savings โšก 1.5โ€“2ร— faster training
Monkey Jump method overview

๐Ÿง  Abstract

MoE-style PEFT without extra params

Mixture-of-experts (MoE) variants of parameter-efficient fine-tuning (PEFT) enable per-token specialization, but they introduce additional trainable routers and expert parameters, increasing memory and training costs. This undermines the core goal of parameter-efficient fine-tuning.

We propose Monkey Jump (MJ) โ€” named for the selective activation pattern: adapters โ€œjumpโ€ on for some projections and off for others. MJ brings MoE-style specialization to PEFT without adding extra trainable parameters. Instead, MJ reuses the PEFT adapters already present in each Transformer block (e.g., query, key, value, up, down) as implicit experts, and routes tokens among them using k-means clustering with EMA-updated centers (no gradients, no learned parameters).

๐Ÿ“ 14 Text datasets ๐Ÿ–ผ๏ธ 14 Image datasets ๐ŸŽฌ 19 Video datasets ๐Ÿงฉ Architecture-agnostic

๐Ÿš€ Features

Fast โ€ข Sparse โ€ข Gradient-free

What Monkey Jump adds

  • ๐Ÿ”€ MoE-style routing without any trainable parameters
  • ๐Ÿง  Token-wise and sentence-wise clustering-based routing
  • ๐Ÿงช Gradient-free token routing via k-means + EMA
  • โšก 1.5โ€“2ร— faster training and inference
  • ๐Ÿ’พ Up to 48% GPU memory savings

Compatibility

  • ๐Ÿ”ง Works with LoRA, AdaLoRA, LoRA-FA, Propulsion
  • ๐Ÿงฉ Adapter-based PEFT methods
  • โœ… Uses existing per-projection adapters as implicit experts
  • ๐ŸŽฏ Top-k sparse activation per token

โš™๏ธ Installation

pip / conda friendly
git clone https://github.com/yourusername/MonkeyJump.git
cd MonkeyJump
pip install torch torchvision torchaudio
pip install transformers accelerate datasets peft
pip install scikit-learn tqdm numpy pandas

Tip: Replace yourusername with your GitHub username.

๐Ÿ’ป Quick Start

Minimal integration
from transformers import AutoModelForCausalLM
from src.MJLoRA import apply_monkeyjump

model = AutoModelForCausalLM.from_pretrained("model_name")
model = apply_monkeyjump(
    model,
    blocks={"LlamaDecoderLayer": list(range(32))},
    linears=["q_proj", "k_proj", "v_proj", "o_proj"],
    shared_expert=["up_proj", "down_proj"],
    rank=2,
    alpha=16.0,
    temperature=1.0,
    ema_momentum=0.2,
    top_k=2,
    rep_mode="token",
)

Initialize Router (k-means centers)

from src.kmneas import init_router_centers

init_router_centers(
    trainer,
    subset_size=4000,
    kmeans_iters=15,
    rep_mode="token",
)

๐Ÿงช Routing Modes

Token / Sequence routing
Mode Description
tokenPer-token routing
lastUses last token only
meanMean of all tokens
prompt_endToken at prompt boundary

๐Ÿ“ˆ Efficiency Analysis

H100 โ€ข Batch 8 โ€ข GA 2
Efficiency chart
Comparison of MJ variants and MoE-PEFT baselines across key efficiency metrics (GPU: NVIDIA H100 80GB, PyTorch + HF Transformers, batch size 8, grad accumulation 2).

๐Ÿ”ข Parameter Efficiency

MethodParams (K)
MJ-Propulsion49
MixLoRA364
HydraLoRA909
MoELoRA1,425
MJ-LoRAFA98
MJ-LoRA270

Total model size remains nearly identical (~1,705MB) because MJ reuses existing adapters instead of adding new experts.

๐Ÿ’พ Memory & โšก Speed

MethodPeak Memory (GB)
MJ-Propulsion12.0
MoEAdaLoRA23.2
MoELoRA22.8
MJ-AdaLoRA15.4

MJ achieves up to 48% memory savings via top-k sparse routing, and improves training/inference throughput.

๐Ÿ”ฌ Theoretical Insights

Expressivity + Information theory

1) Token-wise routing increases expressivity

Standard PEFT sums adapter updates, which can cause cancellation:

U^{PEFT} = ( ฮฃ_{e=1..E} ฮ”W_e ) H

MJ routes tokens selectively to adapters:

U^{MJ} = [ ฮ”W_1 H_1  ...  ฮ”W_E H_E ]
rank(U^{MJ}) โ‰ฅ rank(U^{PEFT})

2) Last-token routing is optimal (sequence-wise)

In causal Transformers, the last token representation has attended to the full sequence:

I(h_T; X) โ‰ฅ I(h_t; X)   for all t < T
I(h_T; X) โ‰ฅ I(mean(h); X)

Note: This page shows equations as plain text so it works on GitHub Pages without extra libraries. If you want real LaTeX rendering, tell me and Iโ€™ll add MathJax.

๐Ÿงช Experiments

47 benchmarks โ€ข 3 modalities

We evaluate Monkey Jump (MJ) on 47 multi-task benchmarks: ๐Ÿ“ 14 Text, ๐Ÿ–ผ๏ธ 14 Image, ๐ŸŽฌ 19 Video.

โš™๏ธ Setup

  • Text: LLaMA-3-8B-Instruct
  • Image/Video: LLaVA-OneVision-Qwen2-7B
  • PEFT / MoE-PEFT applied to attention projections (Q, K, V, O) and FFN gate
  • MJ variants: MJLoRA, MJLoRAFA, MJAdaLoRA, MJPropulsion
MJ benchmark results
Average performance across task families (mean ยฑ std over 5 runs).

๐Ÿ” Key Takeaways

  • โœ… Comparable or better performance than MoE-PEFT with 7โ€“29ร— fewer trainable parameters
  • ๐Ÿ† MJLoRA ties or outperforms HydraLoRA and MoA on GLUE and QA tasks
  • ๐Ÿ–ผ๏ธ MJAdaLoRA leads image classification and action-object tasks
  • โšก MJPropulsion is strong on motion and high-level video reasoning

๐Ÿงช Ablation Study

Visualization + topics
Layer-wise clustering visualization
Layer-wise cluster visualization (t-SNE projection) colored by assigned expert (E0โ€“E4).

๐Ÿ“Š Ablation Study Topics

  • Initialization Method
  • K-Means Sample Size
  • Cluster Update Coverage
  • Router Count
  • Routing Granularity
  • Similarity Function
  • Routing Temperature
  • EMA Smoothing Factor
  • Update Schedule
  • Projection Specialization
  • Linear Probing for Last-Token Routing
  • Expert Permutation Analysis
  • Shared Expert Selection
  • Rank Sensitivity
  • Expert Combination Analysis
  • Impact of K-Means Initialization
  • Expert Usage and Self-Balancing
  • Complexity and Parameter Analysis

For full details, please refer to the full paper.

๐Ÿ“œ Citation

BibTeX
@article{prottasha2025monkeyjump,
  title={Monkey Jump: MoE-Style PEFT for Efficient Multi-Task Learning},
  author={Prottasha, Nusrat Jahan and Kowsher, Md and Yu, Chun-Nam and Chen, Chen and Garibay, Ozlem},
  journal={arXiv preprint arXiv:2601.06356v1},
  year={2026}
}

๐Ÿ“ License

MIT

MIT License

ยฉ Monkey Jump โ€ข Built for GitHub Pages