Monkey Jump

Abstract Features Install Quick Start Theory Experiments Ablations Citation

Monkey Jump : MoE-Style PEFT for Efficient Multi-Task Learning

MoE-style specialization for PEFT without adding trainable routers or expert parameters. Gradient-free routing via k-means with EMA-updated centers.

Project Website

GitHub

arXiv Paper

🔀 Router-free MoE 🧠 Token-wise routing 💾 Up to 48% memory savings ⚡ 1.5–2× faster training

🧠 Abstract

MoE-style PEFT without extra params

Mixture-of-experts (MoE) variants of parameter-efficient fine-tuning (PEFT) enable per-token specialization, but they introduce additional trainable routers and expert parameters, increasing memory and training costs. This undermines the core goal of parameter-efficient fine-tuning.

We propose Monkey Jump (MJ) — named for the selective activation pattern: adapters “jump” on for some projections and off for others. MJ brings MoE-style specialization to PEFT without adding extra trainable parameters. Instead, MJ reuses the PEFT adapters already present in each Transformer block (e.g., query, key, value, up, down) as implicit experts, and routes tokens among them using k-means clustering with EMA-updated centers (no gradients, no learned parameters).

📝 14 Text datasets 🖼️ 14 Image datasets 🎬 19 Video datasets 🧩 Architecture-agnostic

🚀 Features

Fast • Sparse • Gradient-free

What Monkey Jump adds

🔀 MoE-style routing without any trainable parameters
🧠 Token-wise and sentence-wise clustering-based routing
🧪 Gradient-free token routing via k-means + EMA
⚡ 1.5–2× faster training and inference
💾 Up to 48% GPU memory savings

Compatibility

🔧 Works with LoRA, AdaLoRA, LoRA-FA, Propulsion
🧩 Adapter-based PEFT methods
✅ Uses existing per-projection adapters as implicit experts
🎯 Top-k sparse activation per token

⚙️ Installation

pip / conda friendly

git clone https://github.com/yourusername/MonkeyJump.git
cd MonkeyJump
pip install torch torchvision torchaudio
pip install transformers accelerate datasets peft
pip install scikit-learn tqdm numpy pandas

Tip: Replace yourusername with your GitHub username.

💻 Quick Start

Minimal integration

from transformers import AutoModelForCausalLM
from src.MJLoRA import apply_monkeyjump

model = AutoModelForCausalLM.from_pretrained("model_name")
model = apply_monkeyjump(
    model,
    blocks={"LlamaDecoderLayer": list(range(32))},
    linears=["q_proj", "k_proj", "v_proj", "o_proj"],
    shared_expert=["up_proj", "down_proj"],
    rank=2,
    alpha=16.0,
    temperature=1.0,
    ema_momentum=0.2,
    top_k=2,
    rep_mode="token",
)

Initialize Router (k-means centers)

from src.kmneas import init_router_centers

init_router_centers(
    trainer,
    subset_size=4000,
    kmeans_iters=15,
    rep_mode="token",
)

🧪 Routing Modes

Token / Sequence routing

Mode	Description
`token`	Per-token routing
`last`	Uses last token only
`mean`	Mean of all tokens
`prompt_end`	Token at prompt boundary

📈 Efficiency Analysis

H100 • Batch 8 • GA 2

Comparison of MJ variants and MoE-PEFT baselines across key efficiency metrics (GPU: NVIDIA H100 80GB, PyTorch + HF Transformers, batch size 8, grad accumulation 2).

🔢 Parameter Efficiency

Method	Params (K)
MJ-Propulsion	49
MixLoRA	364
HydraLoRA	909
MoELoRA	1,425
MJ-LoRAFA	98
MJ-LoRA	270

Total model size remains nearly identical (~1,705MB) because MJ reuses existing adapters instead of adding new experts.

💾 Memory & ⚡ Speed

Method	Peak Memory (GB)
MJ-Propulsion	12.0
MoEAdaLoRA	23.2
MoELoRA	22.8
MJ-AdaLoRA	15.4

MJ achieves up to 48% memory savings via top-k sparse routing, and improves training/inference throughput.

🔬 Theoretical Insights

Expressivity + Information theory

1) Token-wise routing increases expressivity

Standard PEFT sums adapter updates, which can cause cancellation:

U^{PEFT} = ( Σ_{e=1..E} ΔW_e ) H

MJ routes tokens selectively to adapters:

U^{MJ} = [ ΔW_1 H_1  ...  ΔW_E H_E ]
rank(U^{MJ}) ≥ rank(U^{PEFT})

2) Last-token routing is optimal (sequence-wise)

In causal Transformers, the last token representation has attended to the full sequence:

I(h_T; X) ≥ I(h_t; X)   for all t < T
I(h_T; X) ≥ I(mean(h); X)

Note: This page shows equations as plain text so it works on GitHub Pages without extra libraries. If you want real LaTeX rendering, tell me and I’ll add MathJax.

🧪 Experiments

47 benchmarks • 3 modalities

We evaluate Monkey Jump (MJ) on 47 multi-task benchmarks: 📝 14 Text, 🖼️ 14 Image, 🎬 19 Video.

⚙️ Setup

Text: LLaMA-3-8B-Instruct
Image/Video: LLaVA-OneVision-Qwen2-7B
PEFT / MoE-PEFT applied to attention projections (Q, K, V, O) and FFN gate
MJ variants: MJLoRA, MJLoRAFA, MJAdaLoRA, MJPropulsion

Average performance across task families (mean ± std over 5 runs).

🔍 Key Takeaways

✅ Comparable or better performance than MoE-PEFT with 7–29× fewer trainable parameters
🏆 MJLoRA ties or outperforms HydraLoRA and MoA on GLUE and QA tasks
🖼️ MJAdaLoRA leads image classification and action-object tasks
⚡ MJPropulsion is strong on motion and high-level video reasoning

🧪 Ablation Study

Visualization + topics

Layer-wise cluster visualization (t-SNE projection) colored by assigned expert (E0–E4).

📊 Ablation Study Topics

Initialization Method
K-Means Sample Size
Cluster Update Coverage
Router Count
Routing Granularity
Similarity Function
Routing Temperature
EMA Smoothing Factor
Update Schedule
Projection Specialization
Linear Probing for Last-Token Routing
Expert Permutation Analysis
Shared Expert Selection
Rank Sensitivity
Expert Combination Analysis
Impact of K-Means Initialization
Expert Usage and Self-Balancing
Complexity and Parameter Analysis

For full details, please refer to the full paper.

📜 Citation

BibTeX

@article{prottasha2025monkeyjump,
  title={Monkey Jump: MoE-Style PEFT for Efficient Multi-Task Learning},
  author={Prottasha, Nusrat Jahan and Kowsher, Md and Yu, Chun-Nam and Chen, Chen and Garibay, Ozlem},
  journal={arXiv preprint arXiv:2601.06356v1},
  year={2026}
}

📝 License

MIT

MIT License