Mixture-of-experts (MoE) variants of parameter-efficient fine-tuning (PEFT) enable per-token specialization, but they introduce additional trainable routers and expert parameters, increasing memory and training costs. This undermines the core goal of parameter-efficient fine-tuning.
We propose Monkey Jump (MJ) โ named for the selective activation pattern: adapters โjumpโ on for some projections and off for others. MJ brings MoE-style specialization to PEFT without adding extra trainable parameters. Instead, MJ reuses the PEFT adapters already present in each Transformer block (e.g., query, key, value, up, down) as implicit experts, and routes tokens among them using k-means clustering with EMA-updated centers (no gradients, no learned parameters).
git clone https://github.com/yourusername/MonkeyJump.git
cd MonkeyJump
pip install torch torchvision torchaudio
pip install transformers accelerate datasets peft
pip install scikit-learn tqdm numpy pandas
Tip: Replace yourusername with your GitHub username.
from transformers import AutoModelForCausalLM
from src.MJLoRA import apply_monkeyjump
model = AutoModelForCausalLM.from_pretrained("model_name")
model = apply_monkeyjump(
model,
blocks={"LlamaDecoderLayer": list(range(32))},
linears=["q_proj", "k_proj", "v_proj", "o_proj"],
shared_expert=["up_proj", "down_proj"],
rank=2,
alpha=16.0,
temperature=1.0,
ema_momentum=0.2,
top_k=2,
rep_mode="token",
)
from src.kmneas import init_router_centers
init_router_centers(
trainer,
subset_size=4000,
kmeans_iters=15,
rep_mode="token",
)
| Mode | Description |
|---|---|
token | Per-token routing |
last | Uses last token only |
mean | Mean of all tokens |
prompt_end | Token at prompt boundary |
| Method | Params (K) |
|---|---|
| MJ-Propulsion | 49 |
| MixLoRA | 364 |
| HydraLoRA | 909 |
| MoELoRA | 1,425 |
| MJ-LoRAFA | 98 |
| MJ-LoRA | 270 |
Total model size remains nearly identical (~1,705MB) because MJ reuses existing adapters instead of adding new experts.
| Method | Peak Memory (GB) |
|---|---|
| MJ-Propulsion | 12.0 |
| MoEAdaLoRA | 23.2 |
| MoELoRA | 22.8 |
| MJ-AdaLoRA | 15.4 |
MJ achieves up to 48% memory savings via top-k sparse routing, and improves training/inference throughput.
Standard PEFT sums adapter updates, which can cause cancellation:
U^{PEFT} = ( ฮฃ_{e=1..E} ฮW_e ) H
MJ routes tokens selectively to adapters:
U^{MJ} = [ ฮW_1 H_1 ... ฮW_E H_E ]
rank(U^{MJ}) โฅ rank(U^{PEFT})
In causal Transformers, the last token representation has attended to the full sequence:
I(h_T; X) โฅ I(h_t; X) for all t < T
I(h_T; X) โฅ I(mean(h); X)
Note: This page shows equations as plain text so it works on GitHub Pages without extra libraries. If you want real LaTeX rendering, tell me and Iโll add MathJax.
We evaluate Monkey Jump (MJ) on 47 multi-task benchmarks: ๐ 14 Text, ๐ผ๏ธ 14 Image, ๐ฌ 19 Video.
MJLoRA, MJLoRAFA, MJAdaLoRA, MJPropulsion
For full details, please refer to the full paper.
@article{prottasha2025monkeyjump,
title={Monkey Jump: MoE-Style PEFT for Efficient Multi-Task Learning},
author={Prottasha, Nusrat Jahan and Kowsher, Md and Yu, Chun-Nam and Chen, Chen and Garibay, Ozlem},
journal={arXiv preprint arXiv:2601.06356v1},
year={2026}
}
MIT License