- cross-posted to:
- ai_@lemmy.world
- technology@lemmy.ml
- cross-posted to:
- ai_@lemmy.world
- technology@lemmy.ml
The project implements sparse multiplication and fuses up/down projections in the MLP layers through low rank weight activations. Work is based on Deja Vu and Apple’s LLM in a Flash.
This approach avoids loading and computing activations with feed forward layer weights whose outputs will eventually be zeroed out.
It’s a lossless approach as these weights anyway do not contribute in the current token prediction. It does however, need the predictors to be accurate in clustering the weights.
The result? 5X faster MLP layer performance in transformers with 50% lesser memory consumption avoiding the sleeping nodes in every token prediction. For Llama 3.2, Feed forward layers accounted for 30% of total weights and forward pass computation resulting in 1.6-1.8x increase in throughput:
Sparse LLaMA 3.2 3B vs LLaMA 3.2 3B (on HuggingFace Implementation):
- Time to First Token (TTFT): 1.51× faster (1.209s → 0.803s)
- Output Generation Speed: 1.79× faster (0.7 → 1.2 tokens/sec)
- Total Throughput: 1.78× faster (0.7 → 1.3 tokens/sec)
- Memory Usage: 26.4% reduction (6.125GB → 4.15GB)