technology@hexbear.netEnglish · 1 month ago

Sparse Transformers: Run 2x faster LLM with 30% lesser memory

github.com

Sparse Transformers: Run 2x faster LLM with 30% lesser memory

github.com

☆ Yσɠƚԋσʂ ☆@lemmygrad.ml to

technology@hexbear.netEnglish · 1 month ago

GitHub - NimbleEdge/sparse_transformers: Sparse Inferencing for transformer based LLMs

github.com

Sparse Inferencing for transformer based LLMs. Contribute to NimbleEdge/sparse_transformers development by creating an account on GitHub.

The project implements sparse multiplication and fuses up/down projections in the MLP layers through low rank weight activations. Work is based on Deja Vu and Apple’s LLM in a Flash.

This approach avoids loading and computing activations with feed forward layer weights whose outputs will eventually be zeroed out.

It’s a lossless approach as these weights anyway do not contribute in the current token prediction. It does however, need the predictors to be accurate in clustering the weights.

The result? 5X faster MLP layer performance in transformers with 50% lesser memory consumption avoiding the sleeping nodes in every token prediction. For Llama 3.2, Feed forward layers accounted for 30% of total weights and forward pass computation resulting in 1.6-1.8x increase in throughput:

Sparse LLaMA 3.2 3B vs LLaMA 3.2 3B (on HuggingFace Implementation):

- Time to First Token (TTFT):  1.51× faster (1.209s → 0.803s)
- Output Generation Speed:     1.79× faster (0.7 → 1.2 tokens/sec)  
- Total Throughput:            1.78× faster (0.7 → 1.3 tokens/sec)
- Memory Usage:                26.4% reduction (6.125GB → 4.15GB)

You must log in or # to comment.

Chat

JoeByeThen [he/him, they/them]@hexbear.net
link
fedilink
English
arrow-up
4·
1 month ago

technology@hexbear.net

Create a post

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !technology@hexbear.net

On the road to fully automated luxury gay space communism.

Spreading Linux propaganda since 2020

Rules:

1. Obviously abide by the sitewide code of conduct. Bigotry will be met with an immediate ban
2. This community is about technology. Offtopic is permitted as long as it is kept in the comment sections
3. Although this is not /c/libre, FOSS related posting is tolerated, and even welcome in the case of effort posts
4. We believe technology should be liberating. As such, avoid promoting proprietary and/or bourgeois technology
5. Explanatory posts to correct the potential mistakes a comrade made in a post of their own are allowed, as long as they remain respectful
6. No crypto (Bitcoin, NFT, etc.) speculation, unless it is purely informative and not too cringe
7. Absolutely no tech bro shit. If you have a good opinion of Silicon Valley billionaires please manifest yourself so we can ban you.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

264 users / day
611 users / week
1.17K users / month
1.98K users / 6 months
1 local subscriber
23.9K subscribers
479 Posts
2.91K Comments
Modlog