10
submitted 6 days ago* (last edited 6 days ago) by [email protected] to c/[email protected]

The project implements sparse multiplication and fuses up/down projections in the MLP layers through low rank weight activations. Work is based on Deja Vu and Apple's LLM in a Flash.

This approach avoids loading and computing activations with feed forward layer weights whose outputs will eventually be zeroed out.

It's a lossless approach as these weights anyway do not contribute in the current token prediction. It does however, need the predictors to be accurate in clustering the weights.

The result? 5X faster MLP layer performance in transformers with 50% lesser memory consumption avoiding the sleeping nodes in every token prediction. For Llama 3.2, Feed forward layers accounted for 30% of total weights and forward pass computation resulting in 1.6-1.8x increase in throughput:

Sparse LLaMA 3.2 3B vs LLaMA 3.2 3B (on HuggingFace Implementation):

- Time to First Token (TTFT):  1.51× faster (1.209s → 0.803s)
- Output Generation Speed:     1.79× faster (0.7 → 1.2 tokens/sec)  
- Total Throughput:            1.78× faster (0.7 → 1.3 tokens/sec)
- Memory Usage:                26.4% reduction (6.125GB → 4.15GB)
no comments (yet)
sorted by: hot top new old
there doesn't seem to be anything here
this post was submitted on 05 Jun 2025
10 points (91.7% liked)

Technology

38427 readers
383 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 6 years ago
MODERATORS