Profile picture of Alex Cheema - e/acc

Alex Cheema - e/acc

@alexocheema

Published: January 7, 2025
187
567
4.1k

While Apple has been positioning M4 chips for local AI inference with their unified memory architecture, NVIDIA just undercut them massively. Stacking Project Digits personal computers is now the most affordable way to run frontier LLMs locally. The 1 petaflop headline feels like marketing hyperbole, but otherwise this is a huge deal: Project Digits: 128GB @ 512GB/s, 250 TFLOPS (fp16), $3,000 M4 Pro Mac Mini: 64GB @ 273GB/s, 17 TFLOPS (fp16), $2,200 M4 Max MacBook Pro: 128GB @ 546GB/s, 34 TFLOPS (fp16), $4,700 Project Digits has 2x the memory bandwidth of the M4 Pro with 14x the compute! Project Digits can run Llama 3.3 70B (fp8) at 8 tok/sec (reading speed). Single request (batch_size=1) inference is bottlenecked by memory and memory bandwidth. This was always the constraint with the RTX 4090 and why a gaming PC can't compete on tokens per second at batch_size=1. The whole model can't fit into an RTX 4090 (24GB) so needs be loaded into the GPU from system RAM, bottlenecked by the GPU's PCIe 4.0 link of 64GB/s. You will also start to see builds with multiple 5070 GPUs. The upgrade to PCIe 5.0 means a 2 x 5070 machine could support 256GB/s bandwidth from system RAM to GPU. I estimate this build to be ~$6,000 (supporting full x16/x16 PCIe 5.0 is expensive) in total, then cost of two Project Digits PC's. Congrats NVIDIA, you just found yourself a new market. @exolabs will support high performance inference on a cluster of Project Digits PC's on day 1.

Image in tweet by Alex Cheema - e/acc

Performance benchmarks will be added here once the chip is out https://x.com/exolabs/status/1...

Share this thread

Read on Twitter

View original thread

Navigate thread

1/2