Kai INUI's Twitter Thread

3090/4090のAI性能がかなり向上するパッチがpytorchに入るらしい LLM推論タスクでは4090で+40%ほどになるみたい。いままで16fp精度の行列計算が32fpのGEMMで実装されており、特に民生GPUでは32fpは16fpの半分程度の性能となっていたため。

> show double throughput when doing FP16 GEMM with FP16 accumulation compared to FP32 accumulation. > 40% end-to-end speedup on 4090, with a minimal perplexity increase (0.0006) in LLM serving scenarios. https://github.com/pytorch/pyt...

※3090/4090だけではなく、RTXシリーズ全般