Osvaldo Pinali Doederlein's Twitter Thread

First look at NVIDA's Neural Texture Compression with DXR1.2 Cooperative Vector! First , this needs a preview driver (590.26), I installed that so you don't have too-and it corrupted the screen, only after a few hard-resets it decided to work.😅🧵 https://github.com/NVIDIA-RTX/...

But my system survived so I ran the demo. And well... the tests scene is low-detail AF, you don't even need to zoom much. It has 15 textures, most of them 2K x 2K (as PNGs), still the rendered scene is unimpressive. But let's chalk that down to a cheap demo.

Image in tweet by Osvaldo Pinali Doederlein

How does it perform? Disabling v-sync, RTX 5080, demo at the startup position: (explained next tweet) Default: 2,350fps / 9.20MB No FP8: 2,160fps / 9.20MB No Int8: 2,350fps / 9.20MB DP4A: 1,030fps / 9.14MB Transcoded: 2,600fps / 79.38MB

NVIDIA's NTC is sweet in that it has many execution modes. Default/ideal uses Cooperative Vectors with FP8. DXR1.2 Coop Vec includes FP8 but actual support will be GPU-dependent, for example RDNA 3 could potentially support CV but not any FP8 data type.

So if you don't have FP8 but you have INT8 (like RDNA 3 again), NTC still works, the perf loss is a surprisingly low -9% at least on Blackwell. There's also an option to block use of INT8 but allow FP8, that makes no difference so it seems the default decoder doesn't use INT8.

You can also disable use of Coop Vectors completely, then NTC run on any Shader Model 6.x GPU (DP4A mode) but the perf loss is more severe: -43%, again in Blackwell. How bad is this in practice? You'd need to compare to rendering with standard BCx textures.

We can test that, too: the sample also allows disabling inference on sample (when rendering), in that case it does inference on load (in which case decoding performance is much less important but I suppose it will cost some extra DRAM space for caching textures transcoded to BC).

The demo shows that inference on load results in a gain of +11% fps compared to the best inference on load result. So even in Blackwell super-duper Tensor Cores, NTC is less efficient than old fashioned BC & TMUs (the latter helps also with filtering, can't be used for NTC).

But the big win ofc is VRAM footprint. In this demo, NTC saves almost 90% saving. Textures can be 50%-70% of the VRAM used by games, so this is HUGE. In a real game considering bandwidth, GPU copy costs, cache efficiency... I bet NTC will be easily a net win in perf/fps too.

Now we still need to see how this performs on other GPUs / AMD and Intel. Not those vendors are prepping their own neural rendering kits, but NVIDIA's will likely have the easier adoption so it's great that it's designed to be very compatible-as long as it performs OK.

NTC can be great for download sizes as well, your next AAA games could be 50GB smaller. Hopefully DirectStorage will allow nice integration with this kind of thing so NTC could replace GDeflate and the whole solution make more sense on Windows.🔚

Share this thread

Read on Twitter

Navigate thread