CUDA vs tensor cores in high FLOPs settings

RTX2080S trains x1.6 faster than GTX1070 for small models. For a model with 1+ TFLOPS, however, it trains x6 faster - and I can't understand why.

The RTX has only x1.6 as many cores as the GTX (3072 vs 1920) and similar clock speeds, but with RTX having 384 tensor cores. Benchmark comparison doesn't show any figure at 500%+ however.

Is this expected, and if so, how's it explained? Do tensor cores just have that much greater "FLOP capacity"?

Setting: automatic mixed precision, PyTorch; same software (OS, Anaconda environments), models, datasets; RAM and VRAM not saturated. "Training" = applying bunch of math on input arrays, "bigger model" = more math on more data (at the same time, i.e. in parallel).

2 Sorted by: Reset to default

CUDA vs tensor cores in high FLOPs settings

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

How to use PS4 controller on PC?

I can't see my coordinates and light level when I press F3 on my friend's smp

Which vendors have a lot of cash?