Celeb Glow
news | March 14, 2026

CUDA vs tensor cores in high FLOPs settings

RTX2080S trains x1.6 faster than GTX1070 for small models. For a model with 1+ TFLOPS, however, it trains x6 faster - and I can't understand why.

The RTX has only x1.6 as many cores as the GTX (3072 vs 1920) and similar clock speeds, but with RTX having 384 tensor cores. Benchmark comparison doesn't show any figure at 500%+ however.

Is this expected, and if so, how's it explained? Do tensor cores just have that much greater "FLOP capacity"?

Setting: automatic mixed precision, PyTorch; same software (OS, Anaconda environments), models, datasets; RAM and VRAM not saturated. "Training" = applying bunch of math on input arrays, "bigger model" = more math on more data (at the same time, i.e. in parallel).

2 Reset to default

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy