Comments Locked

40 Comments

Back to Article

  • Freakie - Monday, July 25, 2016 - link

    Neutered FP16 really stinks. It was wishful thinking to be able to use the new Titan X for neural networking but I was at least hoping that Quadro would do it. Guess I won't be buying anything Nvidia just for the neural networking boost. The P100 is just way too pricey, and I think Nvidia is really shooting themselves in the foot for getting their products to be the de facto neural network accelerators. Good news for other vendors selling more capable hardware, though!
  • Yojimbo - Monday, July 25, 2016 - link

    What other vendors have anything more capable than running the code using FP32 on Pascal?
  • Gondalf - Tuesday, July 26, 2016 - link

    Too bad FP64 performance is 1/32 of this.

    This quadro line is not competitive at all with Xeon Phi in a lot of applications. Reason why Intel has orders for 100K Phi this year.
  • extide - Tuesday, July 26, 2016 - link

    Neural networking is largely FP16, not FP32. With P100, you get a 2x perf boost going from FP32 to FP16. On Quadro, and Titan X, there is nor performance boost. That's what he is talking about. Not FP32, not FP64, guys.

    I thin Polaris also has a 2x speedup going from FP32 to FP16 -- and AMD tends to be much more generous in leaving the enhanced compute abilities intact on consumer SKU's, so maybe have a look over there.
  • jabbadap - Monday, July 25, 2016 - link

    Ryan, they did mention fp32 flops:
    P6000 12TFlops amd P5000 8.9TFlops.

    That would make clocks(boost clocks maybe):
    P6000 12000/(2*3840) ~ 1.56GHz and P5000 8900/(2*2560) ~ 1.738 GHz

    http://www.nvidia.com/object/quadro-graphics-with-...
  • Ryan Smith - Monday, July 25, 2016 - link

    Interesting. Thanks. That information was not available at my briefing last week.
  • jabbadap - Monday, July 25, 2016 - link

    Well there's a bit typo on the article too, you are starting to speak m5000 after P6000, while obviously you mean P5000. But overall great read, thanks!
  • eddman - Monday, July 25, 2016 - link

    Now I'm wondering what 1080 Ti is going to be (if we get it this gen to begin with. NVidia might delay it for the eventual 20x0 series, depending on when Vega shows up).

    Given the past relations between quadro, tesla, titan and geforce, it's possible that it'd be a 3584 core chip but with higher clocks than titan. Later on, we might see a new titan black or X black, etc.
  • prashontech - Monday, July 25, 2016 - link

    Well, the next gen would be the 11x0 series, not 20x0
  • eddman - Monday, July 25, 2016 - link

    It's not important. Just a number. Besides, a lot of people thought this gen is going to be 1x00, but nvidia trolled everyone.
  • haukionkannel - Tuesday, July 26, 2016 - link

    Cut down 102 with gddr5 memory instead of gddr5+ memory. So the same as with 1080 vs 1070.
    And it will be released after AMD vega, so sometimes q1 2017. There is no need to release it before. TitanX is the ultimate powerhouse and 1080 the "cheap" middle range alternative at 800$ price range. 1080ti will be something like 999$ but that depends on vega. Is vega is not faster, then 1050-1099 is possible, if vega is faster then 899$ is possible for 1080ti
  • eddman - Tuesday, July 26, 2016 - link

    1080 is already maxing out its GDDR5X memory's bandwidth, according to anandtech. There is no way 1080 Ti would be a plain GDDR5, which would starve out the bigger GP102 chip.
  • Kjella - Tuesday, July 26, 2016 - link

    My guess is that Anandtech got the conclusion for that one already written just substitute for this generation:

    "With an average performance deficit of just 3%, GeForce GTX 980 Ti is for all intents and purposes GTX Titan X with a different name. (...) With a launch price of $649, the GTX 980 Ti may as well be an unofficial price cut to GTX Titan X, delivering flagship GeForce performance for 35% less."

    I expect that the GTX 1080 Ti will come in at $799/$899 (FE) in Q4 2016 or Q1 2017, this time with partner boards. And then there will be a new card with HBM2 to become the new Titan.
  • madwolfa - Monday, July 25, 2016 - link

    Just 1/32 FP64 in top end Quadro? Hmm.
  • DanNeely - Monday, July 25, 2016 - link

    That's because now that Titan exists Quadro is being marketed primarily as professional graphics card; not as a top of the line compute device. As long as pro graphics rendering pipelines don't make heavy use of F16/64 the Quadro line will likely to continue using an architecture that emphasizes packing more FP32 cores on die over widespread support for other compute models.

    OTOH, assuming it has all the circuitry needed to function as a GPU (I've seen rumors claiming otherwise but nothing official either way); once the supercomputing backlog is filled I wouldn't be surprised to see a P7000 quadro using a GP100 chip. At the same time though, I'd also expect that it's either priced about as high as a Tesla or has its compute abilities gimped enough to avoid undercutting it. eg similar core counts, at 60-75% of Tesla, price but only 1:4 FP64 and 1:1 FP16 (half of what an unlocked Tesla is capable of).
  • Ryan Smith - Monday, July 25, 2016 - link

    "OTOH, assuming it has all the circuitry needed to function as a GPU (I've seen rumors claiming otherwise but nothing official either way)"

    I can confirm that GP100 is fully graphics capable. NVIDIA has told me as much back at GTC.
  • Tigran - Tuesday, July 26, 2016 - link

    Ryan, please see below: then why GP102 with it's smaller die size and less transistors has more CUDA cores than GP100?
  • DanNeely - Tuesday, July 26, 2016 - link

    More transistors is because making all the cores support fp16, all pairs of cores support FP64, and doubling the number of registers available takes up a lot more die area than using basic FP32 cores everywhere and then throwing a handful of FP16/64 cores for software development purposes. Not having all the cores turned on means that the yield on the even bigger die wasn't high enough to make the only product sold from it a a fully enabled die model.

    At some point we'll probably see a GP100 product of some sort with all the cores enabled; but probably not until they've fully supplied their super computer customers. Since GP100 is gfx capable My guess would be on an even more expensive Quadro (or maybe Titan but I doubt it) card where they can ration availability to the limited number of perfect dies available.
  • Tigran - Tuesday, July 26, 2016 - link

    Doesn't this explanation contradict Ryan's words ("GP100 is fully graphics capable")?
  • Tigran - Tuesday, July 26, 2016 - link

    DanNeely, sorry if I misunderstood you - please see below my reply to Extide.
  • extide - Tuesday, July 26, 2016 - link

    It doesn't, it has the same amount, the die got smaller because they ripped out all the extra FP64 hardware.
  • Tigran - Tuesday, July 26, 2016 - link

    So there are 3,584 (FP32) + 1,792 (FP64) = 5,376 CUDA cores in GP100, and 3,840 (FP32) + 120 (FP64) = 3,960 CUDA cores, correct?
  • Tigran - Tuesday, July 26, 2016 - link

    ...and 3,840 (FP32) + 120 (FP64) = 3,960 CUDA cores in GP102...
  • DanNeely - Tuesday, July 26, 2016 - link

    Definitely no in the first case. Probably no in the second.

    For GP100 the full die has 30*128 = 3840 total FP32 cores with 256 disabled due to yield management (not enough dies with fewer defects than that are being manufactured). Each FP32 core has additional capabilities that allow it to do two FP16 ops instead of a single FP32 op; as well as additional capabilities that allow it to be paired with a second FP32 core to allow the two of them together to do a single FP64 op instead of 2xFP32 or 4xFP16 ops.

    For GP102, I think I've seen a modified block block diagrams for a single cluster that showed the minimal FP16/64 support as a few extra cores. but I'm 99% sure that was artistic license on someone's part because it would waste more die area than just using a few of the more flexible cores used in GP100. Since they're launching multiple products based on GP102 nVidia can split them into two usable bins making Quadros with the perfect dies and Titan's with the ones with a few defects. At a future point in time GP102's with more defects could be used to make a GTX 1080 Ti.
  • Tigran - Tuesday, July 26, 2016 - link

    1) It's OK with making two FP16 ops on a single FP32. But I doubt two FP32 can perform single FP64. See the quotes from NVIDIA:

    "The GP100 SM ISA provides new arithmetic operations that can perform two FP16 operations at once on a single-precision CUDA Core, and 32-bit GP100 registers can store two FP16 values"

    "Each GP100 SM has 32 FP64 units, providing a 2:1 ratio of single- to double-precision throughput"

    Doesn't the second quote mean there are separate physical cores for FP64 ops in GP100?

    2) May I ask the source for 3840 (3584 + 256 disabled) cores in GP100? We can see the same number in GP102 (which has smaller die size). And doesn't it contradict Ryan's reply to you (NVidia says GP100 is fully graphics capable)?
  • DanNeely - Wednesday, July 27, 2016 - link

    If you want to be uselessly pedantic it's a single 64bit core that can and normally does perform two 32 bit operations in parallel instead of 1 64 bit operation.

    3840 total cores is direct from the horses mouth: "Like previous Tesla GPUs, GP100 ... for a total of 3840 CUDA cores and 240 texture units."
    https://devblogs.nvidia.com/parallelforall/inside-...

    I have no idea why you think any of that means that GP100 isn't fully graphics capable.
  • Tigran - Wednesday, July 27, 2016 - link

    1) You mean physically there are only FP64 cores in GP100 (performing FP32 ops when needed)?

    2) You can see in the same link's table that GP100 has 56 SMs (instead of 60) and 3584 FP32 (instead of 3840) cores. Doesn't it mean 4 SMs and 256 cores are disabled?
  • DanNeely - Wednesday, July 27, 2016 - link

    In reverse order because 2) is much quicker/easier to answer than 1).

    2. Yes. That table is for the configuration used in the card and not the full count of what they're making in the GPU die itself and is what I've been trying to say all along.

    1. You can think of it that way if it's easier for you to build a mental picture of how it works. A lot of people will look askance at you for stating it that way. I think part of the problem is that what's commonly called a core in a GPU is a lot more like a single execution pipeline in a conventional CPU.

    A conventional CPU core will include hardware to decode instructions and cache a small number of micro-ops and somewhere between about 5 and 10 ports/execution units/pipelines each capable of only doing a subset of the total possible instruction types: integer, floating point, and memory access being the three big types with the former often split on larger designs into units that can do all the instructions of a type and those that only do a subset. In a lot of ways the closest equivalent to a CPU core in a GPU is the SM (nVidia) or CU (AMD).

    For SIMD type instructions (MMX and SSE on x86) you also have the ability to do more narrow or fewer wide instructions at once on the same hardware (eg 1x128, 2x64, or 4x32 bit operations) in a single pass. This is the setup that's used to combine/split the hardware to make a single chuck able to do 1x64, 2x32, or 4x16 bit floating point operations.

    Counting cores as the number of 32 bit execution units in a GPU is just the standard convention. In this case it's also the most useful one because GPUs do the vast majority of their work for graphics in FP32 meaning that regardless of how large the groupings are (and note that just within the current generation GP100 uses 64core SMs while GP102/4/6 use 128 core SMs) in normal use all the cores will be active. Comparing GP100 to GP104 using "CPU cores" aka SMs would be 60 vs 20 suggesting a chip with 3x the nominal capability not 1.5x.

    You should also note in the block diagrams that the Special Function Units (SFUs) and memory Load/Store units (LD/ST) are shown as equivalents to the floating point cores but not counted in the total. (And at least one recentish GPU architecture (don't recall which) combined the SFUs into some of the normal FP32 cores instead of breaking them out as separate execution units.)
  • Tigran - Thursday, July 28, 2016 - link

    DanNeely, thanks a lot for your patience and replying my stupid questions, I learn a lot from you.

    Just to make clear how some GPUs (GP100) perform 1/2 FP64 ops, whereas others - 1/32 FP64 ops (GP102/4/6). Lets assume there are 32-bit CUDA cores, and two of them comprise one FP64 unit (or FP64 CUDA core)- it's agrees with Nvidia's terminology:

    GP100
    There are 60 64-core "32-bit" SMs, equal to 3840 32-bit cores or 60x64/2=1920 FP64 units when needed.

    GP104
    There are 20 128-core 32-bit SMs, equal to 2560 32-bit cores. BUT only 1/32 FP64 units. It means that only some cores have 64-bit capability: 2560/32=80 FP64 units or 80*2=160 32-bit cores in GP104. Accordingly 4 FP64 units or 8 32-bit cores in each GP104's SM.

    Am I correct (simplified)?

    You said above "this is the setup that's used to combine/split the hardware to make a single chuck able to do 1x64, 2x32, or 4x16 bit floating point operations". So this setup is maid on a 32-bit core (FP64 unit) level, isn't it?
  • DanNeely - Thursday, July 28, 2016 - link

    yes, that all looks correct.
  • extide - Thursday, August 4, 2016 - link

    No, there are 3840 cores in both GP100, and GP102. In GP100 they allow you to run FP64 work by ganging up two FP32 cores -- it does take extra transistors to enable them to function like this and that is what they ripped out. There are not another 1920 complete cores for FP64 work.

    So, they removed stuff and ended up with a smaller die, they just removed all the stuff that isnt really used by gaming workloads.
  • Tigran - Tuesday, July 26, 2016 - link

    Why GP102 with it's smaller die size and less transistors (see other sources, and it's also obvious from it's number) has more CUDA cores than GP100?
  • DanNeely - Tuesday, July 26, 2016 - link

    It probably has the same number; but being a smaller die GP102 managed to (barely) have yields high enough to allow a product with a 100% active die instead of one with a few sections disabled. WIth super computing customers buying the cards by the shipping container full the Tesla probably also needed to be available in a much larger volume than the top of the line Quadro.
  • Dobson123 - Tuesday, July 26, 2016 - link

    They have the same number (3840) in hardware, GP100 currently uses a partially deactivated GP100, just like the new Titan X uses a partially deactivated GP102. But GP100 has a different, more HPC focused architecture with 64 CUDA cores per SM, larger register files, NVLink and so on.
  • Tigran - Tuesday, July 26, 2016 - link

    So it's not because of TMU&ROP non-existence in Tesla P100? CUDA cores and TMU&ROPs are different and independent from each other calculating blocks, aren't they?
  • Tigran - Tuesday, July 26, 2016 - link

    Quick answer for my stupid question: "each SM has 64 CUDA cores and four texture units" (Nvidia ©). And I guess ROP are outside SM.
  • DanNeely - Tuesday, July 26, 2016 - link

    Correct, ROPs are packaged in with the memory controllers.
  • DanNeely - Tuesday, July 26, 2016 - link

    See Ryan's reply to me above; NVidia says GP100 is fully graphics capable.
  • eddman - Tuesday, July 26, 2016 - link

    Ryan, there are a few M5000 typos in the article. In at least two occasions you've written M5000 instead of P5000.
  • Mirel Aretu - Monday, March 20, 2017 - link

    I really hope you guys can help me decide about what graphics card to use, since I'm not that tech savvy. I will make a list with all the software I use for my work, to give you a better idea. So here it is: Cinema 4D, RealFlow, XParticles, Turbulence FD, Houdini, After Effect - for compositing and visual effects, Illustrator, Photoshop (extensively, on a daily bases), Maya, 3DS Max, Blender, Mocha, Z-Brush/Mudbox (when needed), basically anything that gets the job done, the list is very long. In essence, I need a graphics card with a high computational power to help me with particle simulation, rendering, video encoding and so forth.
    Will the new, cheaper, GTX 1080TI FE do the job or should I just go ahead, sacrifice my soul, and buy a very expensive Quadro P5000?
    Since I never had the chance to put them both to test and never will, nor I understand what one does better than the other, I simply can not decide.

Log in

Don't have an account? Sign up now