Did they cite any numbers for bigger benchmarks like Google's BERT? This is still a very deployable solution that can beat out other more power hungry solutions when power consumption and remote connectivity are important.
Looks like no., probably because it's not so favorable as BERT can't fit in the cache. They started building it years ago when models were a lot smaller. If models continue to get bigger and bigger internal caches are probably not going to be be able to keep up. Then it becomes a niche product for accelerating small networks very well. Well, they need to publish MLPerf results to let people know what the case really is.
I doubt it will be competitive with something like the A100, but I don't think this will have the same use cases as solutions designed with HBM2. Given the formfactor and connectivity capabilities of the devkit, they are likely targeting deployable configurations that operate outside of data centers and special use cases within the data center. Communication latency to remote clients like cell phones, vehicles, or VR headsets should be much better than something sitting in a datacenter, and there should be a big market for that.
Well, the PCIe version is the one that is 75 W and 400 TOPS. With that they are targeting the same market the T4 is in. Both are half height, half length cards.
As far the the other two form factors, if it can't actually compete with the A100 and T4 in real world networks then they shouldn't be making the comparison to those cards. That's why it's important to have the benchmarks. Once you get down to the 50 TOPS version, the 134 GB/s memory bandwidth is probably enough to keep it fed. But then the comparison with the A100 is just silly. The proper comparison would be with a Jetson Xavier and then a Jetson Orin when it comes out, assuming there will be one (dunno why there wouldn't be).
The comparison might very well be worthwhile if you run NNs that play into the chip's strengths and care about both the cost of the chip and TCO of your datacenter deployment which includes power consumption for both the chip and cooling. The nVidia designs are also loaded with graphics specific baggage like texture units and ROPs which are a waste of silicon for AI workloads.
Firstly, it's not good that they are talking about inference and never once mention latency. Secondly, throughput is fast on ResNet-50 because ResNet-50 is small and fits in its cache. But ResNet-50 is also much smaller than what most data centers are running inference on these days, to my understanding. Architectures like this definitely need MLPerf scores to confirm their claims, much like Graphcore, which claims it can stream larger networks effectively from DDR memory but still has no MLPerf results published. This thing, similarly, will be relying on DDR memory for anything that doesn't fit in its internal cache.
I am sceptical that Qualcomm will be able to succeed with a product like this. While the TOPs thing is probably a distraction when what we need is benchmarking based on well chosen workloads running on AI Infererence Accelerators, there will still be the inevitable comparisons using those TOPs numbers. And, things take on a strange look when we go down that path: Qualcomm claims 400TOPs at 75W for the Cloud AI 100 > 5.3TOPs/W Gyrfalcon Technology claims 16.8 TOPs at 700mW for the Lightspeeur 2803S > 24TOPs/W Perceive claim very high performance at 20mW for the Ergo edge inference processor > 55 TOPs/W Imec and GLOBALFOUNDRIES claim orders of magnitude higher performance for their Analog Inference Accelerator (with the ambition to evolve the technology even further) > 2,900 TOPs/W (evolving towards 10,000 TOPS/W)
While the accelerators listed here aren't directly comparable to Qualcomm's Cloud AI 100 (for Cloud Edge data centres) I don't see the job of putting a compelling case for one technology or another to potential users as being an easy one while ever the long term viability of the technologies on offer and the soundness of investment in those technologies is in question.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
14 Comments
Back to Article
close - Wednesday, September 16, 2020 - link
What would the "Intel Cascade Lake CPU 440W" label in the chart represent? The largest TDP for Cascade Lake is something like 205W.Andrei Frumusanu - Wednesday, September 16, 2020 - link
Dual socket setup.firewrath9 - Thursday, September 17, 2020 - link
It could also be the cascade lake platinum 9200 series, iirc those have ~400w tdpsbst1 - Thursday, September 17, 2020 - link
Xeon Platinum 9282Raqia - Wednesday, September 16, 2020 - link
Did they cite any numbers for bigger benchmarks like Google's BERT? This is still a very deployable solution that can beat out other more power hungry solutions when power consumption and remote connectivity are important.Yojimbo - Wednesday, September 16, 2020 - link
Looks like no., probably because it's not so favorable as BERT can't fit in the cache. They started building it years ago when models were a lot smaller. If models continue to get bigger and bigger internal caches are probably not going to be be able to keep up. Then it becomes a niche product for accelerating small networks very well. Well, they need to publish MLPerf results to let people know what the case really is.Raqia - Wednesday, September 16, 2020 - link
I doubt it will be competitive with something like the A100, but I don't think this will have the same use cases as solutions designed with HBM2. Given the formfactor and connectivity capabilities of the devkit, they are likely targeting deployable configurations that operate outside of data centers and special use cases within the data center. Communication latency to remote clients like cell phones, vehicles, or VR headsets should be much better than something sitting in a datacenter, and there should be a big market for that.Yojimbo - Wednesday, September 16, 2020 - link
Well, the PCIe version is the one that is 75 W and 400 TOPS. With that they are targeting the same market the T4 is in. Both are half height, half length cards.As far the the other two form factors, if it can't actually compete with the A100 and T4 in real world networks then they shouldn't be making the comparison to those cards. That's why it's important to have the benchmarks. Once you get down to the 50 TOPS version, the 134 GB/s memory bandwidth is probably enough to keep it fed. But then the comparison with the A100 is just silly. The proper comparison would be with a Jetson Xavier and then a Jetson Orin when it comes out, assuming there will be one (dunno why there wouldn't be).
Raqia - Thursday, September 17, 2020 - link
The comparison might very well be worthwhile if you run NNs that play into the chip's strengths and care about both the cost of the chip and TCO of your datacenter deployment which includes power consumption for both the chip and cooling. The nVidia designs are also loaded with graphics specific baggage like texture units and ROPs which are a waste of silicon for AI workloads.Yojimbo - Thursday, September 17, 2020 - link
The comparison of a 15 Watt DM.2e NN ASIC to a 350 W general purpose data center accelerator is worthwhile? I don't think so.Yojimbo - Wednesday, September 16, 2020 - link
Firstly, it's not good that they are talking about inference and never once mention latency. Secondly, throughput is fast on ResNet-50 because ResNet-50 is small and fits in its cache. But ResNet-50 is also much smaller than what most data centers are running inference on these days, to my understanding. Architectures like this definitely need MLPerf scores to confirm their claims, much like Graphcore, which claims it can stream larger networks effectively from DDR memory but still has no MLPerf results published. This thing, similarly, will be relying on DDR memory for anything that doesn't fit in its internal cache.frbeckenbauer - Wednesday, September 16, 2020 - link
The huge on-chip SRAM cache is what's being rumored for the "big navi" GPUs, interestingfangdahai - Wednesday, September 16, 2020 - link
>>Precision-wise, the architecture supports INT8, INT16 as well as both FP16 and FP32I guess 400TOPS is based on INT8. How about FP32? 50TOPS?
ChrisGX - Friday, September 25, 2020 - link
I am sceptical that Qualcomm will be able to succeed with a product like this. While the TOPs thing is probably a distraction when what we need is benchmarking based on well chosen workloads running on AI Infererence Accelerators, there will still be the inevitable comparisons using those TOPs numbers. And, things take on a strange look when we go down that path:Qualcomm claims 400TOPs at 75W for the Cloud AI 100 > 5.3TOPs/W
Gyrfalcon Technology claims 16.8 TOPs at 700mW for the Lightspeeur 2803S > 24TOPs/W
Perceive claim very high performance at 20mW for the Ergo edge inference processor > 55 TOPs/W
Imec and GLOBALFOUNDRIES claim orders of magnitude higher performance for their Analog Inference Accelerator (with the ambition to evolve the technology even further) > 2,900 TOPs/W (evolving towards 10,000 TOPS/W)
While the accelerators listed here aren't directly comparable to Qualcomm's Cloud AI 100 (for Cloud Edge data centres) I don't see the job of putting a compelling case for one technology or another to potential users as being an easy one while ever the long term viability of the technologies on offer and the soundness of investment in those technologies is in question.