02:04PM EDT - NVIDIA announced at a VLSI conference last year that it had designed a test multi-chip solution for DNN computations. The company is explaining the technology today at Hot Chips, with the idea that what they've created could be a stepping stone for future monetizable products.

02:04PM EDT - This is a test chip

02:04PM EDT - NVIDIA research does many test chips every year

02:05PM EDT - This work is about multi-chip DL inference

02:05PM EDT - CNN was a target for this test chip

02:06PM EDT - System is configurable for scale

02:09PM EDT - 36 small chips

02:09PM EDT - large scale inference accelerators

02:09PM EDT - three key objectives

02:09PM EDT - high inteference scaling and perfomrance

02:09PM EDT - each chip could be a DL edge inference accelerator

02:09PM EDT - many chips enabled data-center scale throughput

02:09PM EDT - network on package architecture

02:10PM EDT - Ground Reference Signalling as an MCM interconnect

02:10PM EDT - Chiplet Enables reuse and lower cost

02:10PM EDT - Assemble existing chips together

02:10PM EDT - NOC uses RISC-V

02:10PM EDT - 20ns per hop

02:10PM EDT - Network on chip and network on package

02:11PM EDT - Ground Reference Signalling - low voltage signalling, up to 1.75 pJ/bit, up to 25 Gbit per pin

02:11PM EDT - Single ended links

02:12PM EDT - Tiled Architecture with Distributed Memory

02:12PM EDT - RISC-V controller is a chip controller

02:13PM EDT - 8 Vector MACs per PE

02:13PM EDT - Processing Engine

02:13PM EDT - Each chip is 12 PEs, Each package is 6x6 chips

02:14PM EDT - PE - 8 MACs, chip is 96 MACs, package is 3456 MACs

02:15PM EDT - Designed for CNNs

02:15PM EDT - Can do different tiling strategies

02:17PM EDT - Multicast support

02:17PM EDT - Extracting model parallelism using the NoP and NoC

02:18PM EDT - TSMC 16mm, 2.5mm x 2.4mm each

02:18PM EDT - 100 Gbps per link

02:18PM EDT - 9.5 TOPS/W, 128 TOPs

02:18PM EDT - 6 months from spec to tapeout

02:19PM EDT - Designed in high level synthesis

02:19PM EDT - Agile VLSI Design

02:19PM EDT - Continuous integration with automated tool flows

02:20PM EDT - C++ to Gates design in 12 hours

02:20PM EDT - MatchLib is opensource

02:24PM EDT - Experimental results

02:24PM EDT - Custom PCB with FPGA DRAM

02:25PM EDT - 27x improvement with 32 chips

02:25PM EDT - GRS uses most energy at high chip counts

02:25PM EDT - (oh that energy is per image)

02:26PM EDT - At high batch, GRS links are all active all the time, consuming power

02:26PM EDT - No sleep modes enabled with GrS

02:27PM EDT - Again, going to 32 chips, GRS becomes a big energy consumption

02:27PM EDT - 0.11 pJ/Op2.5K images/sec with 0.4ms latency on ResNet-50 batch = 1

02:28PM EDT - Q&A time

02:30PM EDT - Q: Results show scale 1-32 chips. Batch went up to 32 - is only one image per chip, or one image across over all chips? A: Tiling strategy depends on layer in CNN. As batch size is scaled, it gives more computations to scale to achieve better scalability. But it's not a catch-all solution.

02:32PM EDT - Q: 10 ns at 1 GHz? A: About 1.1 GHz at 0.7 volts. It includes partition interface latencies and the latency of the router itself

02:33PM EDT - Q: Physical Package? A : Organic substrate. Can be used in 2.5D

02:34PM EDT - That's a wrap. Next up is Xilinx

Comments Locked

2 Comments

View All Comments

  • silencer12 - Tuesday, August 20, 2019 - link

    Yeah, i am first. Look at me. Wooo

    (Example of all those other goofballs)

    Cool article.
  • Rangha - Wednesday, August 21, 2019 - link

    Clarification on "2:14PM EDT - PE - 8 MACs, chip is 96 MACs, package is 3456 MACs"

    Each Vector MAC = 8 MACs, 8 Vector MACs per PE, 16 PEs per Chip, Chip is 1024 MACs, and Package is 36864 MACs.

Log in

Don't have an account? Sign up now