You can install our site as a web app on your iOS device by utilizing the Add to Home Screen feature in Safari. Please see this thread for more details on this.
Note: This feature may not be available in some browsers.
Really helpful summary. I am guessing that with TESLA going public on this information (and recent tweets relating to supercomputing) the performance relative to their GPU cluster must be pretty good. Hoping that AI day does provide some real life relative performance figures for DOJO vs the GPU cluster. Not too long to waitOh my. Just listened to the two Tesla talks doing a deeeeep dive into DOJO. So much detail presented. The first talk, you'd have to be a processor architect (and specifically an AI processor architect) to appreciate the info. And the second talk was just as mind bending, you'd have to be a deep learning programmer architect to appreciate the info. I'll do my poor best to summarize.
Before you ask, no comparative performance numbers were given, only they are "happy" with the performance. More info of that nature will be presented on Tesla AI day #2.
The DOJO processing node is described as a high performance general purpose CPU with a custom instruction set tailored to ML. Each node has 1.25 MB of high speed SRAM used by both instructions and data. There is no cache. Each node has a proprietary off node network interface to connect to the 4 adjacent nodes (east/west/north/south). There is no virtual memory, very limited memory protection mechanisms, and has programmatic sharing of resources. 4 threads, typically 1-2 are used for compute, 1-2 are used for communications at any one time. They of course have SIMD instructions which are also executing in parallel.
The SRAM memory runs at 400 GBps load, 270 GBps store. There is a DMA mechanism to transfer data directly between SRAM stores in different chips.
He explained something called list parsing. It is a separate hardware element that "allows efficient packaging of complex transfer sequences. Runs asynchronously in its own thread"
The instruction set has a bunch of instructions unique to ML workloads, for example, one that does stochastic rounding. 142 vector instructions, 74 scalar and 12 front end (general purpose instructions), plus many variants of each.
Supports a huge number of float formats and the machine can run with all these different formats being used at the same time - each off node network packet is identified with the type of float (or int) being used.
There are 354 of these processing nodes per chip, each chip being approx. 2.5cm x 2.5cm square. So each chip has 440 MB of SRAM distributed among those 354 nodes. Each chip has the raw performance of 362 TFlops using BF16 arithmetic, or 22 TFlops using FP32 running at 2 Ghz. It is built by TMSC using their 7nm process.
At the edge of each chip are a total of 576 bidirectional channels to connect each edge node to an associated edge node on the next chip.
These 2.5 x 2.5 cm chips are then packaged into a 5x5 array onto one DOJO training tile. And now you also have a communication network on each tile edge to do cross tile communications. This training tile is the one that is stacked in three dimensions to allow for power delivery and liquid cooling. Each tile consumes 15 kW of power, an insane amount of power for such a compact device (probably around 40cm x 40cm or so).
The full system is organized as a plane of these training tiles. Don't know how big a plane they've created, but at each edge of the virtual plane (bear in mind that physically, I think these things are like 2 tiles per multi U rack mount space), the edge tiles each connect to a DIP (DOJO Interface Processor). The DIP acts as a host interface to a general purpose host machine, as well as providing shared memory. The DIPS contain an additional 160 GB of shared DRAM per tile for use in high parameter models, in additional to the distributed 11GB of highly distributed SRAM per tile.
There are two ways packets get routed between nodes - through the on-chip hardware communication network, and also through the DIPs connected to a high speed shared Ethernet switched backplane. For nodes that are far apart from each other, it will be faster to go off tile through the DIP, onto the Ethernet network and then back onto a DIP for a far away tile rather than go through many chip to chip and tile to tile communication hops. Basically, there is also this third dimension of connectivity via Ethernet for far away nodes. The compiler is aware of the topology of the entire machine and will use one or the other mechanism for communication depending on which is quicker.
DOJO uses a custom communication protocol on top of their intra node/intra chip interconnects and Ethernet to shuttle packets around. This protocol, called TTP (Tesla Transport Protocol) is efficient enough to get 50 GB/s bandwidth when going over Ethernet (yeah, that's 400 Gbps Ethernet, which is apparently a thing).
Synchronization is done using two mechanisms. Semaphores are used causing threads to block when needed. In addition, the compiler defines a set of nodes that can run independently of another set and uses software signals between node groups to synch. Aren't compilers wonderful.
OK, stepping up a level. The DOJO system was designed such that the system can run with a unique combination of compute, memory and I/O appropriate to each workload. ie. the compiler (that wonderful compiler again), can adjust how this massive machine's resources are allocated to balance out these three resources.
I've left out a ton of ridiculous detail. Anyways, that's DOJO in too much detail, we will no doubt get a more understandable view of it all on AI Day #2.
Yeah, I literally cannot listen to it. Too bad because it did look like it had useful info.Hey video was very helpful. But her voice! Reminded of my daughter's voice, when she was 2....