Heads up: Dojo details given at HotChips conference

Cosmacelf · Aug 21, 2022

This Tuesday, there are three Tesla presentations, two on DOJO, one on their AI architecture being given here:

Program

Attendee Site for Hot Chips 34 (2022)

hc34.hotchips.org

Cosmacelf · Aug 23, 2022

Oh my. Just listened to the two Tesla talks doing a deeeeep dive into DOJO. So much detail presented. The first talk, you'd have to be a processor architect (and specifically an AI processor architect) to appreciate the info. And the second talk was just as mind bending, you'd have to be a deep learning programmer architect to appreciate the info. I'll do my poor best to summarize.

Before you ask, no comparative performance numbers were given, only they are "happy" with the performance. More info of that nature will be presented on Tesla AI day #2.

The DOJO processing node is described as a high performance general purpose CPU with a custom instruction set tailored to ML. Each node has 1.25 MB of high speed SRAM used by both instructions and data. There is no cache. Each node has a proprietary off node network interface to connect to the 4 adjacent nodes (east/west/north/south). There is no virtual memory, very limited memory protection mechanisms, and has programmatic sharing of resources. 4 threads, typically 1-2 are used for compute, 1-2 are used for communications at any one time. They of course have SIMD instructions which are also executing in parallel.

The SRAM memory runs at 400 GBps load, 270 GBps store. There is a DMA mechanism to transfer data directly between SRAM stores in different chips.

He explained something called list parsing. It is a separate hardware element that "allows efficient packaging of complex transfer sequences. Runs asynchronously in its own thread"

The instruction set has a bunch of instructions unique to ML workloads, for example, one that does stochastic rounding. 142 vector instructions, 74 scalar and 12 front end (general purpose instructions), plus many variants of each.

Supports a huge number of float formats and the machine can run with all these different formats being used at the same time - each off node network packet is identified with the type of float (or int) being used.

There are 354 of these processing nodes per chip, each chip being approx. 2.5cm x 2.5cm square. So each chip has 440 MB of SRAM distributed among those 354 nodes. Each chip has the raw performance of 362 TFlops using BF16 arithmetic, or 22 TFlops using FP32 running at 2 Ghz. It is built by TMSC using their 7nm process.

At the edge of each chip are a total of 576 bidirectional channels to connect each edge node to an associated edge node on the next chip.

These 2.5 x 2.5 cm chips are then packaged into a 5x5 array onto one DOJO training tile. And now you also have a communication network on each tile edge to do cross tile communications. This training tile is the one that is stacked in three dimensions to allow for power delivery and liquid cooling. Each tile consumes 15 kW of power, an insane amount of power for such a compact device (probably around 40cm x 40cm or so).

The full system is organized as a plane of these training tiles. Don't know how big a plane they've created, but at each edge of the virtual plane (bear in mind that physically, I think these things are like 2 tiles per multi U rack mount space), the edge tiles each connect to a DIP (DOJO Interface Processor). The DIP acts as a host interface to a general purpose host machine, as well as providing shared memory. The DIPS contain an additional 160 GB of shared DRAM per tile for use in high parameter models, in additional to the distributed 11GB of highly distributed SRAM per tile.

There are two ways packets get routed between nodes - through the on-chip hardware communication network, and also through the DIPs connected to a high speed shared Ethernet switched backplane. For nodes that are far apart from each other, it will be faster to go off tile through the DIP, onto the Ethernet network and then back onto a DIP for a far away tile rather than go through many chip to chip and tile to tile communication hops. Basically, there is also this third dimension of connectivity via Ethernet for far away nodes. The compiler is aware of the topology of the entire machine and will use one or the other mechanism for communication depending on which is quicker.

DOJO uses a custom communication protocol on top of their intra node/intra chip interconnects and Ethernet to shuttle packets around. This protocol, called TTP (Tesla Transport Protocol) is efficient enough to get 50 GB/s bandwidth when going over Ethernet (yeah, that's 400 Gbps Ethernet, which is apparently a thing).

Synchronization is done using two mechanisms. Semaphores are used causing threads to block when needed. In addition, the compiler defines a set of nodes that can run independently of another set and uses software signals between node groups to synch. Aren't compilers wonderful.

OK, stepping up a level. The DOJO system was designed such that the system can run with a unique combination of compute, memory and I/O appropriate to each workload. ie. the compiler (that wonderful compiler again), can adjust how this massive machine's resources are allocated to balance out these three resources.

I've left out a ton of ridiculous detail. Anyways, that's DOJO in too much detail, we will no doubt get a more understandable view of it all on AI Day #2.

Cobbler · Aug 23, 2022

Cosmacelf said:
.....

Thank you for the write-up!! Excellent work!

2101Guy · Aug 23, 2022

This sounds AWESOME. Seriously. SERIOUS technology. World class.

But Does it mean we are now certain about FSD being fully complete by end of year and NYC to LA via Smart Summon by end of year as well? Or where the car will drop you off at the door of the restaurant and go park itself, as the CEO has promised?

If not..

Cosmacelf · Aug 23, 2022

A few more points after listening to the third Tesla presentation (a keynote) at that conference today.

Tesla did what Tesla always does when faced with a difficult task, they engineered the snot out of the problem. As we all know, FSD is chronically behind, and Tesla has been relying on their own GPU cluster (the world's 6th or 7th largest supercomputer GPU cluster), to bang out iteration after iteration of the FSD inference network. GPUs aren't the ideal hardware (they obviously were designed for something else), so Tesla decided to build a better training supercomputer.

While there are dedicated AI training chips out there, Elon was more ambitious and wanted a complete system that would significantly outperform their massive GPU cluster. So you can't think of DOJO as a training chip, or interconnect, or board, etc. It is the complete system that is built like so:

CPU node * 354 -> 1 chip * 25 -> 1 tile * 120 -> system

That's over a million CPU nodes specialized for AI workloads. With attendant very fast interconnects (two types to allow for three dimensional connectivity), and vast distributed memory. All using proprietary power delivery and cooling for the massively compact tiles. Add in making the system easy to program since the ML experts just use PyTorch and the compiler creates optimized machine code for whatever type of NNs they are using.

One more interesting thing they made a point of emphasizing. ML workloads need above all else fast communications between processing nodes. Every time you move off chip or between tiles, you get a slow down in communications. So one approach is to make big chips. One AI company took this to the limit and is using an entire 30 cm wafer as a single chip. The problem with this approach is that you cannot hope to have perfect yield in a 30 cm wafer. You are going to have nodes on such a large wafer chip that don't work and you have to route around the defect.

Tesla instead opted to just make 2.5cm chips (big, but not impossibly so). They then test these chips to find the ones that work 100%. They then re-integrate these working chips back onto a wafer with interconnect logic on the wafer in a 5x5 grid. That's the core of the tile. I don't know if Tesla came up with this chip to wafer integration or whether they licensed this technology, but whatever, it is a great idea. This allows them to have blazingly fast interchip communications speeds on that tile.

By the way, this is Elon's ethos. He doesn't actually invent any new science. He takes existing technology and engineers the crap out of it to extract the best performance possible out of that core technology. Consider all the things he has done this with:

Rockets - liquid fueled rocket engines are an old proven technology. Elon engineered the F1, F9 and now Starship to go way past what others have achieved, but at the end of the day, they are still liquid fueled rockets.

Batteries - instead of waiting around for solid state batteries or exotic chemistries, he picked the best of what was available while also choosing a form factor (cylindrical) which would enable fast manufacturing. This choice seemed ridiculous at the time, why would you choose such a tiny form factor and stuff your car with 8,000 cells? Because you could manufacture it at scale, something competitors are still learning.

AI - rather than wait for better paradigms, Elon just went with what was current state of the art and threw tons of engineering at it to create a huge feedback system (each Tesla car helps the learning system) and a huge training computer, etc. Again, no real new AI paradigm was invented (yet).

2101Guy · Aug 24, 2022

That’s awesome!

So Robotaxi and L5 by 12/31/2022 as promised, right?

Cosmacelf · Aug 24, 2022

Electrek published the slides of the talks here: Tesla releases new deep-dive presentations on its Dojo AI supercomputer

Enjoy!

Bitdepth · Aug 24, 2022

Thanks for sharing and the brief overview.

JohnnyEnglish · Aug 25, 2022

Cosmacelf said:
Oh my. Just listened to the two Tesla talks doing a deeeeep dive into DOJO. So much detail presented. The first talk, you'd have to be a processor architect (and specifically an AI processor architect) to appreciate the info. And the second talk was just as mind bending, you'd have to be a deep learning programmer architect to appreciate the info. I'll do my poor best to summarize.

Before you ask, no comparative performance numbers were given, only they are "happy" with the performance. More info of that nature will be presented on Tesla AI day #2.

The DOJO processing node is described as a high performance general purpose CPU with a custom instruction set tailored to ML. Each node has 1.25 MB of high speed SRAM used by both instructions and data. There is no cache. Each node has a proprietary off node network interface to connect to the 4 adjacent nodes (east/west/north/south). There is no virtual memory, very limited memory protection mechanisms, and has programmatic sharing of resources. 4 threads, typically 1-2 are used for compute, 1-2 are used for communications at any one time. They of course have SIMD instructions which are also executing in parallel.

The SRAM memory runs at 400 GBps load, 270 GBps store. There is a DMA mechanism to transfer data directly between SRAM stores in different chips.

He explained something called list parsing. It is a separate hardware element that "allows efficient packaging of complex transfer sequences. Runs asynchronously in its own thread"

The instruction set has a bunch of instructions unique to ML workloads, for example, one that does stochastic rounding. 142 vector instructions, 74 scalar and 12 front end (general purpose instructions), plus many variants of each.

Supports a huge number of float formats and the machine can run with all these different formats being used at the same time - each off node network packet is identified with the type of float (or int) being used.

There are 354 of these processing nodes per chip, each chip being approx. 2.5cm x 2.5cm square. So each chip has 440 MB of SRAM distributed among those 354 nodes. Each chip has the raw performance of 362 TFlops using BF16 arithmetic, or 22 TFlops using FP32 running at 2 Ghz. It is built by TMSC using their 7nm process.

At the edge of each chip are a total of 576 bidirectional channels to connect each edge node to an associated edge node on the next chip.

These 2.5 x 2.5 cm chips are then packaged into a 5x5 array onto one DOJO training tile. And now you also have a communication network on each tile edge to do cross tile communications. This training tile is the one that is stacked in three dimensions to allow for power delivery and liquid cooling. Each tile consumes 15 kW of power, an insane amount of power for such a compact device (probably around 40cm x 40cm or so).

The full system is organized as a plane of these training tiles. Don't know how big a plane they've created, but at each edge of the virtual plane (bear in mind that physically, I think these things are like 2 tiles per multi U rack mount space), the edge tiles each connect to a DIP (DOJO Interface Processor). The DIP acts as a host interface to a general purpose host machine, as well as providing shared memory. The DIPS contain an additional 160 GB of shared DRAM per tile for use in high parameter models, in additional to the distributed 11GB of highly distributed SRAM per tile.

There are two ways packets get routed between nodes - through the on-chip hardware communication network, and also through the DIPs connected to a high speed shared Ethernet switched backplane. For nodes that are far apart from each other, it will be faster to go off tile through the DIP, onto the Ethernet network and then back onto a DIP for a far away tile rather than go through many chip to chip and tile to tile communication hops. Basically, there is also this third dimension of connectivity via Ethernet for far away nodes. The compiler is aware of the topology of the entire machine and will use one or the other mechanism for communication depending on which is quicker.

DOJO uses a custom communication protocol on top of their intra node/intra chip interconnects and Ethernet to shuttle packets around. This protocol, called TTP (Tesla Transport Protocol) is efficient enough to get 50 GB/s bandwidth when going over Ethernet (yeah, that's 400 Gbps Ethernet, which is apparently a thing).

Synchronization is done using two mechanisms. Semaphores are used causing threads to block when needed. In addition, the compiler defines a set of nodes that can run independently of another set and uses software signals between node groups to synch. Aren't compilers wonderful.

OK, stepping up a level. The DOJO system was designed such that the system can run with a unique combination of compute, memory and I/O appropriate to each workload. ie. the compiler (that wonderful compiler again), can adjust how this massive machine's resources are allocated to balance out these three resources.

I've left out a ton of ridiculous detail. Anyways, that's DOJO in too much detail, we will no doubt get a more understandable view of it all on AI Day #2.

Really helpful summary. I am guessing that with TESLA going public on this information (and recent tweets relating to supercomputing) the performance relative to their GPU cluster must be pretty good. Hoping that AI day does provide some real life relative performance figures for DOJO vs the GPU cluster. Not too long to wait

mspohr · Aug 25, 2022

JohnB007 · Aug 25, 2022

Hey video was very helpful. But her voice! Reminded of my daughter's voice, when she was 2....

Cosmacelf · Aug 25, 2022

JohnB007 said:
Hey video was very helpful. But her voice! Reminded of my daughter's voice, when she was 2....

Yeah, I literally cannot listen to it. Too bad because it did look like it had useful info.

Search

Heads up: Dojo details given at HotChips conference

Cosmacelf

Well-Known Member

Program

Cosmacelf

Well-Known Member

Cobbler

Paranoid T.E.S.L.A Bull

2101Guy

Breaker of Ignore Buttons

Cosmacelf

Well-Known Member

2101Guy

Breaker of Ignore Buttons

Cosmacelf

Well-Known Member

Bitdepth

Member

JohnnyEnglish

Member

mspohr

Well-Known Member

JohnB007

Member

Cosmacelf

Well-Known Member

Similar threads