Here are the notes from the dojo presentation:
Some slides:
Tesla Dojo Custom AI Supercomputer at HC34
Beyond Compute - Enabling AI through System Integration
Ganesh Venkataramanan, Tesla Motors
Ganesh was at AMD, Operon ‘64 and a few others. Also holds WR for fastest clocked silicon
“Hot Chips - One of the most premiere chip conferences”
80% of data is unstructured, requires new types of processing
Need new data types
New types of cores
‘What is AI’
‘What is ML’
‘What is DL’
AI Training Systems - Dataset Models (SW) and Compute Scale (HW)
4D labels in video data - space and time
Taking 3D models of the environment, label it, and take it over time
Automated labeling, but requires a lot of processing and involvement even when automated labeling
If Human is in the AI loop, progress will be slow
Solution is recursion
More good data, better models!
Different types of online and offline models
Knowledge graphics - Semantic networks, BigGAN, Multi-Modal AI, Transfer Learning, Datacentric AI, ART Networks
Since 2012, Effective Compute
8x from Moore’s Law
25x from Algorithms
37500x from Scale out ($$$)
Differentiated hardware
GPU power trend is up
2014: 300W
2016: 350W
2019: 400W
2021: 500W
2022: 700W
2030: 9000W+
Difficulty of cooling
The more power in the same power requires exponentially more power to cool
Power delivery
Most GPUs today uses Lateral Power Delivery
Some AI chips use vertical power delivery
So… vertical power delivery
Datacenter architecture evolution
It’s all about connectivity
Focused compute and connectivity
D1 Chip
400W TDP
645mm2, TSMC N7
50B transistors
362 TOPs BF16/CFP8
22.6 TOPs FP32
25 D1 chips on a Tile
14kW on a tile
9 PetaOPs BF16/CFP16
Pakcaging - Use all fan out layers
Unparalleled integration
25 chips in a Dojo Tile
2 chips in a PCIe card by comparison
You need a software stack
More goals
Architecture
Integration
Disaggregation
Abstraction
Algorithm
Compilers
Q&A
Q : NV : Dojo for non-tesla users?
A : Focused on internal customers first. Elon has made it public, over time will be made available to researchers, but no time frame
Q : Google : Scale was 37000x vs Moore’s Law. How much more scale can we get from $$ - what happens as we approach that limit
A: No idea. But that scale growth is not scalable. Have to get it more efficient. Need to do similar compute in lower power. Most algorithms are developed for current architectures, so those need to develop as well.
Q: Have algorithms changed in Tesla with the HW?
A: In the process. More at AI Day
Q: Chair : Process to build Dojo
A: Since the interview with Elon. Want to do different than CPU/GPU. Whole team is still answering that question. We noticed many bottlenecks in inference first, hence FSD, then started similar scale issues for training, that’s simply how it began. Knowing your workloads is important, need to optimize systems.
Q : Qualcomm : Training tile cooling?
A : Partner IPs involved. Infrastructure level, it’s just a liquid cooled computer. But for efficiency, had to do a lot of design.
Q: Learning from design fan out wafer?
A: Alternatives didn’t integrated enough. MCM, Interposer. Scale out for those weren’t enough. Platform has to be as large as possible. We learned a lot - key factor for fanout is yield - don’t need perfect yield, then focus on power and performance. Chip industry is 2.5D/3D, all of that increases power density, which means cooling difficulties - you fall through a cliff in requirements. All those discontinuities were considered.
Q : How did you think about the config. Bandwidth is linear, compute is quadratic.
A: We have taken the best judgment call based on physics and modeling. We can refine it any time.
Q: Deal with failures
A: Mentioned earlier
Q : SiFive : When IBM650 - first large scale computer. Tesla built a machine because they couldn’t buy one. Make more business sense to create standards?
A: If we can solve our problems with any platform, that’s what we aim for. Doing this in-house is because we didn’t find a commercial solution moving fast enough.
Q : Software target challenges
A : No commercial software for us, we built our own, even down to intermediate representations. Our distributed compiler is amazing. A lot of distributed compilers talked about in 60s/70s.
Q: Types of NNs, how they map to Dojo
A: Transformer NNs are published. Lots of things are changing. We fast follow.
Q : Dojo 2?
A : We won’t sit idle.