Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Project Dojo - the SaaS Product?

This site may earn commission on affiliate links.
1687380295274.png
 
  • Helpful
Reactions: Rarity
Interesting, now we have some Nvidia's H100 performance review vs the A100.

Nvidia H100
700W TDP
BF16 performance compared to A100: 3.2X

Dojo D1 Chip
400W TDP
BF16 performance compared to A100(with Tesla compiler): 3.2X-4X

Looks like D1 Chip has similar performance as Nvidia's latest but using almost half the power. Performance going off chip will be magnitude greater with Dojo's chip on wafer design. This is actually pretty incredible!

 
That list is based on the LINPACK benchmark for traditional x86 (64-bit) CPU-based systems. Those supercomputers are designed for traditional compute workloads. Hence, it's not a useful comparison for Machine Learning systems.

CPUs are well-suited for general-purpose workloads (all software tasks), allowing them to handle diverse instruction types, including arithmetic, logical, control flow, and I/O operations. CPUs are highly optimized for sequential operations and quick memory access.

Traditional compute work is limited by algorithmic complexity. Scaling the system often doesn't provide proportional improvements in performance or results due to the inherent limitations of the algorithms being used.

On the other hand, in machine learning, better results are often achieved with more data and larger models (although there are other factors involved, this is generally true for very complex problems). Scaling supercomputers for machine learning tasks can provide significant improvements. The parallel and distributed nature of computing tasks allows for infinite scalability (in theory).

Machine learning computers use GPUs or accelerators, which are excellent for performing massive-scale matrix multiplication operations required for machine learning tasks. Accelerators and GPUs have hundreds of cores that can process these specific arithmetic operations in parallel.

Now, going on a bit of a tangent...

Not all theoretical FLOPs on accelerators are created equal. Other systems surrounding the accelerators play a critical role in overall system's utilization, occupancy, and energy consumption. Dojo was specifically designed for massive video training. It's important to recognize that Dojo's performance should be evaluated within the context of its intended purpose. It may not do well in some arbitrary training benchmarks used by the industry to compare training hardwares.

On a small benchmark model, Dojo's performance is on par with the A100.
View attachment 955005

However, Dojo excels in large-scale complex models with high-intensity arithmetic workloads. These models face critical data-transfer bottlenecks and diminishing returns when training is scaled on the Nvidia stack. Dojo is 3.2x and 4.4x faster compared to the A100:
View attachment 955007

Dojo V2 will be more general purpose (Autopilot, Bots, AGI and potentially, open it to everyone for a pay-as-you-go service).

View attachment 955013
 
Comparison von NVIDIA’s and Teslas AI Hardare

It seems to me that we have reached now the steep part on the S-curve in the evolution of AI and AI Hardware. To get a better understanding, I compared NVIDIA’s with Teslas portfolio in this area. Beeing not an AI Expert at all, the products look comparable from a performance viewpoint and also from a performance per watt viewpoint. Also important is now who (e.g. Microsoft/OpenAI, Meta, Google and Tesla/x.AI) is able to scale very fast and can leverage the additional computing capacity. Tesla and NVIDIA A100 use the older TSMC 7 nm process while NVIDIA H100 uses the TSMC 4 nm process, which is also used for the Apples A15 System-on-a-Chip of iPhone 14 Pro/Pro Max. It’s possible that Tesla can scale faster due to better available manufacturing capacity at TSMC, but I don’t know which part of the system will be bottle-neck.

For Comparison, the most powerful supercomputing system acording to the Top500 List is currently Frontier with AMD Components and a Performance of 1.2 exaFlops FP64 which I would translate to about 20 exaFlops FP16 AI Performance.

My main Sources were
  • Tweet Tesla AI June 21, 2023
  • Presentation Tesla AI Day 2 October 1, 2022
  • NVIDIA’s Website, July 16, 2023
  • Wikipedia
Out of scope for my comparison was AMD’s Portfolio, due to my limited time.
NVIDIA‘s current AI Portfolio (A100 based)

The A100 Tensor Core GPU has a FP16 (AI) performance of 0.312 petaFLOPS and a max. TDP of 300 W or 400 W depending on the configuration. NVIDIA’s FP16 AI performance in this post is measured without the optimization method “sparsity”, since Tesla also took this assumption in the graph tweeted on June 21, 2023.

The DGX A100 consists of 8x NVIDIA A100 80GB Tensor Core GPU‘s, has a System Power Usage of 6.5 kW max and 2.5 petaFLOP‘s AI Performance. NVIDIA A100 Tensor Core GPU‘s seem to require about 50 % of the max. system power, a distribution I use below for ballpark estimations.

For 1 exaFlop FP16 AI Performance, 400 DGX A100 would be needed, with a total Max System Power Usage of 2.6 MW

NVIDIA‘s upcoming AI Portfolio (H100 based)

The new Grace Hopper superchip with 1000 W TDP consist of a CPU (Grace Arm Neoverse V2), a GPU (NVIDIA H100 Tensor Core GPU), up to 480 GB LPDDR5X ECC Memory and up to 96 GB HBM3. FP16 Performance is 0.990 petaFLOPS.

The DGX GH200, planned for HY2 2023, consists of 256 NVIDIA Grace Hopper superchips (total TDP of Grace Hopper superchips: 256 kW) and has a FP16 performance of 0.25 exaFLOP

Helios will consist of 4 DGX GH200 (TDP of Grace Hopper superchips: 1024 kW) and will have a FP16 performance of 1 exaFLOP.

The total power draw of a system with 1 exaFLOP FP16 AI performance is roughly 2 MW (Assumption: Half of the energy is used for the Grace Hopper superchips)
.
Teslas‘s upcoming AI Portfolio

The new D1 Chip with 400 W TDP has a FP16 Performance of 0.362 petaFLOPS.

A tile consists of 25 D1 Chips (TDP of D1 chips: 10 kW) and has a total FP16 Performance of 9.05 petaFLOPS.

A cabinet consists of 12 Tiles (TDP of D1 chips: 120 kW) and has a FP16 Performance of 108 PetaFLOPS.

An ExaPOD consists of 10 Cabinets (TDP of D1 chips: 1200 kW) has a total FP16 Performance of 1.08 ExaFLOP.

The total power draw of the system for 1.08 exaFLOP FP16 AI Performance is roughly 2.4 MW (Assumption: half of the energy is used for the C1 chips). For 1 exaFLOP 2.22 MW Electrical Power is required.

100 exaFLOP FP16 AI Performance, scheduled for October 2023, will require roughly 220 MW, which equals 1.9 TWh per year if running at 100 % uptime at full capacity.