Project Dojo - the SaaS Product?

Buckminster · Jun 21, 2023

NicoV said:
Dojo production starts in July:

https://twitter.com/x/status/1671589874686730270

Cosmacelf · Jun 21, 2023

Actually, dojo online now.

https://twitter.com/x/status/1671700914548162561

Buckminster · Jun 21, 2023

https://twitter.com/x/status/1671721744623874048

Buckminster · Jun 21, 2023

https://twitter.com/x/status/1671660883532038145

Buckminster · Jun 21, 2023

https://twitter.com/x/status/1671647116538544128

Buckminster · Jun 21, 2023

Rob discusses:

UkNorthampton · Jun 22, 2023

Elon Musk
@elonmusk 5h

And much more than that in cumulative inference compute in vehicle

Whole Mars Catalog
@WholeMarsBlog 5h
true. which is used to help source training data

Buckminster · Jun 22, 2023

https://twitter.com/x/status/1671843988041506816

Buckminster · Jun 22, 2023

https://twitter.com/x/status/1671987717553389568

Buckminster · Jun 23, 2023

Buckminster said:
NVDA = $1T therefore TSLA addition is $1T in 2025?

Warren gets to the same figure but thinks next year:

Buckminster · Jul 1, 2023

Singuy said:
Interesting, now we have some Nvidia's H100 performance review vs the A100.

Nvidia H100
700W TDP
BF16 performance compared to A100: 3.2X

Dojo D1 Chip
400W TDP
BF16 performance compared to A100(with Tesla compiler): 3.2X-4X

Looks like D1 Chip has similar performance as Nvidia's latest but using almost half the power. Performance going off chip will be magnitude greater with Dojo's chip on wafer design. This is actually pretty incredible!

Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave (Part 1) | Databricks

The research and engineering teams here at MosaicML collaborated with CoreWeave, one of the leading cloud providers for NVIDIA GPU-accelerated server platforms, to provide a preview of the performance that can be achieved when training large language models (LLMs) with NVIDIA H100 GPUs on the...

www.mosaicml.com

Buckminster · Jul 9, 2023

https://twitter.com/x/status/1678194745573625859

Buckminster · Jul 11, 2023

space_s3x said:
That list is based on the LINPACK benchmark for traditional x86 (64-bit) CPU-based systems. Those supercomputers are designed for traditional compute workloads. Hence, it's not a useful comparison for Machine Learning systems.

CPUs are well-suited for general-purpose workloads (all software tasks), allowing them to handle diverse instruction types, including arithmetic, logical, control flow, and I/O operations. CPUs are highly optimized for sequential operations and quick memory access.

Traditional compute work is limited by algorithmic complexity. Scaling the system often doesn't provide proportional improvements in performance or results due to the inherent limitations of the algorithms being used.

On the other hand, in machine learning, better results are often achieved with more data and larger models (although there are other factors involved, this is generally true for very complex problems). Scaling supercomputers for machine learning tasks can provide significant improvements. The parallel and distributed nature of computing tasks allows for infinite scalability (in theory).

Machine learning computers use GPUs or accelerators, which are excellent for performing massive-scale matrix multiplication operations required for machine learning tasks. Accelerators and GPUs have hundreds of cores that can process these specific arithmetic operations in parallel.

Now, going on a bit of a tangent...

Not all theoretical FLOPs on accelerators are created equal. Other systems surrounding the accelerators play a critical role in overall system's utilization, occupancy, and energy consumption. Dojo was specifically designed for massive video training. It's important to recognize that Dojo's performance should be evaluated within the context of its intended purpose. It may not do well in some arbitrary training benchmarks used by the industry to compare training hardwares.

On a small benchmark model, Dojo's performance is on par with the A100.
View attachment 955005

However, Dojo excels in large-scale complex models with high-intensity arithmetic workloads. These models face critical data-transfer bottlenecks and diminishing returns when training is scaled on the Nvidia stack. Dojo is 3.2x and 4.4x faster compared to the A100:
View attachment 955007

Dojo V2 will be more general purpose (Autopilot, Bots, AGI and potentially, open it to everyone for a pay-as-you-go service).

View attachment 955013

Buckminster · Jul 16, 2023

Optimeer said:
Comparison von NVIDIA’s and Teslas AI Hardare

It seems to me that we have reached now the steep part on the S-curve in the evolution of AI and AI Hardware. To get a better understanding, I compared NVIDIA’s with Teslas portfolio in this area. Beeing not an AI Expert at all, the products look comparable from a performance viewpoint and also from a performance per watt viewpoint. Also important is now who (e.g. Microsoft/OpenAI, Meta, Google and Tesla/x.AI) is able to scale very fast and can leverage the additional computing capacity. Tesla and NVIDIA A100 use the older TSMC 7 nm process while NVIDIA H100 uses the TSMC 4 nm process, which is also used for the Apples A15 System-on-a-Chip of iPhone 14 Pro/Pro Max. It’s possible that Tesla can scale faster due to better available manufacturing capacity at TSMC, but I don’t know which part of the system will be bottle-neck.

For Comparison, the most powerful supercomputing system acording to the Top500 List is currently Frontier with AMD Components and a Performance of 1.2 exaFlops FP64 which I would translate to about 20 exaFlops FP16 AI Performance.

My main Sources were

Tweet Tesla AI June 21, 2023

Presentation Tesla AI Day 2 October 1, 2022

NVIDIA’s Website, July 16, 2023

Wikipedia

Out of scope for my comparison was AMD’s Portfolio, due to my limited time.
NVIDIA‘s current AI Portfolio (A100 based)

The A100 Tensor Core GPU has a FP16 (AI) performance of 0.312 petaFLOPS and a max. TDP of 300 W or 400 W depending on the configuration. NVIDIA’s FP16 AI performance in this post is measured without the optimization method “sparsity”, since Tesla also took this assumption in the graph tweeted on June 21, 2023.

The DGX A100 consists of 8x NVIDIA A100 80GB Tensor Core GPU‘s, has a System Power Usage of 6.5 kW max and 2.5 petaFLOP‘s AI Performance. NVIDIA A100 Tensor Core GPU‘s seem to require about 50 % of the max. system power, a distribution I use below for ballpark estimations.

For 1 exaFlop FP16 AI Performance, 400 DGX A100 would be needed, with a total Max System Power Usage of 2.6 MW

NVIDIA‘s upcoming AI Portfolio (H100 based)

The new Grace Hopper superchip with 1000 W TDP consist of a CPU (Grace Arm Neoverse V2), a GPU (NVIDIA H100 Tensor Core GPU), up to 480 GB LPDDR5X ECC Memory and up to 96 GB HBM3. FP16 Performance is 0.990 petaFLOPS.

The DGX GH200, planned for HY2 2023, consists of 256 NVIDIA Grace Hopper superchips (total TDP of Grace Hopper superchips: 256 kW) and has a FP16 performance of 0.25 exaFLOP

Helios will consist of 4 DGX GH200 (TDP of Grace Hopper superchips: 1024 kW) and will have a FP16 performance of 1 exaFLOP.

The total power draw of a system with 1 exaFLOP FP16 AI performance is roughly 2 MW (Assumption: Half of the energy is used for the Grace Hopper superchips)
.
Teslas‘s upcoming AI Portfolio

The new D1 Chip with 400 W TDP has a FP16 Performance of 0.362 petaFLOPS.

A tile consists of 25 D1 Chips (TDP of D1 chips: 10 kW) and has a total FP16 Performance of 9.05 petaFLOPS.

A cabinet consists of 12 Tiles (TDP of D1 chips: 120 kW) and has a FP16 Performance of 108 PetaFLOPS.

An ExaPOD consists of 10 Cabinets (TDP of D1 chips: 1200 kW) has a total FP16 Performance of 1.08 ExaFLOP.

The total power draw of the system for 1.08 exaFLOP FP16 AI Performance is roughly 2.4 MW (Assumption: half of the energy is used for the C1 chips). For 1 exaFLOP 2.22 MW Electrical Power is required.

100 exaFLOP FP16 AI Performance, scheduled for October 2023, will require roughly 220 MW, which equals 1.9 TWh per year if running at 100 % uptime at full capacity.

Buckminster · Jul 19, 2023

https://twitter.com/x/status/1681760527230599169

Buckminster · Jul 19, 2023

https://twitter.com/x/status/1681788981007253504

Knightshade · Jul 19, 2023

Buckminster said:
https://twitter.com/x/status/1681760527230599169

Clicking that link led me to a post from someone who wrote up a D&D stat block for Cybertruck.

This was funny, but also clearly written by someone who has never driven an FSD/AP vehicle in the direction of the sun since they included on its list of immunities "blindness"

Buckminster · Jul 20, 2023

https://twitter.com/x/status/1682044096486182913

My expectation is that TSLA will size DOJO to be a minimum of 5x what their long term internal need is. That way they can jump on a major update for the cars if there is a safety requirement. Mostly, the remaining 80% can be utilised for customers.

Knightshade · Jul 20, 2023

Given tesla has shown repeatedly they have no idea how much compute is needed in the car to get past L2 in a generalized way, why do we believe they have any idea how much is needed on the back end to do so?

NOBODY knows either answer until someone actually accomplishes the thing.

Buckminster · Jul 23, 2023

https://twitter.com/x/status/1683040124559323136

Project Dojo - the SaaS Product?

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

TSLA - 12+ startups in 1

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Similar threads