Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Tesla, TSLA & the Investment World: the Perpetual Investors' Roundtable

This site may earn commission on affiliate links.
With that being said, there is never any time where you would give up performance/watt. Your system having worst p/w just means some other hardware can also do the task faster at the same power, or do the same task using less power.
That's true normally for most workloads and entities, but like I said, if it takes 30 days to run your workload once (and you can't speed it up with more A100's / H100's / etc because you've hit a bottleneck someplace - look at their graph showing diminishing returns as they added GPUs to a given task, ), then additional performance even if Perf/W was worse would be worth it up to a point, if you can afford it.

From Tesla's own real world usage we can see that there are diminishing returns the more GPUs you add. While performance might go up from 64 to 128 (not indicated in the graph), clearly going higher is not really getting you much, and there's no way to significantly reduce the time it takes to run such large jobs.

AI Day 2 @ 2:15:44
1664687912176.png

And even with significant efforts from our ML engineers, we see such models don't scale linearly. The Dojo system was built to make such models work at high utilization. The high density integration was built to not only accelerate the compute bound portions of the model, but also the latency bound portions, like a batch norm, or the bandwidth portions like a gradient all reduce use or a parameter all gather.

Whether the point where Perf/W becomes more important again is at a week per job, or a day per job, or an hour per job, versus the original 30 days per job, is entirely going to depend on the goals and resources (i.e. how much money to throw at it) of the entity running the job. I would guess that if Dojo was only as fast (due to bottlenecks or whatever) as their existing GPU clusters, they would not have proceeded with it. But even cutting execution time in half to ~2 weeks was clearly worth it, at least to Tesla, even with the costs of engineering Dojo and all that goes with that.

For such large jobs, that take so long to run, the old saying of "time is money" applies - reducing the time it takes to run the job outweighs the raw Perf/W, if the value of getting those results back faster are suffiently high enough to offsite the extra costs.

Though, at a claimed potential 2x~4x the performance of an A100 die for die, depending on the workload, 2.8x ~ 1.75x the watts per die (comparing official A100 TDPs for various models to the claimed 104kW for a 150-tile 'system tray' for Dojo from the AI Day 2 presentation), ignoring the rest of supporting system power usage (since there's no easy way to determine that for either cluster type), Perf/W is either not much worse or is much better for Dojo, anyways. So Perf/W that close together becomes extra irrelevant compared to the execution speedup. If we were talking double the performance for a job that finishes in minutes, seconds, or less, then of course Perf/W is king. But going from 30 days to ~2 weeks and potentially even faster has significant business implications, that far outweigh the traditional Perf/W implications.
 
Honestly so much of the commentary is treating this like some kind of product unveil, having been accustomed to productions from Apple and others. The only goal was to attract talent, which I am sure they did. Nothing else matters because that will dictate the success and pace of this project moving forward. If the world wants to laugh at the presentation because it was lame or unpolished or poorly thought out, have at it and enjoy. I doubt the target audience was concerned about these things.
One unintended effect could be the team appears to be too awesome and some people who is in phase of imposter syndrome might won’t even think about trying to join. Haha…
(Side note, personal experience shows smarter group of people tends to have higher percentage of imposter syndrome)
 
  • Like
Reactions: capster
That's true normally for most workloads and entities, but like I said, if it takes 30 days to run your workload once (and you can't speed it up with more A100's / H100's / etc because you've hit a bottleneck someplace - look at their graph showing diminishing returns as they added GPUs to a given task, ), then additional performance even if Perf/W was worse would be worth it up to a point, if you can afford it.

From Tesla's own real world usage we can see that there are diminishing returns the more GPUs you add. While performance might go up from 64 to 128 (not indicated in the graph), clearly going higher is not really getting you much, and there's no way to significantly reduce the time it takes to run such large jobs.

AI Day 2 @ 2:15:44
View attachment 859103


Whether the point where Perf/W becomes more important again is at a week per job, or a day per job, or an hour per job, versus the original 30 days per job, is entirely going to depend on the goals and resources (i.e. how much money to throw at it) of the entity running the job. I would guess that if Dojo was only as fast (due to bottlenecks or whatever) as their existing GPU clusters, they would not have proceeded with it. But even cutting execution time in half to ~2 weeks was clearly worth it, at least to Tesla, even with the costs of engineering Dojo and all that goes with that.

For such large jobs, that take so long to run, the old saying of "time is money" applies - reducing the time it takes to run the job outweighs the raw Perf/W, if the value of getting those results back faster are suffiently high enough to offsite the extra costs.

Though, at a claimed potential 2x~4x the performance of an A100 die for die, depending on the workload, 2.8x ~ 1.75x the watts per die (comparing official A100 TDPs for various models to the claimed 104kW for a 150-tile 'system tray' for Dojo from the AI Day 2 presentation), ignoring the rest of supporting system power usage (since there's no easy way to determine that for either cluster type), Perf/W is either not much worse or is much better for Dojo, anyways. So Perf/W that close together becomes extra irrelevant compared to the execution speedup. If we were talking double the performance for a job that finishes in minutes, seconds, or less, then of course Perf/W is king. But going from 30 days to ~2 weeks and potentially even faster has significant business implications, that far outweigh the traditional Perf/W implications.
Well I said Dojo's perf/watt is 2.7x better than A100 so whatever way to cut it, it's worth implementing. However the only ? I have is will it be better than Nvidia's next gen Hopper which is coming out the same time Dojo is trying to go live. If Nvidia didn't claim any 4-6x improvement then it wouldn't matter. However they did so it'll be a new reference of comparison.

I was extremely impressed with Dojo a year ago because the improvement in performance vs A100 was obvious and Tesla's chip knocked it out of the park. However it seems that cooling/powering this ultra dense die gave Tesla lots of problems and added a few delays along the way. Now with Nvidia's next gen ready for launch, they made up a lot of ground(according to their claims) and it's just a drop in plug and play. I did not expect Nvidia's next gen to leapfrog Dojo in any meaningful way and the claims need to be verified. At the end of the day if Tesla's NN gives a more diminishing in return on Nvidia's hardware than Dojo then everything here I am comparing to doesn't matter anyways.
 
Let not get ahead of ourselves here. I heard nothing about AGI last night. That's a much bigger step. What I saw last night was a platform that could replace some factory workers. ie. replace workers who do the same thing over and over.

You will note that Tesla didn't address how to tell the robot what to do. No interactions with the robot. All actions were pre-programmed.

Tesla FSD has a relatively simple task of following a pre-set route. The user doesn't have to tell FSD what to do, it is all baked into the training data and algorithmic code. A general purpose robot is a different animal. I'm not saying Tesla can't get there, but they have a big task ahead of them on that front and they haven't started it yet.

I think we are at the same page, I am not saying AGI is near, although Elon did answer a question by saying that the road to AGI is progressing rapidly in exponential way.

The robot doesn’t need AGI to be useful. There are plenty of simple tasks that don’t need any online training or deep understanding, think more about tasks like installing solar panels in a desert, instead of cleaning a house.

Human language interface for simple tasks is a solved problem now. For the tasks the robot will start with, likely it’s not even needed. I’d rather them not to spend time on that at this stage to impress the wrong crowd.

With what they already have, plus two weeks of improvements, they would be able to deploy millions of robots and create a lot of value already.

For AGI, I am in the camp of AGI needs embodiment. With the robot fleet, Tesla again would have the data flywheel no one else has, so it almost guarantees they would be the first to give birth to AGI, whenever that would be.

AGI is the endgame, once we are there, we are not going to care about investing or economy or anything we are keeping ourselves busy with now.
 
My biggest concern with Tesla’s AI is the lack of self-learning. Every improvement has to be trained on a neural net and pushed in a software update.

Do robots dream of electric sheep? - That's the function of the training enviroment. Trust me, you DO NOT want robotaxi's teaching themselves to 'drive better' in Montreal... :p


Maybe self-learning AI will be appropriate in the Outer Solar System (or the Kuiper Belt), where robots will operate almost completely autonmous. Until then (and here on Earth), you want HUMANS to test, approve, and implement the AI. Honestly*.

Cheers!

*AND Tesla designs the hardware - no mimetic polyalloy, no private APIs.
 
My biggest concern with Tesla’s AI is the lack of self-learning. Every improvement has to be trained on a neural net and pushed in a software update. When FSD beta does something stupid, I can’t give it real time feedback and say “don’t drive in the cross hatch area, it’s not a lane.” I can press the feedback button and send the FSD team a feedback email, but I still have to wait until the next software release to see if there’s improvement.

I don’t think Tesla will allow true self-learning because it will make Tesla Bot too unpredictable (Skynet becoming sentient,etc). I hope they can find a way to automate feedback so when Teslabot makes a mistake, it automatically gets fast-tracked and prioritized in Dojo to retrain the neural nets and push an update by the next day. “Teslabot, quit crashing through the sliding glass door after you windex it”. “Feedback received and submitted to Dojo priority list, expect improved glass recognition in 69 hours”
I think there will be a natural break into four levels, which were to an extent alluded to in the presentation:

- essentially hardwired (ROM) 'emergency stop' code, never updated;
- generic Robo-FSD, updated by Tesla periodically;
- add-on software modules ('apps'), whether from Tesla or other vendors with potential for update (e.g. plant/weed recognition, which by the way already exists*);
- individual user training, only capable of update by the user.

With respect to the last, let us say that you are using your bot to do a gardening task. The bot can recognise different plants. But you still need to instruct it that the daisies growing in this area of the border are the ones you want to keep, whilst the ones that are spreading to the gravel pathways should be weeded out except in that area over there where I want a more informal look. That last level needs to be taught by the user. Of course it may be that the Tesla mothership will harvest all those personalised user instructions and do something meaningful with them so that either the generic or the add-on software becomes globally better.

We know that first release will be to well controlled areas for relatively industrial tasks. My expectation is also that in time we would see different hardware versions, but that the core software will increasingly edge out existing specialised software in industrial robots. You can imagine that in a decade or so most industrial robots (welding stations etc) will run Robo-FSD rather than (say) Fanuc G-Code. This is in much the same way that Android/Apple OS ecosystems have progressively displaced almost all the previous operating systems that all the individual mobile phone makers used to maintain. So in th eecosystem some use-cases will be cars/etc; other use cases will be generic humanoid bots; and then there will be an array of form-specific use-cases.

Naturally at some point we should expect to see other competitors bring something equivalent to market. What is hard in one year becomes easy five years later and trivial a decade later. Whether those competitors will originate from (say) phonecos like Google/Android, Apple; or from trad-legacy-automotive/silicon like Intel/MobilEye; or from other places entirely is unclear at this stage. I am sure a lot of industrial turmoil and M&A will occur en route. You can bet that there are a lot of folk all over the world thinking this through. If nothing else the militaries of the world will have sat up and shat themselves (they are very aware and they've been working on this stuff a long time - I did my first neural net work in the military back in the 1980s). Of course to do the background training you have to have the worlds top supercomputers, be prepared to put a few million instrumented data capture devices into the world, and to run a top AI industrial R&D team for a decade with little return .... if you want to have a crack at being one of those ecosystem owners in the future. And I can't imagine more than 3-4 such ecosystems being viable.

The first iPhone was released in June 2007. The first Android phone was released in September 2008. Look where we are now.

My personal opinion is that FSD for Tesla vehicles is now overpriced at $15k and is increasingly getting rejection. But I don't have data insight to be sure. But when a Tesla bot is mooted at $20k with the same software then Tesla themselves are putting down a pretty firm marker that FSD price will drop.

* see e.g. Homepage - Fieldwork Robotics Ltd but there are others, and they all have software libraries behind the scenes
 
Last edited:
Lots of chatter on social media about Credit Suisse and/or Deutsche Bank being on the brink of collapse, with the Fed potentially convening an “emergency” meeting on Monday to discuss.

Hadn’t seen it mentioned here yet. Not surprising, but could validate a lot of the doom prophets.
Business with the Russia oligarchy must be bad these days ;)
 
Just curious-what phone has Lidar and what kind of apps is it used for? Something I wasn't aware of. I do have a robot vacuum that has it.
A friend has an IPhone 13 Pro Max which has lidar - he showed me an app where you could walk through a room scan the surfaces and overlay them with the regular camera input for an instant 3d model. Obvious application would be for people selling/renting appartments.
 
  • Like
Reactions: TN Mtn Man
Uh, yeah the time it takes to process is the performance. The amount of energy to do such a thing is the watt. This is the only metric that matters for compute in the data center. And it's Tesla that is telling you the performance per watt by using pictures of dojo chips and GPU clusters. Notice Tesla spent some time explaining their power usage and cooling solution because this is all datacenter 101. Insane performance while using insane power still need to be distilled down to performance per watt. That's how you compare hardware to hardware. It's like $/watt when it comes to buying solar. There maybe 100 different type of panels but at the end of the day, the end user only cares about $/watt.

Are you saying the H100 will be worst than the A100 or doesn't work at all? This is the only argument you can make by throwing all metrics out the window
Energy is measured in Joules (J) or WattHours (Wh), Watts is the rate of energy consumption.
 
  • Informative
  • Like
Reactions: kbM3 and dhanson865
That's true normally for most workloads and entities, but like I said, if it takes 30 days to run your workload once (and you can't speed it up with more A100's / H100's / etc because you've hit a bottleneck someplace - look at their graph showing diminishing returns as they added GPUs to a given task, ), then additional performance even if Perf/W was worse would be worth it up to a point, if you can afford it.

It's not just the hardware, but just as much the mathematics.
A model of a problem is often defined by a set of coupled differential equations that evolve over time, resulting in many times having to resolve NxN equations.
In university they teach that for this a so called Runge-Kutta method can be used on a computer. But that may not always be the best solution.

True story:
In university we wanted to solve an engineering problem on a microcomputer, but the estimated computing time to solve just one case came out at around 90 million years.
The professor of our engineering department ridiculed what we tried to do: there was a huge Amdahl computer available, why use these slow silly new microcomputers? So... he proposed a hardware solution.
Well, we're talking 1980's here; we were young and stubborn because we thought there was a huge future for microcomputers.
So we went to another professor in the mathematics department, who explained to us civil engineering students that we likely had a set of 'stiff' differential equations at hand.
Ehh, what?
Explanation: a little change in one equation resulted in a huge change in the other. So... very small steps were needed to solve in a stable way without numbers becoming to big, resulting in huge computing time when using the Runge-Kutta method.
We had simply used the wrong method.
He advised a so-called implicit method, that we had never heard of. After some programming it worked and the computing time went down to.... around half an hour on a microcomputer.
We had a predicting model for water disinfection that proved to be very revolutionary at the time.

So... it's more that just the hardware.
In addition to optimised software, brilliant mathematicians are necessary.
Tesla will definitely have attracted them just like Google, who use them to a large extent.
 
Well I said Dojo's perf/watt is 2.7x better than A100 so whatever way to cut it, it's worth implementing. However the only ? I have is will it be better than Nvidia's next gen Hopper which is coming out the same time Dojo is trying to go live. If Nvidia didn't claim any 4-6x improvement then it wouldn't matter. However they did so it'll be a new reference of comparison.

I was extremely impressed with Dojo a year ago because the improvement in performance vs A100 was obvious and Tesla's chip knocked it out of the park. However it seems that cooling/powering this ultra dense die gave Tesla lots of problems and added a few delays along the way. Now with Nvidia's next gen ready for launch, they made up a lot of ground(according to their claims) and it's just a drop in plug and play. I did not expect Nvidia's next gen to leapfrog Dojo in any meaningful way and the claims need to be verified. At the end of the day if Tesla's NN gives a more diminishing in return on Nvidia's hardware than Dojo then everything here I am comparing to doesn't matter anyways.
Oh yeah, for most nn models hopper might have equivalent or better perf/w, but to handle these large Tesla models that were the reason to develope dojo, Nvidia would basically have to deliver something with similar scale as dojo. The bottlenecks end up being bandwidth and latency related when you get to models that huge, so even if hopper has a 10x peak tflops to ampere, you wouldn't get such a big speedup for these big Tesla models. You might get 2x or so in that case but still not matching dojo, because the raw compute throughput is not getting taken advantage of due to the other bottlenecks.

In addition to the dedicated hardware that improved utilization by more than 90% on dojo by feeding it better, there's also the superior interconnect for low latency propagation through the tile mesh. The first would be easier to solve but the latter would require something like nvlink on steroids, maybe with a dedicated bespoke fiber network mesh formed from all the GPUs, and even then I think they'd struggle as they're just not designed to be used in such a fashion. There's a lot more overhead to support more diverse uses beyond pure compute on a GPU after all.

I think the only place where hopper vs dojo becomes a legitimate question is on "smaller" models (i.e. basically everyone who isn't Tesla) and even then a few years from now dojo might still be competitive from a compute as a service perspective, with traditional GPU clusters. Of course if you want to own the hardware and your name isn't Tesla, you probably are sticking with the other guys. I don't think there's any scenario where Tesla sells the exapods, even to SpaceX. There's no reason to do so, even if you're getting into the business of selling run time on them.

There's no question in my mind that hopper or any other GPU or tpu that isn't built from the ground up in a manner similar to dojo as regards to scale, can ever compete with dojo type hardware as far as these hilariously huge models are concerned. I'm not ruling out someone else doing the same thing, I would consider it inevitable, but the existing compute architectures out there have no way to scale this way. For the other 99% of the market, perf/w does matter, and Nvidia doesn't really need to worry about dojo yet, but whether AMD and Intel can get their acts together.
 
Anyone else pick up that DOJO will squarely take the #2 spot for most powerful supercomputer (and be knocking on #1's door by just 1-2%)?

Note the TOP500 performance is for FP64 (64bit floating point math), the ExaPOD performance of 1.1 EFLOP is for BP16 (Brain Floating Point Format 16), so the two are not comparable. I think the listed FP32 performance for D1 chip is an order of magnitude slower than BF16 performance, FP64 is at least 2x slower on top of that, so need to take this into account if you'd like to get an idea of where DOJO will stand on TOP500. But I don't think D1 supports FP64 anyway, so likely you won't be seeing it in TOP500 at all, which is not a bad thing, since DOJO is optimized for machine learning, it should get bigger bang for the buck by optimizing for one thing instead of covering all high performance computing scenarios.