Tesla AI Day - 2021

emmz0r · Aug 20, 2021

Cosmacelf said:
Not NN. Path planning and car control is still via complex C++ code. While they want to get to end-to-end car control via the NN, it sounds like that's still a future project. It sounded to me like they are so close to be able to release vision FSD to the US that they are gunning for that before they explore other things. Also, having DoJo will allow them the ability to do more quicker when that comes online.

Yeah what made me react was this, it shows the monte carlo function in "neural net planner". His accent was quite heavy so maybe I missed that this is what they WANT it to look like (ie. NOT finished)

Cosmacelf · Aug 20, 2021

emmz0r said:
Yeah what made me react was this, it shows the monte carlo function in "neural net planner". His accent was quite heavy so maybe I missed that this is what they WANT it to look like (ie. NOT finished)

View attachment 699268

I'll be re-listening and I'll try to figure it out. It wasn't so much his accent, but geez, that guy talked fast!

Cosmacelf · Aug 20, 2021

Dan D. said:
If Tesla Vision is 10m+ from a large and mostly featureless continuous wall or a large Gaussian-noise-like object how can it determine distance for an object that resists stitching and the ultrasonics can't be used?

Same way humans do? Well before you are 10m to it, you'll be 40m from it and its size will change to give you a distance estimation.

Dan D. said:
For example if you drove down a gravel alley beside a grey building at night. Or a mirrored building with a wet asphalt road. A road covered with leaves, no curbs visible. Flat new-snow-covered road next to snow-covered cars and snow piles.

You'd be amazed what you (and the Tesla NN) can glean from the faintest signal. The presenters did stress that they are often amazed at what the NN can figure out.

Dan D. said:
Can vision do everything at 70mph without any range-finding sensors?

Humans can. The NN should be able to as well.

orion2001 · Aug 20, 2021

I ended up staying up pretty late watching the whole thing last night. I work in a similar field of research (not self-driving, but computer-vision/object-detection/segmentation/tracking, etc) and really enjoyed the talk. Here are just some general thoughts and notes I had as I watched it:

1. I've always admired Andrej Karpathy as a very pragmatic, no-BS ML researcher and I continue to be glad that he is leading this effort. Not to take anything away from the entire team, but the overall direction and approach they have taken, not just in the sense of where they are going with their ML approaches, but also the spectacular amounts of engineering they have done to build up the right tooling and infrastructure to help them iterate fast is quite something.

2. The evolution of their approach to their networks, architecture and training approaches are all quite sensible but they are extremely complex and require ridiculous amount of data to train which is why Tesla has built out all the amazing infrastructure on the labeling/reconstruction side of things, as well as with simulations.

3. Their approach to spatial and temporal memory is a nice and strongly needed upgrade, and the performance improvement in the demos were very nice to see.

4. The policy learning section was really neat and I'm sure there is a lot more neat stuff under the hood.

5. Very excited to see where they have gone with simulations. The imagery looks quite impressive. It is very tricky because if your visuals don't look near-identical to the real-world, the networks are so powerful that they can just learn to do well on synthetic imagery and then fall apart on real-world imagery. This is a common challenge with folding in synthetic data from simulations, so it was nice to see the extent of work they have done and continue to do with exploring neural rendering to try and bridge the gap between simulation and reality
- I did note that the vehicle dynamics of cars/trucks, etc in the simulations were fairly non existent in terms of body-roll, etc. But that's a 2nd order concern at the moment and unlikely to matter too much for what they are trying to do

6. It's clear that they are starting to hit the limits of being able to fit all the inference for these complex networks onto their current FSD hardware. In the diagram they showed in one of the slides, it's clear that they are now leveraging both compute units to try and squeeze the needed performance (rather than the original vision of having the 2nd unit for redundancy). I wonder how they are going to handle the inevitable fragmentation when they release new hardware and now have to support two separate sets of FSD/AP and associated networks when that happens because they certainly aren't going to foot the bill for upgrading their entire fleet.

7. The amount of crazy levels of custom engineering that has gone into all aspects of their entire pipeling... from labeling tools and infrastructure, training networks, simulations and regression testing their models is quite spectacular.

8. Their offline processing of videos to generate ground-truth 3D point clouds and other assets + autolabeling is extremely impressive. Anyone who works in that kind of field probably really appreciated how neat it was even more.

9. They've definitely made the right decisions early on in how they set up the closed loop between experiments they want to run or data they want to collect and their vehicle fleet. Tying hard/weird examples in with simulations to generate a large number of variants of a single instance of a real-world "weird" scenario is a logical extension and really neat to see because this helps try to address the problem with the long-tail of weird scenarios. It still doesn't fundamentally solve this because the long-tail in the real-world is really long but with time this will still improve the overall FSD abilities to handle weird scenarios

10. I feel like they were a bit disingenuous with their Dojo presentation. There was very little distinction made between what they have built, tested and benchmarked to-date vs. what was aspirational with regards to Dojo. They only just got one working D1 on a benchtop that they managed to train a small GPT model on, but the slides would have you believe they have this huge room-sized cluster almost built up and ready to go. During the Q&A session a researcher on compilers for distributed computing systems asked a question about whether Tesla had managed to solve a very hard and challenging problem with such systems that is an active area of research in academia and the reply from the Tesla counterpart was very wishy washy, basically saying no, it's hard, but we think we can solve it. How Dojo eventually shakes out is still a big unknown at this point.

11. While the inner-geek in my loved the Dojo stuff, and I'm sure anyone working on computing hardware would absolutely love to see someone spending time and a lot of resources on developing a whole new architecture, I have to say that taking a step back and looking past all the hype, it still isn't clear to me if investing all of this effort into Dojo is really all that beneficial to Tesla. It feels more like someone said this would be cool if we did this, and they just ran with it. At the end of the day, even if they meet all their goals for performance and efficiency, it's not like Dojo is going to let them train fundamentally new types of networks that are impossible to train on any other hardware. The Google's, Facebooks, OpenAIs of the world are doing just fine training extraordinarily complex ML systems using conventional GPU clusters relying on nVidia GPUs/TPUs. Even Tesla to date has managed to do everything they have achieved so far on such clusters. All Dojo will let them do is train things faster by some amount, and at a lower cost. It isn't a fundamental game-changer in any aspect that opens up fundamentally new directions and opportunities for Tesla. I wonder if the company wouldn't have been better served building out a regular GPU computing cluster and saving all these resources and allocating it elsewhere. Still curious to see how it shakes out, but it still seems like a bit of a gamble to me with very questionable long-term benefit.

All in all, it was a neat presentation and everything looks to be going in a better direction and it certainly doesn't look like Tesla is just stuck spinning wheels. I'm still very skeptical on them getting anywhere near L5 anytime soon, but I do feel fairly confident that they will have a extremely solid AP even in city conditions with the progress they are making. I still do wish they just made the plunge with augmenting with automotive Lidar, but I guess that's not going to happen due to the optics of switching up their sensor suite so late in the game and the fact that they have sold FSD to so many customers already with the current camera suite.

Ptheven · Aug 20, 2021

orion2001 said:
During the Q&A session a researcher on compilers for distributed computing systems asked a question about whether Tesla had managed to solve a very hard and challenging problem with such systems that is an active area of research in academia and the reply from the Tesla counterpart was very wishy washy, basically saying no, it's hard, but we think we can solve it.

Tesla being extraordinarily reductive and thinking they've got a problem solved before they've even really started? There's no precedent for such a thing!

scottf200 · Aug 20, 2021

Cosmacelf said:
I'll be re-listening and I'll try to figure it out. It wasn't so much his accent, but geez, that guy talked fast!

I watched it on YouTube today and changed the speed (gear icon) from x0.75 to x1.0 to x1.5 depending on who was speaking.

scottf200 · Aug 20, 2021

orion2001 said:
11. While the inner-geek in my loved the Dojo stuff, and I'm sure anyone working on computing hardware would absolutely love to see someone spending time and a lot of resources on developing a whole new architecture, I have to say that taking a step back and looking past all the hype, it still isn't clear to me if investing all of this effort into Dojo is really all that beneficial to Tesla. It feels more like someone said this would be cool if we did this, and they just ran with it. At the end of the day, even if they meet all their goals for performance and efficiency, it's not like Dojo is going to let them train fundamentally new types of networks that are impossible to train on any other hardware. The Google's, Facebooks, OpenAIs of the world are doing just fine training extraordinarily complex ML systems using conventional GPU clusters relying on nVidia GPUs/TPUs. Even Tesla to date has managed to do everything they have achieved so far on such clusters. All Dojo will let them do is train things faster by some amount, and at a lower cost. It isn't a fundamental game-changer in any aspect that opens up fundamentally new directions and opportunities for Tesla. I wonder if the company wouldn't have been better served building out a regular GPU computing cluster and saving all these resources and allocating it elsewhere. Still curious to see how it shakes out, but it still seems like a bit of a gamble to me with very questionable long-term benefit.

Characterizing what they presented as 'train things faster by some amount and at a lower cost' is pretty telling about your expertise on what was presented.

orion2001 · Aug 20, 2021

scottf200 said:
Characterizing what they presented as 'train things faster by some amount and at a lower cost' is pretty telling about your expertise on what was presented.

I'd love to hear your expert insights on this topic rather than snide personal attacks if you have any to offer.

Silicon Desert · Aug 20, 2021

Daniel in SD said:
Artificial general intelligence 2-3 times greater than the average human by the end of the year.

OR 20 times greater intelligence than the average politician.

or maybe even me

diplomat33 · Aug 20, 2021

orion2001 said:
All Dojo will let them do is train things faster by some amount, and at a lower cost. It isn't a fundamental game-changer in any aspect that opens up fundamentally new directions and opportunities for Tesla.

What?! Training things faster at a lower cost is a big deal. The faster you train NNs, the sooner you can deploy them, and the closer you will be to solving FSD. That's a big advantage. Furthermore, the NNs needed for FSD are extremely complex. You need very powerful computing to train them in a reasonable amount of time. If you can speed that up, you should do that. And if you can do it at a lower cost, that is huge because a business will want to save money, to maximize profit.

powertoold · Aug 20, 2021

diplomat33 said:
What?! Training things faster at a lower cost is a big deal. The faster you train NNs, the sooner you can deploy them, and the closer you will be to solving FSD. That's a big advantage. Furthermore, the NNs needed for FSD are extremely complex. You need very powerful computing to train them in a reasonable amount of time. If you can speed that up, you should do that. And if you can do it at a lower cost, that is huge because a business will want to save money, to maximize profit.

Yup, Elon said just as much. The AI Day slide said 4x performance, whatever that means.

But I see orion2001's perspective. From Elon's commentary, Dojo is still in experimental territory. It might end up being vaporware if the software team still clings to the GPU cluster.

orion2001 · Aug 20, 2021

diplomat33 said:
What?! Training things faster at a lower cost is a big deal. The faster you train NNs, the sooner you can deploy them, and the closer you will be to solving FSD. That's a big advantage. Furthermore, the NNs needed for FSD are extremely complex. You need very powerful computing to train them in a reasonable amount of time. If you can speed that up, you should do that. And if you can do it at a lower cost, that is huge because a business will want to save money, to maximize profit.

Yeah, sure. But they could also have trained faster and deployed quicker TODAY, if they had spent all the resources they did on building out a more powerful and bigger compute cluster. My point isn't that, it isn't valuable at all, but that Dojo is still a risky bet on that front. Clearly they still have a long way to go before they are even at a place to benchmark and compare performance, and right now their plot showing how they stack up is purely aspirational from a theoretical standpoint. Also, they compared against TPU v3 in their chart but TPU v4 is already a real thing that is ~ 2.7X faster than TPU v3. My point is that the incumbents aren't stagnating in performance and there is a far easier path available to Tesla to get faster computing capabilities in-house today to deploy and develop models faster. I want them to succeed with Dojo because more competition in the space is always nice, but for Tesla the car company, I really don't know if it was the prudent decision given that the current hardware providers for ML compute are continuously improving and have existing products that are ready to ship today.

Silicon Desert · Aug 20, 2021

Cosmacelf said:
I'll be re-listening and I'll try to figure it out. It wasn't so much his accent, but geez, that guy talked fast!

Hmmm, I was thinking I was the only one having trouble understanding. After all the presentation and the questions to follow, I was wondering if the event was being held in India. BUT, I guess it is because I am getting to be too hard of hearing and have a different accent of my own.

diplomat33 · Aug 20, 2021

orion2001 said:
Yeah, sure. But they could also have trained faster and deployed quicker TODAY, if they had spent all the resources they did on building out a more powerful and bigger compute cluster. My point isn't that, it isn't valuable at all, but that Dojo is still a risky bet on that front. Clearly they still have a long way to go before they are even at a place to benchmark and compare performance, and right now their plot showing how they stack up is purely aspirational from a theoretical standpoint. Also, they compared against TPU v3 in their chart but TPU v4 is already a real thing that is ~ 2.7X faster than TPU v3. My point is that the incumbents aren't stagnating in performance and there is a far easier path available to Tesla to get faster computing capabilities in-house today to deploy and develop models faster. I want them to succeed with Dojo because more competition in the space is always nice, but for Tesla the car company, I really don't know if it was the prudent decision given that the current hardware providers for ML compute are continuously improving and have existing products that are ready to ship today.

Thanks.

powertoold · Aug 20, 2021

orion2001 said:
Also, they compared against TPU v3 in their chart but TPU v4 is already a real thing that is ~ 2.7X faster than TPU v3.

I'm pretty sure Tesla is trying to avoid using third party cloud services for training (for security or data management reasons, since their set is so large), so they're limited to what they can purchase to create their own clusters.

Almost all the best custom training chips I know of are only available through cloud services.

Cosmacelf · Aug 20, 2021

orion2001 said:
6. It's clear that they are starting to hit the limits of being able to fit all the inference for these complex networks onto their current FSD hardware. In the diagram they showed in one of the slides, it's clear that they are now leveraging both compute units to try and squeeze the needed performance (rather than the original vision of having the 2nd unit for redundancy).

It wasn't only in the slides, they said as much in the presentation. However I got the impression that they were only now using a small portion of the second chip. I suspect they still have a ways to go before hitting a limit.

orion2001 said:
10. I feel like they were a bit disingenuous with their Dojo presentation. There was very little distinction made between what they have built, tested and benchmarked to-date vs. what was aspirational with regards to Dojo.

I thought it was pretty clear. It's still very much in development. That 1 cu ft blade might have been the only one they've made so far. They haven't racked anything yet. Indeed, they may still be writing communications code.

orion2001 said:
They only just got one working D1 on a benchtop that they managed to train a small GPT model on, but the slides would have you believe they have this huge room-sized cluster almost built up and ready to go.

Nah. They never showed a cabinet, let alone a room. The slides were all block diagrams, obviously aspirational.

orion2001 said:
During the Q&A session a researcher on compilers for distributed computing systems asked a question about whether Tesla had managed to solve a very hard and challenging problem with such systems that is an active area of research in academia and the reply from the Tesla counterpart was very wishy washy, basically saying no, it's hard, but we think we can solve it. How Dojo eventually shakes out is still a big unknown at this point.

Yes BUT, that chip has 10 TBps on chip bandwidth, and 4 x 4 TBps off chip bandwidth. That's huge! Much more than other TPUs and such. ALSO Tesla created this chip architecture specifically for THEIR workload. So it is optimized and should work much better than alternatives.

orion2001 said:
It still isn't clear to me if investing all of this effort into Dojo is really all that beneficial to Tesla.

Elon has been quoted as saying that Tesla should be viewed as an amalgamation of start up companies. Not all startups will flourish, and not all technologies within Tesla will survive either. But many will, and those that do will propel Tesla really fast upwards. For instance, if you talk to many automotive engineers, they would be aghast at Tesla's humongous single piece castings and would give you a litany of reasons why it's a bad idea. Tesla doesn't care. They'll try it out and if it indeed turns out to be a bad idea, they'll quickly discard it, just as they did with their hyper automation idea during the Model 3 ramp. Or their battery swap idea.

Maybe Dojo will be a bust, but if it isn't, Tesla now has amazing AI supercomputer technology (and it isn't only in the chip design - it is also in the huge communications bandwidth, the compiler software, the packaging, the cooling, the power electronics, etc.) which can be leveraged for god knows what.

And no, I don't believe that doing DoJo "took away" resources elsewhere. Chip architects and the like aren't NN engineers. Parallel projects really.

Cyberax · Aug 20, 2021

powertoold said:
I'm pretty sure Tesla is trying to avoid using third party cloud services for training (for security or data management reasons, since their set is so large), so they're limited to what they can purchase to create their own clusters.

There are customers at Amazon and Google with exabytes of data (millions of terabytes). As for data security, Wall Street companies are using Amazon for their trading data crunching.

Tesla inventing their own [steering] wheel is another sign that they simply don't have enough focus.

powertoold · Aug 20, 2021

Cyberax said:
There are customers at Amazon and Google with exabytes of data (millions of terabytes). As for data security, Wall Street companies are using Amazon for their trading data crunching.

Tesla should be rightfully concerned about depending on third party competitor servers to store their trillion dollar fsd data and resources.

Your comparison doesn't make much sense. Apple storing customer cloud data != Tesla storing training data. Along with your steering wheel mention, I can't take you seriously lol.

Cosmacelf · Aug 20, 2021

Cyberax said:
There are customers at Amazon and Google with exabytes of data (millions of terabytes). As for data security, Wall Street companies are using Amazon for their trading data crunching.

Tesla inventing their own [steering] wheel is another sign that they simply don't have enough focus.

Once you get beyond a certain size, it is much, much cheaper to build your own data center and process in house. In addition, Tesla processes video, I doubt there is enough Internet bandwidth to accommodate them to the cloud.

Finally, who the hell cares? Are you going to criticize their ERP system too? How about the version of Linux they use? I’ll bet their toilet paper supplier is all wrong for them too.

scottf200 · Aug 20, 2021

Cosmacelf said:
It wasn't only in the slides, they said as much in the presentation. However I got the impression that they were only now using a small portion of the second chip. I suspect they still have a ways to go before hitting a limit.

1:44:00 into the video they talk about this and they use both FSD chips interchangeably. During the presentation, it was reiterated that they do run new models in parallel in shadow mode. It stands to reason that this could be some of the extra work on the extra FSD chip.

Tesla AI Day - 2021

Senior Software Engineer

Well-Known Member

Well-Known Member

Member

Member

Well-Known Member

Well-Known Member

Member

Well-Known Member

Average guy who loves autonomous vehicles

Active Member

Member

Well-Known Member

Average guy who loves autonomous vehicles

Active Member

Well-Known Member

Member

Active Member

Well-Known Member

Well-Known Member

Similar threads