Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Tesla Optimus Sub-Prime Robot

This site may earn commission on affiliate links.
How to Train an Optimus

I think I figured out how Tesla will be training the Optimus. Basically they will start with very simple system and gradually decrease the amount of human effort. Here is the stage by stage of the development:

1. Human captures motion, simulate, deploy on robot. Showed in 2022 AI day
2. Human performs motion with sensei helmet, backpack and gloves control the robot aka teleoperation like sanctuary.ai. Showed on Shareholders day 2023 when the robot was moving small objects.
3. Human performs motion with sensei helmet, backpack and gloves and they record the sequences. AI learns to perform same movement on the robot. Aka "end2end". Showed on Shareholders day 2023 when the guy celebrates.
4. Robot observes a human performing the task, translates this into robot movement, no backpack needed, can be done by the customer. Will be shown on AI day 2023.
5. Robot hears a voice command, translates into a text string. An LLM figures out what the task is given the environment, converts this into a sequence of motions, shows it in simulation and the user can verify that it's the correct task and then the robot performs the task. Will be shown on AI day 2024.
6. Robot doesn't even need voice commands, just figures out what it should do. Decides that the dishes needs to be cleaned and places them in the correct place. AI day 2026!?


Currently Tesla is rapidly iterating on the hardware of the robot. What mechanical joints, motors, batteries, electronics etc it should have. and what cameras in what angles etc. The nice thing is that they can keep iterating without having to throw away previous data. What they just need to do is to have an intermediate step that reconstructs the world. This can easily be done, they already do this with autolabeler. Basically generate a "ground truth" environment and then simulate what they cameras should be seeing given a robots configuration and position. Then if a human is performing a task, the autolabeler can imagine what a robot would see if it was in the humans position and how the world would behave given the robots execution of the task.

So they can start generating a dataset of camera input->robot motion->object manipulation. First it will be very small with motion capture from the sensei backpack/helmet/gloves. Then eventually it will grow the dataset with humans interacting with objects seen from the robot. Heck they can probably even take data from the dataset from the cars fleet and see how the world evolves when humans are manipulating objects in it, then translate it into robot reference frame.

I recommend you to rewatch the autolabeler and simulation parts of AI day:

In Shareholder day 2023 they showed the robots "memorizing" the environment aka SLAM. I believe this is the first step of creating the ground truth dataset, then they can "imagine" what a robot would be seeing and more importantly should be perceiving(output from the vision neural network) in any given position. To train the neural network to accurately output the correct lane lines etc in the car, and in the case of the robot the physical shape and properties of the objects to interact with in. Then the control network uses the vision output as input. Thus they can quickly iterate on the hardware and use have the vision input translated from old hardware, from the point of view of a human in a different position etc into training data for the control network.
 
Last edited:
How to Train an Optimus

I think I figured out how Tesla will be training the Optimus. Basically they will start with very simple system and gradually decrease the amount of human effort. Here is the stage by stage of the development:

1. Human captures motion, simulate, deploy on robot. Showed in 2022 AI day
2. Human performs motion with sensei helmet, backpack and gloves control the robot aka teleoperation like sanctuary.ai. Showed on Shareholders day 2023 when the robot was moving small objects.
3. Human performs motion with sensei helmet, backpack and gloves and they record the sequences. AI learns to perform same movement on the robot. Aka "end2end". Showed on Shareholders day 2023 when the guy celebrates.
4. Robot observes a human performing the task, translates this into robot movement, no backpack needed, can be done by the customer. Will be shown on AI day 2023.
5. Robot hears a voice command, translates into a text string. An LLM figures out what the task is given the environment, converts this into a sequence of motions, shows it in simulation and the user can verify that it's the correct task and then the robot performs the task. Will be shown on AI day 2024.
6. Robot doesn't even need voice commands, just figures out what it should do. Decides that the dishes needs to be cleaned and places them in the correct place. AI day 2026!?


Currently Tesla is rapidly iterating on the hardware of the robot. What mechanical joints, motors, batteries, electronics etc it should have. and what cameras in what angles etc. The nice thing is that they can keep iterating without having to throw away previous data. What they just need to do is to have an intermediate step that reconstructs the world. This can easily be done, they already do this with autolabeler. Basically generate a "ground truth" environment and then simulate what they cameras should be seeing given a robots configuration and position. Then if a human is performing a task, the autolabeler can imagine what a robot would see if it was in the humans position and how the world would behave given the robots execution of the task.

So they can start generating a dataset of camera input->robot motion->object manipulation. First it will be very small with motion capture from the sensei backpack/helmet/gloves. Then eventually it will grow the dataset with humans interacting with objects seen from the robot. Heck they can probably even take data from the dataset from the cars fleet and see how the world evolves when humans are manipulating objects in it, then translate it into robot reference frame.

I recommend you to rewatch the autolabeler and simulation parts of AI day:

In Shareholder day 2023 they showed the robots "memorizing" the environment aka SLAM. I believe this is the first step of creating the ground truth dataset, then they can "imagine" what a robot would be seeing and more importantly should be perceiving(output from the vision neural network) in any given position. To train the neural network to accurately output the correct lane lines etc in the car, and in the case of the robot the physical shape and properties of the objects to interact with in. Then the control network uses the vision output as input. Thus they can quickly iterate on the hardware and use have the vision input translated from old hardware, from the point of view of a human in a different position etc into training data for the control network.
Time to edit ran out, so will just add this. I realized that NERF today can go straight into Unreal, which Tesla are using for their simulation. So basically robot camera capture->unreal is done automatically today. Even normal amateurs can do this pretty easily with modern software such as Luma.ai:


It's getting crazy good. This should be excellent to go from robot/human capture->simulation of the environment in unreal. Tesla probably have augmentet their unreal engine with excellent physics, crash simulation software etc.
 
Last edited:
How to Train an Optimus

I think I figured out how Tesla will be training the Optimus. Basically they will start with very simple system and gradually decrease the amount of human effort. Here is the stage by stage of the development:

1. Human captures motion, simulate, deploy on robot. Showed in 2022 AI day
2. Human performs motion with sensei helmet, backpack and gloves control the robot aka teleoperation like sanctuary.ai. Showed on Shareholders day 2023 when the robot was moving small objects.
3. Human performs motion with sensei helmet, backpack and gloves and they record the sequences. AI learns to perform same movement on the robot. Aka "end2end". Showed on Shareholders day 2023 when the guy celebrates.
4. Robot observes a human performing the task, translates this into robot movement, no backpack needed, can be done by the customer. Will be shown on AI day 2023.
5. Robot hears a voice command, translates into a text string. An LLM figures out what the task is given the environment, converts this into a sequence of motions, shows it in simulation and the user can verify that it's the correct task and then the robot performs the task. Will be shown on AI day 2024.
6. Robot doesn't even need voice commands, just figures out what it should do. Decides that the dishes needs to be cleaned and places them in the correct place. AI day 2026!?


Currently Tesla is rapidly iterating on the hardware of the robot. What mechanical joints, motors, batteries, electronics etc it should have. and what cameras in what angles etc. The nice thing is that they can keep iterating without having to throw away previous data. What they just need to do is to have an intermediate step that reconstructs the world. This can easily be done, they already do this with autolabeler. Basically generate a "ground truth" environment and then simulate what they cameras should be seeing given a robots configuration and position. Then if a human is performing a task, the autolabeler can imagine what a robot would see if it was in the humans position and how the world would behave given the robots execution of the task.

So they can start generating a dataset of camera input->robot motion->object manipulation. First it will be very small with motion capture from the sensei backpack/helmet/gloves. Then eventually it will grow the dataset with humans interacting with objects seen from the robot. Heck they can probably even take data from the dataset from the cars fleet and see how the world evolves when humans are manipulating objects in it, then translate it into robot reference frame.

I recommend you to rewatch the autolabeler and simulation parts of AI day:

In Shareholder day 2023 they showed the robots "memorizing" the environment aka SLAM. I believe this is the first step of creating the ground truth dataset, then they can "imagine" what a robot would be seeing and more importantly should be perceiving(output from the vision neural network) in any given position. To train the neural network to accurately output the correct lane lines etc in the car, and in the case of the robot the physical shape and properties of the objects to interact with in. Then the control network uses the vision output as input. Thus they can quickly iterate on the hardware and use have the vision input translated from old hardware, from the point of view of a human in a different position etc into training data for the control network.
Plus they can fire up virtual robots in their thousands or millions and modify their environment slightly or significantly. In a virtual world, add pets, children playing with balls, multiple robots, obstacles - initially just to navigate around, eventually being aware of what an unpredictable animal can do.
 
  • Like
Reactions: Buckminster
Plus they can fire up virtual robots in their thousands or millions and modify their environment slightly or significantly. In a virtual world, add pets, children playing with balls, multiple robots, obstacles - initially just to navigate around, eventually being aware of what an unpredictable animal can do.
Yes. Lots of the massive compute from Dojo will be to train massive offline models to generate very good simulation for the robot to train in. Models that will need to understand the world very accurately. Real world AI. Small changes in the environment and the the robot interacting with the environment. Planning motion around dogs, kids, cars etc. Tesla car fleet will be a massive advantage in understanding how dogs and kids behave and as they have more and robots out there they will get a better understanding for how humans interact with the robots, how humans walk, how humans move their hands etc.
 
Yes. Lots of the massive compute from Dojo will be to train massive offline models to generate very good simulation for the robot to train in. Models that will need to understand the world very accurately. Real world AI. Small changes in the environment and the the robot interacting with the environment. Planning motion around dogs, kids, cars etc. Tesla car fleet will be a massive advantage in understanding how dogs and kids behave and as they have more and robots out there they will get a better understanding for how humans interact with the robots, how humans walk, how humans move their hands etc.
Simulation/Ai learning is a big advantage compared to others.

Starship robot near me had difficulty crossing the road and ended up under a car recently. Some discussion whether it actually uses the crossing like a human does, reading the green man sign or whether it crosses when there's a traffic gap. I know they ask people to press the crossing button as they can't do it themselves. A few cross (maybe 2) while the others queue up out of the way of pedestrians and wait for another opportunity. Anyway, presumably car driver didn't see robot - I think the driver was turning right (reverse for USA equivalent) onto the road and didn't see the robot crossing, perhaps as the other lane's vehicles obscured their view or maybe SUV with little forward view. The little orange triangle isn't obvious enough to some. I can't remember if triangle is lit up. Low speed but robot turned into a wedge/jack underneath. Might have been nasty repair price.

Apparently a few robots have been hit recently (according to rumour). Sunday mornings are busiest times, presumably after a few sherberts on Saturday night - bringing hair of the dog, McDonalds or ingredients for English breakfast (I'm not sure who else use them apart from Co-Op shops & McD's).

1684509126706.jpeg


Obviously not competition for Tesla's Optimus Subprime, BUT they've been operating & expanding for years, so providing evidence of a business case for even simple robots.
 
I mean, it's not actually doing that of course.... The "OPTIMUS IS LEARNING JUST BY WATCHING HUMANS." bit simply is not so... no training ever happens local to the bot, nor does it happen in real time.... just as none ever happens local to a car or in real time- it doesn't have REMOTELY the compute power for that sort of thing.

What you saw shown was a human with sensor gear performing a task over and over with a bunch of data captured (quite a bit more than just "watching"- note the sensor-filled gloves for example)... and the captured data from it will go back to the giant GPU NN training clusters.... same as the fleet data from the cars does for training FSD.

Some folks seem to think they showed some kind of "Show your individual bot how to do something and that bot, just by watching you, will learn to do that thing" and that is not REMOTELY how any of that works.
 
  • Like
Reactions: Cosmacelf
I mean, it's not actually doing that of course.... The "OPTIMUS IS LEARNING JUST BY WATCHING HUMANS." bit simply is not so... no training ever happens local to the bot, nor does it happen in real time.... just as none ever happens local to a car or in real time- it doesn't have REMOTELY the compute power for that sort of thing.

What you saw shown was a human with sensor gear performing a task over and over with a bunch of data captured (quite a bit more than just "watching"- note the sensor-filled gloves for example)... and the captured data from it will go back to the giant GPU NN training clusters.... same as the fleet data from the cars does for training FSD.

Some folks seem to think they showed some kind of "Show your individual bot how to do something and that bot, just by watching you, will learn to do that thing" and that is not REMOTELY how any of that works.
You are correct, but nonetheless, no custom code was written for that task. Tack on a LLM that can interpret and break down tasks down to the offline NN learned level and you’ve got something very powerful.
 
Some folks seem to think they showed some kind of "Show your individual bot how to do something and that bot, just by watching you, will learn to do that thing" and that is not REMOTELY how any of that works.
Probably not right now. For now they probably have to retrain on the cluster for every new task. But once the robot master many tasks it can probably few shot new tasks.

Few-Shot Learning (FSL) is a Machine Learning framework that enables a pre-trained model to generalize over new categories of data (that the pre-trained model has not seen during training) using only a few labeled samples per class. It falls under the paradigm of meta-learning (meta-learning means learning to learn).
 
  • Like
  • Informative
Reactions: Cosmacelf and CarlS
The other thing that was interesting with the Teslabot video was that the bot was mapping out its environment and remembering(?) it. That’s definitely a departure from FSD which doesn’t remember anything from one drive to the next.

Well, it KIND of does though... fleet cars send map-relevant data back to Tesla, and Tesla pushes that info back out to other cars as soon as they put in a route that touches relevant locations....we recently had a whole big thread on twitter from Green about the surprising amount of drive-specific info that gets pushed to cars each time a destination is put in, including fleet-gathered map data.... I would expect it's doing the same type of thing here-just without the "base map" info to begin from that they get from tomotom, google, etc.... so that if you say hired 20 bots to do a thing in a new unmapped location they wouldn't ALL need to map the location- one would, the data would upload to a back end, and then be distributed back out to the rest (or if it's a big area I suppose you could split the mapping task up among bots then have the back end stitch together and push the full map.
 
  • Like
Reactions: CarlS and Cosmacelf

spot is the world leader in the emerging
mobile robotics Market with more than a
thousand robots in over 35 countries
no
other robot has been deployed more often
to tackle some of the industry's
toughest most dangerous tasks
spot handles tasks that are difficult or
dangerous for people
spot spends hours and hours each week
walking Factory floors checking gauges
and Machinery exposes itself to high
radiation in nuclear facilities goes
offshore and much more so people like
you don't have to every single day our
robot is being deployed at job sites all
over the world and it's making a big
difference in Industries like
manufacturing construction power and
utilities mining oil and gas and even in
a classroom where hopefully we're
helping inspire the next generation of
young roboticists but we want spot to do
even more


So basically they have sold 1000 units! Yay! Elon is going for billions... Different scopes. And it's basically a camera on legs so far, not manipulating the environment in any large scale. A useful camera, but why not just put a stationary camera at every guage or even just have the equipment be read digitally and sent over a network? Is that so difficult to do? (just asking, not claiming that it's not)
 
So basically they have sold 1000 units! Yay! Elon is going for billions...

I mean, this is the Waymo vs Tesla FSD argument isn't it?

One is trying for a general works everywhere self-driving solution but has 0 actually deployed examples so far and one has a lot of obstacles to scale affordably or at any speed, but actually has small numbers of working ones in the field.
 
  • Like
Reactions: CarlS