Can someone explain to me why Elon talks compute in terms of Energy?
Fairly straightforward. Nobody sane would put a Cray in an automobile or in a cell phone, even if it gave the Best Performance in the World for whatever compute needs one would need.
Going along those lines, the amount of current a given performance device (say, measured in Floating Operations Per Second or something) goes down as the line widths in the device shrink. The typical power draw of a digital gate goes (very roughly) as P = VCC^2 * frequency * C, where VCC is the supply voltage for the gate, "frequency" is how often per second the gate changes state, and "C" is a fudge factor, but is hugely dependent upon the area taken up by the transistors involved. That area is, again, very roughly, the square of the line width (i.e., just how
small can one make a transistor), given the usual capacitor equation of C = epsilon*A/d, where epsilon is the permittivity (8.854*10^-12 for permittivity of free space, and epsilon_r is the relative permittivity of the material in question). "d" is the distance between the base substrate and the gate of the transistor. The overall capacitance includes the capacitance of any traces in the die, as well.
But this "power" equation is the general reason that processors keep on getting more and more powerful in terms of compute as the line widths and VCC's of semiconductors shrinks down. The 8080 of the 1980's had a relatively huge die in it with line widths in the micrometers; the CPUs of the current day are in the sub-nanometer range. Essentially, improvements in photolithography drive the shrinkage, including going from daylight to UV to X-rays to reduce the wavelength of the light, giving better accuracy, and so on. It's fun being a silicon processing engineer.
Having said the above, algorithms have a lot to do with power consumption as well. CPUs in search of reasonable power consumption (think cell phones) will actually shut down huge amounts of the hardware (i.e., no clock) in order to save power; sometimes, this is done truly dynamically
during processing. How far one wants to go down this pike depends upon how inventive one gets.
As I understand it, the driving computer in a Tesla (the one with all those NN hardware functions) draws a hundred+ Watts of power or so when it's running at full tilt. And this goes to another interesting point: For what they do, a dedicated NN bunch of hardware is
more efficient than, say, a standard processor doing image recognition or what-all. Fewer bits are changing state, really, and
that reduces power consumption. Trying to do what those NN chips are doing without actually using NN chips could likely be done - but it would take more hardware and more power to do it.
So, when Tesla is evaluating various solutions to getting Work Done (ha - and "Power", in this case, is right along with the definition of Physics Work!), they're doing a balancing act between improved line widths, improved hardware algorithms, improved software algorithms, segmenting problems different ways, turning stuff on and off (which slows down performance, since turning something on or off takes time, but maybe it's worth the power consumption reduction), and stuff that I haven't thought of off the top of my head yet.
Assuming that the Task At Hand (driving a car) is vaguely fixed in terms of compute needs, then how much hardware and what-all and how efficient it all is at what it does is vitally important: And that efficiency is, pretty much, on how much power the business uses. For somebody at Elon's level, that's a pretty handy measuring stick. And it wouldn't be just at Elon's level: It's a pretty good way to figure how well one is doing at the job.
Fun stuff.