Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

TSLA Market Action: 2018 Investor Roundtable

This site may earn commission on affiliate links.
Status
Not open for further replies.
“Tesla expects the factory to produce its first cars in three years, according to an earnings release in August”

CNBC conveniently forgets about the Oct 2nd Tesla delivery update which said that shanghai factory plans have been accelerated.

Also, even according to the original plans they wanted to scale up the Shanghai Gigafactory in 2020 to 250K units already. I.e. even the original plan was closer to 2 years than 3 years. This is what Elon said in the August 2 conference call:

Elon Musk: "I think so, yeah. If it's not a million, it's going to be pretty close. I'd say if it's not a million it'd probably be 750,000 or something like that in 2020. So, we're aiming for a million, 2020, but somewhere between half million and a million seems pretty likely."

Tim Higgins: "Where do you get the capacity to do that?"

Elon Musk: "There's this place called Shanghai."

Tim Higgins: "Okay. Shanghai will be important for that, that goal?"

Elon Musk: "Yeah."​

Then they've announced plans to further accelerate that...

BTW., another tidbit from the Q2 conference call:

Elon Musk: "Yeah, Model Y is sort of a whole separate thing but it's definitely one of the elements that convinced us that we can scale up quickly and at low CapEx in Shanghai, where we do an improved version of GA4. And then, we're also figuring out how to make the paint shop a lot simpler and general assembly a lot simpler."​

GA4 is the "Sprung Tent" assembly line which they constructed out of existing unused stations within a couple of weeks only.

Everything depends on the cell manufacturing aspect I suspect: maybe they'll be initially shipping cells from Nevada to Shanghai? Since cells are pretty dense, this should be a pretty efficient route of logistics via container shipping, until they get their 2170 lines up and running in Shanghai.

The Shanghai Gigafactory could start making Model 3's a lot faster than people expect...
 
Last edited:
Indeed, and this significantly lowers the $600-$700 price I estimated for HW2.5: GP106 chips should be much less expensive than GP102 chips.

An equivalent GPU would be the "NVIDIA GeForce GTX 1060 Max-Q" discrete GPU chip for laptops with a 80W TDP, which has a similar boost clock of 1,480 MHz:


Price of the board should be similar to discrete GPU106 based NVIDIA cards, such as the "GeForce GTX 1060 6 GB":


except that the Tesla board has 8 GB of GPU RAM.

The GTX-1060 board retails for around $200-$300, and I suspect Tesla gets the GP106 for much less than $200 in bulk quantities, maybe even below $100. So the Tesla AI chip's direct per-unit cost savings should be less than $100, not the $400-$600 I estimated previously.

This is also more in line with what @oneday noted:




Regarding GPU clock speed:



I'm pretty certain that when AutoPilot is active (i.e. when the car is driven) the chips typically just clock up to the maximum frequency. It's all liquid cooled, so there should be no thermal throttling.

My guess is that low power mode matters mostly when the car is not driving, you'd still want to have vehicle control software running and react to certain sensor inputs (such as temperature sensors to keep the BMS running, or the security system, or cabin overheat protection, etc.) - but full AutoPilot processing of the video+sensor feeds is not required. In this scenario the discrete Pascal GP106 chip is turned off entirely, and the two Parker SoC's are in low power mode. (Maybe even the integrated GPU is off in this case and only some of the ARM cores are running.)

Regarding memory bus speed: using already trained, static neural nets with no back-propagation are exceedingly simple calculations of combining the weights with the input values, where the number of weight values in their neural nets far exceeds the limited hardware cache sizes of Pascal chips, so I suspect their NN throughput is primarily memory bus limited. So if the memory performance of the GP106 and the Parker chips differs significantly, that would have a direct effect on NN processing performance of the integrated GPUs.

An interesting question is whether Tesla is going to replace the Parker SoC's as well, or only the GP106 discrete GPU chip. The safest iterative step would be to only replace the discrete GPU and keep the Parker SoC's, this would leave much of the ARM v8 based vehicle control platform unmodified. Making their own chip is a complex enough step already, they'd want to reduce the HW3 migration risks as much as possible.

But, these are just guesses and wild speculation, and I've been wrong a number of times in this short discussion already.

On recent Nvidia architectures often the top speed even without power or temp limitations is highly variable, part to part, often by as much as 5% due to variations in silicon quality (tested on multiple identical GPUs by Gamers Nexus - I couldn't find a dedicated article/video about it but I've seen them mention it several times recently in various benchmarking videos).

So I imagine Tesla would leave some headroom in their expectations of performance to allow for this silicon quality variability and just peg the clock speed to the worst case top speed, so the actual performance might be just a tad below the touted top speeds, but even a 5% deficit in clock speed is likely hidden by other inefficiencies (it's really hard to keep a GPU fed with data to crunch).

As for HW3, I would agree that most likely they just rip out the GP106 (and associated bits like memory and such) and replace it with their custom ASIC (and it's corresponding associated bits), assuming it has a PCIe interface on it and not something more proprietary. It might be overall more efficient if they ditched Parker entirely for a more purpose tuned ARM SoC (as Parker has a bunch of things they don't need or might want to have more of), but the changing out just the attached PCIe device (GP106 for Tesla NN chip) is far more straightforward and lower friction.
 
Yeah.

So the reason I suggested a shared RAM design is that I think there's a chance that the Tesla AI NN chip has a really radical design: DRAM integrated onto the NN CPU die itself. This is a relatively modern technique that Intel (Haswell and later) and IBM (Power chips) are using:

https://www.eetimes.com/author.asp?section_id=36&doc_id=1323410

DAS_Techinsights_Intel_EmbDRAM_01.jpg


(Having all the weights in SRAM doesn't seem possible currently: the simplest SRAM cell design would require about ~48 billion transistors for 1 GB of weight data which would result in too large dies - and indications are that they are using at least that much weight data.)

The Tesla NN chip might have gone one step further and basically integrated the NN forward calculation functional units into the DRAM cells themselves. One possible design would be that there's an NN input/output SRAM area in the 10-30 MB size range, and the functional units propagate those values through the neural net almost like real neurons.

Such a design would have numerous advantages:
  • Heat dissipation properties would be very good, as all the functional units would be distributed across the die evenly in a very homogeneous layout.
  • Execution time would be very deterministic as there's effectively no caching required.
  • Lack of caching also frees up a lot of die area to put the eDRAM cells on.
  • This design would also allow very small gate count mini-float functional units and very high inherent parallelism.
  • Scaling it up to higher frequencies would also be easier, due to the lower inherent complexity and the lower critical path length.
  • All of this makes it very power efficient as well, i.e. a very high NN throughput for a given die size, gate count and power envelope.
In such a design external RAM modules have a secondary role: they are basically just for initializing the internal "neurons" (multiplier and saturated-add functional unit) and "axons" (weight value) with the static neural net, and to store the output results.

Other designs are possible too - such as self-contained all-in-one 'neuron' functional units that are programmable to perform a given loop of weight calculations with no external communications other than the input fetches from other functional units, the eDRAM cell fetches and the output stores (i.e. intermediate state would not be stored anywhere external outside the functional unit, it's all within small local registers in the functional unit itself with no bus access to them whatsoever) - but the basic idea is to have the NN weights data on-die.

If that's the NN chip design Tesla invented then I'd expect the NN chips on multi-chip boards to share any external RAM, as it's not a performance bottleneck anymore.

But maybe I'm missing some complication that makes such a design impractical - for example the latency of eDRAM cell fetches would be a critical property.

eSRAM/eDRAM isn't *that* cutting edge anymore, just almost nobody uses it because of the massive increases in chip cost due to the area required (and the corresponding hit to chip yields). Xbox 360 had 10MB eDRAM to improve GPU performance (basically it was enough for a frame buffer or two), essentially borrowing from IBM's POWER developments (the CPU was a custom PowerPC variant), and Xbox One had 32MB originally, though the Xbox One X dropped it for more GPU power and upgraded memory bandwidth. Intel uses it for improved GPU performance because they can afford to and their GPU performance is garbage otherwise (integrated GPUs are always starving for bandwidth, but Intel GPUs have earned a reputation of being particularly slow), and on larger Xeon server chips they have very large caches. But larger chips are more likely to have defects, which reduces yields and drives up costs.

Interesting to note that while I've never seen specs for the latency of the Xbox One eSRAM (so latency might have been better versus the faster non-eSRAM Xbox One X SoC), the bandwidth on the Xbox One X's main memory is higher than the eSRAM (never mind the main memory) of the Xbox One SoC - 326 GB/s for the One X (main memory) vs the 68.3GB/s (main memory) + 102GB/s (eSRAM) of the One.

eSRAM/eDRAM is still potentially useful where latency is important (and NN seems like it might be) but for general raw performance it's time has come and gone. For raw performance, it's cheaper to just keep adding memory channels and use external memory chips, than to embed enough memory to make a difference. For latency of course it can be night and day, but the more you add the higher the latency and at some point you're better off going with external memory.

If the NN processing time per phase is sufficiently deterministic and memory needs not too large, you might be able to do something clever with having enough memory for two processing phases, and as one phase runs, load the next one into the other half, rinse and repeat, so that you never stall on memory access but also don't need to fit everything into on-chip memory at the same time. It also helps a lot if the individual NN units (whatever they're called) only need a small amount of memory to access, if they need to potentially access the entire data set then it gets much harder to optimize this. But if their needs are small enough for any given moment, lots of smaller caches directly adjacent to them (versus large cache block) can have extremely tight latencies and bandwidth.

I think the idea of having part of the NN hardware in the memory itself may be a bit too esoteric, though. RAM is already expensive enough without having to design your own custom flavor.
 
On recent Nvidia architectures often the top speed even without power or temp limitations is highly variable, part to part, often by as much as 5% due to variations in silicon quality (tested on multiple identical GPUs by Gamers Nexus - I couldn't find a dedicated article/video about it but I've seen them mention it several times recently in various benchmarking videos).

Does performance differ at the exact same frequencies, or is it the top frequency that is variable between the parts?

NVIDIA is advertising these boards with a fixed "boost clock" max frequency, so I have trouble seeing how some parts could miss hitting that frequency, without them being returned as defective. Must be missing something...

Also, even many of the high end cards tend to be air cooled, while Tesla's chips are liquid cooled, which is a lot more efficient. There's some water cooled variants available:


but it's not typical.
 
For latency of course it can be night and day, but the more you add the higher the latency and at some point you're better off going with external memory.

Yes, so let me rephrase the concept differently: an NN chip with highly simplified functional units (it could even include full loop units and thus avoid having to use register files and work straight from/to RAM cells!) is essentially primarily a high performance memory chip, with NN functionality embedded.

As you said it yourself:

"integrated GPUs are always starving for bandwidth"​

The same is true for many NN workloads, which are much simpler and much more specialized than 3D shaders generally require. Much of the Pascal GPU's transistor count is probably wasted when used for NN computations: I bet even with high performance RAM chips nearby it's currently RAM bandwidth (and latency) limited, fetching the NN weight data all the time, which weight data has no chance to fit into the cache.

And I don't think that concept of eliminating caches and creating NN neuron functional units embedded in RAM modules has been tried, ever before - and Tesla might just have done it. :D

Die size and yield should not be a problem: such a layout would probably be very homogeneous and easy to scale to any given process size and target workload they are optimizing it for.

In fact integration might offer advantages in this area:
  • They could size the performance of the functional units to the bandwidth and latency of the RAM cells perfectly, so that it's very well balanced and makes optimal use of the transistor and power budget.
  • This balancing would be largely an invariant and could be maintained when the layout is scaled to smaller fab process sizes and/or higher frequencies.
  • With separate CPU and RAM chips there's always an imbalance - one of them is always under-utilized and there's not always a degree of freedom to match RAM frequencies and bandwidth to the CPU, without changing the design of the CPU (such as memory bus width and pin layout).
(But then again, this is just a thought experiment, I could be missing serious drawbacks or practical complications.)
 
Last edited:
Suffered a heart attack on the highway. Used Autopilot to drive to the nearest Hospital.

Tesla's Autopilot takes the wheel as driver suffers pulmonary embolism | ZDNet

Naturally the article has an extremely negative slant. And the comments are of course closed.

The odds of him getting in a crash with AP (on the highway at least) were low, and had he become completely unresponsive, the car would have stopped. He was clearly still able to handle surface streets, given that he parked. The odds of him dying of his embolism if he had had to wait too long for an ambulance were high. He made entirely the right choice. I would have done precisely the same thing in his situation.
 
Bankers are reportedly approaching Tesla as debt payments loom

I’m loving the quiet confidence that Tesla is exhibiting lately. What could Tesla possibly have up their sleeves to decline future loans from banks? No analyst took Elon seriously when he stated that he didn’t need or want to raise capital, they aslo underestimated Tesla’s ability to borrow from China... what can Tesla possibly do to pay off debt, besides borrowing? Could it possibly be profits? hmmmm...

This unsolicited DIP financing offer when Tesla has repeatedly stated that it has no interest and is now on the verge of of turning a profit always struck me as a naive hail mary at best, deliberate market manipulation via FUD at worst - and much more likely the latter than the former. It's akin to setting up a billboard across the street from the Fremont plant advertising corporate bankruptcy lawyers.

"Oh no, we're not trying to manipulate the market - we're just advertising services! What do you mean 'do you have any short positions in TSLA'? My, I can't see how that would be relevant..."
 
Last edited:
Yes, so let me rephrase the concept differently: an NN chip with highly simplified functional units (it could even include full loop units and thus avoid having to use register files and work straight from/to RAM cells!) is essentially primarily a high performance memory chip, with NN functionality embedded.

As you said it yourself:

"integrated GPUs are always starving for bandwidth"​

The same is true for many NN workloads, which are much simpler and much more specialized than 3D shaders generally require. Much of the Pascal GPU's transistor count is probably wasted when used for NN computations: I bet even with high performance RAM chips nearby it's currently RAM bandwidth (and latency) limited, fetching the NN weight data all the time, which weight data has no chance to fit into the cache.

And I don't think that concept of eliminating caches and creating NN neuron functional units embedded in RAM modules has been tried, ever before - and Tesla might just have done it. :D

Die size and yield should not be a problem: such a layout would probably be very homogeneous and easy to scale to any given process size and target workload they are optimizing it for.

In fact integration might offer advantages in this area:
  • They could size the performance of the functional units to the bandwidth and latency of the RAM cells perfectly, so that it's very well balanced and makes optimal use of the transistor and power budget.
  • This balancing would be largely an invariant and could be maintained when the layout is scaled to smaller fab process sizes and/or higher frequencies.
  • With separate CPU and RAM chips there's always an imbalance - one of them is always under-utilized and there's not always a degree of freedom to match RAM frequencies and bandwidth to the CPU, without changing the design of the CPU (such as memory bus width and pin layout).
(But then again, this is just a thought experiment, I could be missing serious drawbacks or practical complications.)

Ugh, okay, can you please move this to another thread? I enjoy random off-topic as much as the next person (and Fact Checking's posts in general), but this NN chip design stuff has been taking up half of the thread for the past dozen pages.
 
But maybe I'm missing some complication that makes such a design impractical - for example the latency of eDRAM cell fetches would be a critical property.
Yeah, having read your analysis, your understanding of chip design seems very superficial. Maybe you should talk to an expert before daring to post a draft like that over here. /s :p:D
 
Of course they *forget* it, but they need some kind of negative damper on what is incredibly bullish news.

In any case, we all know that Tesla can just throw up a few tents, stick in some Grohmann machinery, throw-together GA from some old bits of junk they found in some alley and be spitting-out cars by valentines day 2019 :D
Please stop encouraging Elon. He may read this and I can already see Tesla`s engineering leads sweating bullets. :D
 
tuned.PNG


Having checked ir.tesla.com daily now for weeks I come to find it amusing that they're using "stay tuned"... like we were all listening to an AM radio broadcast or something.

I'm hoping they were waiting to announce Q3 ER conference because they wanted to disclose information about the China factory? Could be that now that the land buy is official we could finally go ahead with getting the conference call?
 
Fact Checking,
I love your postings! Thx for your input.
I sometimes have doubt in how Elon's doing stuff (with all the twitter/media stuff blown up), but you make it sound rational/logical. Very hard in these times to just stick to 'what is known/released information by Tesla' and 'what is everybody else is shouting'.
 
I'm hoping they were waiting to announce Q3 ER conference because they wanted to disclose information about the China factory? Could be that now that the land buy is official we could finally go ahead with getting the conference call?

Since according to their schedule the earnings report would be due around October 31, November 1-2, and the China deadline was apparently today, I'm wondering why they'd have worried about it in any fashion. Also, had anything gone wrong with the Shanghai land purchase, they cannot delay the ER for that, as it likely won't be resolved by the ER time anyway.

But yes, I find it weird too that they have not scheduled an ER date yet. They usually post it a few days after the delivery report (which deadline is long gone), and the typical announcement schedule of other firms is roughly 2 weeks before the earnings report - which would have been sometime this week if they are keeping to their original schedule.

So I'm too wondering what's going on. :D
 
The timing of the Shanghai announcement is also great in terms of building on the last few days of positive coverage and SP action. Hopefully this leads into a great ER with FCF+ and a small profit overall, plus reaffirming Q4 profit and ramp.

In terms of production capacity, equally important would be confirming GF1 as production sites for Semi and Y and at least leaking some progress on GF4 (Europe). Hopefully the Er will do that or we can get details on the call.
 
Weeklies: Yesterday I sold 277s for nice gains, shifted majority of proceeds to Aug19s. I had grabbed the 277s when we were in the 250s so yeah, went really well.

But with a hefty slice of the returns I piled on 300s and 320s. This week, lol.

So it would be super cool to get a nice 3%+ run up to start the day. Hoping that China’s land acquisition and funds shifting slightly to tech due to NLFX earnings setting the path give early morning strength.

Also snuck in some amazon Jan19s when it was stupid low...

This dip has been awesome.
 
Breaking: Oct 17, Tesla (Shanghai) Co., Ltd. successfully acquired 864885 square meters (a total of 1297.32 acres) of industrial land in Q01-05, Shanghai Lingang Equipment Industrial Zone, and officially with Shanghai Planning and Land Resources Administration. $TSLA #TeslaChina

vincent on Twitter


846,885 square meters is 209 acres. Conversion listed is two square miles...
(It's numbers, with commas, thus market action.)
 
Status
Not open for further replies.