Well I'm not Elon and I do not work for Tesla or have any special inside knowledge of how their automated driving systems work, but I am a software engineer. So I am going to speculate at what is going on, and please take what I have to say with a grain of salt.
From what I know, the smarts in Tesla's systems are based on neural networks, with some degree of image processing to complement. I've worked with neural networks in the past for defect detection in factories, and if properly sized, designed and trained to handle enough variance, they can do a very good job at recognizing what they have been taught. But that can require a lot of data to capture all the subtle variances, depending on what information is being feed into the neural network for training. The more inputs and ways those inputs can vary, the more data you will need to properly condition and train, and when it comes to vision based systems for driving from multiple cameras and sensors, this can be peta and exabytes of data. Thus probably one of the reasons why all Teslas have the FSD sensors, cameras and cellular radios, to send real world driving data back to the mother ship for training purposes. Every newly captured sequence will be slightly different and will help to fill in any voids and better train the neural networks, making them more robust. This is a huge advantage that Tesla has over the other manufacturers, something that will take the competition years to catch up. There is only so much variation you can generate and capture from a limited set of vehicles driving around in a simulated city some manufacturers use to do their AI and FSD development.
Sounds great right? With all this data Tesla has already, it should just work? Well the problem with neural networks is that they can be unpredictable when you give them something they have not already seen in their training. So there are decisions to make and outcomes to bias in order to decide what to do in those circumstances where the outcome is not something it has already seen before, its decision is not a 100 percent certain one way or another. How the software handles these scenarios and if it should it err on the side of caution (and brake) versus it is nothing (and keep on driving) is most likely what this phantom braking is all about. What the car is sensing and seeing is not close enough to anything it has already seen in its training data, so how it will handle this situation and what it will decide to do is unknown, and I can only assume that the software will bias to braking in those scenarios.
The good news is, the more Teslas that are out there sending back this data, the more robust and complete the training set should become over time, and future versions of the software should better handle these previously unseen cases. There are also tricks you can play like pre-processing the data before sending it to the neural network to reduce the number of metrics and variations to train against, and hopefully have a better coverage and more predictable outcomes, as long as you do not throw away too much when doing that reduction. And there is the possibility of having the car learn while you drive as to what events are braking events and what should be okay, ignored, like the green light and pressing the accelerator to continue we do now. But what Tesla is actually doing is probably a closely guarded secret and we may never know the exact details, so it is a waiting game for the next software version and hopefully that fixes some of these issues and make it better and better...
I do appreciate you taking the time to write down your thoughts and I agree for the most part, however, there are some objects the neural net should be pretty familiar with by now. For example, overpasses. It seems very random in how it reacts for many people. In my case, it hardly ever slows for them anymore (which I assume is a result of recent updates), but for others, it still brakes. These are the kinds of questions I would love to get answers for. And like you mentioned, it could just be a simple case of the neural net not being properly trained and erring on the side of caution, but why would it take so long to train it to react appropriately for an object like an overpass? It would seem like a simple enough object to recognize.