1) Different jurisdictions or nations will likely require testing to be done within their road system. They may even require separate NN fitting and had coding. So if 400M to 800M are required per jurisdiction, you could easily need on order of 6B on some 10 or so different jurisdictions. I doubt that Musk would be construing this as one giant test in which Tesla must demonstrate zero fatalities over the course of 6B miles. Indeed as your own calculation has show at 0.12 deaths per 100M miles (implying 7.2 fatalities), Tesla would have less than 0.1% chance of being about to go 6B miles with 0 fatalities. In practical terms, this is an nigh impossible test, doomed to failure. Maybe it would help to follow the implications setting 7.2 fatalities as your proposed null hypothesis. The test statistic, the actually number of fatalities over 6B miles, is Poisson distributed with mean 7.2 under the null hypothesis. It also has a standard deviation 2.68 = sqrt(7.2). With probability about 95%, the test statistic will be between 3 and 13 fatalities. So I think the test you are really try to set up is that your reject this null hypothesis if there are 2 or fewer deaths or 14 or more. In this case, we are talking about a two-sided test. So it is not really clear why Tesla would ever need to show that its fatality rate is so much lower than 0.12 per 100M miles that they'd be able to reject this 0.12 rate. If Tesla merely want to demonstrate they are safer than a human driver, it would be better just to estimate the rate and provide a confidence interval. The hypothesis testing framework is not really the best framing for that public communications objective. Rather, hypothesis testing is relevant when getting approval from a governing body; moreover, that body is likely to state what rate to test against. That is, the regulator sets the null hypothesis, while the Tesla needs to supply sufficient information to reject that regulatory hurdle.
2) The weighting and stratification I was writing about was in reference to testing for regulatory approval. Collecting data for training FSD is actually a much more complex task to do well. Certainly weighting and stratification can play a role in training a model, but that was not my point. Indeed, for training you want to be sure that you have broad data on the full scope of driving conditions, but you also want to oversample on certain data where critical and rare events happen. For example, you like will want to oversample on collisions, especially on collisions involving injuries and fatalities. Tesla is even using simulations of critical collisions to be able to augment data and revisit the scenario under varying conditions. Basically, Tesla want to learn as much as possible from each critical event so that FSD will never make mistakes again in such scenarios. So these are the sorts of consideration for curating training data.
Regulators will likely be interested in how Tesla curates training data and can do analysis on how representative the coverage is, the quality of the data, and many other issues. This is data review, but it is not road testing. For road testing, the regulator will likely want data on where and when the miles were driven and much more detailed information on critical events: collisions, injuries, fatalities, etc. They will analyze the data for representativeness and consider any weighting methodology deployed. The test day may well be segmented. For example, Tesla may require certain number of beta testers from each state. State or city segmentation helps assure representativeness. Time of day, day of year, weather conditions are other factors that may call for weighting or segmentation. So the regulator will need to be persuaded that the test exposure miles are sufficiently representative and have adequate coverage. They will also want to analyze the critical event data with special attention to any factors that may reveal weakness or flaw in the driving system.
But after all that work, the regulators are confronted with the final counts of types of critical events. The regulators will want data to show that the true frequencies of certain outcomes are far enough below a critical threshold that random error can be ruled out. This is where hypothesis testing comes in. The regulator may say to Tesla, you need to demonstrate that your fatality rate is below 1.2 per 100M miles. That's the null hypothesis. And Tesla might know that it is likely functioning is at or below 0.12 per 100M. So 0.2 is the alternative hypothesis they want to optimize around. The regulator doesn't care what the alternative hypothesis. But for Tesla it gives them the basis for planning how many test miles of exposure they will want to accumulate before they present there result to the regulators. The power of the test is extremely important to Tesla as they want a high probability that they will be able to submit enough data to pass the test, assuming their alternative hypothesis. But it is the regulator who cares about the significance of the test as they want there to be a low likelihood that they are fooled by statistical error. Type I error is an error that the regulator wants to avoid, while Type II is the error that the regulated entity wants to avoid.
Now suppose that Tesla makes significant advances in training its FSD neural nets, enough to convince them that their fatality rate is at or below 0.06 or bellow. In this situation, Tesla could chose 0.06 as there alternative hypothesis. What happens here is that for the same amount of test exposure, the power of their test goes up, they have a higher chance of passing the regulators test. Or put another way, they could proceed with less data and still have sufficient power. The implication here is that the choice of alternative hypothesis for Tesla drives how much exposure data they need. This is why I put the emphasis on Tesla engineering a better FSD system. The better it truly is, the less exposure data will be required to show that it can reject the null hypotheses posed by the regulator. This means Tesla can get through regulatory testing faster and cheaper. So how does Tesla engineer a safer FSD system, primarily it must do a damn careful job of curating training data. Any lack of vital experience in the training data exposes Tesla to incremental risk in beta testing (and full public release).
One other issue, suppose the regulators will be testing multiple outcomes. Say they require demonstration that the fatality rate is significantly below 1.2 per 100M miles, bicycle collisions are significantly below 10 cases per 100M miles, and pedestrian collisions are significantly below 6 cases per 100M miles. Now the power calculations become much more difficult. You need to have enough miles of exposure to have a very good chance of passing all three tests. So this multi-test situation could push Tesla to do more test miles than the fatality test alone would call for. This also could help explain question 1. Some jurisdictions might require multiple outcomes to be tested, and that could drive up the required sample size. Indeed I looked like the NHTSA was going on a fishing expedition, just looking for any outcome that might be higher than average. If a regulator aggressively pursues any finding any fault, they will likely succeed. And no amount of data would have an adequate chance of passing all the tests. But this is veering off into a hostile political situation, not good statistical or regulatory practice. At any rate, the point here is that, if regulators will be testing many end points, that can drive up the amount of exposure miles needed for regulatory approval. But again, even if that is what the regulators are demanding, the best strategy for Tesla is simply to work on improving the FSD along every conceivable test dimension, and to do that Tesla will need to curate substantial training data on every conceivable misstep a driver could make.
Just to get through regulatory approval, to pass all the stated and unstated tests, Tesla may well need FSD to be 10 times better that human drivers.