The entire premise is based on what is probably an incorrect assumption. That being Tesla has 100s of engineers sitting around watching 1000s of video clips all day. In all likelihood the clips go into software that analyzes, sorts, prioritizes and aggregates most (or maybe they just go into a black hole
). I bet at most only a small percentage (probably >0.5%) are actually deemed important/unique enough to be analyzed by a human.
That means there is no place for voice descriptions.