At a
very high level-
Previous system- Each camera ran its own single-frame NN to act as visual perception for the car. Radar was a primary sensor for forward speed and distance of objects. Nothing persisted over time.
Current beta FSD system- All camera inputs are fed into a series of cascading NNs that essentially process a BEV 4D perception of the world (think of it as 360 surround video that persists over time). Radar is not used at all. Speed and distance are determined from the video inputs creating a point cloud.
There were steps in between, where for example some BEV views existed by still frame by frame, and still using some radar inputs as well. But that's the general "How it started and how's it going" of the design philosophy.
This also required rewriting of a lot of the training code too--- the upside being if it's rewritten to understand 360 view and time, you only have to manually label something in frame 1, and the new code can self-label that object for the rest of the video as long as it remains in view of any camera (and even to make predictions about it reappearing if it moves behind something briefly- they gave examples of this on AI day).
Here's Karpathy discussing the overall transition with examples in practice from mid 2020: