Note that the primary factor creating the fork is the two vastly different versions of the hardware: HW3 is 10x-20x faster than HW2 and can load much larger networks as well.
This is a physical fact that cannot be designed away, and splitting functionality out that can only run on the newer hardware is standard industry practice.
For shared functionality it's possible to keep a largely unified "codebase" by downscaling the HW3 networks to fit into HW2's computing constraints - which is a mostly automatic step.
The obvious explanation of why Tesla tried to max out HW2 for so long is to:
- implement EAP and basic Autopilot functionality on a HW2 basis,
- leverage the combined HW2+HW3 fleet to train HW3-destined networks as long as possible,
- to wait for the HW3 fleet to become almost as large as the HW2 fleet,
- to wait for a comparatively "quiet" January-March period when North American and European service centers have excess capacity to perform the mass-retrofits.