In Level 2 the system
assists the driver with the OEDR (object and event detection). The driver is always driving in L2 (even though the system assists with OEDR and steering).
In both Level 3 and Level 4 the system performs the full OEDR, so that the driver do not have to watch the road anymore. She can instead watch a movie or do some emails.
The main difference between Level 4 and Level 3 is that there needs to be a fallback ready driver in L3. This means that at any time, the system can ask for you to take over, and while you reorient to the OEDR (stop looking at the movie and start looking at the road), a process that takes 10-15 seconds, the system is still driving at the same high reliability. An L3 system needs to be designed so it knows that it will not handle an upcoming road situation or a weather condition. An L2 will just beep, fail and drive off the road (like FSD does).
In L4 the system will never need you to take over, and you can be asleep in the back seat.
L4 "robotaxi" are typically bounded to a geo-area as a part of the operational design domain (ODD) so that the service provider can validate the functionality and reliability.
For L3 the ODD tends to be narrow, like highway-only, dry roads up to 60 mph for example. Like the MB DrivePilot queue chauffeur is limited to dry roads and 40 mph. There is an existing standard (UNECE R153) for highway autonomy since a few years back that allows for systems up to 130km/h.
To answer your question, the thing that's missing is reliability. Tesla does about 10-15 miles per disengagement right now, and you probably need to do 30000-50000 miles between failures or more to let people not watch the road.
It doesn't seem very likely to me that any camera-only system including Tesla's system (FSDb on hw4 or hw3) will evolve into an autonomous system L3+ with a meaningful ODD anytime this decade unless there are a few major breakthroughs in computer vision.