Here's how I'd explain the common misconception:
The levels apply solely to the software design / feature (aka, "driving automation system", "ADS").
The levels have no performance criteria for executing the DDT on a sustained basis. The levels also don't define a minimum time to be "sustained." They only say that automatic emergency braking systems aren't "sustained."
Putting it together: the levels simply describe how the roles of the driver and ADS are designed within the software feature. If the software feature is designed to not require a fallback driver, then it can be level 4 or 5 (as long as it can perform the DDT on a sustained basis, which again, there's no minimum time defined for "sustained"). It doesn't matter how good or bad the software is at performing the DDT... the levels' purpose is solely to describe ADS autonomy, not performance. A stupid analogy would be animal species. A newborn baby is a human, and a 90 yr old is also a human, but we know they're very different in their capabilities or experience, etc.
As for the developer, I've said this before, but what the developer markets or says publicly about their ADS doesn't matter at all. If a developer says their ADS is level 5, but the software requires a fallback driver, then it's level 3 max. It doesn't matter if they're testing it or not. Again, if the software requires a fallback driver, the feature can't be level 4/5, no matter what! The levels apply solely to the ADS / software, not how the developer markets or describes their system.