SLAM just refers to the broader field. You can say autonomous driving is an application of SLAM, so Tesla and other manufacturers are already doing SLAM. Other applications include robot hoovers/mowers, drones, AR, robotics, etc.
SLAM in general is using all kinds of sensors - radar, sonar, lidar, USS, laser rangefinding, cameras... What Tesla is doing with Vision could be said to come under Visual SLAM (VSLAM) that uses only cameras. Note VSLAM is pursued for cheapness, not because it is in any way better or easier than SLAM with other/multiple sensor types. And even within VSLAM people are playing with stereo cameras and RGB-D cameras (cameras that also sense depth based on time-of-flight measurement) where Tesla are not.
So again it all boils down to can Tesla really get all this working with only simple cameras? No active sensing that gives range data - like lidar, radar, RGB-D, USS, etc. Just trying to figure it all out from 2D camera frames. I'll believe it when I see it! Just seems to me they are avoiding one problem (sensor fusion) by giving themselves an even bigger problem (autonomous driving with only cameras)!
Cameras don't have memory either! It's all in the implementation. All this talk of persistency and the occupancy network is not exclusive to Tesla Vision - it can just as easily, and arguably much more precisely, be done with other sensory inputs. So I don't see it as some slamdunk justification for ditching the USS and going Vision only, nor anything to be super optimistic about. You could make a persistent occupancy network from USS data if you really wanted to - reality is the basic parking functions never needed it as they simply put appropriate sensors in the appropriate positions for the task!