Voxels are discrete and means a grid of locations of consistent fixed size. Thus they are similar to the pixels in an image and much of the same algorithms used to handle 2D images can be used to process them. Point clouds instead have a much larger degree of freedom, making it much harder to classify.You already have a bunch of cameras with different alignments so you've got to solve that problem anyway.
I think the moving objects are filtered out in the way they train the neural net. They talked about moving through the world and finding the the point cloud that is consistent across all frames. That would naturally filter out moving objects.
What is the advantage of voxels over point clouds?
Beyond the pixel plane: sensing and learning in 3D
The Main Benefits and Disadvantages of Voxel Modeling
Again I encourage you to play around with the tool, the differences are fairly apparent between voxels and point clouds:
Tesla Data Renderer
Tesla Data Renderer