Without reading the post at all and only looking at just the picture. i just wanted to say that non-stacked convolution layers is pretty standard in the world of CNN nowadays, it all started with the first inception network from google which Google really showed that you don't have to just stack conv layers with pooling, drops, on top of each other, but that you can get clever with it. That and using smaller conv filters there is nothing mind blowing about it today. its pretty standard.
Here is Inception v1 from 2014
EDIT: After reading the post. My comments doesn't change. Only thing i would add is that I'm surprised people still take anything jimmy_d says seriously.