Yes, I'll try to do my best
Think of handwriting or optical character recognition (OCR) as a three-dimensional problem that is made out of a two-dimensional drawing pad with a height and width of 32 pixel. For the sake of it, let's say very pixel holds information of three bits:
- 000: Pixel is completely black (not written on)
- 001: Pixel is slightly grey-ish (not directly written on, minimal "bleeding" from neighbor pixel)
- 010: Pixel is a bit more grey-ish (not directly written on, a bit more "bleeding" from neighbor pixel)
- etc.
- 111: Pixel is completely white (directly written on)
That makes a 32 * 32 * 6 bits three-dimensional array we could now feed into a CNN.
But if we now want to factor in / t, we could just use a three-dimensional array for every frame (32 * 32 * 6 * t), with the result of pumping the newly created array into the fourth dimension. This four-dimensional array can then be fed into the CNN just like any other, one-, two- or three-dimensional array, hence calling this approach "naïve"
This is pretty common technique for 2D data + time (eg. stock charts) but you can theoretically add extra dimensions to datasets of any given dimension as described above.
If Kip says the math gets too hairy, you better believe him, haha.
Referring to my example above, I'd like you think of dimensions in the mathematical sense: Time is just a bit of (extra) information that isn't special in any way
EDIT: This is my last post on this topic. At least in this thread. Sorry for de-railing the discussion.