My understanding is he is talking about part two "Mixtures of Experts," while you are talking about part 1.
You aren't averaging the outputs to improve accuracy (as would be intuitive), but rather picking from a bunch of experts by using a function that breaks down the data to select the right one for the right inputs. Under this scheme you can potentially merge 100k different models and not have drastically higher processing requirements (since other than a few models deemed suitable for a given input, the rest are not run). Definitely an interesting idea.
Nah thats still not combining models.
The mixture of experts is simply training different networks on different tasks.
For example a model trained on stop signs and another trained on street signs....etc or a model trained on seeing stop signs at night and another trained to see stop signs in the day then looking at the input data and deciding which model to use.
Or a model trained to detect stop sign in heavy rain vs heavy snow vs normal weather and looking at the input and picking which one to use.
The training data is segregated and only a subset of the training data is used for the specilized model.
So only pictures of stop signs in the rain is feed to one model to be trained with.
Geoffrey himself said
"The idea is to train a number of nn each of which specilizes in a different part of the data. We assume we have a dataset which comes from a number of different regimes.
We train a system in which one nn will specilize in one regime and a managing nn will look at the input data and decide which specialist to give it to."
Walkthrough
We have 3 training data (heavy rain stop signs, clear weather stop signs, then mixture of heavy rain and clear weather signs)
Model 1
Trained with pictures of stop sign in the heavy rain to recognize them
Model 2
Trained with pictures of stop sign in clear weather to recognize.
Model 3
Trained with pictures of a mixture of stop signs in both weather including the accompanying data of which model to use (model 1 or model 2) for each picture.
Input data comes in, model 3 tells it to use either model 1 or 3.
This technique which came out in the 90s is not utilized because it just doesnt give better accuracy.
Almost everyone use one model for everything.
So they feed traffic signs in all weather to one network.
Or all types of traffic light in all weather to one network or all type of pedestrians in all weather.