Yes, there are many potential bottlenecks for parallel compute such as gathering results from multiple processors whether from NPU, GPU or CPU with some waiting on inter-chip communication to pass on to the next sequential step. The neural network scheduling visualization actually shows the CPUs are the least used but maybe it's because existing C++ control filled it up but not shown because it hasn't been neural networks.he stated that the combination of the existing NNs and the C++ code was slower than an all NN based solution
Overall, Tesla was able to significantly speed things up with end-to-end even though NPUs were already close to full utilization with 11.x, so they must have found some optimizations/simplifications to additionally process control related neural network parameters while still showing visualizations.