It’s not reading the sign, though. Reading the sign would involve determining the lettering, words, or symbology they’re upon, establishing the meaning of the words or symbology, interpreting the rule from the meaning, and adding the rule to decision making. The NN is just saying “pixels of x color in x position on road make y factor higher or lower which makes z output more or less likely.”
In a lot of ways it’s more like an experience driver would do in an area or region they are familiar with. But when they, e.g., go to a different country, a human starts reading the signs again.
Now you are getting into the actual mechanics. And the differences between a brain and artificial NNs (they aren’t the same or similar at all!).
By your definition OCR and other text recognition tools are also not reading anything. Which I would sort of agree with. It’s not really doing the same thing as a human, probably.
But I was speaking of the end result. In a perfect implementation (what I was referring to), it’s nearly indistinguishable from a human reading the sign. I said that to solve the proposed problem the system would have to “read” the sign. There is not really another way.
And then it was said that no, actually the decision would be made based on presence/absence of a sign, in the NNs decision. It would just all feed into the network and then there would be an output action.
But this is effectively reading the sign - just like text recognition. The specific output “turn right on red here” based on the specific overall scene is the same as “output the letter P” based on the specific scene in a photo or whatever.
What is different, in terms of end result?
See the examples from
@KArnold above. These will just go into the network, and it will flawlessly produce outputs regardless of a huge variety of inputs (because every possible input will have been thoroughly trained and there are no other unrecognizable signs that exist, or maybe it is a foundation model with emerging capabilities and does not need to be trained on every possibility, but I have no idea on that and it doesn’t matter). If a text or sign recognition output was the desired result instead of driving, it would be perfect. Every sign in an image would produce a flawless output with all the text and everything, lifted from the scene. So how is that not “reading” signs?
Obviously it has absolutely no idea what the signs means or what the output it is producing is, or whether the output is correct (humans can double check whether they produced the right result) - it is just producing the output. But there’s a strong argument that this is still “reading.” Otherwise these text capture tools from images would be useless! But they are not.