Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Has anyone tried to use Vicarious’ Recursive Cortical Network for 3D computer vision?

This site may earn commission on affiliate links.
I’m flummoxed by a recent discovery. The AI/robotics startup Vicarious (in which Elon is an investor, along with Jeff Bezos and Mark Zuckerberg) has developed a new neural network architecture they call a Recursive Cortical Network (RCN). Vicarious used its RCN to solve CAPTCHAs with the same accuracy as a Google DeepMind convolutional neural network. Here’s the kicker: the RCN was trained on only 260 examples, versus 2.3 million for the ConvNet. So that’s a ~900,000% improvement in training data efficiency. Holy $%!&.

You can read about the RCN solving CAPTCHAs in Vicarious’ blog post on the matter, or you can read their paper in the journal Science, if you happen to have access. Vicarious also has a reference implementation of its RCN up on GitHub.

So, the RCN has achieved state-of-the-art accuracy on optical character recognition with ~900,000% higher training data efficiency. Here’s my question: has anyone tried to adapt Vicarious’ RCN for 3D computer vision?

I’m a lay enthusiast and CS 101 dropout, not a computer scientist or software engineer. So I don’t have the ability to try this myself, or even the knowledge to say whether it would feasible to try using the code Vicarious has made available on GitHub. So apologies if this is a misconceived question.

But if I have not exceeded my depth here, this seems like such an exciting experiment. If the RCN can match the accuracy of state-of-the-art ConvNets not just on character recognition, but on object detection in a 3D environment, and do so after being trained on ~0.011% as many examples, imagine the possibilities. Imagine training the vision neural networks for an autonomous car on video from 500 miles of driving, and achieving the same object detection accuracy as Waymo’s neural nets after 5 million miles.

Or, exciting for companies like Waymo and Tesla, what if using RCNs, 5 million miles of test driving is as good as 50 billion miles with ConvNets?
 
You rang?

I took a look at the paper - had not noticed it before.

My initial take is that it's probably not going to perform as well on other visual applications without some significant work. It's interesting, but it's early research level stuff and far from certain to be useful in the real world.

The approach is specifically designed to infer the shapes of obfuscated text - which matches well on to captchas but not as well on more generic vision applications. And while it's true that they achieved relatively high performance in a sample efficient way, that metric only applies if you ignore the pre-training phase. Most of the network is pre-trained on a set of 'generic' imaging challenges. The paper doesn't say what the size of the pre-training dataset is but I get the impression that it's substantial. They demonstrate some interesting results on MNIST as well, but once again those results are constrained to particular metrics and they don't show overall performance that it particularly compelling. Other variations of one shot or few shot learning on pre-trained networks have been demonstrated and they are interesting as a category but not remarkable in terms of their performance at real world applications.

It's also worth keeping in mind that succeeding at captchas (according to the paper) only requires a 1% accuracy rate. They demonstrated much better performance than the minimum required, achieving 60%-scale numbers. Still, those numbers are not adequate for the kind of generic object recognition that self-driving cars require - though it's hard to say exactly how those kind of results would map onto generic object recognition since the paper's target application is quite different.

The authors seem to be particularly interested in developing models that work well on text in challenging settings. They seem to think this challenge is interesting because they see it to be epistemologically related to broader intelligence capabilities. That is not a widely held view, but I can't deny that it might be true.

I hope that's a useful breakdown. Given what I see in the paper I wouldn't be in a rush to try it on widely variant visual applications. In any case - if the technique has real merit you'll see it being adopted by other groups over the next 12 months - groups that understand the paper a lot better than I do. If that happens then the publicly available codebases and implementation hints will expand and you'll see people trying to apply it to more disparate applications - like generic object recognition in the real word.
 
Thank you! I don’t have access to the original paper, so I wasn’t aware of the pre-training phase. I will see if I can get access and look into that aspect of it.

Human accuracy on reCAPTCHAs is 75%, so Vicarious’ 66.6% is not bad. The human failure rate on driving-related visual tasks is obviously not that high or we’d all be dead. We make roads, cars, etc. easy to see, whereas CAPTCHAs are deliberately hard to see and sometimes defeat humans.

Here’s how I interpreted Vicarious’ comments in their blog post. They seem to want to make AI that is capable of understanding concepts, like the concept “A” for example, in a way that’s resilient to surface-level distortions. A lot of currently used neural networks seem to identify objects based on the surface-level statistical properties of images, and can therefore be tricked by changing the surface-level properties, like changing a few pixels — something that might not even be discernible to the human visual system.

For the purpose of self-driving cars, it doesn’t matter how neural networks identify objects as long as they do it well. Objects in the real world don’t undergo deliberately engineered surface-level distortions in their statistical properties. Neural networks can even be made resilient to adversarial attacks like fake stop signs. Adversarial attacks on a car’s mechanical systems likes tires and brakes are already possible and not much of a concern. Especially compared to the 1 million+ accidental deaths a year.

But some cognitive tasks require concept understanding, so if we want AI to be capable of a more general, human-like set of cognitive tasks, we need to develop AI with concept understanding. I think that’s Vicarious’ goal. I like how they are looking at neuroscience and cognitive science to try to figure out how the human brain does vision, and then try to implement similar computational techniques in their neural network.

My background is in philosophy of mind and cognitive science, so this approach naturally resonates with me. When you have functional theories about how the human brain accomplishes cognitive tasks, it is hard to believe that a mechanism as simple as a conventional deep neural network will ever be capable of the level of generality and complexity of human cognition. And indeed, I don’t think anyone actually believes that they will be. Some people just extrapolate progress in machine learning forward on some unclear, dimly imagined trajectory, and assume that this trajectory of improvement will not converge on the mechanisms discovered or invented by natural selection (perhaps many times independently).

I can’t say there is no distinctly non-biological form of intelligence out there in the space of logical possibilities — how could I possibly know that? — but I can make the weaker argument that the quickest path to human-level intelligence is copying human biology. Either copying the low-level details with brain scanning technology or copying the high-level computational mechanisms, like Vicarious is trying to do. Why reinvent the wheel? We already have a wheel. And if you’re concerned about existential risk from distinctly non-biological AI, what better response than to pre-emptively develop biological AI? (“Biological AI” is almost an oxymoron, but not. Whether an intelligence’s computing substrate is made of organic or inorganic matter should be less of a dividing line than whether the design of its cognitive architecture is similar or foreign to natural, biological intelligence.)

Maybe the brute force, statistical number crunching approach is going to the best for applied robotics tasks like self-driving for years to come. That’s my default assumption and fine with me. But it’s exciting to see people like Vicarious do fundamental research on neural networks that is potentially paradigm shifting. I don’t discount the possibility that a new, biologically inspired paradigm might overtake conventional deep learning in a similar way that deep learning overtook precious machine learning techniques. I’ll have to look into the pre-training stuff to see how valid the headline figure of ~900,000% training data efficiency improvement really is.
 
Last edited: