Thoughts on Tesla’s Computer Vision-Only Approach to Navigation

Musk insists on navigation using computer vision (CV) alone, partly due to cost and partly because he believes that, just as humans rely on vision to drive, machines should be able to do the same with advanced neural networks. His argument is that given sufficient compute power and high-quality data, computer vision should be able to match or exceed human perception. However, achieving this level of reliability is dependent on three key factors: image quality, compute speed, and the accuracy of the neural network models.

Image Quality

Higher pixel count increases image detail but also demands greater computational power for processing. Tesla’s HW4 cameras capture 5-megapixel (MP) images, providing approximately 2592 × 1944 pixels per frame. While this equates to over five million raw pixel values per image, not all pixels are directly fed into the neural network in their raw form. Instead, the image undergoes preprocessing, where it may be cropped, resized, normalized, or converted into feature maps before entering the first layer of the neural network. Additionally, Tesla’s FSD system likely processes multiple frames over time, enabling motion estimation and tracking, further reducing reliance on individual pixel values.

As a side note, in a deep learning pipeline, the first layer is typically a convolutional neural network (CNN), which does not process raw pixels directly but instead extracts low-level spatial features such as edges, textures, and patterns. Instead of each pixel acting as an independent input, small regions of the image (kernels or patches) are analyzed in parallel to detect key features that are passed forward in the network.

Compute Speed

Higher computational speed enables more decisions to be made per second, improving responsiveness and real-time performance. Tesla’s HW4 FSD computer is estimated to have two to four times the performance of HW3. HW3 operated at around 72 TOPS (Tera Operations Per Second), meaning HW4 likely exceeds 150-300 TOPS. Real-time inferencing runs at approximately 50-100 Hz (or every 10-20ms) per camera. Latency remains a limiting factor. Even a few milliseconds of delay in processing and decision-making can mean the difference between avoiding or causing an accident.

As a side note: If an autonomous vehicle needs to react within 100ms, inference should run every 10-20ms to provide a buffer.

Neural Network Accuracy

Like any neural network, the accuracy of Tesla’s models depends largely on how well they are trained to generalize and adapt to real-world driving conditions. Training a neural network for self-driving is inherently data-dependent, meaning it can only learn from what it has been exposed to. This creates a fundamental challenge: “You cannot learn what you haven’t been exposed to.” Rare edge cases—such as an unusual road hazard, an unpredictable pedestrian movement, or an animal crossing at night—are difficult to anticipate and require enormous amounts of real-world driving data to cover all possibilities.

If the system misses just one critical scenario, the consequences can be severe. There have been cases where FSD has failed to recognize a deer in the roadway or misjudged an obstacle, leading to a crash. The only way to improve is through continuous training across millions of miles, capturing as many real-world scenarios as possible. Perhaps in the future, with quantum computing, AI could train on all possible driving situations simultaneously, but for now, training takes time and vast amounts of data collected over years of driving. See Tesla’s Limitations and Warnings page.

Radar as a Redundant Safety Sensor

While self-driving cars are not classified as man-rated applications, redundancy remains a fundamental requirement in any system where failures could lead to fatal accidents. Tesla initially used radar as part of its sensor suite but later removed it, betting on vision-only navigation. The reasoning was that advanced CV models could eventually outperform traditional sensor fusion approaches. However, radar has distinct advantages that vision alone struggles to match. Unlike cameras, radar is generally not significantly affected by poor weather conditions such as fog, rain, or snow. It can detect objects even when occluded, such as a stopped vehicle hidden by another car or a pedestrian emerging from behind a truck. Most importantly, radar provides direct velocity measurements, whereas vision-based depth estimation relies on inferencing, which can introduce errors.

After removing radar in 2021, Tesla has now begun reintroducing “HD Radar” in newer models, suggesting that even they recognize the need for redundancy. The new HD radar likely improves on traditional radar by offering higher resolution and better object classification, making it more useful in complex driving environments.

What About LiDAR?

LiDAR, on the other hand, offers even greater precision than both radar and vision when it comes to depth perception. It can generate highly accurate 3D maps of the surrounding environment, making it an attractive choice for fully autonomous vehicles. However, LiDAR is expensive, power-hungry, and difficult to scale for mass production. Companies like Waymo and Cruise use LiDAR as a central component in their self-driving stacks, but Tesla continues to argue that computer vision will eventually be sufficient on its own.

The Cost of a Pure Vision Approach

While computer vision-based neural networks may eventually reach human-level perception, and I believe it will, today they remain imperfect, mainly due to the vast amount of scenarios to be trained. The cost of vision-only approach is real, and that cost is measured in lives—whether it be deer, pedestrians, or other vehicles involved in accidents that might have been prevented with additional sensor redundancy. For now, it is clear that redundancy in perception is still one of the safest approaches to autonomous driving.

/ceo

Thoughts on Tesla’s Computer Vision-Only Approach to Navigation

Leave a Reply