Inside Tesla’s FSD: Patent Explains How FSD Works

Thanks to a Tesla patent published last year, we have a great look into how FSD operates and the various systems it uses. SETI Park, who examines and writes about patents, also highlighted this one on X.
This patent breaks down the core technology used in Tesla’s FSD and gives us a great understanding of how FSD processes and analyzes data.
To make this easily understandable, we’ll divide it up into sections and break down how each section impacts FSD.
Vision-Based
First, this patent describes a vision-only system—just like Tesla’s goal—to enable vehicles to see, understand, and interact with the world around them. The system describes multiple cameras, some with overlapping coverage, that capture a 360-degree view around the vehicle, mimicking but bettering the human equivalent.
What’s most interesting is that the system quickly and rapidly adapts to the various focal lengths and perspectives of the different cameras around the vehicle. It then combines all this to build a cohesive picture—but we’ll get to that part shortly.
Branching
The system is divided into two parts - one for Vulnerable Road Users, or VRUs, and the other for everything else that doesn’t fall into that category. That’s a pretty simple divide - VRUs are defined as pedestrians, cyclists, baby carriages, skateboarders, animals, essentially anything that can get hurt. The non-VRU branch focuses on everything else, so cars, emergency vehicles, traffic cones, debris, etc.
Splitting it into two branches enables FSD to look for, analyze, and then prioritize certain things. Essentially, VRUs are prioritized over other objects throughout the Virtual Camera system.

Virtual Camera
Tesla processes all of that raw imagery, feeds it into the VRU and non-VRU branches, and picks out only the key and essential information, which is used for object detection and classification.
The system then draws these objects on a 3D plane and creates “virtual cameras” at varying heights. Think of a virtual camera as a real camera you’d use to shoot a movie. It allows you to see the scene from a certain perspective.
The VRU branch uses its virtual camera at human height, which enables a better understanding of VRU behavior. This is probably due to the fact that there’s a lot more data at human height than from above or any other angle. Meanwhile, the non-VRU branch raises it above that height, enabling it to see over and around obstacles, thereby allowing for a wider view of traffic.
This effectively provides two forms of input for FSD to analyze—one at the pedestrian level and one from a wider view of the road around it.
3D Mapping
Now, all this data has to be combined. These two virtual cameras are synced - and all their information and understanding are fed back into the system to keep an accurate 3D map of what’s happening around the vehicle.
And it's not just the cameras. The Virtual Camera system and 3D mapping work together with the car’s other sensors to incorporate movement data—speed and acceleration—into the analysis and production of the 3D map.
This system is best understood by the FSD visualization displayed on the screen. It picks up and tracks many moving cars and pedestrians at once, but what we see is only a fraction of all the information it’s tracking. Think of each object as having a list of properties that isn’t displayed on the screen. For example, a pedestrian may have properties that can be accessed by the system that state how far away it is, which direction it’s moving, and how fast it’s going.
Other moving objects, such as vehicles, may have additional properties, such as their width, height, speed, direction, planned path, and more. Even non-VRU objects will contain properties, such as the road, which would have its width, speed limit, and more determined based on AI and map data.
The vehicle itself has its own set of properties, such as speed, width, length, planned path, etc. When you combine everything, you end up with a great understanding of the surrounding environment and how best to navigate it.

Temporal Indexing
Tesla calls this feature Temporal Indexing. In layman’s terms, this is how the vision system analyzes images over time and then keeps track of them. This means that things aren’t a single temporal snapshot but a series of them that allow FSD to understand how objects are moving. This enables object path prediction and also allows FSD to understand where vehicles or objects might be, even if it doesn’t have a direct vision of them.
This temporal indexing is done through “Video Modules”, which are the actual “brains” that analyze the sequences of images, tracking them over time and estimating their velocities and future paths.
Once again, heavy traffic and the FSD visualization, which keeps track of many vehicles in lanes around you—even those not in your direct line of sight—are excellent examples.
End-to-End
Finally, the patent also mentions that the entire system, from front to back, can be - and is - trained together. This training approach, which now includes end-to-end AI, optimizes overall system performance by letting each individual component learn how to interact with other components in the system.

Summary
Essentially, Tesla sees FSD as a brain, and the cameras are its eyes. It has a memory, and that memory enables it to categorize and analyze what it sees. It can keep track of a wide array of objects and properties to predict their movements and determine a path around them. This is a lot like how humans operate, except FSD can track unlimited objects and determine their properties like speed and size much more accurately. On top of that, it can do it faster than a human and in all directions at once.
FSD and its vision-based camera system essentially create a 3D live map of the road that is constantly and consistently updated and used to make decisions.