Camera-Based Tool Detection and the ML Pipeline
The defining feature of AR Toolbox is its ability to look at a collection of tools through your phone's camera and tell you exactly what it sees. Behind that seemingly simple experience is a multi-stage machine learning pipeline that captures frames, preprocesses them, runs inference through a YOLO object detection model, and translates raw predictions into the labeled bounding boxes and AR overlays you see on screen. Understanding how this pipeline works helps explain why detection is so fast, what affects accuracy, and how the system handles the wide variety of tools it encounters in the field.
From Camera Frame to Model Input
When you open the scan screen, AR Toolbox begins capturing frames from your phone's rear camera at a steady rate. Not every frame goes through the full detection pipeline. The app samples frames at a frequency balanced between responsiveness and battery efficiency, typically processing several frames per second on modern devices. Each selected frame is resized and normalized to match the input dimensions the YOLO model expects. This preprocessing step converts the raw camera image into a tensor format that the model can consume, adjusting pixel values to a standard range and ensuring consistent dimensions regardless of your phone's native camera resolution.
The preprocessing also accounts for camera orientation. Whether you are holding your phone in portrait or landscape mode, the image is correctly oriented before being fed to the model. This matters because tools look very different when rotated, and consistent orientation ensures the model sees them in the same way it was trained to recognize them.
YOLO Detection and Bounding Boxes
The core of the detection pipeline is a YOLO (You Only Look Once) model that has been trained specifically on tool imagery. YOLO is a single-pass object detection architecture, which means it analyzes the entire image in one forward pass rather than scanning regions sequentially. This single-pass design is what makes real-time detection possible on a mobile device. The model divides the input image into a grid, and for each grid cell, it predicts bounding box coordinates, an objectness score indicating whether something is present, and class probabilities across all 130+ tool types.
After the raw predictions come out of the model, a post-processing step called non-maximum suppression filters out duplicate and overlapping detections. If three grid cells all detect the same wrench, NMS keeps only the highest-confidence prediction and discards the rest. The result is a clean set of bounding boxes, each with a tool class label and a confidence score expressed as a percentage.
These bounding boxes are then mapped back to the original camera frame coordinates so they align precisely with the tools visible on screen. The AR overlay system uses these coordinates to position labels and information cards directly over each detected tool, creating the augmented reality effect that gives the app its name.
The YOLO architecture was chosen specifically because it prioritizes speed without sacrificing too much accuracy. For a field tool like AR Toolbox, being able to detect tools in real time is more important than squeezing out an extra percentage point of precision on a static benchmark.
Confidence Scores and Supported Tool Types
Every detection comes with a confidence score that tells you how certain the model is about its prediction. A score above 90 percent generally means the model has a clear view of a well-known tool type. Scores between 70 and 90 percent are common for partially obscured tools, unusual angles, or less common tool types. AR Toolbox applies a configurable confidence threshold, and detections below that threshold are not shown by default. This prevents the screen from cluttering with low-quality guesses while still surfacing reliable identifications.
The model currently supports over 130 tool types organized into 11 categories. These categories span hand tools, power tools, measurement and layout tools, electrical tools, plumbing tools, HVAC tools, fasteners and hardware, safety equipment, cutting tools, automotive tools, and general accessories. Each category was trained with thousands of example images captured in real workshop and job site conditions, including varied lighting, angles, backgrounds, and states of wear. This training diversity is what allows the model to recognize a beat-up crescent wrench on a dirty truck bed just as readily as a brand-new one on a clean workbench.
The pipeline is designed to be extensible. As new tool types are identified by users in the field, they can be incorporated into future model updates. These updates are delivered as compact model file replacements, meaning you get improved detection without reinstalling the app or losing any of your existing inventory data.
What's Next
Now that you understand how the camera feed becomes a set of labeled detections, the natural follow-up is how to make those detections even more accurate. In the next post, we will cover practical tips for improving detection accuracy, including lighting, angles, and how user corrections contribute to a better model over time. We will also look at how the confidence threshold can be tuned to match your workflow preferences.