AR Toolbox

On-Device ML and the YOLO Detection Model

The engine behind every tool detection in AR Toolbox is a YOLO (You Only Look Once) object detection model running entirely on your phone. No server calls, no cloud inference, no network dependency. The model analyzes camera frames locally and produces detections in real time using the processing power already in your pocket. This post takes a technical look at how the model works, how it has been optimized for mobile hardware, what inference performance looks like in practice, and how updates are delivered without disrupting your workflow.

YOLO Architecture for Tool Detection

YOLO is a family of object detection architectures that revolutionized real-time detection by treating the problem as a single regression task rather than a multi-stage pipeline. Traditional detection approaches use a region proposal network to identify candidate areas in an image, then classify each region separately. This multi-stage process is accurate but slow. YOLO instead divides the input image into a grid and predicts bounding boxes, objectness scores, and class probabilities for all grid cells simultaneously in a single forward pass through the network.

The AR Toolbox model is based on a compact YOLO variant specifically chosen for its balance of accuracy and speed on mobile hardware. The backbone network, which extracts visual features from the input image, uses depth-wise separable convolutions to reduce the computational cost of each layer without proportionally reducing the feature extraction capability. The detection head sits on top of the backbone and produces the final predictions: bounding box coordinates, confidence scores, and classification probabilities across all 130+ tool classes.

Training the model involved assembling a large dataset of tool images captured in real-world conditions. This dataset includes tools photographed on workbenches, inside toolboxes, spread across truck beds, laid on concrete floors, and positioned against a wide variety of backgrounds. The images span different lighting conditions from bright outdoor sun to dim basement fluorescents, different camera angles from directly overhead to steep oblique views, and different states of tool wear from brand new to heavily used. This diversity in the training data is what enables the model to perform reliably in the unpredictable conditions of actual field use rather than only in controlled laboratory settings.

Mobile Optimization with TFLite and CoreML

A model that runs well on a desktop GPU does not automatically run well on a phone. Mobile devices have constrained memory, thermal limits that throttle sustained computation, and battery budgets that penalize inefficiency. Optimizing the YOLO model for mobile required several techniques applied at both the model architecture level and the deployment level.

Quantization is the most impactful optimization. The full-precision model uses 32-bit floating point numbers for every weight and activation. Quantizing to 8-bit integers reduces the model size by approximately four times and accelerates inference on hardware that has integer-optimized compute units, which includes the neural processing units and GPU shader cores found in modern mobile chips. The accuracy loss from quantization is minimal for tool detection because the visual differences between tool types are large enough that reduced numerical precision does not meaningfully affect classification boundaries.

On Android, the quantized model is deployed as a TensorFlow Lite (TFLite) file and executed using the TFLite interpreter with GPU delegate or NNAPI delegate, depending on which hardware acceleration path is available on the specific device. On iOS, the model is converted to CoreML format and runs through Apple's Neural Engine, which is dedicated silicon designed for exactly this kind of workload. Both deployment paths take advantage of hardware acceleration that would not be available to a model running in generic CPU mode.

Additional optimizations include input resolution tuning, where the model accepts a smaller input image during live preview mode for faster throughput and switches to a larger input during committed capture for higher accuracy. Layer fusion, where multiple sequential operations are combined into a single kernel, further reduces the overhead of moving data between processing stages.

The goal of mobile optimization is not to make the model smaller for its own sake. It is to make inference fast enough that the user never perceives a delay between pointing the camera and seeing results. Every millisecond of inference time matters because it directly affects how responsive and natural the scanning experience feels.

Inference Speed, Model Updates, and Battery Impact

On flagship devices from the last two to three years, the AR Toolbox model completes a single inference pass in approximately 20 to 40 milliseconds, which translates to 25 to 50 frames per second of potential detection throughput. In practice, the app does not need to process every camera frame, so it samples at a lower rate that provides smooth real-time overlay updates while conserving battery. On mid-range devices, inference times are somewhat longer, typically 50 to 80 milliseconds, which still provides comfortably real-time performance for the scanning use case.

Model updates are delivered as part of standard app updates through the App Store and Google Play. When the detection model is improved with better accuracy for specific tool types or expanded to cover new tools, the updated model file is included in the app package. The update process replaces the model file on device while leaving the inventory database completely untouched. You never lose data during an update, and you never need to rescan your existing inventory because of a model change. The model and the data are architecturally independent.

Battery impact is a practical concern for field use where a phone needs to last a full work day. AR Toolbox manages battery consumption through several strategies. The detection pipeline only runs when the scan screen is active. Background screens like inventory browsing and report generation do not use the camera or the model. During active scanning, the intelligent frame sampling system skips frames when the camera view has not changed meaningfully, reducing unnecessary computation. Thermal monitoring prevents the processor from sustaining maximum load for extended periods, which would trigger thermal throttling and degrade both performance and battery life. In practice, a ten-minute scanning session consumes roughly the same battery as recording a ten-minute video, which is a manageable trade-off for the utility it provides.

What's Next

With a solid understanding of the technology powering AR Toolbox, the final piece is seeing how it all comes together in the real world. In the next post, we will walk through concrete field use cases from HVAC technicians loading trucks, to electricians auditing job sites, to contractors managing fleet tool assets across multiple vehicles and teams.

Back to Blog