VisionSpeak Object Detection and Narration System
1Ms. K P Chinmayi, 2Ms. Mahima Hanchinal, 3Mr. Anurag Dindalkopp, 4Ms. Neha Khan, 5Prof. Plasin Francis Dias
1UG Student at KLS VDIT Haliyal, India
2UG Student at KLS VDIT Haliyal, India
3UG Student at KLS VDIT Haliyal, India
4UG Student at KLS VDIT Haliyal, India
5Assistant Professor Department of Electronics and Communication Engineering at KLS VDIT Haliyal, India
Abstract - We present VisionSpeak, a web-based object detection system that narrates visual scenes through audio feedback. Using a laptop webcam and the pre trained YOLOv8s model, our system achieves 22 FPS on consumer hardware (Intel Core i3) with 81% average precision across seven common indoor objects. The core contribution lies in the narration control pipeline: temporal stability filtering reduces unnecessary speech by 73% (from 23.7 to 6.4 narrations per minute) while maintaining 95% detection recall.
Our system runs entirely offline using pyttsx3 for text-to-speech, ensuring privacy and consistent operation without cloud dependencies. The graphical interface eliminates command-line complexity, allowing non-technical users to adjust confidence thresholds (0.1–0.9) and monitor live detections through an annotated video feed.
We evaluated VisionSpeak through 500 manually annotated test frames across five indoor environments and a usability study with 10 participants completing standardized tasks. Results show 87% task success rate and a System Usability Scale score of 82/100. Performance degrades predictably under challenging conditions: dim lighting reduces precision by 15%, while objects smaller than 5% of frame area show 35% lower detection rates.
VisionSpeak serves as an educational demonstration tool for computer vision concepts and a prototype platform for studying narration strategies in vision-based systems. The modular architecture supports future extensions including OCR integration, depth estimation, and deployment on wearable devices.
Keywords: Object detection, YOLOv8, assistive technology, real-time narration, computer vision, accessibility, offline text-to-speech, human–computer interaction, visual impairment, frontend-backend integration.