Personalized English Amharic Medical Image Caption and Speech Generation for Visually Impaired Patients Using Vision Transformer Fused with LLM
Abstract
Access to medical information is critical for healthcare equity, particularly for visually impaired citizens and lowresource language speakers. Our goal is to create a model that enables visually impaired individuals to access their medical image results, by turn the text into an audio message, and translate generated captions into local languages to understand their medical results in their mother tongue. developing algorithms that can generate captions, translating into Amharic and generate speech from images is a major goal of our study by fussing computer vision and Generative AI. Accessible images are essential for those who are blind or visually impaired. In this study, following the design science approach, the data were gathered from the Tikur Anbessa specialized hospital, Addis Ababa University, and the data annotation was carried out by a domain expert. We preprocessed the data to make suitable for models. The work presents a novel approach model fusion such as Vision Transformer (ViT)GPT2, VIT-Llama2, and VGG16-LSTM architectures for medical image captioning. The model is designed to generate detailed captions for radiologists, translate the generated caption into Amharic, and speech for visually impaired patients. Among the model's ViT-Llama2 model generate high-quality caption and robust feature extraction, ensures precise, context-aware captions. Experiments demonstrate the effectiveness of this method, VIT-Llama2 achieving a high BLEU score of 0.633\% in image captioning and enhanced usability and accessibility. The system is deployed as a user-friendly application that accepts medical images as input, processes them through the models, outputs textual captions, translates generated caption into Amharic, and speech. This model bridges the gap in medical accessibility for low-resource language speakers, empowering visually impaired individuals and understand their medical image results.