03 September 2024
TalkBack is Android’s screen reader in the Android Accessibility Suite that describes text and images for Android users who have blindness or low vision. The TalkBack team is always working to make Android more accessible. Today, thanks to Gemini Nano with multimodality, TalkBack automatically provides users with blindness or low vision more vivid and detailed image descriptions to better understand the images on their screen.
Advancing accessibility is a core part of Google’s mission to build for everyone. That’s why TalkBack has a feature to describe images when developers didn’t include descriptive alt text. This feature was powered by a small ML model called Garcon. However, Garcon produced short, generic responses and couldn’t specify relevant details like landmarks or products.
The development of Gemini Nano with multimodality was the perfect opportunity to use the latest AI technology to increase accessibility with TalkBack. Now, when TalkBack users opt in on eligible devices, the screen reader uses Gemini Nano’s new multimodal capabilities to automatically provide users with clear, detailed image descriptions in apps including Google Photos and Chrome, even if the device is offline or has an unstable network connection.
“Gemini Nano helps fill in missing information,” said Lisie Lillianfeld, product manager at Google. “Whether it’s more details about what’s in a photo a friend sent or the style and cut of clothing when shopping online.”
Here’s an example that illustrates how Gemini Nano improves image descriptions: When Garcon is presented with a panorama of the Sydney, Australia shoreline at night, it might read: “Full moon over the ocean.” Gemini Nano with multimodality can paint a richer picture, with a description like: “A panoramic view of Sydney Opera House and the Sydney Harbour Bridge from the north shore of Sydney, New South Wales, Australia.”
“It's amazing how Nano can recognize something specific. For instance, the model will recognize not just a tower, but the Eiffel Tower,” said Lisie. “This kind of context takes advantage of the unique strengths of LLMs to deliver a helpful experience for our users.”
Using an on-device model like Gemini Nano was the only feasible solution for TalkBack to provide automatically generated detailed image descriptions for images, even while the device is offline.
“The average TalkBack user comes across 90 unlabeled images per day, and those images weren't as accessible before this new feature,” said Lisie. The feature has gained positive user feedback, with early testers writing that the new image descriptions are a “game changer” and that it’s “wonderful” to have detailed image descriptions built into TalkBack.
One important decision the Android accessibility team made when implementing Gemini Nano with multimodality was between inference verbosity and speed, which is partially determined by image resolution. Gemini Nano with multimodality currently accepts images in either 512 pixels or 768 pixels.
“The 512-pixel resolution emitted its first token almost two seconds faster than 768 pixels, but the output wasn't as detailed,” said Tyler Freeman, a senior software engineer at Google. “For our users, we decided a longer, richer description was worth the increased latency. We were able to hide the perceived latency a bit by streaming the tokens directly to the text-to-speech system, so users don’t have to wait for the full text to be generated before hearing a response.”
TalkBack developers also implemented a hybrid AI solution using Gemini 1.5 Flash. With this server-based AI model, TalkBack can provide the best of on-device and server-based generative AI features to make the screen reader even more powerful.
When users want more details after hearing an automatically generated image description from Gemini Nano, TalkBack gives the user an option to listen to more by running the image through Gemini Flash. When users focus on an image, they can use a three-finger tap to open the TalkBack menu and select the “Describe Image” option to send the image to Gemini 1.5 Flash on the server and get even more details.
By combining the unique advantages of both Gemini Nano's on-device processing with the full power of cloud-based Gemini 1.5 Flash, TalkBack provides blind and low-vision Android users a helpful and informative experience with images. The “describe image” feature powered by Gemini 1.5 Flash launched to TalkBack users on more Android devices, so even more users can get detailed image descriptions.
The Android accessibility team recommends developers looking to use the Gemini Nano with multimodality prototype and test on a powerful, server-side model first. There developers can understand the UX faster, iterate on prompt engineering, and get a better idea of the highest quality possible using the most capable model available.
While Gemini Nano with multimodality can include missing context to improve image descriptions, it’s still best practice for developers to provide detailed alt text for all images on their apps or websites. If the alt text is not provided, TalkBack can help fill in the gaps.
The Android accessibility team’s goal is to create inclusive and accessible features, and leveraging Gemini Nano with multimodality to provide vivid and detailed image descriptions automatically is a big step towards that. Furthermore, their hybrid approach towards AI, combining the strengths of both Gemini Nano on device and Gemini 1.5 Flash in the server, showcases the transformative potential of AI in promoting inclusivity and accessibility and highlights Google's ongoing commitment to building for everyone.
Learn more about Gemini Nano for app development.
This blog post is part of our series: Spotlight Week on Android 15, where we provide resources — blog posts, videos, sample code, and more — all designed to help you prepare your apps and take advantage of the latest features in Android 15. You can read more in the overview of Spotlight Week: Android 15, which will be updated throughout the week.