Search

AI Vision in Action: Building a Multimodal App with .NET MAUI

Yulian Airapetov
4 days ago
4 min read

With the rapid evolution of artificial intelligence, both mobile and desktop applications are gaining access to new capabilities — particularly in the area of visual understanding. Multimodal models that can analyse both text and images are paving the way for smarter interfaces that can comprehend, interpret, and describe visual content.

In a recent article, Microsoft demonstrated how such models can be integrated into cross-platform applications built with .NET MAUI using cloud-based AI services. Developers now have a straightforward way to access powerful AI models without the need to dive deep into machine learning or build their own infrastructure.

In this post, I’ll briefly summarise the key ideas from Microsoft’s article and share my thoughts on how this approach can be applied in real-world projects.

Introduction to Multimodal AI Models

Multimodal models are advanced AI systems capable of processing and analyzing data from multiple sources simultaneously—such as text, images, audio, and more. In the context of this article, the focus is on models that can combine textual and visual information.

For example, these models can take a photo and a natural language question as input, then provide a meaningful response based on the image content. This includes scene descriptions, object counting, reading text from photos, and even interpreting complex visual data.

Such capabilities are already being used across various industries. In healthcare, AI assists with analyzing medical scans; in retail, it can automatically generate product descriptions based on photos. In educational apps, multimodal models enable interactive content with visual explanations.

Overall, multimodal models greatly enhance user interfaces by making them more intuitive and intelligent, thanks to their ability to deeply understand both context and visual input.

Building a Multimodal App with .NET MAUI

In the official Microsoft article, a cross-platform app is built using .NET MAUI that interacts with a multimodal AI model via a cloud-based service. The app allows users to upload an image and receive a meaningful response in natural language from the model.

How the app works:

The user selects or captures an image using the device.
The image is sent to a server, where it is processed by a multimodal model (e.g., GPT-4-turbo with vision).
The model returns a text response, which is displayed in the app interface.

The key component — Microsoft.Extensions.AI

One of the most important parts of this example is the use of Microsoft.Extensions.AI, a new abstraction layer introduced by Microsoft to simplify AI integration in .NET applications.

It provides:

Easy integration with various AI providers (OpenAI, Azure OpenAI, and others)
A unified and standardised way to configure and call AI services
Encapsulation of model logic into clear and extensible components
Built-in support for dependency injection to access AI functionality throughout the app

Thanks to this abstraction, developers can focus on application logic without dealing with the complexities of the underlying AI APIs.

Technologies used in the project:

.NET MAUI — a cross-platform UI framework for Android, iOS, Windows, and macOS
Microsoft.Extensions.AI + OpenAI (or Azure OpenAI) — to interact with the multimodal language model
HttpClient with multipart/form-data — to upload image data to the API

This example demonstrates how Microsoft is working to make AI integration more accessible and developer-friendly within the .NET ecosystem.

Real-world implementation example:

Below is a code snippet that shows how tasks can be extracted from an image using the multimodal model:

Use Cases

Multimodal models unlock a wide range of scenarios where AI can interpret visual content in combination with text — making them especially valuable in real-world business applications. Here are some common examples:

Healthcare
Detecting abnormalities in medical scans (MRI, X-ray, CT) and generating descriptive reports for doctors. The model can answer questions like: “Do you see signs of pneumonia in this image?”
Education
Creating learning materials where AI explains what’s shown in an image or answers questions about diagrams, graphs, maps, and illustrations.
Logistics & Industry
Analysing photos from warehouse or factory cameras to assess package conditions, detect damage, or assist in sorting.
E-commerce
Automatically generating product descriptions from photos, including titles and categories. For example, asking: “What’s in this image?” — and receiving: “Red Nike Air Max 2023”
Accessibility
Helping visually impaired users by describing the environment, reading visual information from signs or documents, and recognising objects in real time.
Document Processing
Working with scanned documents that require both OCR (text recognition) and structural understanding — including tables, signatures, and stamps.

Multimodal AI systems are especially powerful when contextual visual understanding is needed — not just basic recognition. This makes them a key enabler in process automation and digital transformation.

Conclusion

Integrating multimodal AI into .NET applications has become significantly easier thanks to Microsoft’s efforts. By using Microsoft.Extensions.AI alongside OpenAI or Azure OpenAI, developers without deep machine learning expertise can now build intelligent apps that understand both text and images 🖼️.

The .NET MAUI example demonstrates how to implement a complete interaction with a multimodal model — from image selection to receiving a meaningful response. What once required complex infrastructure and a team of specialists is now available almost out of the box 📦.

Multimodal models represent the next step in how humans interact with machines 🤖➡️👤. The easier they are to integrate, the faster they will be adopted across industries like healthcare 🏥, education 🎓, business 💼, and everyday life 🏡.

Our team at Igniscor 💡 is ready to help you bring these technologies into real products. We specialise in cross-platform development and AI integration — from MVP 🚀 to scalable commercial applications 📱💻, we’ll help turn your ideas into reality.