Mistral's Multimodal Model: Introducing Pixtral 12B

Gábor Bíró September 9, 2024
3 min read

The rapidly rising French AI startup, Mistral AI, has ventured into the realm of multimodal artificial intelligence with the release of Pixtral 12B. Multimodal AI refers to systems capable of processing and understanding information from multiple data types simultaneously, such as text and images. This new 12 billion-parameter model positions Mistral, known for its focus on open-source solutions and challenging US tech giants, to compete with similar offerings from major players like OpenAI and Anthropic.

Mistral's Multimodal Model: Introducing Pixtral 12B
Source: Mistral

Pixtral 12B Features

Pixtral 12B builds upon Mistral's earlier Nemo 12B text-based model, incorporating a 400 million-parameter visual encoder that enables it to process images alongside text. While 12 billion parameters place it as a mid-sized model compared to some industry giants, it offers significant capabilities, especially as an open-source offering. The model can handle images up to 1024x1024 pixels, breaking them down into 16x16 pixel patches for analysis. It utilizes 2D Rotary Position Embeddings (RoPE) technology, which crucially helps the model better understand the spatial relationships between objects within an image. With a vocabulary of 131,072 tokens and specialized image processing tokens, Pixtral 12B excels at tasks such as image captioning (describing scenes in pictures), object counting (e.g., counting apples in a basket), and visual question answering (VQA), like responding to "What color is the car in the image?".

Licensing and Availability

Pixtral 12B is released under the permissive Apache 2.0 license. This is a significant advantage for the AI community, as it means the model can be freely downloaded, used, modified, and deployed, even for commercial purposes, without requiring users to share their modifications. This fosters innovation, allows businesses to integrate it into their products without vendor lock-in concerns, and promotes transparency. Developers can access the model, which has a size of approximately 24GB, via GitHub and Hugging Face, enabling them to fine-tune it for various specific applications.

Comparison with Other Models

Pixtral 12B enters a highly competitive field populated by powerful multimodal models like OpenAI's GPT-4o, Anthropic's Claude, and Google's Gemini family. A key differentiator for Mistral's model is its open-source nature. While competitors often provide access primarily through commercial APIs (Application Programming Interfaces), Pixtral 12B's open availability grants researchers and developers greater access, transparency, and customization capabilities. This approach is crucial for accelerating research, enabling independent audits, and fostering a collaborative development ecosystem. While its performance needs comprehensive benchmarking against these closed-source counterparts, its accessible size and flexibility make it an attractive alternative for the AI community.

Model Company Key Features Availability
Pixtral 12B Mistral AI 12B parameters, text & image processing, open-source Freely available under Apache 2.0 license
GPT-4o OpenAI Large-scale multimodal model, advanced reasoning Commercial API access
Claude 3 (Opus/Sonnet/Haiku) Anthropic Text & image understanding, strong performance, ethics focus Commercial API access
Gemini (Pro/Ultra) Google Multimodal capabilities, integrated into Google services API access & via Google products

Future Outlook

Fresh off a $645 million funding round that valued the company at an impressive $6 billion, Mistral AI is poised for significant growth. This substantial investment underscores market confidence and provides the resources needed to rapidly innovate and compete globally. The release of Pixtral 12B aligns perfectly with Mistral's strategy of offering powerful open models freely while generating revenue through optimized, managed versions and enterprise consulting services. As Mistral continues to expand its portfolio, Pixtral 12B is expected to be integrated into the company's chat platform (Le Chat) and API platform (La Plateforme) soon. This integration will allow a broader range of users to easily test, utilize, and explore the model's expanding capabilities, further driving its adoption and development.

Gábor Bíró September 9, 2024