Mistral ships a 12-billion-parameter vision-language model that runs on a laptop GPU
Pixtral 12B combines a 12 B parameter language decoder with a 400-million-parameter vision encoder trained on 1.2 trillion image-text tokens. It fits in 24 GB VRAM at 4-bit precision and delivers 78.4 percent on DocVQA without any cloud calls.
You move private document and image workflows off shared servers. Local inference removes per-token fees and keeps sensitive files inside your network boundary.
Mistral reports that within two weeks of release the model reached 180 000 downloads on Hugging Face and is running in production at a French legal-tech startup processing 40 000 contracts per day.
Step 1: Install the mistral-inference package with pip install mistral-inference. Step 2: Download the 4-bit weights via the command mistral-download --model pixtral-12b. Step 3: Run python -m pixtral.chat --image contract.pdf --prompt "Extract total amount" and confirm the answer appears in under two seconds on an RTX 4090.