Meta releases first open AI model that can process images

Just two months after releasing its last major AI model, Meta is back with a major update: its first open-source model that can process both images and text.

The new model, Llama 3.2, could allow developers to create more advanced AI applications, including augmented reality applications that provide real-time understanding of videos, visual search engines that rank images by content, or document analysis that summarizes long passages of text for you.

Meta says it will be easy for developers to get the new model up and running. Ahmad Al-Dahle, vice president of generative AI at Meta, said there’s little developers need to do other than “add this new multimodality and have Llama display images and communicate.” Border.

Other AI developers, including OpenAI and Google, have already launched multimodal models in the past year, so Meta is playing catch-up here. Adding vision support will also play a key role as Meta continues to develop its AI capabilities in hardware like the Ray-Ban Meta glasses.

Llama 3.2 includes two image models (11 billion parameters and 90 billion parameters) and two lightweight text-only models (1 billion parameters and 3 billion parameters). The smaller models are designed to run on Qualcomm, MediaTek, and other Arm hardware, and Meta clearly hopes to see them used on mobile.

There’s still room for the (slightly) older Llama 3.1, though: Released in July, that model included a version with 405 billion parameters, which should theoretically be more capable of rendering text.

Alex Heath contributed reporting