KOSMOS-2: A Multimodal Large Language Model by Microsoft


2023 has been an AI year, from language models to stable diffusion models. One of the new players that has taken center stage is the KOSMOS-2, developed by Microsoft. It is a multimodal large language model (MLLM) making waves with groundbreaking capabilities in understanding text and images. Developing a language model is one thing, while creating a model for vision is another, but having a model with both technologies is another whole level of Artificial intelligence. In this article, we will delve into the features and potential applications of KOSMOS-2 and its impact on AI and machine learning.

Learning Objectives

  • Understanding KOSMOS-2 multimodal large language model.
  • Learn how KOSMOS-2 performs multimodal grounding and referring expression generation.
  • Gain insights into the real-world applications of KOSMOS-2.
  • Running an inference with KOSMOS in Colab

This article was published as a part of the Data Science Blogathon.

Understanding KOSMOS-2 Model

KOSMOS-2 is the brainchild of a team of researchers at Microsoft in their paper titled “Kosmos-2: Grounding Multimodal Large Language Models to the World.” Designed to handle text and images simultaneously and redefine how we interact with multimodal data, KOSMOS-2 is built on a Transformer-based causal language model architecture, similar to other renowned models like LLaMa-2 and Mistral AI’s 7b model.


However, what sets KOSMOS-2 apart is its unique training process. It is trained on a vast dataset of grounded image-text pairs known as GRIT, where text contains references to objects in images in the form of bounding boxes as special tokens. This innovative approach allows KOSMOS-2 to provide a new understanding of text and images.

What is Multimodal Grounding?

One of the standout features of KOSMOS-2 is its ability to perform “multimodal grounding.” This means that it can generate captions for images that describe the objects and their location within the image. This reduces “hallucinations,” a common issue in language models, dramatically improving the model’s accuracy and reliability.

This concept connects text to objects in images through unique tokens, effectively “grounding” the objects in the visual context. This reduces hallucinations and enhances the model’s ability to generate accurate image captions.

Referring Expression Generation

KOSMOS-2 also excels in “referring expression generation.” This feature lets users prompt the model with a specific bounding box in an image and a question. The model can then answer questions about specific locations in the image, providing a powerful tool for understanding and interpreting visual content.

This impressive use case of “referring expression generation” allows users to use prompts and opens new avenues for natural language interactions with visual content.

Code Demo with KOSMOS-2

We will see how to run an inference on Colab using KOSMOS-2 mode. Find the entire code here: https://github.com/inuwamobarak/KOSMOS-2

Step 1: Set Up Environment

In this step, we install necessary dependencies like 🤗 Transformers, Accelerate, and Bitsandbytes. These libraries are crucial for efficient inference with KOSMOS-2.

!pip install -q git+https://github.com/huggingface/transformers.git accelerate bitsandbytes

Step 2: Load the KOSMOS-2 Model

Next, we load the KOSMOS-2 model and its processor.

from transformers import AutoProcessor, AutoModelForVision2Seq

processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224", load_in_4bit=True, device_map={"": 0})

Step 3: Load Image and Prompt

In this step, we do image grounding. We load an image and provide a prompt for the model to complete. We use the unique <grounding> token, crucial for referencing objects in the image.

import requests
from PIL import Image

prompt = "<grounding>An image of"
url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.png"
image = Image.open(requests.get(url, stream=True).raw)

Step 4: Generate Completion

Next, we prepare the image and prompt for the model using the processor. We then let the model autoregressively generate a completion. The generated completion provides information about the image and its content.

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda:0")

# Autoregressively generate completion
generated_ids = model.generate(**inputs, max_new_tokens=128)
# Convert generated token IDs back to strings
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Step 5: Post-Processing

We look at the raw generated text, which may include some tokens related to image patches. This post-processing step ensures that we get meaningful results.

<image>. the, to and of as in I that' for is was- on’ it with The as at bet he have from by are " you his “ this said not has an ( but had we her they will my or were their): up about out who one all been she can more would It</image><grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming up by<phrase> a fire</phrase><object><patch_index_0006><patch_index_0879></object>

Step 6: Further Processing

This step focuses on the generated text beyond the initial image-related tokens. We extract details, including object names, phrases, and location tokens. This extracted information is more meaningful and allows us to better understand the model’s response.

# By default, the generated text is cleaned up and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)

An image of a snowman warming up by a fire
[('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a fire', (36, 42), [(0.203125, 0.015625, 0.484375, 0.859375)])]
end_of_image_token = processor.eoi_token
caption = generated_text.split(end_of_image_token)[-1]
<grounding> An image of<phrase> a snowman</phrase><object><patch_index_0044><patch_index_0863></object> warming up by<phrase> a fire</phrase><object><patch_index_0006><patch_index_0879></object>

Step 7: Plot Bounding Boxes

We show how to visualize the bounding boxes of objects identified in the image. This step allows us to understand where the model has located specific objects. We leverage the extracted information to annotate the image.

from PIL import ImageDraw

width, height = image.size
draw = ImageDraw.Draw(image)

for entity, _, box in entities:
    box = [round(i, 2) for i in box[0]]
    x1, y1, x2, y2 = tuple(box)
    x1, x2 = x1 * width, x2 * width
    y1, y2 = y1 * height, y2 * height
    draw.rectangle(xy=((x1, y1), (x2, y2)), outline="red")
    draw.text(xy=(x1, y1), text=entity)


Step 8: Grounded Question Answering

KOSMOS-2 allows you to interact with specific objects in an image. In this step, we prompt the model with a bounding box and a question related to a particular object. The model provides answers based on the context and information from the image.

url = "https://huggingface.co/ydshieh/kosmos-2-patch14-224/resolve/main/pikachu.png"
image = Image.open(requests.get(url, stream=True).raw)

We can prepare a question and a bounding box for Pikachu. The use of special <phrase> tokens indicates the presence of a phrase in the question. This step showcases how to get specific information from an image with grounded question answering.

prompt = "<grounding> Question: What is<phrase> this character</phrase>? Answer:"

inputs = processor(text=prompt, images=image, bboxes=[(0.04182509505703422, 0.39244186046511625, 0.38783269961977185, 1.0)], return_tensors="pt").to("cuda:0")

Step 9: Generate Grounded Answer

We allow the model to autoregressively complete the question, generating an answer based on the provided context.

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# By default, the generated text is cleaned up, and the entities are extracted.
processed_text, entities = processor.post_process_generation(generated_text)

Question: What is this character? Answer: Pikachu in the anime.
[('this character', (18, 32), [(0.046875, 0.390625, 0.390625, 0.984375)])]

Applications of KOSMOS-2

KOSMOS-2’s capabilities extend far beyond the lab and into real-world applications. Some of the areas where it can make an impact include:

  1. Robotics: Imagine if you could tell your robot to wake you from sleep if the cloud looks heavy. It needs to be able to see the sky contextually. The ability of robots to see contextually is a valuable feature. KOSMOS-2 can be integrated into robots to understand their environment, follow instructions, and learn from their experiences by observing and comprehending their surroundings and interacting with the world through text and images.
  2. Document Intelligence: Apart from the external environment, KOSMOS-2 can be used for document intelligence. This could be to analyze and understand complex documents containing text, images, and tables, making extracting and processing relevant information more accessible.
  3. Multimodal Dialogue: Two common uses for AI have been more common in language or vision. With KOSMOS-2, we can employ chatbots and virtual assistants to work together, allowing them to understand and respond to user queries involving text and images.
  4. Image Captioning and Visual Question Answering: These involve automatically generating captions for images and answering questions based on visual information, which has applications in industries like advertising, journalism, and education. This includes generating specialized or fine-tuned versions mastering specific use cases.

Practical Real-World Use Cases

We have seen that KOSMOS-2’s capabilities extend beyond traditional AI and language models. Let us see specific application:

  • Automated Driving: It has the potential to improve automated driving systems by detecting and understanding the relative positions of objects in the vehicle, like the trafficator and the wheels, enabling more intelligent decision-making in complex driving scenarios. It could identify pedestrians and tell their intentions on the highway based on their body position.
  • Safety and Security: When building police security robots, the KOSMOS-2 architecture can be trained to detect when people are ‘freezed’ or are not.
  • Market Research: Additionally, it can be a game-changer in market research, where vast amounts of user feedback, images, and reviews can be analyzed together. KOSMOS-2 offers new ways to surface valuable insights at scale by quantifying qualitative data and combining it with statistical analysis.

The Future of Multimodal AI

KOSMOS-2 represents a leap forward in the field of multimodal AI. Its ability to precisely understand and describe text and images opens up possibilities. As AI grows, models like KOSMOS-2 drive us closer to realizing advanced machine intelligence and are set to revolutionize industries.

This is one of the closest models that drive toward artificial general intelligence (AGI), which is currently only a hypothetical type of intelligent agent. If realized, an AGI could learn to perform tasks that humans can perform.


Microsoft’s KOSMOS-2 is a testament to the potential of AI in combining text and images to create new capabilities and applications. Finding its way into domains, we can expect to see AI-driven innovations that were considered beyond the reach of technology. The future is getting closer, and models like KOSMOS-2 are shaping it. Models like KOSMOS-2 are a step forward for AI and machine learning. They will bridge the gap between text and images, potentially revolutionizing industries and opening doors to innovative applications. As we continue to explore the possibilities of multimodal language models, we can expect exciting advancements in AI, paving the way for the realization of advanced machine intelligence like AGIs.

Key Takeaways

  • KOSMOS-2 is a groundbreaking multimodal large language model that can understand text and images, with a unique training process involving bounding boxes in-text references.
  • KOSMOS-2 excels in multimodal grounding to generate image captions that specify the locations of objects, reducing hallucinations and improving model accuracy.
  • The model can answer questions about specific locations in an image using bounding boxes, opening up new possibilities for natural language interactions with visual content.

Frequently Asked Questions

Q1: What is KOSMOS-2, and what makes it unique?

A1: KOSMOS-2 is a multimodal large language model developed by Microsoft. What sets it apart is its ability to understand both text and images simultaneously, with a unique training process involving bounding boxes in-text references.

Q2: How does KOSMOS-2 improve the accuracy of language models?

A2: KOSMOS-2 enhances accuracy by performing multimodal grounding, which generates image captions with object locations. This reduces hallucinations and provides an understanding of visual content.

Q3: What is multimodal grounding, and why is it important?

A3: Multimodal grounding is the ability of KOSMOS-2 to connect text to objects in images using unique tokens. This is crucial for reducing ambiguity in language models and improving their performance in visual content tasks.

Q4: What are some practical applications of KOSMOS-2?

A4: KOSMOS-2 can be integrated into robotics, document intelligence, multimodal dialogue systems, and image captioning. It enables robots to understand their environment, process complex documents, and natural language interactions with visual content.

Q5: How does KOSMOS-2 generate captions for images with object locations?

A5: KOSMOS-2 uses unique tokens and bounding boxes in-text references for object locations in images. These tokens guide the model in generating accurate captions that include object positions.


  • https://github.com/inuwamobarak/KOSMOS-2
  • https://github.com/NielsRogge/Transformers-Tutorials/tree/master/KOSMOS-2
  • https://arxiv.org/pdf/2306.14824.pdf
  • https://huggingface.co/docs/transformers/main/en/model_doc/kosmos-2
  • https://huggingface.co/datasets/zzliang/GRIT
  • Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., & Wei, F. (2023). Kosmos-2: Grounding Multimodal Large Language Models to the World. ArXiv. /abs/2306.14824

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

Latest articles


Related articles

Leave a reply

Please enter your comment!
Please enter your name here