Object detection has been a fundamental challenge in the computer vision industry, with applications in robotics, image understanding, autonomous vehicles, and image recognition. In recent years, groundbreaking work in AI, particularly through deep neural networks, has significantly advanced object detection. However, these models have a fixed vocabulary, limited to detecting objects within the 80 categories of the COCO dataset. This limitation stems from the training process, where object detectors are trained to recognize only specific categories, thus limiting their applicability.
To overcome this, we introduce YOLO-World, an innovative approach aimed at enhancing the YOLO (You Only Look Once) framework with open vocabulary detection capabilities. This is achieved by pre-training the framework on large-scale datasets and implementing a vision-language modeling approach. Specifically, YOLO-World employs a Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to foster interaction between linguistic and visual information. Through RepVL-PAN and region-text contrastive loss, YOLO-World can accurately detect a wide range of objects in a zero-shot setting, showing remarkable performance in open-vocabulary segmentation and object detection tasks.
This article aims to provide a thorough understanding of YOLO-World’s technical foundations, model architecture, training process, and application scenarios. Let’s dive in.
YOLO or You Only Look Once is one of the most popular methods for modern day object detection within the computer vision industry. Renowned for its incredible speed and efficiency, the advent of YOLO mechanism has revolutionized the way machines interpret and detect specific objects within images and videos in real time. Traditional object detection frameworks implement a two-step object detection approach: in the first step, the framework proposes regions that might contain the object, and the framework classifies the object in the next step. The YOLO framework on the other hand integrates these two steps into a single neural network model, an approach that allows the framework to look at the image only once to predict the object and its location within the image, and hence, the name YOLO or You Only Look Once.
Furthermore, the YOLO framework treats object detection as a regression problem, and predicts the class probabilities and bounding boxes directly from the full image in a single glance. Implementation of this method not only increases the speed of the detection process, but also enhances the ability of the model to generalize from complex and diverse data, making it a suitable choice for applications operating in real-time like autonomous driving, speed detection or number plate recognition. Furthermore, the significant advancement of deep neural networks in the past few years has also contributed significantly in the development of object detection frameworks, but the success of object detection frameworks is still limited since they are able to detect objects only with limited vocabulary. It is primarily because once the object categories are defined and labeled in the dataset, trained detectors in the framework are capable of recognizing only these specific categories, thus limiting the applicability and ability of deploying object detection models in real-time and open scenarios.
Moving along, recently developed vision language models employ distilled vocabulary knowledge from language encoders to address open-vocabularry detection. Although these frameworks perform better than traditional object detection models on open-vocabulary detection, they still have limited applicability owing to the scarce availability of training data with limited vocabulary diversity. Furthermore, selected frameworks train open-vocabulary object detectors at scale, and categorize training object detectors as region-level vision-language pre-training. However, the approach still struggles in detecting objects in real-time due to two primary reasons: complex deployment process for edge devices, and heavy computational requirements. On the positive note, these frameworks have demonstrated positive results from pre-training large detectors to employ them with open recognition capabilities.
The YOLO-World framework aims to achieve highly efficient open-vocabulary object detection, and explore the possibility of large-scale pre-training approaches to boost the efficiency of traditional YOLO detectors for open-vocabulary object detection. Contrary to the previous works in object detection, the YOLO-World framework displays remarkable efficiency with high inference speeds, and can be deployed on downstream applications with ease. The YOLO-World model follows the traditional YOLO architecture, and encodes input texts by leveraging the capabilities of a pre-trained CLIP text encoder. Furthermore, the YOLO-World framework includes a Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) component in its architecture to connect image and text features for enhanced visual-semantic representations. During the inference phase, the framework removes the text encoder, and re-parameterized the text embeddings into RepVL-PAN weights, resulting in efficient deployment. The framework also includes region-text contrastive learning in its framework to study open-vocabulary pre-training methods for the traditional YOLO models. The region-text contrastive learning method unifies image-text data, grounding data, and detection data into region-text pairs. Building on this, the YOLO-World framework pre-trained on region-text pairs demonstrate remarkable capabilities for open and large vocabulary detection. Additionally, the YOLO-World framework also explores a prompt-then-detect paradigm with the aim to enhance the efficiency of the open-vocabulary object detection in real-time and real-world scenarios.
As demonstrated in the following image, traditional object detectors focus on close-set of fixed vocabulary detection with predefined categories whereas open vocabulary detectors detect objects by encoding user prompts with text encoders for open vocabulary. In comparison, YOLO-World’s prompt-then-detect approach first builds an offline vocabulary(varying vocabulary for varying needs) by encoding the user prompts allowing the detectors to interpret the offline vocabulary in real-time without having to re-encode the prompts.
YOLO-World : Method and Architecture
Region-Text Pairs
Traditionally, object detection frameworks including the YOLO family of object detectors are trained using instance annotations that contain category labels and bounding boxes. In contrast, the YOLO-World framework re-formulate the instance annotations as region-text pairs where the text can be the description of the object, noun phrases, or category name. It is worth pointing out that the YOLO-World framework adopts both the texts and images as input and output predicted boxes with its corresponding object embeddings.
Model Architecture
At its core, the YOLO-World model consists of a Text Encoder, a YOLO detector, and the Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) component, as illustrated in the following image.
For an input text, the text encoder component encodes the text into text embeddings followed by the extraction of multi-scale features from the input image by the image detectors in the YOLO detector component. The Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) component then exploits the cross-modality fusion between the text and feature embeddings to enhance the text and image representations.
YOLO Detector
The YOLO-World model is built on top of the existing YOLOv8 framework that contains a Darknet backbone component as its image encoder, a head for object embeddings and bounding box regression, and a PAN or Path Aggression Network for multi-scale feature pyramids.
Text Encoder
For a given text, the YOLO-World model extracts the corresponding text embeddings by adopting a pre-trained CLIP Transformer text encoder with a certain number of nouns and embedding dimension. The primary reason why the YOLO-World framework adopts a CLIP text encoder is because it offers better visual-semantic performance for connecting texts with visual objects, significantly outperforming traditional text-only language encoders. However, if the input text is either a caption or a referring expression, the YOLO-World model opts for a simpler n-gram algorithm to extract the phrases. These phrases are then fed to the text encoder.
Text Contrastive Head
Decoupled head is a component utilized by earlier object detection models, and the YOLO-World framework adopts a decoupled head with dual 3×3 convolutions to regress object embeddings and bounding boxes for a fixed number of objects. The YOLO-World framework employs a text contrastive head to obtain the object-text similarity using the L2 normalization approach and text embeddings. Additionally, the YOLO-World model also employs the affine transformation approach with a shifting factor and a learnable scaling factor, with the L2 normalization and affine transformation enhancing the stability of the model during region-text training.
Online Vocabulary Training
During the training phase, the YOLO-World model constructs an online vocabulary for each mosaic sample consisting of 4 images each. The model samples all positive nouns included in the mosaic images, and samples some negative nouns randomly from the corresponding dataset. The vocabulary for each sample consists of a maximum of n nouns, with the default value being 80.
Offline Vocabulary Inference
During inference, the YOLO-World model presents a prompt-then-detect strategy with offline vocabulary to further enhance the efficiency of the model. The user first defines a series of custom prompts that might include categories or even captions. The YOLO-World model then obtains offline vocabulary embeddings by utilizing the text encoder to encode these prompts. As a result, the offline vocabulary for inference helps the model avoid computations for each input, and also allows the model to adjust the vocabulary flexibly according to the requirements.
Re-parameterizable Vision-Language Path Aggression Network (RevVL-PAN)
The following figure illustrates the structure of the proposed Re-parameterizable Vision-Language Path Aggression Network that follows the top-down and bottom-up paths to establish the feature pyramid with multi-scale feature images.
To enhance the interaction between text and image features, the YOLO-World model proposes an Image-Pooling Attention and a Text-guided CSPLayer (Cross-Stage Partial Layers) with the ultimate aim of improving the visual-semantic representations for open vocabulary capabilities. During inference, the YOLO-World model re-parametrize the offline vocabulary embeddings into the weights of the linear or convolutional layers for effective deployment.
As it can be seen in the above figure, the YOLO-World model utilizes the CSPLayer after the top-down or bottom-up fusion, and incorporates text-guidance into multi-scale image features, forming the Text-Guided CSPLayer, thus extending the CSPLayer. For any given image feature and its corresponding text embedding, the model adopts the max-sigmoid attention after the last bottleneck block to aggregate text features into image features. The updated image feature is then concatenated with the cross-stage features, and is presented as the output.
Moving on, the YOLO-World model aggregates image features to update the text embedding by introducing the Image Pooling Attention layer to enhance the text embeddings with image aware information. Instead of using the cross-attention directly on image features, the model leverages max pooling on multi-scale features to obtain 3×3 regions, resulting in 27 patch tokens with the model updating the text embeddings in the next step.
Pre-Training Schemes
The YOLO-World model follows two primary pre-training schemes: Learning from Region-Text Contrastive Loss and Pseudo Labeling with Image-Text Data. For the primary pre-training scheme, the model outputs object predictions along with annotations for a given text and mosaic samples. The YOLO-World framework matches the predictions with ground truth annotations by following and leveraging task-assigned label assignment, and assigns individual positive predictions with a text index that serves as the classification label. On the other hand, the Pseudo Labeling with Image-Text Data pre-training scheme proposes to use an automated labeling approach instead of using image-text pairs to generate region-text pairs. The proposed labeling approach consists of three steps: extract noun phrases, pseudo labeling, and filtering. The first step utilizes the n-gram algorithm to extract noun phrases from the input text, the second step adopts a pre-trained open vocabulary detector to generate pseudo boxes for the given noun phrase for individual images, whereas the third and the final step employs a pre-trained CLIP framework to evaluate the relevance of the region-text and text-image pairs, following which the model filters low-relevance pseudo images and annotations.
YOLO-World : Results
Once the YOLO-World model has been pre-trained, it is evaluated directly on the LVIS dataset in a zero-shot setting, with the LVIS dataset consisting over 1200 categories, significantly more than the pre-training datasets used by existing frameworks for testing their performance on large vocabulary detection. The following figure demonstrates the performance of the YOLO-World framework with some of the existing state of the art object detection frameworks on the LVIS dataset in a zero-shot setting.
As it can be observed, the YOLO-World framework outperforms a majority of existing frameworks in terms of inference speeds, and zero-shot performance, even with frameworks like Grounding DINO, GLIP, and GLIPv2 that incorporate more data. Overall, the results demonstrate that small object detection models like YOLO-World-S with only 13 million parameters can be utilized for pre-training on vision-language tasks with remarkable open-vocabulary capabilities.
Final Thoughts
In this article, we have talked about YOLO-World, an innovative approach that aims to enhance the abilities of the YOLO or You Only Look Once framework with open vocabulary detection capabilities by pre-training the framework on large-scale datasets, and implementing the vision-language modeling approach. To be more specific, the YOLO-World framework proposes to implement a Re-parameterizable Vision Language Path Aggregation Network or RepVL-PAN along with region-text contrastive loss to facilitate an interaction between the linguistic and the visual information. By implementing RepVL-PAN and region-text contrastive loss, the YOLO-World framework is able to accurately and effectively detect a wide range of objects in a zero-shot setting.