How to leverage BLIP Transformer model for image captioning

Here in this article we will try to leverage BLIP transfomer model from HuggingFace platform for captioning images using a python script.

Table of Contents

Hugging Face Transformers

Hugging Face Transformers is a popular open-source Python library that provides tools to easily download, train, and deploy state-of-the-art pre-trained machine learning models.

Supported Modals

The primary purpose this library was build is to provide natural language processing capability, but now its expanded to support multiple AI fields.

Text: Text generation, translation, summarization, and sentiment analysis.
Vision: Image classification, object detection, and image segmentation.
Audio: Automatic speech recognition and audio classification.
Multimodal: Combining text, vision, and audio for tasks like visual question answering

Hugging Face Transformers support for multimodal learning, provides the capability to combines text and image data for tasks such as image captioning and visual question answering.

This capability is particularly relevant to the discussion of Bootstrapping Language-Image Pretraining (BLIP), as it leverages both text and image data to enhance AI models’ understanding and generation of image descriptions.

Bootstrapping Language-Image Pre-training (BLIP)

BLIP (Bootstrapping Language-Image Pre-training) is a powerful AI framework designed to bridge the gap between computer vision and natural language processing. Developed by Salesforce Research, it allows machines to “see” images and talk about them with high accuracy.

It enhances the models ability to understand and generate image descriptions. It learns to associate images with relevant text, allowing it to generate captions, answer image-related questions, and support image-based search queries.

If you are interested in watching the video. Here is the YouTube video on the same step by step procedure outlined below.

Procedure

Step1: Install pre-requisite libraries

As a first we will install the below libraries for this demo. Let us try to understand briefly about these libraries.

Torch (torch): The core PyTorch deep learning framework. It provides the tensor operations and automatic differentiation engine needed to build and train neural networks.
Transformers (transformers): Developed by Hugging Face, this library gives you pre-trained access to thousands of state-of-the-art models (like BERT, GPT, and Vision Transformers).
Pillow (pillow): Also known as PIL, it is the standard image processing library in Python. It handles opening, manipulating, and saving image files.
Torchvision (torchvision): A companion library for PyTorch specifically tailored for computer vision.
Torchaudio (torchaudio): The audio equivalent of Torchvision. It handles loading audio files, performing feature extraction (like spectrograms), and running pre-trained audio models

admin@fedser:computervision$ pip install transformers Pillow torch torchvision torchaudio

Step2: Download Sample Image

Here let’s download a sample image from google’s database which we will leverage for image captioning.

admin@fedser:computervision$ wget https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg -O demo.jpg

Step3: Generate a Python Script to Caption Image

Here is the sample python script which we will use to caption an image using the BLIP huggingface model.

BlipProcessor — handles preprocessing: resizes/normalizes the image and tokenizes any text into tensors the model expects.
BlipForConditionalGeneration — the neural network itself, which outputs token sequences (captions).

admin@fedser:computervision$ cat imagecaption.py 
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

# Initialize processor and model from hugging face
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# Load image
image = Image.open("demo.jpg")

# Prepare image
inputs = processor(image, return_tensors="pt")

# Generate captions
outputs = model.generate(**inputs, max_length=100)
caption = processor.decode(outputs[0],skip_special_tokens=True)

print("Generated Caption: " + caption)

Step4: Validate Image Captioning Script

Now you can run the python script and see it providing a caption to our image as shown below.

admin@fedser:computervision$ python imagecaption.py 
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████| 473/473 [00:00&lt;00:00, 22816.37it/s]
Generated Caption: a woman sitting on the beach with her dog

Hope you enjoyed reading this article. Thank you..

How to leverage BLIP Transformer model for image captioning