Why use image captions?

Some people hate to write them, some ignore them, and others rely on them. But regardless of any opinion, they make websites and articles accessible to the visually impaired. If that isn’t incentive enough for you, they help improve your SEO ranking. When used in your own database, they help with the searchability of images based on text input. Unarguably, there’s no real reason against image captions, other than “they are so annoying to write”. But AI to the rescue, you don’t have to write your image captions yourself anymore (unless you want to, in that case, knock yourself out).

AI captions

In the early 2000s, the issue of teaching AI to capture the essence of an image in a textual caption and not just a classification label was introduced. And while this problem cannot yet be labelled as solved (to be honest, which AI problem is ever declared as solved?), models certainly developed to a point where their outputs are useful coherent captions that a human could have written. In most cases at least. An especially small model that outputs detailed captions is BLIP. (If you don’t care about how BLIP works and just want it to help you do your job more efficiently, skip this next part.)

BLIP by Salesforce

Like many AI models, BLIP was given a whimsical name based on an acronym: Bootstrapping Language-Image Pre-training. This model consists of four modules, enabling it to solve a whole variety of Computer Vision and Natural Language Processing tasks. The image encoder module is trained together with a text encoder module, which encourages a similar encoding of images and text that belong together. The two other modules are an image-grounded text encoder and an image-grounded text decoder. These two share most of their parameters during pre-training, but not during fine-tuning.

Furthermore, BLIP introduces a new approach to improving noisy image-text data sets from web-crawled image and alt-text pairs. This approach is called CapFilt and is used for fine-tuning BLIP. CapFilt consists of two modules, a captioner that generates new synthetic captions and a filter that removes noisy text-image pairs. The two modules are initialised with the pre-trained image-grounded text decoder as the captioner and the pre-trained image-grounded text encoder as the filter. The captioner/decoder uses nucleus sampling, giving BLIP the ability to generate various captions for the same image. If this behaviour is unwanted, the captioner/decoder can also use beam search for creating a deterministic caption that will always be the same for the same image.

This style of fine-tuning helps BLIP learn to create state-of-the-art captions based on a noisy data set. There are many other vision-language tasks on which BLIP outperforms older state-of-the-art models. For more details, you can read the paper here: https://arxiv.org/abs/2201.12086 

Multiple Nucleus Sampling Caption: two dogs playing in the snow wearing coats. two dogs are playing in the snow on a snowy hill. two dogs play in the snow in their coats. (Image by Jan Walter Luigi
Beam Search Caption: two dogs playing in the snow with a frisbee.

Let the AI work for you

In this part, you will finally learn, how AI can make your life easier. Depending on how you wish to use BLIP for image captioning, you will find the explanation in the following sections:

Just gimme the caption

To get a good caption out of this AI you won’t have to program anything yourself. Thanks to the Hugging Face spaces, you can simply use it by dragging and dropping an image into a rectangle on a website! How convenient is that?! For captions in English you can simply use the original BLIP space by Salesforce:  https://huggingface.co/spaces/Salesforce/BLIP  You will have to select “Image Captioning” (Sorry, to be Captain Obvious) and leave the question field blank. (If you want to know what that is for, check out this article and scroll to “Visual Question Answering”.) Then you can choose between “Beam Search” and “Nucleus Sampling”. The difference between the two is that beam search is deterministic, meaning it will always give you the same output for the same input. While nucleus sampling is stochastic, so it uses a probability which means you will get different outputs for the same input (try it by hitting the send button again after you already got a caption). Once you hit send the AI needs a few seconds up to a minute to compute a good caption and translate it. If the process takes longer than a minute, a second counter can be found in the top right corner, there might be problems with the website.

For generating captions in another language, such as German, French, Spanish, Ukrainian, Swedish, Arabic, Italian, Hindi or Chinese, you can do the same but use this BLIP space that I created instead: https://huggingface.co/spaces/sophiaaez/BLIPsinki If the language that you need is not on there yet, feel free to reach out so that I can add it.

Side note: if you change the language or settings after hitting send, nothing changes. In this case, you will have to press send again.

I want to integrate and customise

Integration of BLIP into your own software can be easily achieved via an HTTP request. BLIP is available in a hugging face space, which has an automatically generated API provided by Gradio. For python accessing this looks as follows:

1. Don’t forget to install and import the necessary libraries.

import requests
import gradio

2. Load the image from the file straight into a base64 string with the help of Gradio.

b64_string = gradio.processing_utils.encode_url_or_file_to_base64(image_path)

3. Choose a strategy, meaning choose between beam search (deterministic) and nucleus sampling (stochastic):

strategy = 'Beam search' #or 'Nucleus sampling'

4. Make the request:

response = requests.post(url='https://hf.space/embed/Salesforce/BLIP/+/api/predict/', json={"data": [ b64_string,"Image Captioning","None",str(strategy)]})

5. Unpack the response:

jres = response.json()
caption = jres["data"][0]

7. If you then want your caption translated, I trust you to figure out how to use the hugging face transformers with ready-to-use trained models. More information on that can be found here: https://huggingface.co/tasks/translation

If you wish to make changes to the parameters for beam search or nucleus sampling or generally play around some more, you will have to clone the GitHub repository: https://github.com/salesforce/BLIP 


Stage-Image by Cole Keister
Teaser-Image by Annie Spratt