Good Caption Hunting

Sophia Zell

Last modified: 29.07.2022 5 minutes read

Thanks to AI you’ll never have to write another image caption again if you don’t want to. In this article, you will learn how to make use of image captioning AI regardless of your tech background.

Why use image captions?

Some people hate to write them, some ignore them, and others rely on them. But regardless of any opinion, they make websites and articles accessible to the visually impaired. If that isn’t incentive enough for you, they help improve your SEO ranking. When used in your own database, they help with the searchability of images based on text input. Unarguably, there’s no real reason against image captions, other than “they are so annoying to write”. But AI to the rescue, you don’t have to write your image captions yourself anymore (unless you want to, in that case, knock yourself out).

AI captions

In the early 2000s, the issue of teaching AI to capture the essence of an image in a textual caption and not just a classification label was introduced. And while this problem cannot yet be labelled as solved (to be honest, which AI problem is ever declared as solved?), models certainly developed to a point where their outputs are useful coherent captions that a human could have written. In most cases at least. An especially small model that outputs detailed captions is BLIP. (If you don’t care about how BLIP works and just want it to help you do your job more efficiently, skip this next part.)

BLIP by Salesforce

Like many AI models, BLIP was given a whimsical name based on an acronym: Bootstrapping Language-Image Pre-training. This model consists of four modules, enabling it to solve a whole variety of Computer Vision and Natural Language Processing tasks. The image encoder module is trained together with a text encoder module, which encourages a similar encoding of images and text that belong together. The two other modules are an image-grounded text encoder and an image-grounded text decoder. These two share most of their parameters during pre-training, but not during fine-tuning.

Furthermore, BLIP introduces a new approach to improving noisy image-text data sets from web-crawled image and alt-text pairs. This approach is called CapFilt and is used for fine-tuning BLIP. CapFilt consists of two modules, a captioner that generates new synthetic captions and a filter that removes noisy text-image pairs. The two modules are initialised with the pre-trained image-grounded text decoder as the captioner and the pre-trained image-grounded text encoder as the filter. The captioner/decoder uses nucleus sampling, giving BLIP the ability to generate various captions for the same image. If this behaviour is unwanted, the captioner/decoder can also use beam search for creating a deterministic caption that will always be the same for the same image.

This style of fine-tuning helps BLIP learn to create state-of-the-art captions based on a noisy data set. There are many other vision-language tasks on which BLIP outperforms older state-of-the-art models. For more details, you can read the paper here: https://arxiv.org/abs/2201.12086

Multiple Nucleus Sampling Caption: two dogs playing in the snow wearing coats. two dogs are playing in the snow on a snowy hill. two dogs play in the snow in their coats. (Image by Jan Walter Luigi
Beam Search Caption: two dogs playing in the snow with a frisbee.

Let the AI work for you

In this part, you will finally learn, how AI can make your life easier. Depending on how you wish to use BLIP for image captioning, you will find the explanation in the following sections:

Just gimme the caption

To get a good caption out of this AI you won’t have to program anything yourself. Thanks to the Hugging Face spaces, you can simply use it by dragging and dropping an image into a rectangle on a website! How convenient is that?! For captions in English you can simply use the original BLIP space by Salesforce: https://huggingface.co/spaces/Salesforce/BLIP You will have to select “Image Captioning” (Sorry, to be Captain Obvious) and leave the question field blank. (If you want to know what that is for, check out this article and scroll to “Visual Question Answering”.) Then you can choose between “Beam Search” and “Nucleus Sampling”. The difference between the two is that beam search is deterministic, meaning it will always give you the same output for the same input. While nucleus sampling is stochastic, so it uses a probability which means you will get different outputs for the same input (try it by hitting the send button again after you already got a caption). Once you hit send the AI needs a few seconds up to a minute to compute a good caption and translate it. If the process takes longer than a minute, a second counter can be found in the top right corner, there might be problems with the website.

For generating captions in another language, such as German, French, Spanish, Ukrainian, Swedish, Arabic, Italian, Hindi or Chinese, you can do the same but use this BLIP space that I created instead: https://huggingface.co/spaces/sophiaaez/BLIPsinki If the language that you need is not on there yet, feel free to reach out so that I can add it.

Side note: if you change the language or settings after hitting send, nothing changes. In this case, you will have to press send again.

I want to integrate and customise

Integration of BLIP into your own software can be easily achieved via an HTTP request. BLIP is available in a hugging face space, which has an automatically generated API provided by Gradio. For python accessing this looks as follows:

1. Don’t forget to install and import the necessary libraries.

import requests
import gradio

2. Load the image from the file straight into a base64 string with the help of Gradio.

b64_string = gradio.processing_utils.encode_url_or_file_to_base64(image_path)

3. Choose a strategy, meaning choose between beam search (deterministic) and nucleus sampling (stochastic):

strategy = 'Beam search' #or 'Nucleus sampling'

4. Make the request:

response = requests.post(url='https://hf.space/embed/Salesforce/BLIP/+/api/predict/', json={"data": [ b64_string,"Image Captioning","None",str(strategy)]})

5. Unpack the response:

jres = response.json()
caption = jres["data"][0]

6. WELL DONE!
7. If you then want your caption translated, I trust you to figure out how to use the hugging face transformers with ready-to-use trained models. More information on that can be found here: https://huggingface.co/tasks/translation

If you wish to make changes to the parameters for beam search or nucleus sampling or generally play around some more, you will have to clone the GitHub repository: https://github.com/salesforce/BLIP

Stage-Image by Cole Keister
Teaser-Image by Annie Spratt

The opinions and information stated in this article are personal to the individual author and do not necessarily represent Bertelsmann.

Sophia ZellData ScientistRTL Group

#nlp#artificial-life#computer-vision

Newest job offers

Support Engineer (vulnerability remediation)

Brasov, BV, RO, 500446

IT Security and Compliance

View job offer

Software Engineer ( Fullstack - .NET/C#/Angular)

Morrisville, NC, US, 27560

Project Management

View job offer

Business Process Management Lead

Louisville, KY, US, 40219

Project Management

View job offer

Systemadministrator Windows Server (m/w/d)

Germany multi-location, NW, DE, 33415

System Administration

View job offer

Full-Stack Architect

Riga, RIX, LV, LV-1013

Web Development

View job offer

SAP Developer

Ontario, CA, US, 91764

SAP Consulting / Development

View job offer

IT Specialist (IT Technician)

Easton, PA, US, 18045

IT Security and Compliance

View job offer

Backend Engineer - Ad Alliance

Hilversum, NH, NL, 1217WP

System Architecture

View job offer

Create Your Own Career

On our career website "Create Your Own Career" you can discover the wide range of entry and career opportunities at Bertelsmann and be inspired by our employee stories!

Find more interesting jobs

Good Caption Hunting

Why use image captions?

AI captions

BLIP by Salesforce

Let the AI work for you

Just gimme the caption

I want to integrate and customise

About the Author

Tags

Share Article

Newest job offers

Support Engineer (vulnerability remediation)

Software Engineer ( Fullstack - .NET/C#/Angular)

Business Process Management Lead

Systemadministrator Windows Server (m/w/d)

Full-Stack Architect

SAP Developer

IT Specialist (IT Technician)

Backend Engineer - Ad Alliance

Create Your Own Career