Photo by Anomaly on Unsplash

There are many different ways of enjoying magazines beyond flipping through the physical pages. These days you don’t have to be able to see, read or touch the magazine to consume its content. AI helps make magazines, websites and other forms of media more accessible. 
This article is related to a current exhibition showcasing creative AI pieces at our RTL offices in Hamburg. The exhibit connected to this article is a GEO Kompakt magazine, which can be controlled with gestures and uses Text-To-Speech technology to read the pages to the user.

Photo by Jamie Street on Unsplash

Relevance

Gesture Control

Imagine you are standing at a platform waiting for your Ubahn. In front of you is a large display showing you an ad for the latest GEO magazine. Wouldn’t it be nice to flick through some selected pages rather than staring at the cover and pricing info? 
This doesn’t have to be the only use case for gesture control: you could replace static ads with ones that bear more information, much like digital leaflets, you could make information stands in cities or malls smarter without adding a fragile touchscreen or button controls. All you need is a camera. Furthermore, this type of control is germ-free, which can be valuable depending on the location (e.g. hospital) or season (e.g. flu season). It can also make magazines more accessible to people lacking fine motor skills due to disabilities.

Text-to-Speech

Much in the same way text-to-speech can be used as a “nice-to-have” feature but also as an accessibility feature for some user groups. Text-to-Speech makes media accessible for the visually impaired and blind, helping them use websites, “read” a blog/book/magazine or do other tasks online that sighted people rarely think twice about. And while this was possible for a while now with standard Text-to-Speech approaches, think about what would be more pleasant to listen to for a longer period: a metallic-sounding voice or a human-sounding one with the correct intonation. Furthermore, Text-to-Speech can lend a voice to those who have lost theirs, for which Stephen Hawking is the most famous example.

The technology

Gesture Control

There are many AI models that recognise diverse gestures or hand poses directly from a video or photos. As part of computer vision, many of these models make use of visual transformers or convolutional neural networks. However, it is also possible to take a detour by using skeleton-based gesture recognition. This means that instead of recognising hand poses and gestures straight from pixels, we first recognise the internal hand skeleton and then process that afterwards. 

For this, Google Mediapipe Hands provides an AI that retrieves the skeleton of hands from video or photo. What models are exactly used in this AI is not revealed. It is only known that it consists of two modules: palm detection and hand-landmark detection, where 21 hand landmarks (aka the skeleton) are found and returned.  

Using these hand landmarks we can now implement our own gesture recognition with custom gestures. In this piece, there are 10 distinct gestures, 3 moving and 7 still ones: open (moving left, moving right, moving towards the camera), closed, zoom in, zoom out, middle finger, and rock hand. At first, still gestures are recognised by the relative position of hand landmarks to the hand base, therefore, gestures can be recognised stably regardless of their position. This recognition is done with a multi-layer perceptron trained on these 7 still poses.  The detection of movement is rather straightforward and based solely on which direction the hand base point moves. This way the movement detection determines whether a hand moves right, left or toward the camera. For the latter, the point of the middle fingertip is also considered as the z-coordinates are not as reliable as the x and y coordinates of the hand skeleton since we work with a simple RGB camera. Depending on that, going to the next or previous page or playing the text-to-speech audio is triggered respectively.

Text-to-Speech

The process of turning text into speech (through synthesising speech) has been around since the 70s. And you might remember the robotic-sounding voices of train announcements and other early Text-to-Speech applications. Nowadays, it is possible to create speech segments that sound realistic and human-like with the help of AI such as transformers or Long Short Term Memory Networks (short LSTM). You can even create custom voices from celebrities or anyone of whom you have enough voice material, however, this can become an ethical problem if the person in question did not consent to this usage of their voice. 

Photo by Ritupon Baishya on Unsplash

A great open-source Text-to-Speech model was published by  Silero AI and includes 10 languages; English, German, French, Spanish, Ukrainian, Russian, Tartar, Uzbek, Kalmyk, and various Indic languages with multiple speakers for most languages. This AI is one of the few open-source ones that offer such a variety of languages. It even supports auto-stress to enhance the quality of the speech, however, as of now, this feature is only available for Russian. While Silero’s models are open-source for usage, the code to the models and training are kept secret so that it cannot be misused the AI for unethical endeavours.  

To avoid this unethical misuse, our colleagues from ntv Digital have strict rules for their custom voices. The voices of Maik Meuser (RTL News) and Inken Wriedt (Audio Alliance) were synthesised in the Project “Synthetische Stimmen” (German for synthetic voices) in cooperation with Microsoft, hence the underlying AI is unrevealed as well. They are only to be used for their intended purpose and internal testing (such as this exhibition). Therefore, you can hear them narrate news on the ntv website and app thus making them more accessible.