fbpx
Exclusive 15:58 05 Sep 2023

Ukrainian startup teaches AI to recognize endangered Crimean Tatar language

The Ukrainian startup Respeecher initiated a project to collect the voices of Crimean Tatar speakers to teach artificial intelligence to speak the Crimean Tatar language.

What is the problem?

UNESCO lists the Crimean Tatar language among those in need of protection with a status of extremely endangered. It is the language of one of the indigenous peoples of Ukraine, which is subjected to oppression and repression in Crimea, temporarily occupied by Russia.

What is the solution?

Ukraine created a special commission this year to preserve the Crimean Tatar language, which should contribute to preserving the country's national culture, traditions, customs, and historical memory, popularizing the Crimean Tatar language among young people, and raising its prestige.

However, not only the Ukrainian authorities are concerned with the fate of the Crimean Tatar language. The Ukrainian startup Respeecher initiated an extremely important project to popularize the Crimean Tatar language, the Representation of the President of Ukraine in the Autonomous Republic of Crimea reports.

The project should accelerate the usage of the Crimean Tatar language in many familiar services: telephone assistants, chatbots, and automatic translators.

How does it work?

Ukrainian AI startup Respeecher is engaged in synthesizing voices in Hollywood. Previously, the team used artificial intelligence to create synthesized voices for Darth Vader, Luke Skywalker, and God of War Ragnarok.

In a new project, the startup decided to teach the popular neural network the Crimean Tatar language. This volunteer initiative aims to help improve the situation with the Crimean Tatar language and promote it.

To train a free neural network to recognize a language, one needs to collect 1000 hours of recordings of the Crimean Tatar language. Native speakers are asked to send audio recorded on a voice recorder even at home. The more distinct accents and tonalities of voices the AI model will analyze, the more accurate the speech recognition result will be.

Audio recordings of the Crimean Tatar literary language, made in a quiet room with as little background noise as possible (such as other people's voices, the noise of cars, the sound of an air conditioner or refrigerator, etc.) are best suited for analysis. It is preferable to make recordings with a good microphone. However, even audio recordings made on an iPhone are also suitable. The main thing is that the audio duration should be from 30 minutes to an hour.

"In our project, we want to emphasize the variety of voices and the number of hours — this is about 1,000 hours of the Crimean Tatar language in the voices of native speakers," said the company. "Unfortunately, this language still has quite a bit of clean and high-quality audio. Such a dataset will help train and improve speech recognition and other interesting algorithms and will increase the amount of good in the universe."

The startup, which adheres to ethical cooperation standards, assures that no data of specific individuals will be stored, and the team will never reproduce or synthesize other people's voices without the carrier's permission. All sent data will be used only to train the neural network and analyze the Crimean Tatar language in general and its better recognition.

So far, the Respeecher team has collected 100 hours of audio recordings in Crimean Tatar. Some of the recordings were made in the startup studio. Some were sent audio recorded on a dictaphone.

"Unfortunately, so far, only 39 people have made such recordings. The thing is that audio shorter than 30 minutes or those with dialogues or noises/music in the background will not work. The AI model can only train on longer audios that are made in silence,"  says Dmytro Belevtsov, technical director and co-founder of Respeecher. "However, if you emotionally read your favorite book in Crimean Tatar on an ordinary telephone recorder for 40 minutes in a relatively quiet room, the echo will be an invaluable contribution to the project of popularizing the Crimean Tatar language."

The Respeecher team has already trained this neural network to recognize the Ukrainian language. This resource can be used both by individual developers and scientists to improve the audio perception of the Ukrainian language in their products and by large corporations, such as Facebook and Google, or assistants, such as Siri. In general, to create an assistant in a niche industry, for example, in the agricultural sector, companies will not need to spend tens of thousands of dollars on collecting a large amount of specialized data and training the network itself. They can start from a higher point and create technologies based on speech recognition significantly faster and cheaper than from scratch.

"The process of collecting and analyzing information is quite time-consuming: it can take many months. However, our team wants this resource to be free and available in open sources," says CEO and co-founder of Respeecher Dmytro Belevtsov. "We believe this will help popularize the use of Ukrainian and Crimean Tatar languages."  

The project's initiators invite everyone to send speech recordings or links to recordings using this form.

Even more useful solutions!

The non-governmental organization QIRI'M Young, which implements the project "National Corpus of the Crimean Tatar Language" as part of the Strategy for the Development of the Crimean Tatar Language for 2022-2032, also participated in the Respeecher project.

"Our team's vision is to digitize the Crimean Tatar language as soon as possible and introduce it in the most common operating systems, search engines, etc. Our team is forming a text base for language research — the National Corpus of the Crimean Tatar language, which can teach AI to understand Crimean Tatar texts and give Respeecher an audio base, which will teach AI to pronounce these texts,"  the public organization comments. "We gladly joined the collection of materials for the Respeecher project, providing about 10 hours of audio recordings. We consider this initiative very important. Projects to popularize and expand the scope of the Crimean Tatar language are extremely necessary and relevant. We wish our colleagues success and look forward to the development results!"

Rubryka reported that a Crimean Tatar publishing house, Kitap Qalesi, was opened in Ukraine.

If you have found a spelling error, please, notify us by selecting that text and pressing Ctrl+Enter.

Spelling error report

The following text will be sent to our editors: