• er under ombygning - vi er tilbage mandag med nyt udseende. Henover weekenden er alt vores indhold åbent, men man kan ikke logge ind og debattere.

Start-up company open-sources a pre-trained model for Danish speech recognition: Together we can make an even better version

Illustration: designer491, bigstock

Denmark has a new language model that can be used to develop speech technology. In December, start-up company Alvenir launched a wav2vec2 model that can be fine-tuned for Danish speech recognition and beats the international models by far.

“If we compare ourselves with Google in a broad general usage scenario, we have up to double the performance in Danish,” says Martin Carsten Nielsen, who co-founded Alvenir together with Rasmus Arpe Fogh Egebæk.

Performance is measured, among other things, using the word error rate—that is, how many of the spoken words are recorded and interpreted correctly by the model compared to a transcript made by a human. The new model, without specialization and in a general usage scenario, can reach a word error rate of 20 percent—which on average means that in a sentence of ten words, there will be two incorrect, missing, or added words.

“It may sound bad that the error rate is 20 percent, but it must be seen in the context of a word being incorrect as soon as one letter is wrong,” Martin Carsten Nielsen explains.

“If I say a word like ‘kommunen’, there is a good chance that the acoustic model does not capture the whole word correctly because the double ‘m’ is not audible. But that’s what we measure.”

A word error rate of 20 percent is generally considered acceptable, but it is still a long way from 5–10 percent that tech giants’ models can achieve in English. That is the goal that Alvenir is working towards.

“This model is trained on 1,300 hours, podcasts and audiobooks. The next big step for Danish speech recognition is that we need to have a large version with 10 or 100 times more data,” Martin Carsten Nielsen says.

Trained on raw speech

The new Danish wav2vec2 model is based on a framework that Facebook’s AI researchers presented at the end of 2020. The framework uses the transformer architecture to achieve in speech what transformers have achieved in text. Just as the large language models use self-supervised learning from large amounts of raw text data, the wav2vec2 model can be trained on raw speech data.

The method removes one of the major bottlenecks that has historically been present in the development of language technology for a language such as Danish—securing large amounts of speech data that has been transcribed. Transcribed speech is only needed when the pre-trained model needs to be fine-tuned for a specific task.

There are also other benefits to the framework, Martin Carsten Nielsen says.

“Our wav2vec2.0 models are eerily much more efficient computation-wise than our old deep-speech2 models. With the old models, we had to use a GPU to run inference. Now we can run models on a CPU,” he says.

“Training, on the other hand, is demanding. We trained for 14 days on six GPUs. But that’s nothing if you look at how other models are made.”

Just a few hours of data may be needed for fine-tuning

The wav2vec2 model cannot be used for speech recognition until it has been fine-tuned. How much data that requires cannot be said definitively, Martin Carsten Nielsen says.

“Pre-training must make the models work both in a car, in a studio, in the office, and with various accents. How much data is needed for fine-tuning is highly dependent on the problem,” he says.

Facebook’s results show that getting a few hours of data—preferably five hours or more—can provide good results. The important thing is that the data covers the vocabulary that is relevant to the task.

“If you want models to have a very limited vocabulary, and you e.g. say ‘just’ a few words in a production environment, then 1 hour may be enough,” Martin Carsten Nielsen says.

Language technology for all dialects

In English, the large variety of dialects has long been a source of problems—to a degree where technology has been criticized for discriminating against minorities because their way of speaking the language is not recognized. Also in Danish, there is an inherent risk that language models will define Danish as only standard Danish, Martin Carsten Nielsen says.

“You can think of it as having categories under Danish such as standard Danish, South Jutlandic, and the Bornholm dialect. They sound very different, and in many respects use different words,” he says.

It will therefore not necessarily be sufficient in the short run to pre-train the model in all the local dialects for which data can be found. We will probably have to treat each dialect as a separate language instead, Martin Carsten Nielsen says.

“It’s not a given that you can just throw it all into one model. But one can imagine different models handling different dialects. You can e.g. make a model that starts by determining which dialect is spoken and then uses the right model for that dialect. That’s probably the most realistic solution in the short term.”

Shared blueprint

Alvenir will focus on developing solutions based on the new model. But the two founders still see making the model open source as a good move.

“The general models are tools that can be optimized in many ways. By making them freely available, we give organizations the opportunity to try them out. Some companies may be able to take the model and solve their tasks themselves, and that’s great. Others will need people like us and our expertise,” Martin Carsten Nielsen says.

“We have only just started working on these technologies, and Denmark should have the opportunity to be at the forefront. I might as well share the blueprint for my hammer, so that we can together make an even better version in the long run.”

Moreover, a positive attitude towards open source can also improve the company’s profile, Martin Carsten Nielsen points out:

“We want to be able to recruit talent, and the best people often tend to be interested in open source.”