Can we still believe our ears?

Last week I received a link from a WhatsApp group of colleagues to an event which proudly featured AIVIA AI interpretation courtesy of Interprefy. Curious to find out more about what is being touted as my imminent replacement as a human interpreter by its makers, I took the time to listen to the English AI voice (as the presentations I followed were in German). Until I felt I would lose my own grip on the language if I were to listen for another moment.

I don’t mean at all to say that it was awful. In fact, I found rather awesome how deceptively natural the voice sounded initially. It had quite a pleasant cadence – until a few words into a sentence where there suddenly was a full stop where I wasn’t expecting one. I am not sure what came after that first full stop and how it related to the sentence which was cut short so abruptly. I was still too busy figuring out why this had happened. Listening further, I realised that sometimes the – rather frequent – misplaced full stops would indeed be in the middle of a sentence that would then continue in the next one with the word that would have made sense after the one on which the sentence had ended. Except sometimes they wouldn’t.

The source speech, which it was unfortunately not possible to follow alongside the interpretation, was a moderated podium discussion. The original spoken word was coherent enough in the way it tends to be when people who are not necessarily used to public speaking, talk without a script and when they may be a bit nervous. While the interpretation, listened to over several minutes, made less and less sense.

And then it struck me. The bits of original speech I listened to involved people answering questions, probably prepared to a degree, but also partly off the cuff, so it would be quite normal to hesitate after a few words while considering how to continue… to restart or rephrase the initial beginning of the answer… to pause for thought… to slip in a little sideways remark before getting back to the point… to backtrack a bit… or generally just to use sub-clauses in the middle of a main clause, as German grammar, when used skilfully, allows you to do… and the interpretation? It calmly and smoothly continued in its unchangingly soothing tone. It didn’t do hesitation, nor did it do question marks. The quite mellifluous voice – for a machine – stuck with the same tone until it became monotonous in itself and somewhat tiring as it was impossible to detect where an interjection began or ended. References to anything that was said a sentence or two previously totally went over the AI’s head (pardon the pun). The AI interpreting bot (for want of a better term) didn’t do banter, was unable to reproduce passion, emphasis, doubt, conviction – in short, lost all the human expressions of the spoken word, which tweak meaning in so many subtle and not so subtle ways, in translation.

After over an hour of listening to this I was still fascinated but mentally exhausted and had frankly barely grasped the gist of what the conversations had been about. Without switching to the original and referring back to the agenda, I’m not sure I would have been able to extract much useful information at all from what was ‘Wortsalat’ in some ways, as we say in German, while in other ways it wasn’t. There were passages that were naturally structured and perfectly intelligible. However, these passages just didn’t fit together or were artificially chopped apart. In fact, listening to some meaningful snippets that were patched together in a way that didn’t really make much sense and then trying to pick them apart and reconstruct them as something resembling coherence (fortunately a core skill of the human members of my profession) was a rather frustrating and tiring experience, even for me as a trained linguist. This is precisely how following a presentation in a language one doesn’t speak should absolutely not feel like!

My conclusion is that AI is lacking the human touch. It may get to mimic it at some point, just like it mimics human voices quite successfully now, but it will never move beyond mimicking anything that goes beyond the cognitive, that I am convinced of. So even if it will at some point be able to successfully reproduce the coherence – or sometimes lack thereof – in human speech, or even emotions, it would still be pretending; in short, it would be fake. People will hopefully pick up on this, the way they pick up on people pretending to be what they are not.

On the other hand, AI will massively help boost any cognitive tasks, given its access to all our collective human (and AI?) knowledge at lightning speed, even in the interpreting profession. In fact, it is already doing so successfully: there is already live on-screen transcription that will highlight and show in the text the translation of words from glossaries the interpreters prepared and uploaded to the AI, that will highlight names and figures, and this support already improves interpreter accuracy in those borderline situations where accents, speaker idiosyncrasies, dodgy audio and other factors normally adversely impact on interpreting output. Such AI booth mates, as the are called, also reduce cognitive load in such situations, especially when working with another recent technological advance in simultaneous interpretation, RSI, where a team of interpreters may be sitting in different countries and its members are therefore limited in the extent to which they can assist one another.

All in all I think there are more beneficial and urgent use cases for AI than taking over what makes humans so unique – our ability to communicate and therefore act jointly in much larger numbers than any other mammal, and all that via our language/s. Language should remain a purely human affair unless we are happy to let some – maybe at some point no longer controllable – algorithms start to dictate to us what our words are supposed to mean, which ones to use and how.

AI live speech translation right now is good for public announcements, like at airports or train stations. Possibly also for prepared presentations with a script that has been submitted upfront to train the AI (just as necessary as for human interpreters where ISO standards set out the kind of material to make available to interpreters before their event in order to prepare, a requirement that is very rarely adhered to, even if interpreters or LSP project managers repeatedly ask for it). Any time that things become interactive, AI live speech translation loses the plot. But ploughs on regardless. A bit like the existing crop of not so good conference interpreters (yes, like in every profession there are the good, the bad and the ugly in interpreting, too) it attempts to replace.

1 thought on “Can we still believe our ears?

Leave a comment