Improving callsign recognition by incorporating information from the radar

When Air Traffic Controllers (ATCs) talk to pilots they identify the plane the pilot is flying with a callsign. These usually consist of one term for the airline and then a sequence of alphanumeric characters, for example "Speedbird Seven Alpha Five". When doing speech recognition for ATCs recognizing these is particularly important.

Fortunately we have additional information that can help us with recognizing call signs. The radar, which every Air Traffic Tower has, tells us what planes are in the vicinity, and as an ATC could only be talking to one of these planes we therefore know that if a call sign was said it must be one of those that is on the radar. We developed two methods to use this information so as to improve callsign recognition.

The first modified the speech recognition system directly to boost the probability of recognizing the callsigns that we knew from the radar. Thanks to advances in efficient transducer composition[1] these modifications can be done so as to allow continuous updating of the model as new information from the radar comes in. This means that the model is tuned in real-time so it can adapt to changes in the real world. We published this technique in a paper at ICASSP 2021[2].

The second method post-processes the output of the speech recognition system to boost the probability of recognizing specific callsigns. This is simpler as it just involves rescoring the system output. Rescoring means giving certain outputs more weight, and thereby increasing the chance that the higher weighted terms are output as the model predictions. The rescoring was implemented as composition[3].

Both methods worked well, increasing call sign accuracy by up to 30\%. We believe there is still further room for improvement and plan on working further on this topic.

[1] Filters for Efficient Composition of Weighted Finite-State Transducers

[2] A comparison of methods for oov-word recognition on a new public dataset

[3] Weighted finite-state transducers in speech recognition