Figure 1. Interspeech 2021 will be held between August 30 and September 3, 2021.
This blog post will shortly review each of the three research papers ATCO2 will present on-site during INTERSPEECH. The first paper is related to the language used during the ATC communication.
We launched a community platform for collecting the ATC speech world-wide in the ATCO2 project. Filtering out unseen non-English speech is one of the main components in the data processing pipeline. The proposed English Language Detection (ELD) system is based on the embeddings from a Bayesian subspace multinomial model. It is trained on the word confusion network from an ASR system. It is robust, easy to train, and light weighted. We achieved 0.0439 equal-error-rate (EER), a 50% relative reduction as compared to the state-of-the-art acoustic ELD system based on x-vectors, in the in-domain scenario. Further, we achieved an EER of 0.1352, a 33% relative reduction as compared to the acoustic ELD, in the unseen language (out-of-domain) condition. We plan to publish the evaluation dataset from the ATCO2 project.
Further information in the following links:
Teaser: https://www.youtube.com/watch?v=qj42c4qmmAc
Abstract: https://arxiv.org/abs/2104.02332 and,
Paper: https://arxiv.org/abs/2104.02332
Contextual adaptation is a technique of “suggesting” small snippets of text that are likely to appear in the speech recognition output. The snippets of text are derived from the current “situation” of the speaker, in our project ATCO this is location and time. The location and time are then used to query from OpenSky Network a list of callsigns (airplanes) that match these two inputs.
Applying Automatic Speech Recognition (ASR) to the Air Traffic Control domain (ATC) is difficult due to factors like : noisy radio channels, foreign accents, cross-language code-switching, very fast speech rate, and also situation-dependent vocabulary with many infrequent words. All this combined leads to error rates that make it difficult to apply speech recognition.
For ASR in ATC, contextual adaptation is beneficial. For instance, we can use a list of airplanes that are nearby. From an airport identity, we can derive local waypoints, local geographical names, phrases in local language etc. It is important that the adaptation is dynamic, i.e. the adaptation snippets of text do change over time. And, the adaptation also has to be light-weight, so it should not require rebuilding the recognition network from scratch. We use the snippets of text by means of Weighted Finite State Transducer (WFST) composition. An example of a biasing FST is shown in Figure 2.
Figure 2. “Toy-example” topology of a biasing WFST graph for boosting the ASR’s recognition network. The boosted callsign is ‘CSA one two three alfa bravo’.
Further information in the following link:
Paper: Boosting of contextual information in ASR for air-traffic call-sign recognition
Air traffic management and specifically air-traffic control (ATC) rely mostly on voice communications between Air Traffic Controllers (ATCos) and pilots. In most cases, these voice communications follow a well-defined grammar that could be leveraged in Automatic Speech Recognition (ASR) technologies. The callsign used to address an airplane is an essential part of all ATCo-pilot communications. We propose a two-steps approach to add contextual knowledge during semi-supervised training to reduce the ASR system error rates at recognizing the part of the utterance that contains the callsign. Initially, we represent in a WFST the contextual knowledge (i.e. air-surveillance data) of an ATCo-pilot communication. Then, during Semi-Supervised Learning (SSL) the contextual knowledge is added by second-pass decoding (i.e. lattice rescoring). Results show that 'unseen domains' (e.g. data from airports not present in the supervised training data) are further aided by contextual SSL when compared to standalone SSL. For this task, we introduce the Callsign Word Error Rate (CA-WER) as an evaluation metric, which only assesses ASR performance of the spoken callsign in an utterance. We obtained a 32.1% CA-WER relative improvement applying SSL with an additional 17.5% CA-WER improvement by adding contextual knowledge during SSL on a challenging ATC-based test set gathered from LiveATC.
Figure 3. Process of retrieving a list of callsigns (contextual data) from OpenSky Network. The contextual data is the compendium of all possible verbalized versions of each callsign.
Further information in the following links:
Paper: Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems
]]>
Programme:
The organised session is dedicated to automatic speech recognition in air-traffic management, and the following agenda of the session has been released:
Thu-M-SS-2 Thursday, September 2, 11:00-13:00 Special-Hybrid: Automatic Speech Recognition in Air Traffic Management
Air-traffic management is a dedicated domain where in addition to using the voice signal, other contextual information (i.e. air traffic surveillance data, meteorological data, etc.) plays an important role. Automatic speech recognition is the first challenge in the whole chain. Further processing usually requires transforming the recognized word sequence into the conceptual form, a more important application in ATM. This also means that the usual metrics for evaluating ASR systems (e.g. word error rate) are less important, and other performance criteria (i.e. objective such as command recognition error rate, callsign detection accuracy, overall algorithmic delay, real-time factor, or reduced flight times, or subjective such as decrease of a workload of the users) are employed.
This special session is to bring together ATM players (both academic and industrial) interested in ASR and ASR researchers looking for new challenges. This can accelerate near future R&D plans to enable an integration of speech technologies to the challenging, but highly safety oriented air-traffic management domain.
The organisation is split among two persons (Hartmut Helmke (DLR, coordinator of HAAWAII project) and Pavel Kolcarek (Honeywell, topic manager of ATCO2 project).
Applying Automatic Speech Recognition (ASR) to the Air Traffic Control domain (ATC) is difficult due to factors like : noisy radio channels, foreign accents, cross-language code-switching, very fast speech rate, and also situation-dependent vocabulary with many infrequent words. All this combined leads to error rates that make it difficult to apply speech recognition.
For ATC ASR contextual adaptation is beneficial. For instance, we can use a list of airplanes that are nearby. From an airport identity, we can derive local waypoints, local geographical names, phrases in local language etc. It is important that the adaptation is dynamic, i.e. the adaptation snippets of text do change over time. And, the adaptation also has to be light-weight, so it should not require rebuilding the recognition network from scratch. We use the snippets of text by means of Weighted Finite State Transducer (WFST) composition.
We apply the on-the-fly boosting to the HCLG graph. The HCLG graph is the recognition network which defines the paths that the beam-search HMM decoder will be exploring. This graph contains costs that can be altered. We do this by WFST composition applied as:
HCLG’ = HCLG o B.
The composition is marked with operator ‘o’ and its algorithm is described in [1]. Informally, the output symbols of left operand are coupled (matched) with input symbols of right operand. The weights from both graphs are recombined in a way defined by the semi-ring of WFST weights. The result is a single graph having input symbols from left operand and output symbols of right operand. An example of boosting graph B is in Figure 1.
[1] Mehryar Mohri, Fernando Pereira, Michael Riley: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1): 69-88 (2002)
[1] Keith B. Hall, Eunjoon Cho, Cyril Allauzen, Françoise Beaufays, Noah Coccaro, Kaisuke Nakajima, Michael Riley, Brian Roark, David Rybach, Linda Zhang: Composition-based on-the-fly rescoring for salient n-gram biasing. INTERSPEECH 2015: 1418-1422
[1] Mehryar Mohri, Fernando Pereira, Michael Riley: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1): 69-88 (2002)
Figure 1. “Toy-example” topology of a WFST graph B for boosting the recognition network HCLG.
The boosting is done as composition: HCLG’ = HCLG o B, which introduces
the score discounts into the HCLG recognition network.
As you can see, we are boosting individual words. We cannot boost whole phrases, since such composition would require a lot of computation time. Also, we should not boost common words that are likely to be present in the lattice anyway. So, here we boost only ‘rare’ words like airline designators from callsigns (e.g. ‘air_berlin’). For the future, we think of boosting waypoints, local names and frequent phrases in local language.
In lattice boosting we have more freedom for designing the boosting graphs. The lattice is a relatively small graph compared to the HCLG graph, plus lattices are acyclic graphs. All this combined leads to faster runtimes of the composition operation. So, the boosting graph B can encode many word-sequences that obtain the score discount only if the whole word-sequence is matched in the lattice, when doing the WFST composition.
Similarly to previous section, the composition is done as:
L’ = L o B
where L is the input lattice, B is a boosting graph from Figure 2 and L’ is the output lattice with score discounts introduced by the composition.
Figure 2. A “toy-example” topology of a WFST graph B for boosting lattices (speech-to-text
output with alternative hypotheses). The boosting is done as composition: L’ = L o B,
which introduces score discounts for word-sequences that we decided to boost.
These word sequences represent the contextual information.
The lattice boosting is specific for each utterance, the composition is run in batch mode for a whole test-set. The toy-example in Figure 2 has a “lower part” with all the words in a lexicon in parallel; this makes sure no word sequence is dropped by the composition. There is also a phi symbol #0 on the ”entrance” to the lower part. The “upper part” encodes word sequences that we want to boost (e.g. call signs), the score discounts -4 or -8 are on the word links. As we use the phi symbol #0 in the composition, the lower part is accessed only if the partial word sequence in the lattice cannot be matched with the “upper part” of the B graph (the part with discounts).
The experiments with HCLG boosting and Lattice boosting are summarized in the paper we submitted to the conference Interspeech 2021. Here, we share the main table from the paper:
The table contains both Word Error Rate results (WER) and Call-Sign Accuracies (CSA). On liveatc_test_set2 we have a huge improvement from 53.5 to 80.6. For malorca_vienna the absolute CSA improvement is smaller, nevertheless the gain from 84.4 to 88.1 removed 60.7% of the gap spanning from baseline to oracle CSA. We also see that Lattice boosting on its own already brings good improvements, and the best results are obtained with the combination of HCLG boosting and Lattice boosting.
Check out our previous blog posts:
We introduced hardware setups in one of our previous blog posts. We used 2 different antennas and 2 different SDR receivers. We share our results about comparison of various combinations of the HW. To recall we have following HW:
Low performance/quality |
Higher performance/quality (more expensive) |
|||
Item |
Price |
Item |
Price |
|
Antenna |
Sirio Md 118-137 incl. 5m cable |
40 |
Watson WBA-20 |
60 |
SDR receiver |
RTL-SDR |
50 |
SDRPlay - RSP1A |
130 |
Our experiment was done on LKTB (Brno airport), where we are located at a distance of about 14km from the airport (see this blog post for details). See the altitude profile on the image below.
We placed both antennas for the test at approximately the same height.
One of our interests was to find out the quality of recorded audio signals (as we want to be as close as possible to the speech observed in Cockpit / Tower) and compare the more expensive and cheaper recording setups. The comparison is made on the estimated SNR values (see previous blog post link to blog 5). It is worth to mention, that the RSP1A was also run in 8-bit mode (recording 10MHz bandwidth).
We recorded 3 days with both (more expensive RSP1A on Watson WBA-20 and cheaper RTL-SDR on Sirio MD) HW setups in parallel and then switched the antennas and recorded another 3 days (RSP1A on Sirio MD and RTL-SDR on Watson WBA-20). We conclude the experiments with the following results:
To briefly compare the lower quality (~200EUR) and the more expensive (~440EUR) HW setups, refer to the histograms below. The cheaper setup (RTL-SDR dongle with Sirio antenna) provides a SNR ~3.6dB on average while the expensive setup ~19.2dB on average. We also emphasized the amount of speech and signal in the histogram. The speech is filling about 70% of recorded audio signals.
Next two histograms compare SNRs of ‘fixed’ receivers while we switch the antennas. We see that Watson antenna provides higher SNR (6 to 10dB) compared to Sirio.
The next two histograms compare SNR with a ‘fixed’ antenna while we switch the receiver. Here we see the 4dB SNR superiority of RSP1A on the Sirio antenna and 10dB SNR superiority on the Watson antenna.
Our main conclusion is that a good antenna is important (i.e. it increases the SNR from 3.6dB to 9.2dB on average). If a good antenna is deployed, we can get even more gain in SNR from a better receiver (9.2dB to 19.2dB).
Let’s summarize mean SNRs in the following table:
mean SNR [dB] |
antenna |
||
Sirio MD (cheaper) |
Watson (more expensive) |
||
receiver |
RTL-SDR dongle (cheaper) |
3.58 |
9.22 |
SDRplay RSP1A (more expensive) |
8.78 |
19.16 |
]]>
Check out our previous blog posts:
This blog post is more technical compared to the previous ones. In the next paragraphs we
will describe the raw signal processing pipeline. The rtl-airband software is set to produce
raw data coming from the SDR hardware in cs16 format.
Produced cs16
files are processed through:
cat ${signalfile}.cs16 | csdr convert_s16_f | csdr amdemod_cf | csdr fastdcblock_ff | csdr gain_ff 3 | csdr limit_ff | csdr convert_f_s16 > ${signalfile}.raw
which does:
Next, we drop all segments shorter than 1 second as they do not contain any meaningful signal. You may have noticed we are not using automatic gain control (AGC). The reason is, that the AGC does a signal deformation (rapidly changing volume and thus amount of noise). As we have the whole recording and can process it off-line, we implemented a segment base gainer.
We detect push-to-talk clicks using wavelet transform and identify particular utterances in the audio. We amplify each segment not to exceed 95% of the maximum level of the wav file (1.0 in our case). The peak levels are ignored. See the figure below:
Original raw signal is on top, amplified is on the bottom.
We detect speech parts of the audio to be further used to reliably estimate the Signal-to-Noise Ratio. The Voice Activity Detector (VAD) is based on a neural network with 2 hidden layers and 2 output classes. It was trained on 1366 hours of multilingual telephone speech corpus. The neural network output is smoothed by averaging over a 5 frame window, and we can adjust the detection threshold to control the amount of detected speech. See the figure below with indicated speech in the recording (red parts).
The SNR estimation technique is based on the waveform amplitude distribution analysis (Chanwoo Kim, Richard M. Stern, "Robust Signal-to-Noise Ratio Estimation Based on Waveform Amplitude Distribution Analysis", Interspeech 2008). In principle, the amplitude distribution of noise is Gaussian while the amplitude distribution of speech is Gamma. We can “guess the SNR by estimating where we are between Gaussian and Gamma distributions” for our signal.
To estimate the SNR reliably we select only speech segments and avoid all the non-speech parts. We apply the SNR estimation technique which provides an SNR estimate per each voiced segment.
]]>Check out our previous blog posts:
Welcome to the next blog post from our series of “How to set up an ATC voice recorder”. We will aim at software installation and SDR settings. We expect you to choose Linux as the OS.
Please follow the instructions at https://atco.opensky-network.org/ website. You should end up with a Linux distribution with installed SDR drivers and RTL-airband software running (https://github.com/szpajder/RTLSDR-Airband).
You need to do several steps to set up the SDR. First, you need to identify VHF frequencies you want to record and decide what is your central frequency and bandwidth. If some frequencies are too distant, you may use two SDR devices (we are also using this setup). Let see two examples.
We checked available on-line resources and found main frequencies used in LKTB.
We are not interested in ATIS. You can notice that the distance between LKTB_TWR and LKTB_APP is 7.75MHz
which is much larger than 2.5MHz
supported by the RTL-SDR but smaller than 10.6MHz
supported by the SDR RSP1A (see previous blog post regarding more technical information). So to fully cover the LKTB, we need a pair of RTL-SDRs or one RSP1A. We choose the second option. See our configuration in the following figure:
The green boxes indicate 25kHz
bandwidth of one channel. We placed the central frequency in the middle. The “bandwidth” of the SDR - sampling frequency - was chosen wider than needed to overcome possible distortions on the edges.
Sample of rtl-airband config for SDRplay RSP1A device would look like the following:
country = "Czech Republic";
location= "49.25411,16.58154";
fft_size = 1024;
devices:
({
type = "soapysdr"; #driver
device_string="driver=sdrplay,serial=xxxxxxxxxxxxxx";
gain = "IFGR=20,RFGR=2"; #Every type of device has different gain settings
centerfreq = 123.500; #MHz
correction = 0;
mode = "multichannel";
sample_rate = 9.00; #bandwidth in MHz around centerfreq
channels:
({
freq = 119.600;
airport = "LKTB";
label = "BRNO_Tower";
outputs:
({
type = "rawfile";
directory = "/home/pi/output_airband";
filename_template = "BRNO_Tower_119_600MHz";
split_on_transmission = true;
});
});
});
The LKPR airport has more channels. One of the Radars and the Tower are the problematic ones as they are away from the rest. We would need about 16MHz
bandwidth to cover them all.
We analyzed the traffic on the channels and found out that the Radar on 127MHz
is a “copy” of Radar on 120MHz
. So we discarded it. Finally, our solution was to use SDRplay RSP1A and RTL-SDR (on two separated antennas). RSP1A covered the group of channels around 123MHz
and RTL-SDR took care of the Tower on 134MHz
. See the following figure:
We set the bandwidth of RSP1A to 5MHz
which gave us 14-bit
sampling precision (better audio quality). We limited the bandwidth of RTL-SDR and put the center frequency the same as the frequency of the Tower (134.55MHz
).
However we found a problem with recording the Tower (note: our setup is very close to the airport so we have a strong signal). We had strong harmonic distortion in the audio signal. See the following spectrogram:
Notice the spectral line around 1.6kHz
. RSP1A did not suffer from the problem. The problem is called ghosting (thanks to https://www.sdrplay.com/community/viewtopic.php?t=2968). It may happen that a strong source near you may leak into your recording (even if it is on a different frequency). We have tried to change the bandwidth and gain but it did not help. The solution was to change the central frequency.
Sample of rtl-airband config for RTL-SDR device.
country = "Czech Republic";
location= "50.10678,14.26600"
fft_size = 512;
devices:
({
type = "rtlsdr"; #driver
index = 0;
gain = 15; #Every type of device has different gain settings
serial = "00000001";
centerfreq = 134.750; #MHz
correction = 0;
mode = "multichannel";
sample_rate = 900100; #bandwidth in Hz around centerfreq
channels:
({
freq = 134.550;
airport = "LKPR";
label = "PRAGUE_Tower";
outputs:
({
type = "rawfile";
directory = "/home/pi/output_airband";
filename_template = "PRAGUE_Tower_134_550MHz";
split_on_transmission = true;
});
});
});
There are 2 more parameters that have an impact on the audio quality. The first one is gain
and the second one is fft_size
.
FFT size is an internal parameter that impacts the signal processing. The larger the value (in power of 2) the slightly better the signal but the more CPU power is needed. Good practice is that for wider bandwidth the FFT size should be larger. Tune this parameter (128
/ 256
/ 512
/ 1024
) and watch the load and signal quality. If you set it too high, then the signal starts to be choppy.
To set up the gain(s) is critical. There may be more gain controllers for your device. RTL-SDR has 1 gain, SRDplay RSP1A has 2 gain controls. Please consult documentation, support, or community for your device to find out block diagrams, gain controllers and proper settings. You should set the gain as low as possible in general. Ideally you should tweak only the analog gain closest to the antenna. The rest can be switched off. If you set the gain too low, you will receive noisy audio signals as there is not enough energy and your signal will be coded only in a few lower bits by the ADC. On the other hand, if you set the gain too high, then clipping appears on the ADC and you get “noisy” recordings.
We tuned the gains carefully and did some more experiments which we will share with you in one of our next blog posts. To make the long story short:
IFGN\RFGN |
0 |
1 |
2 |
3 |
4 |
20 |
6.75 |
10.33 |
12.29 |
11.76 |
9.70 |
25 |
10.39 |
11.385 |
11.55 |
11.83 |
8.97 |
30 |
11.20 |
11.36 |
11.39 |
10.71 |
5.47 |
Table of gain tuning of RSP1A connected to the Watson WBA-20 antenna for LKTB. Values are SNR [dB].
The RFGN (columns) is the main gain on the SDRplay RSP1A, where higher the value (0-9) smaller the gain. The IFGN is a “minor” gain controller which does not have much influence if the RFGN is tuned properly. You can see, that there is optimal point at RFGN = 2
and IFGN “switched off”.
We encourage you to do a similar thing. You do not need to calculate SNR, but collect some sufficient amount of audio and listen to it. You can try to do this on ATIS or Tower ATCs where you should have stable signals. Then try different gains and find the optimum.
This is all about setting up the SDR software. We hope this will help you to set up the recording easily with good results. We are aware that many things were simplified here. To go deeper in principles is out of the scope of these blog posts. If you are interested, please study more underpinning resources.
]]>Welcome to our ATCO2 project site. This is the first blogpost from a short series of “What SDR to buy, where to place it, how to setup it and connect to the OpenSky-Network platform .” We hope these posts help you in receiving clean audio signals from ATC VHF communication and feed the community. As most of us were noobs in SDR we had to learn a lot. And now, we are sharing what we learned to make your life easier. If you are an expert in this area skip this post. If you think there is something missing here, share your thoughts!
You decided to buy, set up and use an ATC receiver. Congratulations for your decision! Now let’s see what you particularly need to do. You must decide on where to place, what HW to buy, and how to set it up. WHAT is discussed in this and the next blog post while the WHERE and HOW answers are discussed in the following posts.
You should select a place with as clear visibility as possible to the airport tower (or a place where the transmitter antennas are). Use some on-line map and make an elevation profile between your position and the airport. There should not be any hills. The better your position is close to an approach route or waiting circuit. You will have a clean signal from the plains above you.
Now the general WHAT answers come. What you buy depends very much on your budget. We have tried two varians:
There are four components you need to take into account:
You want an 50ohm antenna with the highest “gain” (Well, antenna is a passive thing, so there is not any gain technically. You want to minimize signal loss.). There are many types of antennas so please select the one you can mount easily where you want to. We have tried several types of antenna (J-pole, Discone, Dipole). One important parameter of the antenna is the frequency range (or tuned frequency). Here, you are interested only in Rx (receiving frequency) range. The range should be covering the airband (ATC frequencies) which are in range 108MHz to 137MHz (usually around 122MHz). It is good to have a narrow band antenna tuned just for these frequencies. The narrow band antenna may lower noise coming from other strong sources around you (AM/FM radio stations, TV stations, GSM, ...).
You need a 50ohm coaxial cable to connect your antenna with an SDR device. Every cable has a signal loss. You want a cable as short as possible (but keep some reserve). We used the LLC category one (low loss). The lower the loss, the higher the price. It is also good to check the technical specifications and find the loss (in dB per m) for the given frequency range (you are not interested in loss in 2GHz but just around 120MHz). The last important property of the coax cable are the connectors. Every connector introduces signal losses. Find a cable which has the right connectors for your antenna (N-Type Female connector on the cable side for example) and SDR device (usually SMA Male connector on the cable side). Adding any adapters increases the signal loss. Warning: the coaxial cable cannot be bended in a small radius (several centimeters / an inch) - the bend may introduce high losses. Check the smallest allowed radius in the cable technical specifications. Note: If you want to use more receivers on one antenna, you will need to buy an active splitter. We have tried it and it works. But we will not go into details here.
You need an SDR (Software Defined Radio) receiver. There are other types of receivers but we ignore them for the sake of simplicity here. The SDR means that the receiver just digitalizes the analog signal from the antenna. The voice decoding is done by a software in a computer. The most expensive item in your bill is the SDR. The more expensive, the better quality (means coping better with low quality signals). The SDR usually has some analog circuits (gain controllers, filters, etc), an analog-to-digital converter (ADC), and some communication chips to talk to the computer (handling USB port for example). One of the most important parameters is the dynamic range of the SDR. The range is defined by the ADC. The problem is, that you will face strong and weak signals. If the dynamic range is small, then the strong signals may lead to clipping (signal distortion) while the weak ones are sunk in noise. Also the quality of the analog part is essential to overcome noise coming from your computer, power suppliers and other electronic devices at home. Minimum is 8-bit SDR but if you can afford 12, 14, 16 or more bit SDR it would be better. (some more reading about SDR sensitivity is here: SDR Receiver Performance Overview)
Here you want something small enough with low consumption, but powerful enough to decode all the channels you want to listen to and share with the community. The computer should also have the internet connection (WiFi, Ethernet, etc.). You can use an old notebook, your desktop or some sort of Raspberry Pi etc. Just take into account that the computer should be always on (if you want to be our data feeder). You connect the SDR to the computer (by USB in most cases) and then the computer to the Internet. We provide you with a description of how to install and configure all the software needed. There are several programs running on the computer. First, there is a radio demodulator. This program takes raw data (digitized signal) from the SDR and extracts the voice. Amplitude modulation is used in the VHF ATC. The program listens to selected frequencies (yes you can tune in and listen to voice communications in parallel), detects communication (when the pilot pushes a button and starts to talk), passes data through the demodulator, and stores the demodulated audio internally. Another program immediately post-processes these files and sends them to our servers. You can then log in to the OpenSky Network web and listen to your recordings.
That is all the compressed basic information about what you need to set up your own data feeder and to start to listen to ATC communication. We will go deeper in the next post. We will share what devices we tried and what results we got.
]]>Check out our previous blog posts:
In this blog post, we will look closer to the hardware (HW) setups for ATC recording from the VHF channel. We did a general overview of the four most important components: Antenna, Coaxial cable, SDR receiver, and computer (and computing resource). We built and tested two HW setups. The first one costs about 200EUR as an “entry solution” and the second one for about 400 EUR as a better one. The table below describes both configurations/setups:
Entry solution (more affordable) |
More expensive |
|||
Item |
Price |
Item |
Price |
|
Antenna |
Sirio Md 118-137 incl. 5m cable |
40 |
Watson WBA-20 |
60 |
Coax cable |
- |
0 |
LLC200A 20m |
77 |
SDR receiver |
RTL-SDR |
50 |
SDRPlay - RSP1A |
130 |
Raspberry Pi |
RPi 3 - 1GB |
40 |
RPi 4 - 8GB |
92 |
RPi case |
Metal case + active cooling |
24 |
Argon One |
28 |
micro SD |
256GB |
38 |
256GB |
38 |
power source |
USB 5V 2.5A |
10 |
USB-C 5V 3A |
10 |
SUM |
202 |
435 |
Let us discuss the items now.
The antenna is crucial as it has a direct impact on SNR of radio communication. We decided to purchase two tuned dipole antennas for aviation frequencies (118MHz-137MHz). We picked Sirio MD 118-137
and Watson WBA-20.
The Sirio is equipped with 5m long coaxial cable. We also purchased 20 meters LLC200A coaxial cable for the Watson antenna to allow us an easy mounting of the antenna on a roof with minimum signal loss. On the other hand, the Sirio is good to mount on a balcony for example.
We did a set of experiments (see details in one of our next blog posts) to estimate the impact of different antennas on ATC voice quality. The voice quality was measured by SNR -- signal-to-noise ratio or rather speech-to-noise ratio. Our conclusion was that we were able to get up to +6dB - +10dB better SNR with the Watson antenna.
We also tested wideband double discone and narrow band J-pole antennas. This was done in other places and close to the airport so we do not have direct comparison of all four. A custom built J-pole style antenna tuned to 135 MHz was connected to 5 meters of RG-58 type coaxial cable.
The double discone is a wideband antenna tuned to receive 25-2000MHz. This antenna was connected into an active two way splitter using 2 meters of CFD240 type coaxial cable (up to 5GHz).
Both antennas worked well but we cannot make any deeper comparison. These antennas belong to one of our data feeders who allowed us to use them.
We followed general recommendation and purchased a suggested “standard” for airband the RTL-SDR dongle
We also aimed to test technically better solutions but still for a reasonable budget. After a quick survey we decided to go for SDRplay RSP1A.
Both receivers have SMA female coaxial cable connector and USB. The difference in internal circuits (gain controllers, filters, ADC, etc.).
The main advantage of SDRplay RSP1A (over RTL-SDR) is, that it is up to 14bits (versus 8bits RTL) and has up to 10.6MHz recording bandwidth (versus 2.5MHz for RTL) - note: the 14bits precision is available up to 6MHz, 12bit up to 8MHz, 10bit up to 9.2Mhz and 8bit above 9.2MHz bandwidth. Both of these options are critical, because we are targeting to collect all available frequencies used by the given airport (Tower, Approach, Radar, Ground, Departure, ...) to monitor the whole flight communication. It happens often, that the frequencies are spread in the larger window than 2.5MHz (RTL dongle). Sometimes even the 10MHz bandwidth is not enough, thus two receivers would be required. Next, the 14bits bit depth may help to get a better SNR (signal to noise ratio), but it depends on the bandwidth used. Please see this post aiming at deeper channels vs. bandwidth analysis and suggestions.
We decided to use the Raspberry Pi mini computer that is used for running the SDR software and processing pipeline. Both Raspberry Pi are small and powerful enough. We bought a RPi 3B+ (with 1GB of RAM) as an ‘entry solution’ and a most powerful RPi 4 with 8GB RAM. We used 256GB microSD for system and data storage. To avoid overheating, we also used active cooling (heatsink and fan). You have several tens of combinations of RPi models and cases. You probably want to go with a good passive heatsink to minimize the noise coming from the fan. The RPi 4 in the Argon One case (url: https://www.amazon.com/Argon-Raspberry-Aluminum-Heatsink-Supports/dp/B07WP8WC3V) was an excellent solution. The Argon One case is easy to mount and has sufficient passive cooling for processing of 4 channels in parallel. The RPi 4 is able to process these 4 channels in parallel at 90% of load on 1 core. We noticed that the RPi 3B+ cannot handle 4 channels coming from RSP1A. So if you want to receive just 1 or 2 channels, the RPi 3B+ and RTL may be good enough. Otherwise we would suggest you go with the RPi4. One of our next posts will discuss the settings of the processing pipeline in the RPi and what can be tweaked.
We hope you got some better insight into the HW needed for receiving, processing and feeding the ATC communication. There are also other possibilities, so please do not hesitate and search for it. What we put here is our experience and what worked for us for a reasonable budget.
Check out our previous blog posts:
Let’s take a look on where to mount your antenna and what to do to check if your place is good or not. The first step is to find out the elevation (or altitude) profile between your place and the airport. We expect that the transmitting antennas are on the airport tower or nearby. It is good to check where exactly the airports’ antennas are placed. There should not be hills and other obstacles in between. The best option is that you have direct visibility. Any obstacles will block the signal coming from the Tower. You will hear the pilots well, but not the ground. We share two cases: LKTB and LKPR airports.
The LKTB is a small “international” airport.
We were lucky that one of our project partners lives in a house with a good enough position. The elevation profile does not look very bad.
The distance is about 14 kilometers and there is a hill in between which is not very high. There is no direct visibility. On the other hand, he lives “under” one of the approaches way November - Bravo. We mounted the Watson antenna on the roof (and also experimented with the Sirio antenna - see this post for details).
To make the long story short, it works. We got good results with the Watson antenna and the RSP1A. However the combination of Sirio antenna and RTL-SDR was not so successful. The amount of good quality speech was significantly lower compared to Watson + RSP1A. So if you are in similar conditions, think about a more expensive (and better) receiver and antenna.
LKPR is the largest airport of Czech Republic.
Here we were more lucky. Our data feeder lives in the ideal position.
The signal there is strong so all combinations of the devices worked well. The only problem is, that part of one runway is below the horizon so we receive low quality signals from airplanes in that position.
Notes: If you install the antenna on your roof, be sure you place it as high as possible from your metal roof. Also be sure it is well grounded. The last thing you want is to let a lightning bolt hit your antenna. If there is a thunderstorm near you, you should unplug the coaxial cable from your SDR receiver (and ideally throw the end outside). A close lightning can induct high voltage and the free cable connector may be dangerous. Please search the internet for a proper solution. Here is a nice solution for ADS-B antennas.
In the last blog, we introduced a way to improve the word error rate (WER) on callsigns of the automatic speech recognition (ASR) output by incorporating surveillance information in the transcription process. In this blog post, we want to talk about extracting the callsigns from the ASR output. The process of callsign recognition can be broken down in two stages:
1) Tagging the callsign in the sequence
2) Mapping of the callsign word sequence into its ICAO format (ICAO stands for International Civil Aviation Organization)
Figure 1 illustrates the two-stage process. In the tagging step, the input transcript, originating from our ASR system is tagged with the IOB format (short for inside, outside, beginning), to find the tokens that are part of a callsign. In the second step, the part of the ASR transcript, that is tagged as callsign (labeled with B/I-CALL) is mapped to the standard ICAO format for callsigns, which consists of a 3 character airline identifier followed by the flight ID which consists out of several digits followed optionally by 1-2 characters (In case of interest, a list of airline identifiers can be found here:https://en.wikipedia.org/wiki/List_of_airline_codes).
Since we have two processes, the idea on hand is to train two different networks for the task, one that specializes in tagging and one that takes care of mapping the sequence tagged as callsign into the ICAO format. In this case, both processes can be tuned individually. The drawback of this architecture is, that information, that is lost in the first step, cannot be recovered in the second step. The other possibility is to train an End-to-End network, that outputs directly the ICAO callsign given the ASR transcripts as input. This architecture has the benefit, that there is no information loss in between. Both architectures are visualized in Figure 2. In our experiments showed that the End-To-End approach performs better than the 2 network solution in the majority of test cases.
A closer look at Figure 1 reveals that the predicted ICAO callsign does contain information that is missing in the labels and in the transcript, namely the last two digits of the flight id. This information comes from the surveillance information. Callsigns from planes near the location where the ATC Communication is recorded are time matched with the recordings and fed as additional input into the network as seen in Figure 3. In case the transcripts only contains the partial information of a callsign, the missing information can be recovered from the surveillance input. The End-To-End network shows a callsign accuracy rate over 90% on clean transcripts, if surveillance information is available. On our ASR output with a WER of 28.7, an accuracy over 80% is reached. The network also shows an increased resistance towards higher ASR WERs. The accuracy scores for two different datasets can be read up in our Interspeech paper submission: “Boosting of contextual information in ASR for air-traffic call-sign recognition”.
]]>ATCO2 project is proud to have in our consortium one of the best research center oriented in the field of automatic speech recognition:
According to the the latest ranking of the AI 2000 Most Influential Scholars, Faculty of Information Technology, BUT is among the world leaders in this field. BUT is among the five most important world institutions in this field - next to Google, Facebook, IBM and Carnegie Mellon University. FIT researchers Lukáš Burget, Jan Černocký and Pavel Matějka are also on the list of TOP100 world's most influential researchers. Brno University of Technology is the only institution from the Czech Republic in this ranking. BUT researchers succeeded together with Tomáš Mikolov, a FIT graduate. AMiner indexes authors, publications and data from the field of computer science. The Faculty of Information Technology, BUT, and the research group BUT Speech@FIT are among the leaders in the field of speech data mining for a long time.
This year, Brno will host InterSpeech 2021, the world's largest conference in this field.
]]>
Fortunately we have additional information that can help us with recognizing call signs. The radar, which every Air Traffic Tower has, tells us what planes are in the vicinity, and as an ATC could only be talking to one of these planes we therefore know that if a call sign was said it must be one of those that is on the radar. We developed two methods to use this information so as to improve callsign recognition.
The first modified the speech recognition system directly to boost the probability of recognizing the callsigns that we knew from the radar. Thanks to advances in efficient transducer composition[1] these modifications can be done so as to allow continuous updating of the model as new information from the radar comes in. This means that the model is tuned in real-time so it can adapt to changes in the real world. We published this technique in a paper at ICASSP 2021[2].
The second method post-processes the output of the speech recognition system to boost the probability of recognizing specific callsigns. This is simpler as it just involves rescoring the system output. Rescoring means giving certain outputs more weight, and thereby increasing the chance that the higher weighted terms are output as the model predictions. The rescoring was implemented as composition[3].
Both methods worked well, increasing call sign accuracy by up to 30\%. We believe there is still further room for improvement and plan on working further on this topic.
[1] Filters for Efficient Composition of Weighted Finite-State Transducers
[2] A comparison of methods for oov-word recognition on a new public dataset
]]>Biometrics refers to technologies that measure and analyse a person’s physical characteristics, making it possible to identify it through its biometric features and can also be used for authentication purposes.
From a data protection perspective, biometric technologies in general are closely linked to specific physical, physiological, behavioural or even psychological characteristics of a person, and some of them might also reveal sensitive data.
As to the voice, biometrics may concern the analysis of the tone, pitch, cadence and frequency of a person’s voice, which can make it possible to determine if a certain person is who he/she declares to be, or the identity of an unknown person, if matched with data from other databases.
Biometric data may also allow for automated tracking, tracing or profiling of persons and, as such, their potential impact on the privacy and the right to data protection of individuals is high, as also observed by the EU data protection authorities.
Moreover, biometric data are irrevocable: a breach concerning biometric data threatens the further safe use of biometrics as identifier and the right to data protection of the concerned persons for which there is no possibility to mitigate the effects of the breach.
One can change its passwords if forgotten or compromised, or its home keys if lost, but not its voice.
Voice biometric authentication systems are based on measurements of the biological characteristics of the individual and comparisons with other individuals previously checked and recorded in a database by a mechanism called enrollment.
Every spoken word (of a predefined speech used as sample) is converted, by a chain of mathematical operations, into a person’s voice print (also called ‘iVector’ in the R&D community), which is stored in the database. This shall be further interrogated to determine if a speaker is the person it claims to be, by comparing the stored voice print with the speaker’s, or even to determine which speaker, in a group of known speakers, most closely matches the unknown speaker (and in this case it is more appropriate to refer to identification systems, instead of authentication systems).
According to the General Data Protection Regulation (article 9), biometric data may be regarded as a ‘special category’ of data (commonly said: sensitive data).
However, in order for it to be considered as processing of special categories of personal data (Article 9) it requires that biometric data is processed “for the purpose of uniquely identifying a natural person”.
In short, in the light of articles 4.14 and 9, three criteria must be considered:
Sensitive data may only be processed if specific conditions are met, for example:
Being an EU Regulation, the GDPR is directly applicable in all EU Member States, but we should remember that in some cases it leaves States free to adopt specific rules, as in the case of the special categories of data.
Member States may actually maintain or introduce further conditions, including limitations, with regard to the processing of genetic, biometric or health data.
Attention should thus be paid to State-specific rules and regulations.
(Romagna Tech, Claudia Cevenini)
References
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)
]]>
Obviously one needs to have the voice recordings in order to be able to do the conversion and this will be the focus of this blog post - we’ll take a look at how to set a VHF (very high frequency) receiver. Luckily it is really simple and a person interested in eavesdropping to pilot-controller dialogue does not have to be an expert in radio equipment.
So, what is needed to get started?
These things will make set-up easier:
The first step to get started is to set up the Raspberry PI, a versatile single-board computer that can be used for developing pretty much anything.
More about the Raspberry Pi project can be found here: https://www.raspberrypi.org/
As the Raspberry Pi project has produced really easy to follow instructions on their homepage, we are not going to give exhaustive instructions here. Instead we’ll guide people to here: https://projects.raspberrypi.org/en/projects/raspberry-pi-setting-up
Alright, hopefully by now you have a Raspberry Pi ready to be used. Next, software to enable radio signal reception needs to be installed. The following is based on the instructions given here: https://atco.opensky-network.org/
The software is based on RTLSDR-Airband. It’s beautifully crafted open source software that can be found here: https://github.com/szpajder/RTLSDR-Airband/
It pretty much has three main parts in it:
(I live in Tallinn, Estonia so the following examples are based on that.And although you can have more than one receiver dongle attached to your Raspberry Pi, the following example assume only one)
a. Hit “Add New” button and hit to confirm your choices
(Note the comments explaining some of the choices you need to make. You can have more than 1 one receiver dongle per your Raspberry Pi)
b. Specify parameters related to the receiver location
(It will be used for proposing frequencies you could listen to by searching airports that are close to your location. Pick one method to localize yourself.)
c. Choose an airport of whose communication you would like to listen to
(Normally you would not one want to pick one that is further than 10km or so. But in some cases it would be of interest. For example, when you're directly underneath the descend route and would like to eavesdrop pilots’ talk. In this case you’ll probably have a bad reception of the controller's voice).
d. Choose the frequencies you would like to record.
(You basically can choose to listen to more than one frequency. BUT if the difference between the frequencies is greater than the one you specified in step “a)” then an error message will be given. It looks something like that: “Bandwidth of device 0 - My New VHF Receiver exhausted! Used bandwidth 7.300000000000011 - available 2.4”
If you see something like that, then just reselect frequencies that you want to follow and consider using multiple SDR dongles)
e. Place the config file to the right place
After you download the configuration file, place it to the right folder and restart the device, the receiver will start to record the communication taking place on the frequency. By default The audio files will be created to "/home/pi/output_airband"
You can make whichever modifications to the configuration file. The instructions can be found here: https://github.com/szpajder/RTLSDR-Airband/wiki
f. Hear the skies with your new VHF receiver
In this instance, you probably have set up both, hardware and software. Now, it is time to start hearing what’s in the sky. Firstly, you need to open a terminal and type “rtl_airband -ef”. After some time, you’ll see in the terminal, the frequencies that your receiver is “hearing” and when you see “*” that means that the recording system has been activated and an output file is going to be created.
After some time, you could check the following folder → ~./output_airband/ to see the segmented files in “cs.16” format, accompanied by a “cs16.info” file, which shows some key information.
And that’s it. I hope you enjoy following what’s going on in the skies above you. And don’t forget to contact any of the project members should you have any comments.
]]>