Figure 1. Interspeech 2021 will be held between August 30 and September 3, 2021.
This blog post will shortly review each of the three research papers ATCO2 will present on-site during INTERSPEECH. The first paper is related to the language used during the ATC communication.
We launched a community platform for collecting the ATC speech world-wide in the ATCO2 project. Filtering out unseen non-English speech is one of the main components in the data processing pipeline. The proposed English Language Detection (ELD) system is based on the embeddings from a Bayesian subspace multinomial model. It is trained on the word confusion network from an ASR system. It is robust, easy to train, and light weighted. We achieved 0.0439 equal-error-rate (EER), a 50% relative reduction as compared to the state-of-the-art acoustic ELD system based on x-vectors, in the in-domain scenario. Further, we achieved an EER of 0.1352, a 33% relative reduction as compared to the acoustic ELD, in the unseen language (out-of-domain) condition. We plan to publish the evaluation dataset from the ATCO2 project.
Further information in the following links:
Teaser: https://www.youtube.com/watch?v=qj42c4qmmAc
Abstract: https://arxiv.org/abs/2104.02332 and,
Paper: https://arxiv.org/abs/2104.02332
Contextual adaptation is a technique of “suggesting” small snippets of text that are likely to appear in the speech recognition output. The snippets of text are derived from the current “situation” of the speaker, in our project ATCO this is location and time. The location and time are then used to query from OpenSky Network a list of callsigns (airplanes) that match these two inputs.
Applying Automatic Speech Recognition (ASR) to the Air Traffic Control domain (ATC) is difficult due to factors like : noisy radio channels, foreign accents, cross-language code-switching, very fast speech rate, and also situation-dependent vocabulary with many infrequent words. All this combined leads to error rates that make it difficult to apply speech recognition.
For ASR in ATC, contextual adaptation is beneficial. For instance, we can use a list of airplanes that are nearby. From an airport identity, we can derive local waypoints, local geographical names, phrases in local language etc. It is important that the adaptation is dynamic, i.e. the adaptation snippets of text do change over time. And, the adaptation also has to be light-weight, so it should not require rebuilding the recognition network from scratch. We use the snippets of text by means of Weighted Finite State Transducer (WFST) composition. An example of a biasing FST is shown in Figure 2.
Figure 2. “Toy-example” topology of a biasing WFST graph for boosting the ASR’s recognition network. The boosted callsign is ‘CSA one two three alfa bravo’.
Further information in the following link:
Paper: Boosting of contextual information in ASR for air-traffic call-sign recognition
Air traffic management and specifically air-traffic control (ATC) rely mostly on voice communications between Air Traffic Controllers (ATCos) and pilots. In most cases, these voice communications follow a well-defined grammar that could be leveraged in Automatic Speech Recognition (ASR) technologies. The callsign used to address an airplane is an essential part of all ATCo-pilot communications. We propose a two-steps approach to add contextual knowledge during semi-supervised training to reduce the ASR system error rates at recognizing the part of the utterance that contains the callsign. Initially, we represent in a WFST the contextual knowledge (i.e. air-surveillance data) of an ATCo-pilot communication. Then, during Semi-Supervised Learning (SSL) the contextual knowledge is added by second-pass decoding (i.e. lattice rescoring). Results show that 'unseen domains' (e.g. data from airports not present in the supervised training data) are further aided by contextual SSL when compared to standalone SSL. For this task, we introduce the Callsign Word Error Rate (CA-WER) as an evaluation metric, which only assesses ASR performance of the spoken callsign in an utterance. We obtained a 32.1% CA-WER relative improvement applying SSL with an additional 17.5% CA-WER improvement by adding contextual knowledge during SSL on a challenging ATC-based test set gathered from LiveATC.
Figure 3. Process of retrieving a list of callsigns (contextual data) from OpenSky Network. The contextual data is the compendium of all possible verbalized versions of each callsign.
Further information in the following links:
Paper: Contextual Semi-Supervised Learning: An Approach To Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems
]]>
Programme:
The organised session is dedicated to automatic speech recognition in air-traffic management, and the following agenda of the session has been released:
Thu-M-SS-2 Thursday, September 2, 11:00-13:00 Special-Hybrid: Automatic Speech Recognition in Air Traffic Management
Air-traffic management is a dedicated domain where in addition to using the voice signal, other contextual information (i.e. air traffic surveillance data, meteorological data, etc.) plays an important role. Automatic speech recognition is the first challenge in the whole chain. Further processing usually requires transforming the recognized word sequence into the conceptual form, a more important application in ATM. This also means that the usual metrics for evaluating ASR systems (e.g. word error rate) are less important, and other performance criteria (i.e. objective such as command recognition error rate, callsign detection accuracy, overall algorithmic delay, real-time factor, or reduced flight times, or subjective such as decrease of a workload of the users) are employed.
This special session is to bring together ATM players (both academic and industrial) interested in ASR and ASR researchers looking for new challenges. This can accelerate near future R&D plans to enable an integration of speech technologies to the challenging, but highly safety oriented air-traffic management domain.
The organisation is split among two persons (Hartmut Helmke (DLR, coordinator of HAAWAII project) and Pavel Kolcarek (Honeywell, topic manager of ATCO2 project).
This page presents more information about the datasets collected and open-sourced by the ATCO2 project. The corpora released by ATCO2 can be used for many speech and text-based machine learning (ML) tasks, including:
The figure below depicts the type of annotations offered by our corpus.
Find below some links of interests:
The ATCO2 corpora is split into 3 main parts:
Consists of audio and raw metadata:
License: Available for Commercial and Non-Commercial Use (see ELRA)
The official test data consist of:
License: Available for Commercial and Non-Commercial Use (see ELRA)
A sample test data for research purposes that consist of:
License: available for research purposes
An overview of the data processing pipeline developed by the ATCO2 project and used to collect the ATCO2 corpus is depicted in the figure above. The data processing pipeline developed by our project consists of several steps:
ATCO2 utilized this pipeline to pre-process the ATCO2-PL-set corpus which is the training corpus and ATCO2-test-set corpus.
The ATCO2 corpus is publicly available in ELDA catalog at the following URL: http://catalog.elra.info/en-us/repository/browse/ELRA-S0484/.
During the ATCO2 project, audio data was collected from radio receivers (feeders) placed near different airports worldwide. Simultaneously, we captured ADS-B (radar) data that we match with the audio recordings. This step is of special importance because it allows the ATCO2 corpora to be used for contextual ASR. In contextual ASR, we boost certain entities at decoding time, which can lead to benefits: i) reduced WER and ii) increased accuracy on entity detection, such as call-signs.
ADS-B data: Alongside audio and transcripts pairs for the training data, we also offer radar data (ADS-B) that is aligned to the target sample. For instance, the sample below shows the files available for the recording `LKPR_Tower_134_560MHz_20220119_185902`.
```
├── LKPR_Tower_134_560MHz_20220119_185902.boosting
├── LKPR_Tower_134_560MHz_20220119_185902.callsigns
├── LKPR_Tower_134_560MHz_20220119_185902.cnet_10_b15-13-400
├── LKPR_Tower_134_560MHz_20220119_185902.info
├── LKPR_Tower_134_560MHz_20220119_185902.segm
├── LKPR_Tower_134_560MHz_20220119_185902.wav
```
The files ending on “.callsign” and “.boosting” are: ADS-B data in ICAO format, e.g., ECC502 SWR115Z. The “boosting” file contains different verbalization for each callsign. We take the ICAO callsign and verbalize as, e.g., “eclair five zero two; eclair zero two; swiss one one five zulu; swiss one five zulu”.
Further information about the verbalization rules are in our papers:
Additional characteristics available for ATCO2 corpora, per airport: In the table below, you can find some statistics about the collected databases per airport:
If you are interested in acquiring the ATCO2 dataset, you can check the table above to find out if the data you are seeking matches one of the Airports packages. Note that in most cases, you can select the data with language scores higher than 0.5, which partly ensures that the audio is in English.
The characteristics per airport can be easily exported to text files by running the preparation script from our GitHub repository: https://github.com/idiap/atco2-corpus/tree/main/data/databases/atco2_pl_set
In the table below, you can find:
You can find more information, including WERs, in the following papers:
We also release a set of GitHub repositories:
The ATCO2 corpora can be employed to perform several natural (or spoken) language understanding (NLU) tasks. This can be used to:
Further information is described in: https://www.mdpi.com/2226-4310/10/10/898, while the Figure below show examples of named-entity recognition and text-based speaker role detection tasks.
Furthermore, the table below shows the performance on Precision (@P), Recall (@R) and F1-score (@F1) when fine-tuning a BERT model on the named-entity recognition task with the ATCO2-test-set-4h in a 5-fold cross-validation scheme.
"The ATC LID/ASR evaluation dataset is going to be published at Interspeech 2021. Stay tuned!"
Abstract: Detecting English Speech in the Air Traffic Control Voice Communication |
Name: ATCO2-LIDdataset-v1_beta
Description: This dataset was build for development and evaluation of techniques for English and non-English speech classification of ATC data. Note: The dataset is considered as beta version and will be updated in the future (more language pairs will be add and some cleaning/debugging may happen). The dataset consists of language pairs:
CZEN - devel (6.11 hours),
CZEN - eval (6.21 hours)
FREN - devel (2.68 hours),
FREN - eval (3.27 hours),
GEEN - devel English only (5.61 hours),
GEEN - eval (2.41 hours),
EN-AU (Australian English) - eval English only (0.17 hours).
Where possible we split the pair to development and evaluation subsets. We provided audio (wav format), English automatic transcript generated by an ASR and info file with estimated SNR, language and length.
Link to file to download: https://www.replaywell.com/atco2/download/ATCO2-LIDdataset-v1_beta.tgz
]]>Applying Automatic Speech Recognition (ASR) to the Air Traffic Control domain (ATC) is difficult due to factors like : noisy radio channels, foreign accents, cross-language code-switching, very fast speech rate, and also situation-dependent vocabulary with many infrequent words. All this combined leads to error rates that make it difficult to apply speech recognition.
For ATC ASR contextual adaptation is beneficial. For instance, we can use a list of airplanes that are nearby. From an airport identity, we can derive local waypoints, local geographical names, phrases in local language etc. It is important that the adaptation is dynamic, i.e. the adaptation snippets of text do change over time. And, the adaptation also has to be light-weight, so it should not require rebuilding the recognition network from scratch. We use the snippets of text by means of Weighted Finite State Transducer (WFST) composition.
We apply the on-the-fly boosting to the HCLG graph. The HCLG graph is the recognition network which defines the paths that the beam-search HMM decoder will be exploring. This graph contains costs that can be altered. We do this by WFST composition applied as:
HCLG’ = HCLG o B.
The composition is marked with operator ‘o’ and its algorithm is described in [1]. Informally, the output symbols of left operand are coupled (matched) with input symbols of right operand. The weights from both graphs are recombined in a way defined by the semi-ring of WFST weights. The result is a single graph having input symbols from left operand and output symbols of right operand. An example of boosting graph B is in Figure 1.
[1] Mehryar Mohri, Fernando Pereira, Michael Riley: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1): 69-88 (2002)
[1] Keith B. Hall, Eunjoon Cho, Cyril Allauzen, Françoise Beaufays, Noah Coccaro, Kaisuke Nakajima, Michael Riley, Brian Roark, David Rybach, Linda Zhang: Composition-based on-the-fly rescoring for salient n-gram biasing. INTERSPEECH 2015: 1418-1422
[1] Mehryar Mohri, Fernando Pereira, Michael Riley: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1): 69-88 (2002)
Figure 1. “Toy-example” topology of a WFST graph B for boosting the recognition network HCLG.
The boosting is done as composition: HCLG’ = HCLG o B, which introduces
the score discounts into the HCLG recognition network.
As you can see, we are boosting individual words. We cannot boost whole phrases, since such composition would require a lot of computation time. Also, we should not boost common words that are likely to be present in the lattice anyway. So, here we boost only ‘rare’ words like airline designators from callsigns (e.g. ‘air_berlin’). For the future, we think of boosting waypoints, local names and frequent phrases in local language.
In lattice boosting we have more freedom for designing the boosting graphs. The lattice is a relatively small graph compared to the HCLG graph, plus lattices are acyclic graphs. All this combined leads to faster runtimes of the composition operation. So, the boosting graph B can encode many word-sequences that obtain the score discount only if the whole word-sequence is matched in the lattice, when doing the WFST composition.
Similarly to previous section, the composition is done as:
L’ = L o B
where L is the input lattice, B is a boosting graph from Figure 2 and L’ is the output lattice with score discounts introduced by the composition.
Figure 2. A “toy-example” topology of a WFST graph B for boosting lattices (speech-to-text
output with alternative hypotheses). The boosting is done as composition: L’ = L o B,
which introduces score discounts for word-sequences that we decided to boost.
These word sequences represent the contextual information.
The lattice boosting is specific for each utterance, the composition is run in batch mode for a whole test-set. The toy-example in Figure 2 has a “lower part” with all the words in a lexicon in parallel; this makes sure no word sequence is dropped by the composition. There is also a phi symbol #0 on the ”entrance” to the lower part. The “upper part” encodes word sequences that we want to boost (e.g. call signs), the score discounts -4 or -8 are on the word links. As we use the phi symbol #0 in the composition, the lower part is accessed only if the partial word sequence in the lattice cannot be matched with the “upper part” of the B graph (the part with discounts).
The experiments with HCLG boosting and Lattice boosting are summarized in the paper we submitted to the conference Interspeech 2021. Here, we share the main table from the paper:
The table contains both Word Error Rate results (WER) and Call-Sign Accuracies (CSA). On liveatc_test_set2 we have a huge improvement from 53.5 to 80.6. For malorca_vienna the absolute CSA improvement is smaller, nevertheless the gain from 84.4 to 88.1 removed 60.7% of the gap spanning from baseline to oracle CSA. We also see that Lattice boosting on its own already brings good improvements, and the best results are obtained with the combination of HCLG boosting and Lattice boosting.
Check out our previous blog posts:
We introduced hardware setups in one of our previous blog posts. We used 2 different antennas and 2 different SDR receivers. We share our results about comparison of various combinations of the HW. To recall we have following HW:
Low performance/quality |
Higher performance/quality (more expensive) |
|||
Item |
Price |
Item |
Price |
|
Antenna |
Sirio Md 118-137 incl. 5m cable |
40 |
Watson WBA-20 |
60 |
SDR receiver |
RTL-SDR |
50 |
SDRPlay - RSP1A |
130 |
Our experiment was done on LKTB (Brno airport), where we are located at a distance of about 14km from the airport (see this blog post for details). See the altitude profile on the image below.
We placed both antennas for the test at approximately the same height.
One of our interests was to find out the quality of recorded audio signals (as we want to be as close as possible to the speech observed in Cockpit / Tower) and compare the more expensive and cheaper recording setups. The comparison is made on the estimated SNR values (see previous blog post link to blog 5). It is worth to mention, that the RSP1A was also run in 8-bit mode (recording 10MHz bandwidth).
We recorded 3 days with both (more expensive RSP1A on Watson WBA-20 and cheaper RTL-SDR on Sirio MD) HW setups in parallel and then switched the antennas and recorded another 3 days (RSP1A on Sirio MD and RTL-SDR on Watson WBA-20). We conclude the experiments with the following results:
To briefly compare the lower quality (~200EUR) and the more expensive (~440EUR) HW setups, refer to the histograms below. The cheaper setup (RTL-SDR dongle with Sirio antenna) provides a SNR ~3.6dB on average while the expensive setup ~19.2dB on average. We also emphasized the amount of speech and signal in the histogram. The speech is filling about 70% of recorded audio signals.
Next two histograms compare SNRs of ‘fixed’ receivers while we switch the antennas. We see that Watson antenna provides higher SNR (6 to 10dB) compared to Sirio.
The next two histograms compare SNR with a ‘fixed’ antenna while we switch the receiver. Here we see the 4dB SNR superiority of RSP1A on the Sirio antenna and 10dB SNR superiority on the Watson antenna.
Our main conclusion is that a good antenna is important (i.e. it increases the SNR from 3.6dB to 9.2dB on average). If a good antenna is deployed, we can get even more gain in SNR from a better receiver (9.2dB to 19.2dB).
Let’s summarize mean SNRs in the following table:
mean SNR [dB] |
antenna |
||
Sirio MD (cheaper) |
Watson (more expensive) |
||
receiver |
RTL-SDR dongle (cheaper) |
3.58 |
9.22 |
SDRplay RSP1A (more expensive) |
8.78 |
19.16 |
]]>
Check out our previous blog posts:
This blog post is more technical compared to the previous ones. In the next paragraphs we
will describe the raw signal processing pipeline. The rtl-airband software is set to produce
raw data coming from the SDR hardware in cs16 format.
Produced cs16
files are processed through:
cat ${signalfile}.cs16 | csdr convert_s16_f | csdr amdemod_cf | csdr fastdcblock_ff | csdr gain_ff 3 | csdr limit_ff | csdr convert_f_s16 > ${signalfile}.raw
which does:
Next, we drop all segments shorter than 1 second as they do not contain any meaningful signal. You may have noticed we are not using automatic gain control (AGC). The reason is, that the AGC does a signal deformation (rapidly changing volume and thus amount of noise). As we have the whole recording and can process it off-line, we implemented a segment base gainer.
We detect push-to-talk clicks using wavelet transform and identify particular utterances in the audio. We amplify each segment not to exceed 95% of the maximum level of the wav file (1.0 in our case). The peak levels are ignored. See the figure below:
Original raw signal is on top, amplified is on the bottom.
We detect speech parts of the audio to be further used to reliably estimate the Signal-to-Noise Ratio. The Voice Activity Detector (VAD) is based on a neural network with 2 hidden layers and 2 output classes. It was trained on 1366 hours of multilingual telephone speech corpus. The neural network output is smoothed by averaging over a 5 frame window, and we can adjust the detection threshold to control the amount of detected speech. See the figure below with indicated speech in the recording (red parts).
The SNR estimation technique is based on the waveform amplitude distribution analysis (Chanwoo Kim, Richard M. Stern, "Robust Signal-to-Noise Ratio Estimation Based on Waveform Amplitude Distribution Analysis", Interspeech 2008). In principle, the amplitude distribution of noise is Gaussian while the amplitude distribution of speech is Gamma. We can “guess the SNR by estimating where we are between Gaussian and Gamma distributions” for our signal.
To estimate the SNR reliably we select only speech segments and avoid all the non-speech parts. We apply the SNR estimation technique which provides an SNR estimate per each voiced segment.
]]>Check out our previous blog posts:
Welcome to the next blog post from our series of “How to set up an ATC voice recorder”. We will aim at software installation and SDR settings. We expect you to choose Linux as the OS.
Please follow the instructions at https://atco.opensky-network.org/ website. You should end up with a Linux distribution with installed SDR drivers and RTL-airband software running (https://github.com/szpajder/RTLSDR-Airband).
You need to do several steps to set up the SDR. First, you need to identify VHF frequencies you want to record and decide what is your central frequency and bandwidth. If some frequencies are too distant, you may use two SDR devices (we are also using this setup). Let see two examples.
We checked available on-line resources and found main frequencies used in LKTB.
We are not interested in ATIS. You can notice that the distance between LKTB_TWR and LKTB_APP is 7.75MHz
which is much larger than 2.5MHz
supported by the RTL-SDR but smaller than 10.6MHz
supported by the SDR RSP1A (see previous blog post regarding more technical information). So to fully cover the LKTB, we need a pair of RTL-SDRs or one RSP1A. We choose the second option. See our configuration in the following figure:
The green boxes indicate 25kHz
bandwidth of one channel. We placed the central frequency in the middle. The “bandwidth” of the SDR - sampling frequency - was chosen wider than needed to overcome possible distortions on the edges.
Sample of rtl-airband config for SDRplay RSP1A device would look like the following:
country = "Czech Republic";
location= "49.25411,16.58154";
fft_size = 1024;
devices:
({
type = "soapysdr"; #driver
device_string="driver=sdrplay,serial=xxxxxxxxxxxxxx";
gain = "IFGR=20,RFGR=2"; #Every type of device has different gain settings
centerfreq = 123.500; #MHz
correction = 0;
mode = "multichannel";
sample_rate = 9.00; #bandwidth in MHz around centerfreq
channels:
({
freq = 119.600;
airport = "LKTB";
label = "BRNO_Tower";
outputs:
({
type = "rawfile";
directory = "/home/pi/output_airband";
filename_template = "BRNO_Tower_119_600MHz";
split_on_transmission = true;
});
});
});
The LKPR airport has more channels. One of the Radars and the Tower are the problematic ones as they are away from the rest. We would need about 16MHz
bandwidth to cover them all.
We analyzed the traffic on the channels and found out that the Radar on 127MHz
is a “copy” of Radar on 120MHz
. So we discarded it. Finally, our solution was to use SDRplay RSP1A and RTL-SDR (on two separated antennas). RSP1A covered the group of channels around 123MHz
and RTL-SDR took care of the Tower on 134MHz
. See the following figure:
We set the bandwidth of RSP1A to 5MHz
which gave us 14-bit
sampling precision (better audio quality). We limited the bandwidth of RTL-SDR and put the center frequency the same as the frequency of the Tower (134.55MHz
).
However we found a problem with recording the Tower (note: our setup is very close to the airport so we have a strong signal). We had strong harmonic distortion in the audio signal. See the following spectrogram:
Notice the spectral line around 1.6kHz
. RSP1A did not suffer from the problem. The problem is called ghosting (thanks to https://www.sdrplay.com/community/viewtopic.php?t=2968). It may happen that a strong source near you may leak into your recording (even if it is on a different frequency). We have tried to change the bandwidth and gain but it did not help. The solution was to change the central frequency.
Sample of rtl-airband config for RTL-SDR device.
country = "Czech Republic";
location= "50.10678,14.26600"
fft_size = 512;
devices:
({
type = "rtlsdr"; #driver
index = 0;
gain = 15; #Every type of device has different gain settings
serial = "00000001";
centerfreq = 134.750; #MHz
correction = 0;
mode = "multichannel";
sample_rate = 900100; #bandwidth in Hz around centerfreq
channels:
({
freq = 134.550;
airport = "LKPR";
label = "PRAGUE_Tower";
outputs:
({
type = "rawfile";
directory = "/home/pi/output_airband";
filename_template = "PRAGUE_Tower_134_550MHz";
split_on_transmission = true;
});
});
});
There are 2 more parameters that have an impact on the audio quality. The first one is gain
and the second one is fft_size
.
FFT size is an internal parameter that impacts the signal processing. The larger the value (in power of 2) the slightly better the signal but the more CPU power is needed. Good practice is that for wider bandwidth the FFT size should be larger. Tune this parameter (128
/ 256
/ 512
/ 1024
) and watch the load and signal quality. If you set it too high, then the signal starts to be choppy.
To set up the gain(s) is critical. There may be more gain controllers for your device. RTL-SDR has 1 gain, SRDplay RSP1A has 2 gain controls. Please consult documentation, support, or community for your device to find out block diagrams, gain controllers and proper settings. You should set the gain as low as possible in general. Ideally you should tweak only the analog gain closest to the antenna. The rest can be switched off. If you set the gain too low, you will receive noisy audio signals as there is not enough energy and your signal will be coded only in a few lower bits by the ADC. On the other hand, if you set the gain too high, then clipping appears on the ADC and you get “noisy” recordings.
We tuned the gains carefully and did some more experiments which we will share with you in one of our next blog posts. To make the long story short:
IFGN\RFGN |
0 |
1 |
2 |
3 |
4 |
20 |
6.75 |
10.33 |
12.29 |
11.76 |
9.70 |
25 |
10.39 |
11.385 |
11.55 |
11.83 |
8.97 |
30 |
11.20 |
11.36 |
11.39 |
10.71 |
5.47 |
Table of gain tuning of RSP1A connected to the Watson WBA-20 antenna for LKTB. Values are SNR [dB].
The RFGN (columns) is the main gain on the SDRplay RSP1A, where higher the value (0-9) smaller the gain. The IFGN is a “minor” gain controller which does not have much influence if the RFGN is tuned properly. You can see, that there is optimal point at RFGN = 2
and IFGN “switched off”.
We encourage you to do a similar thing. You do not need to calculate SNR, but collect some sufficient amount of audio and listen to it. You can try to do this on ATIS or Tower ATCs where you should have stable signals. Then try different gains and find the optimum.
This is all about setting up the SDR software. We hope this will help you to set up the recording easily with good results. We are aware that many things were simplified here. To go deeper in principles is out of the scope of these blog posts. If you are interested, please study more underpinning resources.
]]>Welcome to our ATCO2 project site. This is the first blogpost from a short series of “What SDR to buy, where to place it, how to setup it and connect to the OpenSky-Network platform .” We hope these posts help you in receiving clean audio signals from ATC VHF communication and feed the community. As most of us were noobs in SDR we had to learn a lot. And now, we are sharing what we learned to make your life easier. If you are an expert in this area skip this post. If you think there is something missing here, share your thoughts!
You decided to buy, set up and use an ATC receiver. Congratulations for your decision! Now let’s see what you particularly need to do. You must decide on where to place, what HW to buy, and how to set it up. WHAT is discussed in this and the next blog post while the WHERE and HOW answers are discussed in the following posts.
You should select a place with as clear visibility as possible to the airport tower (or a place where the transmitter antennas are). Use some on-line map and make an elevation profile between your position and the airport. There should not be any hills. The better your position is close to an approach route or waiting circuit. You will have a clean signal from the plains above you.
Now the general WHAT answers come. What you buy depends very much on your budget. We have tried two varians:
There are four components you need to take into account:
You want an 50ohm antenna with the highest “gain” (Well, antenna is a passive thing, so there is not any gain technically. You want to minimize signal loss.). There are many types of antennas so please select the one you can mount easily where you want to. We have tried several types of antenna (J-pole, Discone, Dipole). One important parameter of the antenna is the frequency range (or tuned frequency). Here, you are interested only in Rx (receiving frequency) range. The range should be covering the airband (ATC frequencies) which are in range 108MHz to 137MHz (usually around 122MHz). It is good to have a narrow band antenna tuned just for these frequencies. The narrow band antenna may lower noise coming from other strong sources around you (AM/FM radio stations, TV stations, GSM, ...).
You need a 50ohm coaxial cable to connect your antenna with an SDR device. Every cable has a signal loss. You want a cable as short as possible (but keep some reserve). We used the LLC category one (low loss). The lower the loss, the higher the price. It is also good to check the technical specifications and find the loss (in dB per m) for the given frequency range (you are not interested in loss in 2GHz but just around 120MHz). The last important property of the coax cable are the connectors. Every connector introduces signal losses. Find a cable which has the right connectors for your antenna (N-Type Female connector on the cable side for example) and SDR device (usually SMA Male connector on the cable side). Adding any adapters increases the signal loss. Warning: the coaxial cable cannot be bended in a small radius (several centimeters / an inch) - the bend may introduce high losses. Check the smallest allowed radius in the cable technical specifications. Note: If you want to use more receivers on one antenna, you will need to buy an active splitter. We have tried it and it works. But we will not go into details here.
You need an SDR (Software Defined Radio) receiver. There are other types of receivers but we ignore them for the sake of simplicity here. The SDR means that the receiver just digitalizes the analog signal from the antenna. The voice decoding is done by a software in a computer. The most expensive item in your bill is the SDR. The more expensive, the better quality (means coping better with low quality signals). The SDR usually has some analog circuits (gain controllers, filters, etc), an analog-to-digital converter (ADC), and some communication chips to talk to the computer (handling USB port for example). One of the most important parameters is the dynamic range of the SDR. The range is defined by the ADC. The problem is, that you will face strong and weak signals. If the dynamic range is small, then the strong signals may lead to clipping (signal distortion) while the weak ones are sunk in noise. Also the quality of the analog part is essential to overcome noise coming from your computer, power suppliers and other electronic devices at home. Minimum is 8-bit SDR but if you can afford 12, 14, 16 or more bit SDR it would be better. (some more reading about SDR sensitivity is here: SDR Receiver Performance Overview)
Here you want something small enough with low consumption, but powerful enough to decode all the channels you want to listen to and share with the community. The computer should also have the internet connection (WiFi, Ethernet, etc.). You can use an old notebook, your desktop or some sort of Raspberry Pi etc. Just take into account that the computer should be always on (if you want to be our data feeder). You connect the SDR to the computer (by USB in most cases) and then the computer to the Internet. We provide you with a description of how to install and configure all the software needed. There are several programs running on the computer. First, there is a radio demodulator. This program takes raw data (digitized signal) from the SDR and extracts the voice. Amplitude modulation is used in the VHF ATC. The program listens to selected frequencies (yes you can tune in and listen to voice communications in parallel), detects communication (when the pilot pushes a button and starts to talk), passes data through the demodulator, and stores the demodulated audio internally. Another program immediately post-processes these files and sends them to our servers. You can then log in to the OpenSky Network web and listen to your recordings.
That is all the compressed basic information about what you need to set up your own data feeder and to start to listen to ATC communication. We will go deeper in the next post. We will share what devices we tried and what results we got.
]]>Check out our previous blog posts:
Let’s take a look on where to mount your antenna and what to do to check if your place is good or not. The first step is to find out the elevation (or altitude) profile between your place and the airport. We expect that the transmitting antennas are on the airport tower or nearby. It is good to check where exactly the airports’ antennas are placed. There should not be hills and other obstacles in between. The best option is that you have direct visibility. Any obstacles will block the signal coming from the Tower. You will hear the pilots well, but not the ground. We share two cases: LKTB and LKPR airports.
The LKTB is a small “international” airport.
We were lucky that one of our project partners lives in a house with a good enough position. The elevation profile does not look very bad.
The distance is about 14 kilometers and there is a hill in between which is not very high. There is no direct visibility. On the other hand, he lives “under” one of the approaches way November - Bravo. We mounted the Watson antenna on the roof (and also experimented with the Sirio antenna - see this post for details).
To make the long story short, it works. We got good results with the Watson antenna and the RSP1A. However the combination of Sirio antenna and RTL-SDR was not so successful. The amount of good quality speech was significantly lower compared to Watson + RSP1A. So if you are in similar conditions, think about a more expensive (and better) receiver and antenna.
LKPR is the largest airport of Czech Republic.
Here we were more lucky. Our data feeder lives in the ideal position.
The signal there is strong so all combinations of the devices worked well. The only problem is, that part of one runway is below the horizon so we receive low quality signals from airplanes in that position.
Notes: If you install the antenna on your roof, be sure you place it as high as possible from your metal roof. Also be sure it is well grounded. The last thing you want is to let a lightning bolt hit your antenna. If there is a thunderstorm near you, you should unplug the coaxial cable from your SDR receiver (and ideally throw the end outside). A close lightning can induct high voltage and the free cable connector may be dangerous. Please search the internet for a proper solution. Here is a nice solution for ADS-B antennas.
Check out our previous blog posts:
In this blog post, we will look closer to the hardware (HW) setups for ATC recording from the VHF channel. We did a general overview of the four most important components: Antenna, Coaxial cable, SDR receiver, and computer (and computing resource). We built and tested two HW setups. The first one costs about 200EUR as an “entry solution” and the second one for about 400 EUR as a better one. The table below describes both configurations/setups:
Entry solution (more affordable) |
More expensive |
|||
Item |
Price |
Item |
Price |
|
Antenna |
Sirio Md 118-137 incl. 5m cable |
40 |
Watson WBA-20 |
60 |
Coax cable |
- |
0 |
LLC200A 20m |
77 |
SDR receiver |
RTL-SDR |
50 |
SDRPlay - RSP1A |
130 |
Raspberry Pi |
RPi 3 - 1GB |
40 |
RPi 4 - 8GB |
92 |
RPi case |
Metal case + active cooling |
24 |
Argon One |
28 |
micro SD |
256GB |
38 |
256GB |
38 |
power source |
USB 5V 2.5A |
10 |
USB-C 5V 3A |
10 |
SUM |
202 |
435 |
Let us discuss the items now.
The antenna is crucial as it has a direct impact on SNR of radio communication. We decided to purchase two tuned dipole antennas for aviation frequencies (118MHz-137MHz). We picked Sirio MD 118-137
and Watson WBA-20.
The Sirio is equipped with 5m long coaxial cable. We also purchased 20 meters LLC200A coaxial cable for the Watson antenna to allow us an easy mounting of the antenna on a roof with minimum signal loss. On the other hand, the Sirio is good to mount on a balcony for example.
We did a set of experiments (see details in one of our next blog posts) to estimate the impact of different antennas on ATC voice quality. The voice quality was measured by SNR -- signal-to-noise ratio or rather speech-to-noise ratio. Our conclusion was that we were able to get up to +6dB - +10dB better SNR with the Watson antenna.
We also tested wideband double discone and narrow band J-pole antennas. This was done in other places and close to the airport so we do not have direct comparison of all four. A custom built J-pole style antenna tuned to 135 MHz was connected to 5 meters of RG-58 type coaxial cable.
The double discone is a wideband antenna tuned to receive 25-2000MHz. This antenna was connected into an active two way splitter using 2 meters of CFD240 type coaxial cable (up to 5GHz).
Both antennas worked well but we cannot make any deeper comparison. These antennas belong to one of our data feeders who allowed us to use them.
We followed general recommendation and purchased a suggested “standard” for airband the RTL-SDR dongle
We also aimed to test technically better solutions but still for a reasonable budget. After a quick survey we decided to go for SDRplay RSP1A.
Both receivers have SMA female coaxial cable connector and USB. The difference in internal circuits (gain controllers, filters, ADC, etc.).
The main advantage of SDRplay RSP1A (over RTL-SDR) is, that it is up to 14bits (versus 8bits RTL) and has up to 10.6MHz recording bandwidth (versus 2.5MHz for RTL) - note: the 14bits precision is available up to 6MHz, 12bit up to 8MHz, 10bit up to 9.2Mhz and 8bit above 9.2MHz bandwidth. Both of these options are critical, because we are targeting to collect all available frequencies used by the given airport (Tower, Approach, Radar, Ground, Departure, ...) to monitor the whole flight communication. It happens often, that the frequencies are spread in the larger window than 2.5MHz (RTL dongle). Sometimes even the 10MHz bandwidth is not enough, thus two receivers would be required. Next, the 14bits bit depth may help to get a better SNR (signal to noise ratio), but it depends on the bandwidth used. Please see this post aiming at deeper channels vs. bandwidth analysis and suggestions.
We decided to use the Raspberry Pi mini computer that is used for running the SDR software and processing pipeline. Both Raspberry Pi are small and powerful enough. We bought a RPi 3B+ (with 1GB of RAM) as an ‘entry solution’ and a most powerful RPi 4 with 8GB RAM. We used 256GB microSD for system and data storage. To avoid overheating, we also used active cooling (heatsink and fan). You have several tens of combinations of RPi models and cases. You probably want to go with a good passive heatsink to minimize the noise coming from the fan. The RPi 4 in the Argon One case (url: https://www.amazon.com/Argon-Raspberry-Aluminum-Heatsink-Supports/dp/B07WP8WC3V) was an excellent solution. The Argon One case is easy to mount and has sufficient passive cooling for processing of 4 channels in parallel. The RPi 4 is able to process these 4 channels in parallel at 90% of load on 1 core. We noticed that the RPi 3B+ cannot handle 4 channels coming from RSP1A. So if you want to receive just 1 or 2 channels, the RPi 3B+ and RTL may be good enough. Otherwise we would suggest you go with the RPi4. One of our next posts will discuss the settings of the processing pipeline in the RPi and what can be tweaked.
We hope you got some better insight into the HW needed for receiving, processing and feeding the ATC communication. There are also other possibilities, so please do not hesitate and search for it. What we put here is our experience and what worked for us for a reasonable budget.
In the last blog, we introduced a way to improve the word error rate (WER) on callsigns of the automatic speech recognition (ASR) output by incorporating surveillance information in the transcription process. In this blog post, we want to talk about extracting the callsigns from the ASR output. The process of callsign recognition can be broken down in two stages:
1) Tagging the callsign in the sequence
2) Mapping of the callsign word sequence into its ICAO format (ICAO stands for International Civil Aviation Organization)
Figure 1 illustrates the two-stage process. In the tagging step, the input transcript, originating from our ASR system is tagged with the IOB format (short for inside, outside, beginning), to find the tokens that are part of a callsign. In the second step, the part of the ASR transcript, that is tagged as callsign (labeled with B/I-CALL) is mapped to the standard ICAO format for callsigns, which consists of a 3 character airline identifier followed by the flight ID which consists out of several digits followed optionally by 1-2 characters (In case of interest, a list of airline identifiers can be found here:https://en.wikipedia.org/wiki/List_of_airline_codes).
Since we have two processes, the idea on hand is to train two different networks for the task, one that specializes in tagging and one that takes care of mapping the sequence tagged as callsign into the ICAO format. In this case, both processes can be tuned individually. The drawback of this architecture is, that information, that is lost in the first step, cannot be recovered in the second step. The other possibility is to train an End-to-End network, that outputs directly the ICAO callsign given the ASR transcripts as input. This architecture has the benefit, that there is no information loss in between. Both architectures are visualized in Figure 2. In our experiments showed that the End-To-End approach performs better than the 2 network solution in the majority of test cases.
A closer look at Figure 1 reveals that the predicted ICAO callsign does contain information that is missing in the labels and in the transcript, namely the last two digits of the flight id. This information comes from the surveillance information. Callsigns from planes near the location where the ATC Communication is recorded are time matched with the recordings and fed as additional input into the network as seen in Figure 3. In case the transcripts only contains the partial information of a callsign, the missing information can be recovered from the surveillance input. The End-To-End network shows a callsign accuracy rate over 90% on clean transcripts, if surveillance information is available. On our ASR output with a WER of 28.7, an accuracy over 80% is reached. The network also shows an increased resistance towards higher ASR WERs. The accuracy scores for two different datasets can be read up in our Interspeech paper submission: “Boosting of contextual information in ASR for air-traffic call-sign recognition”.
]]>