ATC recording using SDR - deeper analysis - raw signal processing and SNR estimation

This blog post is more technical. We describe our raw signal processing pipeline here. The rtl-airband software is set to produce raw data coming from the SDR hardware in cs16 format.

Check out our previous blog posts:

Blog 1: Basic terminology and hardware setup description for ATC listening
Blog 2: Where to place your antenna for ATC recordings
Blog 3: What is the best SDR hardware choice for ATC
Blog 4: How to setup your SDR for clean ATC audio

This blog post is more technical compared to the previous ones. In the next paragraphs we
will describe the raw signal processing pipeline. The rtl-airband software is set to produce
raw data coming from the SDR hardware in cs16 format.

Converting the raw signal into the audio format

Produced cs16 files are processed through:

cat ${signalfile}.cs16 | csdr convert_s16_f | csdr amdemod_cf | csdr fastdcblock_ff | csdr gain_ff 3 | csdr limit_ff | csdr convert_f_s16 > ${signalfile}.raw

which does:

conversion from int to float value,
AM demodulation
signal level adjustments
back to int conversion
saving as PCM

Next, we drop all segments shorter than 1 second as they do not contain any meaningful signal. You may have noticed we are not using automatic gain control (AGC). The reason is, that the AGC does a signal deformation (rapidly changing volume and thus amount of noise). As we have the whole recording and can process it off-line, we implemented a segment base gainer.

Segment base gainer

We detect push-to-talk clicks using wavelet transform and identify particular utterances in the audio. We amplify each segment not to exceed 95% of the maximum level of the wav file (1.0 in our case). The peak levels are ignored. See the figure below:

01_pasted image 0.png

Original raw signal is on top, amplified is on the bottom.

Voice activity detection

We detect speech parts of the audio to be further used to reliably estimate the Signal-to-Noise Ratio. The Voice Activity Detector (VAD) is based on a neural network with 2 hidden layers and 2 output classes. It was trained on 1366 hours of multilingual telephone speech corpus. The neural network output is smoothed by averaging over a 5 frame window, and we can adjust the detection threshold to control the amount of detected speech. See the figure below with indicated speech in the recording (red parts).

02_pasted image 0.png

Signal-to-Noise Ratio estimation

The SNR estimation technique is based on the waveform amplitude distribution analysis (Chanwoo Kim, Richard M. Stern, "Robust Signal-to-Noise Ratio Estimation Based on Waveform Amplitude Distribution Analysis", Interspeech 2008). In principle, the amplitude distribution of noise is Gaussian while the amplitude distribution of speech is Gamma. We can “guess the SNR by estimating where we are between Gaussian and Gamma distributions” for our signal.

To estimate the SNR reliably we select only speech segments and avoid all the non-speech parts. We apply the SNR estimation technique which provides an SNR estimate per each voiced segment.