Publisher of Humanities, Social Science & STEM Books
Abstract pink lines on a black background illustrating sound waves

Advances in Sound Quality for Online Musical Collaborations: An Engineer's Perspective

Posted on: March 15, 2021

By Christopher Bennett, Professor in the Music Engineering Technology program at the University of Miami, Frost School of Music and author of Digital Audio Theory.

Introduction

Where I work, the Frost School of Music at the University of Miami, there is a high expectation of audio quality among the faculty and studio directors, many of whom are GRAMMY and Emmy Award winners. This holds as true for in-person rehearsals, lessons, and performances as it does for virtual ones. And while the dramatic switchover has presented many hurdles, musicians worldwide have found ways to adapt and continue to create. Quickly taking notice of this, virtual meeting platforms, such as Microsoft Teams and Zoom, have enhanced their audio capabilities with musicians in mind.

These platforms have many advanced signal processing algorithms that improve conversational speech quality.  For example, “noise reduction” is a signal processing algorithm that works much like an expander or noise gate, which suppresses low-level background noise. And “echo cancellation” algorithms improve voice quality by detecting a voice signal being sent and suppress it on the microphone on the receiving side. Without such algorithms, unwanted echoes, feedback, and noise would degrade conversational speech. Last but not least, speech can be understood with a high degree of intelligibility with a comparably lower bit-rate. In order to handle poor network conditions, bit-rates for speech are often reduced. But while these algorithms make speech sound clearer for conversations, they can degrade the sonic quality of musical performances. This is why we were all pleased to see that Zoom was making new meeting options that include the ability to transmit in high-fidelity music mode.

In this new mode, it is possible to turn off these speech features (namely, disable echo cancellation, turn down noise suppression, and improve the quality of the audio compression codec). But the question remains: what is the audio quality? (And, importantly, how am I defining audio quality?)

But before diving into the details of the quality measures, let’s look at the overall setup. My goal was to capture only the effects of the Zoom processing, and not to introduce any microphone or other acoustic effects. For this reason, I used the “Share Computer Sound” option to send audio directly from MATLAB™ (The MathWorks). From there, the test files were compressed and transmitted to another computer that was recording audio directly from the audio device using Audio Hijack® (Rogue Amoeba). The captured audio was saved in *.aiff file format at a bit depth of 20 bits and a sampling rate of 48 kHz.

Figure 1 caption: A standard battery of audio quality testing could include: impulse response, frequency response, and noise floor.

Noise Floor

The noise floor can be thought of as the sound level that can be measured when no sound source is present. For this test, a 10 second stereo audio file containing silence was generated by the transmitting computer. The noise floor was determined by calculating the root-mean-square (RMS) value for the captured by the receiving computer and reported in dB (re: full-scale). We measured a noise floor of –61 dBFS.

Impulse Response (IR)

 

<img alt="" data-cke-saved-src="https://tandfbis.s3-us-west-2.amazonaws.com/rt-files/Faculty/Laura+Maisey/Digital+Audio+Theory+2+OP.png" src="https://tandfbis.s3-us-west-2.amazonaws.com/rt-files/Faculty/Laura+Maisey/Digital+Audio+Theory+2+OP.png" width="600" height="259 class=" img-fluid"="">

Figure 2 caption: Electrical clicks transmitted over the internet to the receiving computer were synchronously averaged to generate a clean impulse response

Click IR Code

% transmitting computer
fs = 48000;
num_reps = 20;
clk = [zeros(fs/10,1); 1; zeros(fs/10,1)];
sound(repmat(clk, num_reps, 1),fs);

% receiving computer
data =audioread('hijack_click_train.wav');
len = length(clk);
ir_clk=zeros(len,1);
for ii=1:num_reps
    start_idx = (ii-1)*len+1;
    end_idx = start_idx+len-1;
    L = data(start_idx:end_idx, 1)/20;
    R = data(start_idx:end_idx, 2)/20;
    ir_clk = ir_clk+(L+R)/2;
end

Exponential sine sweep

The same transmission and averaging technique was also used for the exponential swept-sine (ESS). But in this case, the stimulus was a sinusoid of continuously varying frequency, ranging from 20 Hz to 20 kHz, with a duration of 2 seconds, and an inter-stimulus interval of 10 ms. Four ESS responses were averaged together, then deconvolved with the stimulus to produce an IR.

Figure 3 caption: An exponential sine sweep allows for impulse response capture with a steady-state stimulus

Exponential sine sweep code:

% transmitting computer
fa = 20;
fb = 20000;
T = 2;
sil = 0.1;
fs=48000;

ESS = sweeptone(T,sil,fs,'SweepFrequencyRange',[fa fb]);
sound([ESS; ESS; ESS; ESS],fs);
% receiving computer
data =audioread('hijack_ESS.wav');

len = length(ESS);
ir1 = impzest(ESS, data(1:len,:););
ir2 = impzest(ESS, data(len+1:len*2,:));
ir3 = impzest(ESS, data(len*2+1:len*3,:));
ir4 = impzest(ESS, data(len*3+1:len*4,:));

ir_ess = mean([ir1 ir2 ir3 ir4], 2);

Measured Impulse Response

Not surprisingly, the click-based method (blue in the figure below) and ESS based method (red) both produced nearly identical IRs. The only notable feature, other than being in phase with the stimulus, is the ringing of a single frequency beginning at 1 ms. Counting peaks, this ringing is around 8 kHz. Interestingly, this frequency does not appear prominently in the frequency response graphs below. A sound card issue cannot be conclusively ruled out, although two different audio interfaces on the transmitting computer were tested with comparable results.  Furthermore, this 8 kHz ringing was not observed when transmitting silence through the soundcard, which would be expected if this were a soundcard artifact. 

Figure 4 caption: The impulse response, as measured from a computer sending an electrical click over Zoom (with enhanced music features enabled) to a receiving computer


Frequency Response

The frequency response was determined by applying the discrete Fourier transform to each of the IRs (click again in blue and ESS, red). Observing the magnitude, an otherwise flat spectrum can be observed, up until 10 kHz, where the magnitude rolls-off significantly to the noise floor in the octave between 10 and 20 kHz, with a –3 dB cut-off frequency of 13 kHz.

Figure 5 caption: The bandwidth with enhanced music features is up to 13 kHz


Frequency response code:

[H_clk,W] = freqz(ir_clk,1,fs,fs);
[H_ess,~]=freqz(ir_ess,1,fs,fs);
[Hn,~]=freqz(noise,1,fs,fs);

hold on;
semilogx(W, db(abs(H_clk)), 'LineWidth', 3, 'Color', 'b');
semilogx(W,db(abs(H_ess)), 'LineWidth', 2, 'Color', 'r');
semilogx(Wn,db(abs(Hn)), 'Color', [0.5 0.5 0.5]);
grid on;
box on;
axis([20 20000 -24 6])
set(gca, 'XScale', 'log')

Conclusions

The major issue faced by recording musicians under quarantine conditions was the inability to collaborate and perform with other musicians. One needs to look no farther than popular musical recording artists such as Taylor Swift to see that we are entering a new era of music recording (see: Folklore and Evermore). To address this need, meeting platforms have begun to roll out features that enhance real-time musical collaboration. After a battery of audio quality testing, I observed that the Zoom audio compression codec exhibits a dynamic range of 60 dB, with a bandwidth up to 13 kHz.

A sample rate of 48 kHz is advertised, which would indicate a bandwidth up to 24 kHz. The soundcard and electrical stimuli were configured to reach these frequencies, so it is curious that the bandwidth was observed to extend only to 13 kHz. Perhaps the audio coded relies on some sort of spectral band replication to reproduce these frequencies on the receiving computer. With respect to the noise floor, it does give sufficient head room, with the digital floor being 1000 times lower than the loudest audio signal levels. The noise floor does appear to be psychoacoustically shaped, with the increased noise floor occurring in the high frequencies, where the ear does not detect noise with great sensitivity.

These findings are a great improvement in musical quality. At my institution, we have begun implementing these new settings for remote lessons with great success.

Christopher L. Bennett
https://people.miami.edu/profile/[email protected]
https://www.linkedin.com/in/christopherlbennett/

Zoom logo: Copyright ©2021 Zoom Video Communications, Inc.