Questioning New Dimensions In Conference Audio

Headset vs Conference PhoneI’ve long had a fascination with spatial audio processing. This was in part why Voxeet caught my attention when the service initially launched. It was over a year before we were able to have them appear on VUC #471 on January 10th.

From that session you may recall that Voxeet offers a binaural conference service. Participants join a conference using a PC smart phone application. They use a stereo headset allowing the client application to provide placement of the individual participants within a controlled sound stage.

Voxeet is interesting. However, it’s not exactly clear what aspect of the service is most compelling. At point of launch they used the Speex audio codec, which allows wideband audio (aka HDVoice.)

In the recent v2 release their PC client has been moved to a WebRTC foundation, leveraging Opus. I’ve done a quick analysis of their updated online demo. Newly fitted with American voices where there were once French accents, it presents 16 KHz usable audio path, suggesting a 32 KHz sample rate. It certainly sounds very good.

Their mobile apps still rely upon a proprietary codec. They don’t have a client for OSX at present.

Voceet-Demo-Call

They also have the ability to move an ongoing call smoothly from one client to another, say from the PC to an iPhone, allowing the participant to smoothly change modes on-the-fly. When I tried this myself I found it to be surprisingly impressive.

Over at TMC’s TechZone 360 Doug Mohney has recently asked if 3D/spatial audio is the next breakout big thing. He notes the  collaborative effort  of BT Conferencing & Dolby in offering BT’s MeetMe with Dolby Voice. Dolby Voice is also one of the finalists in this year’s Best in Enterprise Connect 2014.

Like Voxeet, this service depends upon all participants joining the call using a client application and a stereo headset. Once joined to the call participants enjoy HDVoice and the ability to place the other participants in a soundstage such that the various participants are each given a different acoustic perspective.

On their web site Dolby states that the Dolby Voice client uses the DVC-2 audio codec supporting a 16 KHz audio path. They report 16 KHz audio at around 24 kbps. They also have stated support for G.722 and G.711 in the conference server.

Dolby states that the conference server provides, “Efficient mixing of all active participants.” That implies that the creation of the stereo image, including placement of participants on the sound stage, is done at the server with the resulting stereo stream being delivered to the client application.

There are many things that impact intelligibility of voice communications. I find myself wondering which attribute of these new binaural conference systems has the most substantial impact upon improving the call experience; wideband audio or spatial placements of participants? Since Dolby has an interesting example of Dolby Voice online, I thought I might take a few measurements to see how the representative audio samples compare.

The Dolby Voice example online (requires Fash) has an initial section that is a fixed presentation, followed by a section that’s interactive. During the interactive portion you can toggle between the Dolby Voice rendition of the situation and a plain old PSTN conference call.

To gain a little further insight into the two facets of this demo I captured a series of screencasts and the associated audio streams. This allowed me to create an overlay that displays the sample audio in both waveform and spectral energy distribution, along with the Dolby presentation.

There are two versions; the first has Dolby Voice enabled.

The second example has Dolby Voice disabled.

The only processing of the audio was capture to my Zoom H2 Handy Recorder (48 KHz 16 bit Stereo WAV file) then normalization of level in Adobe Audition.

The example with Dolby Voice enabled is much clearer and easier to understand. I find myself thinking that the single biggest improvement stems from wideband audio, although the sample exhibits an 8 KHz usable channel not unlike plain vanilla G.722. Beyond the wideband codec it’s unclear if better noise reduction or positional trickery has more impact.

The example with Dolby Voice disabled is not at the same level as the other. This is due to some noise in the stream. Short duration noise bursts cause simplistic volume normalization schemes to react by lowering the overall level. This can be significant.

From the realm of the audiophiles we know that, all other factors being equal, given two sounds to compare, people tend to prefer the source that is simply louder. Even if it’s only minutely louder. Clearly noise reduction and level control can be important factors in perceived call quality.

The very fact that both Dolby Voice and Voxeet require the use of a headset also implies that there will be a microphone more optimally placed to reduce the impact of room tone and other background noises. This is in sharp contrast to a speakerphone or conference phone. Those devices, even very good ones, allow the acoustics of the room to impact the call. In almost every case the addition of room tone degrades the call experience.

Desk Phone vs mobile headset

As an aside, trends (fashion?) in headsets vary by use case or application. Users who are joined to a binaural conference call by way of a mobile device are very likely to use a wired headset with an inline microphone, in my case an Etymotic Research ER23-HF2.

A microphone inline with the headset usually falls near the users throat or upper chest. Such microphones are typically non-directional, making them more susceptible to ambient noise. The microphone placement is much better than a speakerphone, but still not optimal.

PC headste vs mobile headset

There’s a greater chance that a user at a PC or Mac could be wearing a headset with a boom mounted microphone. This provides the most optimal audio pickup. To wear such a headset in an office isn’t quite the fashion statement that it would be to wear such a thing out in public. Such observations are admittedly just a quibble on my part.

It will be interesting to see what kind of traction these binaural conference services manage to garner. I’ve long felt that, given a choice, headsets are the best approach to business telephony. However, it’s not clear to me how many people will find the requirement to wear a headset to be acceptable.

Nor am I clear about how these services will address the need to accommodate participants in traditional conference/meeting rooms. A group joined to a call by way of a conference phone loses the benefit of the spatial trickery and appear as a single source to others on the call. That would create a marked and potentially jarring contrast to the perception of individual remote users vs those in the meeting room.

When I left an inquiry as a comment to a post at the Dolby Labs Blog their response revealed that Dolby is working on a hardware solution to that very concern.

“Your comments about the importance of including rooms in the conference make perfect sense and we plan to launch a conference phone product later this year. The new device is revolutionary and captures all the audio in the room for remote participants and helps to separate multiple remote talkers for those in the room. The experience is fully integrated with the soft-client experience already launched on Macs and PCs and which will be available on smartphones in a few weeks”

Dr. Mike Hollier, Vice President and Chief Technical Officer, Communications Business Group, Dolby Laboratories

It’s also interesting to note that Dr Hollier worked at BT Labs for more than a decade. This establishes something of a link to BT as the first to offer conferencing with Dolby Voice.

In the end the real question for both services revolves around changing user behavior. Will people find the benefits of binaural conference calls sufficiently compelling to change how they participate? If so, what portion of the market will make the effort? That’s the group that BT+Dolby and Voxeet are trying to hook.

“…it makes me wonder.” – Robert Plant

  • Jeff

    The question of a change in user behavior being required is spot on. This is the key inhibiting factor that is most often cited in opposition to Dolby Voice. We’ve seen a massive shift in user behavior from customers of the BT MeetMe with Dolby Voice solution. Before deploying the service, one enterprise customer reported 100% PSTN dial-in for conferences and within 2 months after roll out, over 30% of all conferencing minutes in the company were on soft-clients with headsets. We expect the mobile app to move this even further (available in April).
    Jeff Smith, Director of Marketing, Communications Business Group, Dolby Laboratories