While at AMOOCON Randy made a point of recording some of the presentations. These recordings can be found on the VoIP Users Conference site. Various sessions were recorded by the conference organizers and are available on the AMOOCON site. There they also have quite a lot of video of the formal sessions.
In what I think is a very inspired move Randy also recorded some of the casual conversations that sprung up outside of the formal sessions. This is truly the equivalent of “reading-between-the-lines.” These conversations tend to cover matters that are on peoples minds at the moment, but have not evolved to the point where anyone has a formal presentation to give.
In particular he recorded a conversation with a very good group of people that included John Todd, Tim Panton, Ole Johansson, Michael Ideema, Stefan Wintermeyer, Nir Simionovich, Mark Spencer…and those are just the people that I can identify from listening to the recording. The conversation drifts around a little but it’s definitely worth taking the time (45 minutes) to give it a listen.
Of particular interest to me are the parts about codecs, wideband telephony and the potential of “stereo telephony.” These are topics that are very dear to me. I’d like to offer some commentary on some of the discussion in this area, so…off you go now….give it a listen.
Ok, now with that fresh in your mind I’d like to expand on a few of points from the conversation.
From 3:49 into the recording, it’s said in reference to wideband telephony based upon G.722;
“It shocks people how good what they think a phone is can sound. When you listening to it on a soft phone on a PC with a headset (or something) people don’t associate that with being a phone. But when you actually put a phone in front of them, and they pick it up and, WOW this a phone and it sounds so good! “
…and…
“…people are just using it to socialize….for that the audio quality matters in a way that for a quick phone call it doesn’t.”
This is a very good and interesting observation made by Tim Panton. Simply stated, the context of communication matters. A quick phone call to get some information is very different from a lengthy call that is truly a means of “telepresence.” That is, a sustained call making up for the fact that you are not physically there.
My personal experience definitely supports the use of wideband in this application. I’ve been spending a lot of time away from home the past few months. A wideband call between two soft phones, and using high quality spearkerphone devices on our PCs, lets my wife and I conduct lengthy calls comfortably. The enhanced call quality makes it more like we’re in the same room. It’s as if, if I just looked up from what I was doing, she’d be right there.
Part of that sense comes from the fact that my brain is not laboring to understand what she’s saying. I can hear the sounds from the TV that she has on in the background. I know what she’s talking about as she comments on something she sees on TV.
John Todd’s tale of experimental use of Polycom video gear as part of his teleworking further illustrates this point. When the communication channel is more transparent, whether that is via enhanced audio quality or adding a visual component, the communication flows more naturally.
Simply stated…quality matters!
Then John Todd broaches a subject that I find very interesting;
“…I keeping wondering why stereo is never on anyone’s lips when we talk about these next generation codecs?”
A number of related issues are mentioned by various parties;
“…stereo is required from the transmitting site…”
“…you’re just thinking about it from a voice perspective…”
“…you don’t have to send the full stereo data. You can send enough to let the client synthesize it.”
“…if you’re going to go to the effort of creating stereo you should really do actual stereo.”
This starts to get under the covers of sending directionally encoded audio. The matter of “stereo” is a leap in logic that might not be well considered. Stereo from the perspective of music is one thing, but what’s truly required is some form of surround sound. The fact that we use the equivalent of two audio streams for the encoded channel is purely incidental.
All surround sound systems make some effort to be “stereo compatible” so that they will present an acceptable, if less than optimal result when conveyed through only two speakers.
At 13:40 into the recording someone asks;
“…do we hear phase? Do we need the temporal difference?”
Yes, absolutely! Our hearing is a differential mechanism. Both phase and delay are temporal issues relative to wavelength and distance.
“…if you have a hifi that allows you to invert the phase on one of the channels you can absolutely hear that. But I don’t know if you can hear say, one of the channels being 5ms ahead or behind the other one.”
Time is the key factor here. If we use a digital delay device to insert delay into one channel of a stereo signal the effect of the delay will be wildy different and various settings. When the delay is long, for example 300ms, we hear it as a directional cue between the two channels.
As the delay is shortened its effect narrows. When the delay is around the same duration as the wavelengths of sound in our target range, say 50 Hz-12 KHz, it becomes less pronounced. If the delay should be variable we hear it the “phasing” or “flanging” effects commonly used on guitars in recording studios.
Just as extended frequency response makes a call sound more natural, the timing of sounds between channels is important to conveying the natural directional perspective.
Please remember my prior post referencing Pink Floyd’s production of “Money” from the album “Dark Side of The Moon.” The point made there was that you have two basic approaches to directionality;
- You can try to accurately capture, convey and reproduce an original acoustic event
- You can impart directionality as an effect, without regard for the original sound source
Both of these approaches can be taken to the nth degree of sophistication.
I think the idea that you can synthesize directional cues is rooted firmly in the decision that you’re going about the matter of directionality purely as an effect. You must ask yourself a series of questions;
- What am I really trying to convey during a conference call?
- Do I need to shift focus to the dominant speaker in the room?
- Does that change the sonic mix or image positioning?
- What is the effect of rotating the sonic image when the visual image stays with the video screen that shows the remote sites?
- As I move from one acoustic perspective to another will I suffer the equivalent of acoustic whiplash?
- How do I overlay the acoustic perspective of multiple sites?
- What is their relative orientation?
Let’s get what some might feel is a rather controversial statement out on the table. “Surround sound” as we commonly find in the entertainment industry, including all the various forms of 5.1 and 7.1 surround configurations, is not an effort to accurately convey anything at all. The acoustic perspectives presented are usually much more dramatic than the actual events would be. It’s all done for effect.
With 5.1 or 7.1 surround you get the effect of something happening beside or behind you, but typically no really rich directional clues. All the dialogue is arbitrarily in the front channels so as to be located on-screen with the characters. Most directional information is based upon relative audio levels in the various channels, usually with few temporal clues at all.
Surround sound systems, and specifically home theater systems, are designed around certain practical realities. There’s a limit to what can be sensibly implemented in a theater or your living room, and every living room is a different acoustic environment.
Nonetheless, in the case of TV and movies the surround sound effect is often very good. This I suspect is the result of the fact that there’s a lot going on in most scenes. With a very rich sound environment the ear can get by with simplistic directional cues. When there’s less activity in the scene that your attention is more focused on the dialogue on-screen seems natural.
None of the common surround sound systems even take into consideration the vertical plane. That is, they don’t convey height at all. The curious 22.2 channel surround system proposed by NHK for their new Super High-Definition TV (four times 1080p!) is the first to make that effort. Even then, they make little effort to convey vertical information in the surround channels, offering it only in the frontal hemisphere.
The very fact that there are so many different, incompatible surround sound schemes tends to suggest that none of them take a comprehensive, scalable approach. It’s a bit like the old days of quadraphonic audio.
My suspicion is that, given the two primary approaches, we’ll be better off taking the more strict path of trying to accurately encode the directional cues inherent in the source event, and recreate these at the remote ends.
This approach is uniquely and elegantly embodied in an existing technology known as “Ambisonics.” That will be a topic for another time.
For the moment I will close with words of wisdom from Ole Johansson;
“In Sweden the average household broadband connection is 10 MBps both ways so there’s no reason to compress. It’s time to go wideband. It’s time to go stereo…5.1 surround. Experiment with a new kind of communication experience instead of trying to emulate the PSTN.”
“…the technology has to be advanced anyway. You have to do something beyond the traditional PSTN network. Because, if you don’t, what’s the point of VOIP?”
“…maybe wideband in itself, or stereo won’t pay for itself, but it will certainly create data traffic…and someone is paying for that data traffic.”
Amen, brother!