A Strange Brew: VoIP/Telephony Crossed With Surround Sound

(this started as a quick comment on my Facebook page, but I’m moving it here so that people outside of FaceBook can join in)

With apologies to the McKenzie brothers. There appears to be an odd cross between two of my passions in the works. As I get more into the daily use of wideband telephony I wonder if there’s a potential to leverage some surround sound techniques to take conferencing to a new level?

It couldn’t be the puritanical kind of approach used in music recording. It would be more a matter of using surround panning to position participants in an synthetic soundfield. I wonder if this has been done to any degree elsewhere?

Dean Collins Comments:

Yep already thought about it, I suggested using stereo channels with ‘N’ deviations

Details posted here http://deancollinsblog.blogspot.com/2008/08/diamondware-spatial-conferencing.html

I would be really interested in listening to how this actually works in real life so if someone takes this idea and implements it please contact me so i can listen in on a call.

Cheers,
Dean Collins
www.Cognation.net

My response:

Stereo is extremely limited in scope. Most of a synthetic stereo image is manipulated using simplistic level based panning, not unlike an old school balance control. It’s coarse and two dimensional at best.

I’m thinking that UHJ format ambisonic encoding might prove more useful. It allows for accurate, controllable three dimensional positioning while only using the equivalent of a stereo stream.

Also, in recent year ambisonic functionality has been implemented in common digital audio workstation software, usually as a plug-in module. That suggests that such functions could be hosted on a dedicated hardware conference bridge where DSPs could accelerate the process.

  • Having taken a quick look at Diamondware’s web site (http://www.diamondware.com) it seems that their approach is rooting in both level and temporal (phase) manipulation. They make specific note of intra-aural delay.

    http://www.dw.com/about_technology_3Dpositioning.php

  • I know that several companies have done work on this in the past, though of course the requirement would be for a stereo receiver. There is a distinct lack of stereo codecs in the open source world – I think that is a big impediment. I think Speex suports stereo, but few others that are easily available work with two channels. Plus, of course, 99.999% of handsets don’t have two speakers – they have one. So this would be a desktop-specific app to start, which quickly moves it to “proprietary solution” unless there are some RFCs that talk about stereo usage. Anyone putting that many hours into a solution would be unlikely to give it away. Also there are patents. 🙁

    http://www.google.com/search?num=30&hl=en&safe=off&q=conference+call+spatialization&btnG=Search

    The next downside is that the spatial configuration would require fairly intensive re-coding of each audio stream, since each listener would have a different “perspective”, therefore each listener would have a completely different audio channel that would need construction on the central server. Certainly this isn’t impossible, but would lead to some very compute-intensive processes leading to low call density on a per-system basis.

    Of course, Asterisk is open-source, so anyone wanting a testbed could come up with something there. Code is welcome! 🙂

  • There’s a natural tendency to associate “surround” with “stereo” but this need not be the case. In fact, stereo, as presented by two loudspeakers has very limited application in providing substantive spatial localization. It’s the minimum playback mechanism required but far from effective.

    In contrast, it seems like only a little more effort to use what the surrsound people call planar surround. That is, two dimensional surround lacking in height information. This can be conveyed in a two channel stream using UHJ encoded as defined by Michael Gerzon in the early days of Ambisonics.

    Unlike the various common Dolby schemes UHJ is not dependent upon costly licensing. Production equipment (hardware and software) exists to accurately place sounds at an arbitrary location in space. Decoding hardware/software also exists. The Meridien surround decoder being the one I have the most direct experience with, although it is costly.

    I know that the folks on the surrsound mailing list have been working to with the Vorbis folks to define a surround specific container for multi-channel surround streams. There are several implementations or partial implementations around, but no clear standard at the moment.

    Of course any of this would only seem to make sense in the context of a telepresence suite. But perhaps the video could be set aside entirely if the surround environment was very good.

  • I’ve often thought that a Skype plugin that allows the individual voices to be positioned in different locations would be a significant benefit to conference calls.

    I do a lot of conference calls with people whom I have not met and I am not familiar with their voice … I often have to ask people, after they comment, who they are. If I knew that they were positioned front left, for example, I would not need to ask who they were.

    This would not necessarily require ambisonics, I would have thought. HRTFs would be sufficient. I was thinking that there would not be any need for a VOIP protocal to carry any extra info … the client (eg Skype client) could position the sources as it sees fit…. but then I realised that it might be of benefit for each participant to have the same ‘virtual’ space experience. i.e. if Bob is on the left of Jane, then Jane should be on the left of Bob…

    • Etienne,

      First, I am honored that you would stop by and comment. I’ve been on the surrsound list quite a while but generally just lurk.

      I too attend many conference calls, often with a large number of attendees. There is the matter of establishing who sits where at the virtual table, if a table is even the correct metaphor. Perhaps a classroom or auditorium might be more appropriate. Does each person setup their own cast? Or is that established for the group?

      There are bandwidth limits but modern VoIP codecs can provide very good voice reproduction, within limits. I had thought that if the call was a UHJ stream where source placement was defined at the conference bridge then it might be conveyed via the equivalent of two call streams. This would have similar bandwidth requirements to stereo, but with perhaps greater image placement opportunity.

      That implies that the conference bridge would be encoding the UHJ stream from a number of callers mono streams. As the focus is less on surround accuracy and more on the effect for its own sake it should be easier to implement than would be the case for music. Each listener would only need to establish what is essentially their orientation to the table.

      Diamondware seems to have been focused on HRTFs as you suggest. Skype is problematic as it is an entirely closed architecture. As such a plug-in as you suggest seems unlikely. OTOH, there are a number of soft phone clients, many in the open source realm, that could be extended as necessary if a smart-client approach is most suitable.

      In a related thought, now that a couple of relatively affordable surround mic’s exist I’ve been thinking about streaming surround encoded material from a convention as a novel approach to covering papers, panel discussions, etc.

  • Apparently Apple have a patent in this area, so perhaps we will see it in the iPhone.

    I did some experiments with this using a tweaked version of asterisk’s meet-me and our
    own voip client.
    I took the simplistic line that only one person would be speaking at a time, so used a mono codec, but sent extra info along with the call whenever the speaker changed, allowing
    the voip client to shift the audio source in the stereo stage. I concluded that I needed an audio guru to get the perception stuff right and shelved it.

  • Anyone able to use a VST plugin to mix conference audio in a stereo voip client?
    If so here’s a perfect mixer.
    http://www.jeversi.com/atc/atc.htm

    I hear some of you wish you could try out this virtual audio space. I wrote the mixer & GUI, but I haven’t plugged it into a phone. Let me know if you can get it closer to real.