Considering Our SILKen Future

Much is being made of the recent events in the IETF CODEC Working Group . Specifically, the fact that Skype has included the c source code for their SILK codec in the Draft RFC document.

Dan York has some excellent coverage including a good general backgrounder on the matter. Jim Courtney has also posted something interesting, as has Phil Wolff of Skype Journal.

A lot of what is being expressed seems to me unbridled enthusiasm for what is seen as a bold, even surprising move on the part of Skype. I agree that this is a gutsy move…and one that I applaud. However, I’m also here to reign in the enthusiasm just a bit. Tempering it with a dose of reality we can see this in a larger context and keep our eyes on the larger goal…ubiquitous wideband telephony.

I find that I don’t agree with the manner in which Jim belittles other efforts in the wideband realm. According to Jim:

“There are many efforts to incorporate “wideband (8 Khz)” HD Voice into the telecom infrastructure. GIPS has some excellent wideband codecs; Jeff Pulver is running HD Voice conferences. While HD Voice is definitely an improvement on today’s standard PSTN voice calls it is not up to the quality of SILK-supported calls.”

In my head I keep hearing “One ring to rule them all!” In reality there’s a little more to it than that. In fact, it does the broader goal of ubiquitous wideband telephony a disservice on several levels.

First, I’d like to point out that the term “HDVoice” is somewhat vague. In it’s most broad interpretation it could be taken to mean anything better than the 8 KHz sample rate and the implied 3.4 KHz useful audio path that is the decades old standard of the current PSTN. Perhaps the HDConnect group may have some more explicit definition arising from their effort to lobby regulators? Or not.

The term “Polycom HDVoice” is a registered trademark of Polycom, and so has a more clear definition in the scope of codecs that they offer in their products. Even that can mean anything from 16 KHz sampled G.722 in the SoundPoint desk phones to 28 KHz sampled G.722.1C, or even 48 KHz sampled Siren 22 or G.719 in the high-end conference room systems.

Some of those codecs equal or surpass SILK, at least on the simplistic basis of sampling rate. Yet SILK is in many ways very different.

SILK is a variable beasty. SILK is adaptive to its momentary network circumstance. It can vary its audio performance depending upon instantaneous network conditions.

This has considerable implications for it’s implementation on a very broad scale. Specifically, it means that we may need to architect networks so as to provide the codec with the statistical feedback that it needs to perform its adaptive magic.

Tim Panton of Phone From Here was the developer responsible for implementing SILK in the Blabbelon online gaming audio service. As a developer familiar with the public API for SILK I posed him some questions about its use. According to Tim,

“The main technical barrier to SILKs adoption may be the degree of re-write needed to extract accurate network stats from the RTP protocol stack at the receiving end and feed it back via RTCP to the encoder.”

…on the other hand, on last weeks VUC VoIPathon BKW, one of the leaders of the FreeSwitch project off-handedly commented,

“…it’s all happening in the media stream, so there’s nothing external that you have to worry about such as RTCP. It’s all happening in the media.”

BKW had SILK compiled and running with FreeSwitch within hours of the release of the IETF draft document that included the c source code.

I admit that I’m not a software engineer, so looking through all the source code attached to the draft RFC does me no good at all. Further, I don’t mean to denigrate SILK at all. I’m certain that it’s a brilliant piece of work, and I look forward to its examination by the IETF WB Codec working group. Actually, I hope to learn a lot by following along.

However, there has been a lot of codec myopia going around. The legacy PSTN codecs tend to encourage this view as they can be swapped in/out without needing to consider re-architecting other aspects of the network.

Clearly, Skype still have some work ahead of them in getting SILK to become an IETF standard. They’re going down that standards path which is something that should be acknowledged, encouraged and applauded. Even so, I just don’t foresee a future with one wideband codec in all applications. I think that Doug Mohney has recently summed the codec landscape nicely in “Welcome to the HD Voice Codec Wars.”

It will be literally years before SILK is deployed on a scale similar to G.722, at least a far as hardware end-points are concerned. Further, at this stage most of the effort in moving SILK forward falls to a relatively small group of people; Skype, the IETF, developers, manufacturers, etc. While they press on with those matters the rest of us can also further the cause by simply making the best use of HDVoice as it is currently available.

In so doing we’ll bring to light other issues that also need to be addressed. The matters of name-space, peering & security all need attention. To borrow an analogy, we’ll be fighting the battle on multiple fronts. And the more the public actually experiences HDVoice, the more they will want it for their daily use.

For now I don’t think that people should so easily cast aside G.722. It’s known patent free and can be used today across an immense diversity of hardware and software, including legacy PSTN infrastructure.

Our experience with VUC calls is that 16 KHz sampling is plainly, dramatically better than 8 KHZ sampling ala PSTN. We have also found that G.722 support in affordable hardware & software is broad and growing quickly.

For my Astricon 2009 presentation I prepared a set of audio files sampled at 8 KHz (G.711), 16 KHz (G722) and 28 KHz (G.722.1C) to make comparisons simple and convenient. All are just the human voice, with a mix of male and female voices, in various languages and in various accents. I invite you to give these a listen.

From these two experiences I find that there is a diminishing return for ever-increasing sample rates. That is, at least if the primary concern is the human voice. The difference between G.711 and G.722 is dramatic and extremely useful. The difference between G.722 and G.722.1C (Siren14) less so.

So I suspect that there’s a middle ground here. SILK will not be widely available on a diversity of hardware for daily use any time soon, but that’s ok.

On wireline networks, as found in corporate installations and CableCo’s, we can use existing royalty-free codecs like G.722 or its newer bretheren. We can do this today, which is critical to promoting the very real merits of wideband telephony.

When it comes to wireless networks (aka cellular) we may still see AMR-WB as the dominant codec. Perhaps SILK can make some greater inroads into this space. Given that cellular handsets turnover faster than any other type of end-point this is one market where the future may be very fluid.

At very least SILK may bring competitive pressure that could drive down the license cost of AMR-WB. If that’s the sole achievement arising out of releasing SILK into the IETF process then it will still be a tremendous victory, although probably no the outcome that Skype is for.

When the IETF WB Codec working group has completed its work we may well find that we have a new standard codec, and it might be SILK or something derived from SILK with some other ideas (CELTic?) thrown in. That will be great, and we can act based upon that result….when the time comes.

In the mean time, my employer is using wideband telephony right now…every day…based upon G.722. You can, too. There are very real benefits to be realized today, on today’s networks, with very little effort or cost. You don’t have to wait for some magical SILKen future.

HDvoice is not going to be great, some day. It’s great right now…and from the sound of everything going on it will only be getting better!

  • Great article but you left out some very important codecs in the race for the best HD codec like Speex which has most of the advantages SILK has but has been completely opensource from the start and has gathered important attention from giants like Google who is working to implement it in its Google Talk applications.

    Id say, SILK is great but it was opened a bit too late after realizing that something out there might actually become the standard and not SILK itself. Now Skype is trying to prevent that from happening but it will most likely need to do lots if they want to catch up with the rest.

    SILK doesnt even appear as an option in most commercial PBX and SBCs but Speex is starting to be considered in some of them and now huge companies like ACME are actually implementing it in their SBCs. So far SILK is falling way behind. But only time will tell what actually happens. Still Id think twice about naming SILK the big winner for future HD voice encoding.


    • The thing that makes codecs tricky are the legal issues. The real issue being the vague and illusory nature of some of the patents issues for voice compression. Speex has not seen widespread adoption at least in part because, while they rightfully claim that it’s not infringing on anyone’s patents, that assertion remains untested in court. Any company using Speex runs the risk of being sued by an aggrieved patent holder.

      The current IETF CODEC WG is actually considering CELT, the successor to Speex, along with SILK, the BroadVoice Codecs and something from SpiritDSP. While I’ve been following that activity from a distance it now seems like their work might result in the creation of something new, taking the best ideas from each submission.

      Further, the CPU & power constraints of mobile handsets are becoming a greater factor in the influence of codec design. Some have commented that it’s CPU requirements are too high for the resulting audio quality. I doubt very much that we’ll see much further market penetration on the part of Speex. CELT offers lower latency and greater scalability of sample rates.