OLD | NEW |
(Empty) | |
| 1 <?xml version="1.0" encoding="utf-8"?> |
| 2 <!DOCTYPE rfc SYSTEM 'rfc2629.dtd'> |
| 3 <?rfc toc="yes" symrefs="yes" ?> |
| 4 |
| 5 <rfc ipr="trust200902" category="std" docName="draft-ietf-codec-opus-14"> |
| 6 |
| 7 <front> |
| 8 <title abbrev="Interactive Audio Codec">Definition of the Opus Audio Codec</titl
e> |
| 9 |
| 10 |
| 11 <author initials="JM" surname="Valin" fullname="Jean-Marc Valin"> |
| 12 <organization>Mozilla Corporation</organization> |
| 13 <address> |
| 14 <postal> |
| 15 <street>650 Castro Street</street> |
| 16 <city>Mountain View</city> |
| 17 <region>CA</region> |
| 18 <code>94041</code> |
| 19 <country>USA</country> |
| 20 </postal> |
| 21 <phone>+1 650 903-0800</phone> |
| 22 <email>jmvalin@jmvalin.ca</email> |
| 23 </address> |
| 24 </author> |
| 25 |
| 26 <author initials="K." surname="Vos" fullname="Koen Vos"> |
| 27 <organization>Skype Technologies S.A.</organization> |
| 28 <address> |
| 29 <postal> |
| 30 <street>Soder Malarstrand 43</street> |
| 31 <city>Stockholm</city> |
| 32 <region></region> |
| 33 <code>11825</code> |
| 34 <country>SE</country> |
| 35 </postal> |
| 36 <phone>+46 73 085 7619</phone> |
| 37 <email>koen.vos@skype.net</email> |
| 38 </address> |
| 39 </author> |
| 40 |
| 41 <author initials="T." surname="Terriberry" fullname="Timothy B. Terriberry"> |
| 42 <organization>Mozilla Corporation</organization> |
| 43 <address> |
| 44 <postal> |
| 45 <street>650 Castro Street</street> |
| 46 <city>Mountain View</city> |
| 47 <region>CA</region> |
| 48 <code>94041</code> |
| 49 <country>USA</country> |
| 50 </postal> |
| 51 <phone>+1 650 903-0800</phone> |
| 52 <email>tterriberry@mozilla.com</email> |
| 53 </address> |
| 54 </author> |
| 55 |
| 56 <date day="17" month="May" year="2012" /> |
| 57 |
| 58 <area>General</area> |
| 59 |
| 60 <workgroup></workgroup> |
| 61 |
| 62 <abstract> |
| 63 <t> |
| 64 This document defines the Opus interactive speech and audio codec. |
| 65 Opus is designed to handle a wide range of interactive audio applications, |
| 66 including Voice over IP, videoconferencing, in-game chat, and even live, |
| 67 distributed music performances. |
| 68 It scales from low bitrate narrowband speech at 6 kb/s to very high quality |
| 69 stereo music at 510 kb/s. |
| 70 Opus uses both linear prediction (LP) and the Modified Discrete Cosine |
| 71 Transform (MDCT) to achieve good compression of both speech and music. |
| 72 </t> |
| 73 </abstract> |
| 74 </front> |
| 75 |
| 76 <middle> |
| 77 |
| 78 <section anchor="introduction" title="Introduction"> |
| 79 <t> |
| 80 The Opus codec is a real-time interactive audio codec designed to meet the requi
rements |
| 81 described in <xref target="requirements"></xref>. |
| 82 It is composed of a linear |
| 83 prediction (LP)-based <xref target="LPC"/> layer and a Modified Discrete Cosine
Transform |
| 84 (MDCT)-based <xref target="MDCT"/> layer. |
| 85 The main idea behind using two layers is that in speech, linear prediction |
| 86 techniques (such as Code-Excited Linear Prediction, or CELP) code low frequenci
es more efficiently than transform |
| 87 (e.g., MDCT) domain techniques, while the situation is reversed for music and |
| 88 higher speech frequencies. |
| 89 Thus a codec with both layers available can operate over a wider range than |
| 90 either one alone and, by combining them, achieve better quality than either |
| 91 one individually. |
| 92 </t> |
| 93 |
| 94 <t> |
| 95 The primary normative part of this specification is provided by the source code |
| 96 in <xref target="ref-implementation"></xref>. |
| 97 Only the decoder portion of this software is normative, though a |
| 98 significant amount of code is shared by both the encoder and decoder. |
| 99 <xref target="conformance"/> provides a decoder conformance test. |
| 100 The decoder contains a great deal of integer and fixed-point arithmetic which |
| 101 needs to be performed exactly, including all rounding considerations, so any |
| 102 useful specification requires domain-specific symbolic language to adequately |
| 103 define these operations. |
| 104 Additionally, any |
| 105 conflict between the symbolic representation and the included reference |
| 106 implementation must be resolved. For the practical reasons of compatibility and |
| 107 testability it would be advantageous to give the reference implementation |
| 108 priority in any disagreement. The C language is also one of the most |
| 109 widely understood human-readable symbolic representations for machine |
| 110 behavior. |
| 111 For these reasons this RFC uses the reference implementation as the sole |
| 112 symbolic representation of the codec. |
| 113 </t> |
| 114 |
| 115 <t>While the symbolic representation is unambiguous and complete it is not |
| 116 always the easiest way to understand the codec's operation. For this reason |
| 117 this document also describes significant parts of the codec in English and |
| 118 takes the opportunity to explain the rationale behind many of the more |
| 119 surprising elements of the design. These descriptions are intended to be |
| 120 accurate and informative, but the limitations of common English sometimes |
| 121 result in ambiguity, so it is expected that the reader will always read |
| 122 them alongside the symbolic representation. Numerous references to the |
| 123 implementation are provided for this purpose. The descriptions sometimes |
| 124 differ from the reference in ordering or through mathematical simplification |
| 125 wherever such deviation makes an explanation easier to understand. |
| 126 For example, the right shift and left shift operations in the reference |
| 127 implementation are often described using division and multiplication in the text
. |
| 128 In general, the text is focused on the "what" and "why" while the symbolic |
| 129 representation most clearly provides the "how". |
| 130 </t> |
| 131 |
| 132 <section anchor="notation" title="Notation and Conventions"> |
| 133 <t> |
| 134 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", |
| 135 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be |
| 136 interpreted as described in RFC 2119 <xref target="rfc2119"></xref>. |
| 137 </t> |
| 138 <t> |
| 139 Various operations in the codec require bit-exact fixed-point behavior, even |
| 140 when writing a floating point implementation. |
| 141 The notation "Q<n>", where n is an integer, denotes the number of binary |
| 142 digits to the right of the decimal point in a fixed-point number. |
| 143 For example, a signed Q14 value in a 16-bit word can represent values from |
| 144 -2.0 to 1.99993896484375, inclusive. |
| 145 This notation is for informational purposes only. |
| 146 Arithmetic, when described, always operates on the underlying integer. |
| 147 E.g., the text will explicitly indicate any shifts required after a |
| 148 multiplication. |
| 149 </t> |
| 150 <t> |
| 151 Expressions, where included in the text, follow C operator rules and |
| 152 precedence, with the exception that the syntax "x**y" indicates x raised to |
| 153 the power y. |
| 154 The text also makes use of the following functions: |
| 155 </t> |
| 156 |
| 157 <section anchor="min" toc="exclude" title="min(x,y)"> |
| 158 <t> |
| 159 The smallest of two values x and y. |
| 160 </t> |
| 161 </section> |
| 162 |
| 163 <section anchor="max" toc="exclude" title="max(x,y)"> |
| 164 <t> |
| 165 The largest of two values x and y. |
| 166 </t> |
| 167 </section> |
| 168 |
| 169 <section anchor="clamp" toc="exclude" title="clamp(lo,x,hi)"> |
| 170 <figure align="center"> |
| 171 <artwork align="center"><![CDATA[ |
| 172 clamp(lo,x,hi) = max(lo,min(x,hi)) |
| 173 ]]></artwork> |
| 174 </figure> |
| 175 <t> |
| 176 With this definition, if lo > hi, the lower bound is the one that |
| 177 is enforced. |
| 178 </t> |
| 179 </section> |
| 180 |
| 181 <section anchor="sign" toc="exclude" title="sign(x)"> |
| 182 <t> |
| 183 The sign of x, i.e., |
| 184 <figure align="center"> |
| 185 <artwork align="center"><![CDATA[ |
| 186 ( -1, x < 0 , |
| 187 sign(x) = < 0, x == 0 , |
| 188 ( 1, x > 0 . |
| 189 ]]></artwork> |
| 190 </figure> |
| 191 </t> |
| 192 </section> |
| 193 |
| 194 <section anchor="abs" toc="exclude" title="abs(x)"> |
| 195 <t> |
| 196 The absolute value of x, i.e., |
| 197 <figure align="center"> |
| 198 <artwork align="center"><![CDATA[ |
| 199 abs(x) = sign(x)*x . |
| 200 ]]></artwork> |
| 201 </figure> |
| 202 </t> |
| 203 </section> |
| 204 |
| 205 <section anchor="floor" toc="exclude" title="floor(f)"> |
| 206 <t> |
| 207 The largest integer z such that z <= f. |
| 208 </t> |
| 209 </section> |
| 210 |
| 211 <section anchor="ceil" toc="exclude" title="ceil(f)"> |
| 212 <t> |
| 213 The smallest integer z such that z >= f. |
| 214 </t> |
| 215 </section> |
| 216 |
| 217 <section anchor="round" toc="exclude" title="round(f)"> |
| 218 <t> |
| 219 The integer z nearest to f, with ties rounded towards negative infinity, |
| 220 i.e., |
| 221 <figure align="center"> |
| 222 <artwork align="center"><![CDATA[ |
| 223 round(f) = ceil(f - 0.5) . |
| 224 ]]></artwork> |
| 225 </figure> |
| 226 </t> |
| 227 </section> |
| 228 |
| 229 <section anchor="log2" toc="exclude" title="log2(f)"> |
| 230 <t> |
| 231 The base-two logarithm of f. |
| 232 </t> |
| 233 </section> |
| 234 |
| 235 <section anchor="ilog" toc="exclude" title="ilog(n)"> |
| 236 <t> |
| 237 The minimum number of bits required to store a positive integer n in two's |
| 238 complement notation, or 0 for a non-positive integer n. |
| 239 <figure align="center"> |
| 240 <artwork align="center"><![CDATA[ |
| 241 ( 0, n <= 0, |
| 242 ilog(n) = < |
| 243 ( floor(log2(n))+1, n > 0 |
| 244 ]]></artwork> |
| 245 </figure> |
| 246 Examples: |
| 247 <list style="symbols"> |
| 248 <t>ilog(-1) = 0</t> |
| 249 <t>ilog(0) = 0</t> |
| 250 <t>ilog(1) = 1</t> |
| 251 <t>ilog(2) = 2</t> |
| 252 <t>ilog(3) = 2</t> |
| 253 <t>ilog(4) = 3</t> |
| 254 <t>ilog(7) = 3</t> |
| 255 </list> |
| 256 </t> |
| 257 </section> |
| 258 |
| 259 </section> |
| 260 |
| 261 </section> |
| 262 |
| 263 <section anchor="overview" title="Opus Codec Overview"> |
| 264 |
| 265 <t> |
| 266 The Opus codec scales from 6 kb/s narrowband mono speech to 510 kb/s |
| 267 fullband stereo music, with algorithmic delays ranging from 5 ms to |
| 268 65.2 ms. |
| 269 At any given time, either the LP layer, the MDCT layer, or both, may be active. |
| 270 It can seamlessly switch between all of its various operating modes, giving it |
| 271 a great deal of flexibility to adapt to varying content and network |
| 272 conditions without renegotiating the current session. |
| 273 The codec allows input and output of various audio bandwidths, defined as |
| 274 follows: |
| 275 </t> |
| 276 <texttable anchor="audio-bandwidth"> |
| 277 <ttcol>Abbreviation</ttcol> |
| 278 <ttcol align="right">Audio Bandwidth</ttcol> |
| 279 <ttcol align="right">Sample Rate (Effective)</ttcol> |
| 280 <c>NB (narrowband)</c> <c>4 kHz</c> <c>8 kHz</c> |
| 281 <c>MB (medium-band)</c> <c>6 kHz</c> <c>12 kHz</c> |
| 282 <c>WB (wideband)</c> <c>8 kHz</c> <c>16 kHz</c> |
| 283 <c>SWB (super-wideband)</c> <c>12 kHz</c> <c>24 kHz</c> |
| 284 <c>FB (fullband)</c> <c>20 kHz (*)</c> <c>48 kHz</c> |
| 285 </texttable> |
| 286 <t> |
| 287 (*) Although the sampling theorem allows a bandwidth as large as half the |
| 288 sampling rate, Opus never codes audio above 20 kHz, as that is the |
| 289 generally accepted upper limit of human hearing. |
| 290 </t> |
| 291 |
| 292 <t> |
| 293 Opus defines super-wideband (SWB) with an effective sample rate of 24 kHz, |
| 294 unlike some other audio coding standards that use 32 kHz. |
| 295 This was chosen for a number of reasons. |
| 296 The band layout in the MDCT layer naturally allows skipping coefficients for |
| 297 frequencies over 12 kHz, but does not allow cleanly dropping just those |
| 298 frequencies over 16 kHz. |
| 299 A sample rate of 24 kHz also makes resampling in the MDCT layer easier, |
| 300 as 24 evenly divides 48, and when 24 kHz is sufficient, it can save |
| 301 computation in other processing, such as Acoustic Echo Cancellation (AEC). |
| 302 Experimental changes to the band layout to allow a 16 kHz cutoff |
| 303 (32 kHz effective sample rate) showed potential quality degradations at |
| 304 other sample rates, and at typical bitrates the number of bits saved by using |
| 305 such a cutoff instead of coding in fullband (FB) mode is very small. |
| 306 Therefore, if an application wishes to process a signal sampled at 32 kHz, |
| 307 it should just use FB. |
| 308 </t> |
| 309 |
| 310 <t> |
| 311 The LP layer is based on the SILK codec |
| 312 <xref target="SILK"></xref>. |
| 313 It supports NB, MB, or WB audio and frame sizes from 10 ms to 60 ms, |
| 314 and requires an additional 5 ms look-ahead for noise shaping estimation. |
| 315 A small additional delay (up to 1.5 ms) may be required for sampling rate |
| 316 conversion. |
| 317 Like Vorbis <xref target='Vorbis-website'/> and many other modern codecs, SILK i
s inherently designed for |
| 318 variable-bitrate (VBR) coding, though the encoder can also produce |
| 319 constant-bitrate (CBR) streams. |
| 320 The version of SILK used in Opus is substantially modified from, and not |
| 321 compatible with, the stand-alone SILK codec previously deployed by Skype. |
| 322 This document does not serve to define that format, but those interested in the |
| 323 original SILK codec should see <xref target="SILK"/> instead. |
| 324 </t> |
| 325 |
| 326 <t> |
| 327 The MDCT layer is based on the CELT codec <xref target="CELT"></xref>. |
| 328 It supports NB, WB, SWB, or FB audio and frame sizes from 2.5 ms to |
| 329 20 ms, and requires an additional 2.5 ms look-ahead due to the |
| 330 overlapping MDCT windows. |
| 331 The CELT codec is inherently designed for CBR coding, but unlike many CBR |
| 332 codecs it is not limited to a set of predetermined rates. |
| 333 It internally allocates bits to exactly fill any given target budget, and an |
| 334 encoder can produce a VBR stream by varying the target on a per-frame basis. |
| 335 The MDCT layer is not used for speech when the audio bandwidth is WB or less, |
| 336 as it is not useful there. |
| 337 On the other hand, non-speech signals are not always adequately coded using |
| 338 linear prediction, so for music only the MDCT layer should be used. |
| 339 </t> |
| 340 |
| 341 <t> |
| 342 A "Hybrid" mode allows the use of both layers simultaneously with a frame size |
| 343 of 10 or 20 ms and a SWB or FB audio bandwidth. |
| 344 The LP layer codes the low frequencies by resampling the signal down to WB. |
| 345 The MDCT layer follows, coding the high frequency portion of the signal. |
| 346 The cutoff between the two lies at 8 kHz, the maximum WB audio bandwidth. |
| 347 In the MDCT layer, all bands below 8 kHz are discarded, so there is no |
| 348 coding redundancy between the two layers. |
| 349 </t> |
| 350 |
| 351 <t> |
| 352 The sample rate (in contrast to the actual audio bandwidth) can be chosen |
| 353 independently on the encoder and decoder side, e.g., a fullband signal can be |
| 354 decoded as wideband, or vice versa. |
| 355 This approach ensures a sender and receiver can always interoperate, regardless |
| 356 of the capabilities of their actual audio hardware. |
| 357 Internally, the LP layer always operates at a sample rate of twice the audio |
| 358 bandwidth, up to a maximum of 16 kHz, which it continues to use for SWB |
| 359 and FB. |
| 360 The decoder simply resamples its output to support different sample rates. |
| 361 The MDCT layer always operates internally at a sample rate of 48 kHz. |
| 362 Since all the supported sample rates evenly divide this rate, and since the |
| 363 the decoder may easily zero out the high frequency portion of the spectrum in |
| 364 the frequency domain, it can simply decimate the MDCT layer output to achieve |
| 365 the other supported sample rates very cheaply. |
| 366 </t> |
| 367 |
| 368 <t> |
| 369 After conversion to the common, desired output sample rate, the decoder simply |
| 370 adds the output from the two layers together. |
| 371 To compensate for the different look-ahead required by each layer, the CELT |
| 372 encoder input is delayed by an additional 2.7 ms. |
| 373 This ensures that low frequencies and high frequencies arrive at the same time. |
| 374 This extra delay may be reduced by an encoder by using less look-ahead for noise |
| 375 shaping or using a simpler resampler in the LP layer, but this will reduce |
| 376 quality. |
| 377 However, the base 2.5 ms look-ahead in the CELT layer cannot be reduced in |
| 378 the encoder because it is needed for the MDCT overlap, whose size is fixed by |
| 379 the decoder. |
| 380 </t> |
| 381 |
| 382 <t> |
| 383 Both layers use the same entropy coder, avoiding any waste from "padding bits" |
| 384 between them. |
| 385 The hybrid approach makes it easy to support both CBR and VBR coding. |
| 386 Although the LP layer is VBR, the bit allocation of the MDCT layer can produce |
| 387 a final stream that is CBR by using all the bits left unused by the LP layer. |
| 388 </t> |
| 389 |
| 390 <section title="Control Parameters"> |
| 391 <t> |
| 392 The Opus codec includes a number of control parameters which can be changed dyna
mically during |
| 393 regular operation of the codec, without interrupting the audio stream from the e
ncoder to the decoder. |
| 394 These parameters only affect the encoder since any impact they have on the bit-s
tream is signaled |
| 395 in-band such that a decoder can decode any Opus stream without any out-of-band s
ignaling. Any Opus |
| 396 implementation can add or modify these control parameters without affecting inte
roperability. The most |
| 397 important encoder control parameters in the reference encoder are listed below. |
| 398 </t> |
| 399 |
| 400 <section title="Bitrate" toc="exlcude"> |
| 401 <t> |
| 402 Opus supports all bitrates from 6 kb/s to 510 kb/s. All other paramete
rs being |
| 403 equal, higher bitrate results in higher quality. For a frame size of 20 ms,
these |
| 404 are the bitrate "sweet spots" for Opus in various configurations: |
| 405 <list style="symbols"> |
| 406 <t>8-12 kb/s for NB speech,</t> |
| 407 <t>16-20 kb/s for WB speech,</t> |
| 408 <t>28-40 kb/s for FB speech,</t> |
| 409 <t>48-64 kb/s for FB mono music, and</t> |
| 410 <t>64-128 kb/s for FB stereo music.</t> |
| 411 </list> |
| 412 </t> |
| 413 </section> |
| 414 |
| 415 <section title="Number of Channels (Mono/Stereo)" toc="exlcude"> |
| 416 <t> |
| 417 Opus can transmit either mono or stereo frames within a single stream. |
| 418 When decoding a mono frame in a stereo decoder, the left and right channels are |
| 419 identical, and when decoding a stereo frame in a mono decoder, the mono output |
| 420 is the average of the left and right channels. |
| 421 In some cases, it is desirable to encode a stereo input stream in mono (e.g., |
| 422 because the bitrate is too low to encode stereo with sufficient quality). |
| 423 The number of channels encoded can be selected in real-time, but by default the |
| 424 reference encoder attempts to make the best decision possible given the |
| 425 current bitrate. |
| 426 </t> |
| 427 </section> |
| 428 |
| 429 <section title="Audio Bandwidth" toc="exlcude"> |
| 430 <t> |
| 431 The audio bandwidths supported by Opus are listed in |
| 432 <xref target="audio-bandwidth"/>. |
| 433 Just like for the number of channels, any decoder can decode audio encoded at |
| 434 any bandwidth. |
| 435 For example, any Opus decoder operating at 8 kHz can decode a FB Opus |
| 436 frame, and any Opus decoder operating at 48 kHz can decode a NB frame. |
| 437 Similarly, the reference encoder can take a 48 kHz input signal and |
| 438 encode it as NB. |
| 439 The higher the audio bandwidth, the higher the required bitrate to achieve |
| 440 acceptable quality. |
| 441 The audio bandwidth can be explicitly specified in real-time, but by default |
| 442 the reference encoder attempts to make the best bandwidth decision possible |
| 443 given the current bitrate. |
| 444 </t> |
| 445 </section> |
| 446 |
| 447 |
| 448 <section title="Frame Duration" toc="exlcude"> |
| 449 <t> |
| 450 Opus can encode frames of 2.5, 5, 10, 20, 40 or 60 ms. |
| 451 It can also combine multiple frames into packets of up to 120 ms. |
| 452 For real-time applications, sending fewer packets per second reduces the |
| 453 bitrate, since it reduces the overhead from IP, UDP, and RTP headers. |
| 454 However, it increases latency and sensitivity to packet losses, as losing one |
| 455 packet constitutes a loss of a bigger chunk of audio. |
| 456 Increasing the frame duration also slightly improves coding efficiency, but the |
| 457 gain becomes small for frame sizes above 20 ms. |
| 458 For this reason, 20 ms frames are a good choice for most applications. |
| 459 </t> |
| 460 </section> |
| 461 |
| 462 <section title="Complexity" toc="exlcude"> |
| 463 <t> |
| 464 There are various aspects of the Opus encoding process where trade-offs |
| 465 can be made between CPU complexity and quality/bitrate. In the reference |
| 466 encoder, the complexity is selected using an integer from 0 to 10, where |
| 467 0 is the lowest complexity and 10 is the highest. Examples of |
| 468 computations for which such trade-offs may occur are: |
| 469 <list style="symbols"> |
| 470 <t>The order of the pitch analysis whitening filter <xref target="Whitening"/>,<
/t> |
| 471 <t>The order of the short-term noise shaping filter,</t> |
| 472 <t>The number of states in delayed decision quantization of the |
| 473 residual signal, and</t> |
| 474 <t>The use of certain bit-stream features such as variable time-frequency |
| 475 resolution and the pitch post-filter.</t> |
| 476 </list> |
| 477 </t> |
| 478 </section> |
| 479 |
| 480 <section title="Packet Loss Resilience" toc="exlcude"> |
| 481 <t> |
| 482 Audio codecs often exploit inter-frame correlations to reduce the |
| 483 bitrate at a cost in error propagation: after losing one packet |
| 484 several packets need to be received before the decoder is able to |
| 485 accurately reconstruct the speech signal. The extent to which Opus |
| 486 exploits inter-frame dependencies can be adjusted on the fly to |
| 487 choose a trade-off between bitrate and amount of error propagation. |
| 488 </t> |
| 489 </section> |
| 490 |
| 491 <section title="Forward Error Correction (FEC)" toc="exlcude"> |
| 492 <t> |
| 493 Another mechanism providing robustness against packet loss is the in-band |
| 494 Forward Error Correction (FEC). Packets that are determined to |
| 495 contain perceptually important speech information, such as onsets or |
| 496 transients, are encoded again at a lower bitrate and this re-encoded |
| 497 information is added to a subsequent packet. |
| 498 </t> |
| 499 </section> |
| 500 |
| 501 <section title="Constant/Variable Bitrate" toc="exlcude"> |
| 502 <t> |
| 503 Opus is more efficient when operating with variable bitrate (VBR), which is |
| 504 the default. However, in some (rare) applications, constant bitrate (CBR) |
| 505 is required. There are two main reasons to operate in CBR mode: |
| 506 <list style="symbols"> |
| 507 <t>When the transport only supports a fixed size for each compressed frame</t> |
| 508 <t>When encryption is used for an audio stream that is either highly constrained |
| 509 (e.g. yes/no, recorded prompts) or highly sensitive <xref target="SRTP-VBR"><
/xref> </t> |
| 510 </list> |
| 511 |
| 512 When low-latency transmission is required over a relatively slow connection, the
n |
| 513 constrained VBR can also be used. This uses VBR in a way that simulates a |
| 514 "bit reservoir" and is equivalent to what MP3 (MPEG 1, Layer 3) and |
| 515 AAC (Advanced Audio Coding) call CBR (i.e., not true |
| 516 CBR due to the bit reservoir). |
| 517 </t> |
| 518 </section> |
| 519 |
| 520 <section title="Discontinuous Transmission (DTX)" toc="exlcude"> |
| 521 <t> |
| 522 Discontinuous Transmission (DTX) reduces the bitrate during silence |
| 523 or background noise. When DTX is enabled, only one frame is encoded |
| 524 every 400 milliseconds. |
| 525 </t> |
| 526 </section> |
| 527 |
| 528 </section> |
| 529 |
| 530 </section> |
| 531 |
| 532 <section anchor="modes" title="Internal Framing"> |
| 533 |
| 534 <t> |
| 535 The Opus encoder produces "packets", which are each a contiguous set of bytes |
| 536 meant to be transmitted as a single unit. |
| 537 The packets described here do not include such things as IP, UDP, or RTP |
| 538 headers which are normally found in a transport-layer packet. |
| 539 A single packet may contain multiple audio frames, so long as they share a |
| 540 common set of parameters, including the operating mode, audio bandwidth, frame |
| 541 size, and channel count (mono vs. stereo). |
| 542 This section describes the possible combinations of these parameters and the |
| 543 internal framing used to pack multiple frames into a single packet. |
| 544 This framing is not self-delimiting. |
| 545 Instead, it assumes that a higher layer (such as UDP or RTP <xref target='RFC355
0'/> |
| 546 or Ogg <xref target='RFC3533'/> or Matroska <xref target='Matroska-website'/>) |
| 547 will communicate the length, in bytes, of the packet, and it uses this |
| 548 information to reduce the framing overhead in the packet itself. |
| 549 A decoder implementation MUST support the framing described in this section. |
| 550 An alternative, self-delimiting variant of the framing is described in |
| 551 <xref target="self-delimiting-framing"/>. |
| 552 Support for that variant is OPTIONAL. |
| 553 </t> |
| 554 |
| 555 <t> |
| 556 All bit diagrams in this document number the bits so that bit 0 is the most |
| 557 significant bit of the first byte, and bit 7 is the least significant. |
| 558 Bit 8 is thus the most significant bit of the second byte, etc. |
| 559 Well-formed Opus packets obey certain requirements, marked [R1] through [R7] |
| 560 below. |
| 561 These are summarized in <xref target="malformed-packets"/> along with |
| 562 appropriate means of handling malformed packets. |
| 563 </t> |
| 564 |
| 565 <section anchor="toc_byte" title="The TOC Byte"> |
| 566 <t anchor="R1"> |
| 567 A well-formed Opus packet MUST contain at least one byte [R1]. |
| 568 This byte forms a table-of-contents (TOC) header that signals which of the |
| 569 various modes and configurations a given packet uses. |
| 570 It is composed of a configuration number, "config", a stereo flag, "s", and a |
| 571 frame count code, "c", arranged as illustrated in |
| 572 <xref target="toc_byte_fig"/>. |
| 573 A description of each of these fields follows. |
| 574 </t> |
| 575 |
| 576 <figure anchor="toc_byte_fig" title="The TOC Byte"> |
| 577 <artwork align="center"><![CDATA[ |
| 578 0 |
| 579 0 1 2 3 4 5 6 7 |
| 580 +-+-+-+-+-+-+-+-+ |
| 581 | config |s| c | |
| 582 +-+-+-+-+-+-+-+-+ |
| 583 ]]></artwork> |
| 584 </figure> |
| 585 |
| 586 <t> |
| 587 The top five bits of the TOC byte, labeled "config", encode one of 32 possible |
| 588 configurations of operating mode, audio bandwidth, and frame size. |
| 589 As described, the LP (SILK) layer and MDCT (CELT) layer can be combined in three
possible |
| 590 operating modes: |
| 591 <list style="numbers"> |
| 592 <t>A SILK-only mode for use in low bitrate connections with an audio bandwidth |
| 593 of WB or less,</t> |
| 594 <t>A Hybrid (SILK+CELT) mode for SWB or FB speech at medium bitrates, and</t> |
| 595 <t>A CELT-only mode for very low delay speech transmission as well as music |
| 596 transmission (NB to FB).</t> |
| 597 </list> |
| 598 The 32 possible configurations each identify which one of these operating modes |
| 599 the packet uses, as well as the audio bandwidth and the frame size. |
| 600 <xref target="config_bits"/> lists the parameters for each configuration. |
| 601 </t> |
| 602 <texttable anchor="config_bits" title="TOC Byte Configuration Parameters"> |
| 603 <ttcol>Configuration Number(s)</ttcol> |
| 604 <ttcol>Mode</ttcol> |
| 605 <ttcol>Bandwidth</ttcol> |
| 606 <ttcol>Frame Sizes</ttcol> |
| 607 <c>0...3</c> <c>SILK-only</c> <c>NB</c> <c>10, 20, 40, 60 ms</c> |
| 608 <c>4...7</c> <c>SILK-only</c> <c>MB</c> <c>10, 20, 40, 60 ms</c> |
| 609 <c>8...11</c> <c>SILK-only</c> <c>WB</c> <c>10, 20, 40, 60 ms</c> |
| 610 <c>12...13</c> <c>Hybrid</c> <c>SWB</c> <c>10, 20 ms</c> |
| 611 <c>14...15</c> <c>Hybrid</c> <c>FB</c> <c>10, 20 ms</c> |
| 612 <c>16...19</c> <c>CELT-only</c> <c>NB</c> <c>2.5, 5, 10, 20 ms</c> |
| 613 <c>20...23</c> <c>CELT-only</c> <c>WB</c> <c>2.5, 5, 10, 20 ms</c> |
| 614 <c>24...27</c> <c>CELT-only</c> <c>SWB</c> <c>2.5, 5, 10, 20 ms</c> |
| 615 <c>28...31</c> <c>CELT-only</c> <c>FB</c> <c>2.5, 5, 10, 20 ms</c> |
| 616 </texttable> |
| 617 <t> |
| 618 The configuration numbers in each range (e.g., 0...3 for NB SILK-only) |
| 619 correspond to the various choices of frame size, in the same order. |
| 620 For example, configuration 0 has a 10 ms frame size and configuration 3 |
| 621 has a 60 ms frame size. |
| 622 </t> |
| 623 |
| 624 <t> |
| 625 One additional bit, labeled "s", signals mono vs. stereo, with 0 indicating |
| 626 mono and 1 indicating stereo. |
| 627 </t> |
| 628 |
| 629 <t> |
| 630 The remaining two bits of the TOC byte, labeled "c", code the number of frames |
| 631 per packet (codes 0 to 3) as follows: |
| 632 <list style="symbols"> |
| 633 <t>0: 1 frame in the packet</t> |
| 634 <t>1: 2 frames in the packet, each with equal compressed size</t> |
| 635 <t>2: 2 frames in the packet, with different compressed sizes</t> |
| 636 <t>3: an arbitrary number of frames in the packet</t> |
| 637 </list> |
| 638 This draft refers to a packet as a code 0 packet, code 1 packet, etc., based on |
| 639 the value of "c". |
| 640 </t> |
| 641 |
| 642 </section> |
| 643 |
| 644 <section title="Frame Packing"> |
| 645 |
| 646 <t> |
| 647 This section describes how frames are packed according to each possible value |
| 648 of "c" in the TOC byte. |
| 649 </t> |
| 650 |
| 651 <section anchor="frame-length-coding" title="Frame Length Coding"> |
| 652 <t> |
| 653 When a packet contains multiple VBR frames (i.e., code 2 or 3), the compressed |
| 654 length of one or more of these frames is indicated with a one- or two-byte |
| 655 sequence, with the meaning of the first byte as follows: |
| 656 <list style="symbols"> |
| 657 <t>0: No frame (discontinuous transmission (DTX) or lost packet)</t> |
| 658 <t>1...251: Length of the frame in bytes</t> |
| 659 <t>252...255: A second byte is needed. The total length is (second_byte*4)+firs
t_byte</t> |
| 660 </list> |
| 661 </t> |
| 662 |
| 663 <t> |
| 664 The special length 0 indicates that no frame is available, either because it |
| 665 was dropped during transmission by some intermediary or because the encoder |
| 666 chose not to transmit it. |
| 667 Any Opus frame in any mode MAY have a length of 0. |
| 668 </t> |
| 669 |
| 670 <t> |
| 671 The maximum representable length is 255*4+255=1275 bytes. |
| 672 For 20 ms frames, this represents a bitrate of 510 kb/s, which is |
| 673 approximately the highest useful rate for lossily compressed fullband stereo |
| 674 music. |
| 675 Beyond this point, lossless codecs are more appropriate. |
| 676 It is also roughly the maximum useful rate of the MDCT layer, as shortly |
| 677 thereafter quality no longer improves with additional bits due to limitations |
| 678 on the codebook sizes. |
| 679 </t> |
| 680 |
| 681 <t anchor="R2"> |
| 682 No length is transmitted for the last frame in a VBR packet, or for any of the |
| 683 frames in a CBR packet, as it can be inferred from the total size of the |
| 684 packet and the size of all other data in the packet. |
| 685 However, the length of any individual frame MUST NOT exceed |
| 686 1275 bytes [R2], to allow for repacketization by gateways, |
| 687 conference bridges, or other software. |
| 688 </t> |
| 689 </section> |
| 690 |
| 691 <section title="Code 0: One Frame in the Packet"> |
| 692 |
| 693 <t> |
| 694 For code 0 packets, the TOC byte is immediately followed by N-1 bytes |
| 695 of compressed data for a single frame (where N is the size of the packet), |
| 696 as illustrated in <xref target="code0_packet"/>. |
| 697 </t> |
| 698 <figure anchor="code0_packet" title="A Code 0 Packet" align="center"> |
| 699 <artwork align="center"><![CDATA[ |
| 700 0 1 2 3 |
| 701 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 702 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 703 | config |s|0|0| | |
| 704 +-+-+-+-+-+-+-+-+ | |
| 705 | Compressed frame 1 (N-1 bytes)... : |
| 706 : | |
| 707 | | |
| 708 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 709 ]]></artwork> |
| 710 </figure> |
| 711 </section> |
| 712 |
| 713 <section title="Code 1: Two Frames in the Packet, Each with Equal Compressed Siz
e"> |
| 714 <t anchor="R3"> |
| 715 For code 1 packets, the TOC byte is immediately followed by the |
| 716 (N-1)/2 bytes of compressed data for the first frame, followed by |
| 717 (N-1)/2 bytes of compressed data for the second frame, as illustrated in |
| 718 <xref target="code1_packet"/>. |
| 719 The number of payload bytes available for compressed data, N-1, MUST be even |
| 720 for all code 1 packets [R3]. |
| 721 </t> |
| 722 <figure anchor="code1_packet" title="A Code 1 Packet" align="center"> |
| 723 <artwork align="center"><![CDATA[ |
| 724 0 1 2 3 |
| 725 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 726 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 727 | config |s|0|1| | |
| 728 +-+-+-+-+-+-+-+-+ : |
| 729 | Compressed frame 1 ((N-1)/2 bytes)... | |
| 730 : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 731 | | | |
| 732 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : |
| 733 | Compressed frame 2 ((N-1)/2 bytes)... | |
| 734 : +-+-+-+-+-+-+-+-+ |
| 735 | | |
| 736 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 737 ]]></artwork> |
| 738 </figure> |
| 739 </section> |
| 740 |
| 741 <section title="Code 2: Two Frames in the Packet, with Different Compressed Size
s"> |
| 742 <t anchor="R4"> |
| 743 For code 2 packets, the TOC byte is followed by a one- or two-byte sequence |
| 744 indicating the length of the first frame (marked N1 in <xref target='code2_pack
et'/>), |
| 745 followed by N1 bytes of compressed data for the first frame. |
| 746 The remaining N-N1-2 or N-N1-3 bytes are the compressed data for the |
| 747 second frame. |
| 748 This is illustrated in <xref target="code2_packet"/>. |
| 749 A code 2 packet MUST contain enough bytes to represent a valid length. |
| 750 For example, a 1-byte code 2 packet is always invalid, and a 2-byte code 2 |
| 751 packet whose second byte is in the range 252...255 is also invalid. |
| 752 The length of the first frame, N1, MUST also be no larger than the size of the |
| 753 payload remaining after decoding that length for all code 2 packets [R4]. |
| 754 This makes, for example, a 2-byte code 2 packet with a second byte in the range |
| 755 1...251 invalid as well (the only valid 2-byte code 2 packet is one where the |
| 756 length of both frames is zero). |
| 757 </t> |
| 758 <figure anchor="code2_packet" title="A Code 2 Packet" align="center"> |
| 759 <artwork align="center"><![CDATA[ |
| 760 0 1 2 3 |
| 761 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 762 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 763 | config |s|1|0| N1 (1-2 bytes): | |
| 764 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : |
| 765 | Compressed frame 1 (N1 bytes)... | |
| 766 : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 767 | | | |
| 768 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| 769 | Compressed frame 2... : |
| 770 : | |
| 771 | | |
| 772 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 773 ]]></artwork> |
| 774 </figure> |
| 775 </section> |
| 776 |
| 777 <section title="Code 3: A Signaled Number of Frames in the Packet"> |
| 778 <t anchor="R5"> |
| 779 Code 3 packets signal the number of frames, as well as additional |
| 780 padding, called "Opus padding" to indicate that this padding is added at the |
| 781 Opus layer, rather than at the transport layer. |
| 782 Code 3 packets MUST have at least 2 bytes [R6,R7]. |
| 783 The TOC byte is followed by a byte encoding the number of frames in the packet |
| 784 in bits 2 to 7 (marked "M" in <xref target='frame_count_byte'/>), with bit 1 in
dicating whether |
| 785 or not Opus padding is inserted (marked "p" in <xref target='frame_count_byte'/
>), and bit 0 |
| 786 indicating VBR (marked "v" in <xref target='frame_count_byte'/>). |
| 787 M MUST NOT be zero, and the audio duration contained within a packet MUST NOT |
| 788 exceed 120 ms [R5]. |
| 789 This limits the maximum frame count for any frame size to 48 (for 2.5 ms |
| 790 frames), with lower limits for longer frame sizes. |
| 791 <xref target="frame_count_byte"/> illustrates the layout of the frame count |
| 792 byte. |
| 793 </t> |
| 794 <figure anchor="frame_count_byte" title="The frame count byte"> |
| 795 <artwork align="center"><![CDATA[ |
| 796 0 |
| 797 0 1 2 3 4 5 6 7 |
| 798 +-+-+-+-+-+-+-+-+ |
| 799 |v|p| M | |
| 800 +-+-+-+-+-+-+-+-+ |
| 801 ]]></artwork> |
| 802 </figure> |
| 803 <t> |
| 804 When Opus padding is used, the number of bytes of padding is encoded in the |
| 805 bytes following the frame count byte. |
| 806 Values from 0...254 indicate that 0...254 bytes of padding are included, |
| 807 in addition to the byte(s) used to indicate the size of the padding. |
| 808 If the value is 255, then the size of the additional padding is 254 bytes, |
| 809 plus the padding value encoded in the next byte. |
| 810 There MUST be at least one more byte in the packet in this case [R6,R7]. |
| 811 The additional padding bytes appear at the end of the packet, and MUST be set |
| 812 to zero by the encoder to avoid creating a covert channel. |
| 813 The decoder MUST accept any value for the padding bytes, however. |
| 814 </t> |
| 815 <t> |
| 816 Although this encoding provides multiple ways to indicate a given number of |
| 817 padding bytes, each uses a different number of bytes to indicate the padding |
| 818 size, and thus will increase the total packet size by a different amount. |
| 819 For example, to add 255 bytes to a packet, set the padding bit, p, to 1, insert |
| 820 a single byte after the frame count byte with a value of 254, and append 254 |
| 821 padding bytes with the value zero to the end of the packet. |
| 822 To add 256 bytes to a packet, set the padding bit to 1, insert two bytes after |
| 823 the frame count byte with the values 255 and 0, respectively, and append 254 |
| 824 padding bytes with the value zero to the end of the packet. |
| 825 By using the value 255 multiple times, it is possible to create a packet of any |
| 826 specific, desired size. |
| 827 Let P be the number of header bytes used to indicate the padding size plus the |
| 828 number of padding bytes themselves (i.e., P is the total number of bytes added |
| 829 to the packet). |
| 830 Then P MUST be no more than N-2 [R6,R7]. |
| 831 </t> |
| 832 <t anchor="R6"> |
| 833 In the CBR case, let R=N-2-P be the number of bytes remaining in the packet |
| 834 after subtracting the (optional) padding. |
| 835 Then the compressed length of each frame in bytes is equal to R/M. |
| 836 The value R MUST be a non-negative integer multiple of M [R6]. |
| 837 The compressed data for all M frames follows, each of size |
| 838 R/M bytes, as illustrated in <xref target="code3cbr_packet"/>. |
| 839 </t> |
| 840 |
| 841 <figure anchor="code3cbr_packet" title="A CBR Code 3 Packet" align="center"> |
| 842 <artwork align="center"><![CDATA[ |
| 843 0 1 2 3 |
| 844 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 845 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 846 | config |s|1|1|0|p| M | Padding length (Optional) : |
| 847 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 848 | | |
| 849 : Compressed frame 1 (R/M bytes)... : |
| 850 | | |
| 851 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 852 | | |
| 853 : Compressed frame 2 (R/M bytes)... : |
| 854 | | |
| 855 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 856 | | |
| 857 : ... : |
| 858 | | |
| 859 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 860 | | |
| 861 : Compressed frame M (R/M bytes)... : |
| 862 | | |
| 863 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 864 : Opus Padding (Optional)... | |
| 865 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 866 ]]></artwork> |
| 867 </figure> |
| 868 |
| 869 <t anchor="R7"> |
| 870 In the VBR case, the (optional) padding length is followed by M-1 frame |
| 871 lengths (indicated by "N1" to "N[M-1]" in <xref target='code3vbr_packet'/>), ea
ch encoded in a |
| 872 one- or two-byte sequence as described above. |
| 873 The packet MUST contain enough data for the M-1 lengths after removing the |
| 874 (optional) padding, and the sum of these lengths MUST be no larger than the |
| 875 number of bytes remaining in the packet after decoding them [R7]. |
| 876 The compressed data for all M frames follows, each frame consisting of the |
| 877 indicated number of bytes, with the final frame consuming any remaining bytes |
| 878 before the final padding, as illustrated in <xref target="code3cbr_packet"/>. |
| 879 The number of header bytes (TOC byte, frame count byte, padding length bytes, |
| 880 and frame length bytes), plus the signaled length of the first M-1 frames thems
elves, |
| 881 plus the signaled length of the padding MUST be no larger than N, the total siz
e of the |
| 882 packet. |
| 883 </t> |
| 884 |
| 885 <figure anchor="code3vbr_packet" title="A VBR Code 3 Packet" align="center"> |
| 886 <artwork align="center"><![CDATA[ |
| 887 0 1 2 3 |
| 888 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 889 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 890 | config |s|1|1|1|p| M | Padding length (Optional) : |
| 891 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 892 : N1 (1-2 bytes): N2 (1-2 bytes): ... : N[M-1] | |
| 893 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 894 | | |
| 895 : Compressed frame 1 (N1 bytes)... : |
| 896 | | |
| 897 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 898 | | |
| 899 : Compressed frame 2 (N2 bytes)... : |
| 900 | | |
| 901 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 902 | | |
| 903 : ... : |
| 904 | | |
| 905 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 906 | | |
| 907 : Compressed frame M... : |
| 908 | | |
| 909 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 910 : Opus Padding (Optional)... | |
| 911 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 912 ]]></artwork> |
| 913 </figure> |
| 914 </section> |
| 915 </section> |
| 916 |
| 917 <section anchor="examples" title="Examples"> |
| 918 <t> |
| 919 Simplest case, one NB mono 20 ms SILK frame: |
| 920 </t> |
| 921 |
| 922 <figure anchor='framing_example_1'> |
| 923 <artwork><![CDATA[ |
| 924 0 1 2 3 |
| 925 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 926 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 927 | 1 |0|0|0| compressed data... : |
| 928 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 929 ]]></artwork> |
| 930 </figure> |
| 931 |
| 932 <t> |
| 933 Two FB mono 5 ms CELT frames of the same compressed size: |
| 934 </t> |
| 935 |
| 936 <figure anchor='framing_example_2'> |
| 937 <artwork><![CDATA[ |
| 938 0 1 2 3 |
| 939 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 940 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 941 | 29 |0|0|1| compressed data... : |
| 942 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 943 ]]></artwork> |
| 944 </figure> |
| 945 |
| 946 <t> |
| 947 Two FB mono 20 ms Hybrid frames of different compressed size: |
| 948 </t> |
| 949 |
| 950 <figure anchor='framing_example_3'> |
| 951 <artwork><![CDATA[ |
| 952 0 1 2 3 |
| 953 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 954 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 955 | 15 |0|1|1|1|0| 2 | N1 | | |
| 956 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| 957 | compressed data... : |
| 958 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 959 ]]></artwork> |
| 960 </figure> |
| 961 |
| 962 <t> |
| 963 Four FB stereo 20 ms CELT frames of the same compressed size: |
| 964 </t> |
| 965 |
| 966 <figure anchor='framing_example_4'> |
| 967 <artwork><![CDATA[ |
| 968 0 1 2 3 |
| 969 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 970 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 971 | 31 |1|1|1|0|0| 4 | compressed data... : |
| 972 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 973 ]]></artwork> |
| 974 </figure> |
| 975 </section> |
| 976 |
| 977 <section anchor="malformed-packets" title="Receiving Malformed Packets"> |
| 978 <t> |
| 979 A receiver MUST NOT process packets which violate any of the rules above as |
| 980 normal Opus packets. |
| 981 They are reserved for future applications, such as in-band headers (containing |
| 982 metadata, etc.). |
| 983 Packets which violate these constraints may cause implementations of |
| 984 <spanx style="emph">this</spanx> specification to treat them as malformed, and |
| 985 discard them. |
| 986 </t> |
| 987 <t> |
| 988 These constraints are summarized here for reference: |
| 989 <list style="format [R%d]"> |
| 990 <t>Packets are at least one byte.</t> |
| 991 <t>No implicit frame length is larger than 1275 bytes.</t> |
| 992 <t>Code 1 packets have an odd total length, N, so that (N-1)/2 is an |
| 993 integer.</t> |
| 994 <t>Code 2 packets have enough bytes after the TOC for a valid frame |
| 995 length, and that length is no larger than the number of bytes remaining in the |
| 996 packet.</t> |
| 997 <t>Code 3 packets contain at least one frame, but no more than 120 ms |
| 998 of audio total.</t> |
| 999 <t>The length of a CBR code 3 packet, N, is at least two bytes, the number of |
| 1000 bytes added to indicate the padding size plus the trailing padding bytes |
| 1001 themselves, P, is no more than N-2, and the frame count, M, satisfies |
| 1002 the constraint that (N-2-P) is a non-negative integer multiple of M.</t> |
| 1003 <t>VBR code 3 packets are large enough to contain all the header bytes (TOC |
| 1004 byte, frame count byte, any padding length bytes, and any frame length bytes), |
| 1005 plus the length of the first M-1 frames, plus any trailing padding bytes.</t> |
| 1006 </list> |
| 1007 </t> |
| 1008 </section> |
| 1009 |
| 1010 </section> |
| 1011 |
| 1012 <section title="Opus Decoder"> |
| 1013 <t> |
| 1014 The Opus decoder consists of two main blocks: the SILK decoder and the CELT |
| 1015 decoder. |
| 1016 At any given time, one or both of the SILK and CELT decoders may be active. |
| 1017 The output of the Opus decode is the sum of the outputs from the SILK and CELT |
| 1018 decoders with proper sample rate conversion and delay compensation on the SILK |
| 1019 side, and optional decimation (when decoding to sample rates less than |
| 1020 48 kHz) on the CELT side, as illustrated in the block diagram below. |
| 1021 </t> |
| 1022 <figure> |
| 1023 <artwork> |
| 1024 <![CDATA[ |
| 1025 +---------+ +------------+ |
| 1026 | SILK | | Sample | |
| 1027 +->| Decoder |--->| Rate |----+ |
| 1028 Bit- +---------+ | | | | Conversion | v |
| 1029 stream | Range |---+ +---------+ +------------+ /---\ Audio |
| 1030 ------->| Decoder | | + |------> |
| 1031 | |---+ +---------+ +------------+ \---/ |
| 1032 +---------+ | | CELT | | Decimation | ^ |
| 1033 +->| Decoder |--->| (Optional) |----+ |
| 1034 | | | | |
| 1035 +---------+ +------------+ |
| 1036 ]]> |
| 1037 </artwork> |
| 1038 </figure> |
| 1039 |
| 1040 <section anchor="range-decoder" title="Range Decoder"> |
| 1041 <t> |
| 1042 Opus uses an entropy coder based on range coding <xref target="range-coding"></x
ref> |
| 1043 <xref target="Martin79"></xref>, |
| 1044 which is itself a rediscovery of the FIFO arithmetic code introduced by <xref ta
rget="coding-thesis"></xref>. |
| 1045 It is very similar to arithmetic encoding, except that encoding is done with |
| 1046 digits in any base instead of with bits, |
| 1047 so it is faster when using larger bases (i.e., a byte). All of the |
| 1048 calculations in the range coder must use bit-exact integer arithmetic. |
| 1049 </t> |
| 1050 <t> |
| 1051 Symbols may also be coded as "raw bits" packed directly into the bitstream, |
| 1052 bypassing the range coder. |
| 1053 These are packed backwards starting at the end of the frame, as illustrated in |
| 1054 <xref target="rawbits-example"/>. |
| 1055 This reduces complexity and makes the stream more resilient to bit errors, as |
| 1056 corruption in the raw bits will not desynchronize the decoding process, unlike |
| 1057 corruption in the input to the range decoder. |
| 1058 Raw bits are only used in the CELT layer. |
| 1059 </t> |
| 1060 |
| 1061 <figure anchor="rawbits-example" title="Illustrative example of packing range |
| 1062 coder and raw bits data"> |
| 1063 <artwork align="center"><![CDATA[ |
| 1064 0 1 2 3 |
| 1065 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 1066 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1067 | Range coder data (packed MSB to LSB) -> : |
| 1068 + + |
| 1069 : : |
| 1070 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1071 : | <- Boundary occurs at an arbitrary bit position : |
| 1072 +-+-+-+ + |
| 1073 : <- Raw bits data (packed LSB to MSB) | |
| 1074 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1075 ]]></artwork> |
| 1076 </figure> |
| 1077 |
| 1078 <t> |
| 1079 Each symbol coded by the range coder is drawn from a finite alphabet and coded |
| 1080 in a separate "context", which describes the size of the alphabet and the |
| 1081 relative frequency of each symbol in that alphabet. |
| 1082 </t> |
| 1083 <t> |
| 1084 Suppose there is a context with n symbols, identified with an index that ranges |
| 1085 from 0 to n-1. |
| 1086 The parameters needed to encode or decode symbol k in this context are |
| 1087 represented by a three-tuple (fl[k], fh[k], ft), with |
| 1088 0 <= fl[k] < fh[k] <= ft <= 
;65535. |
| 1089 The values of this tuple are derived from the probability model for the |
| 1090 symbol, represented by traditional "frequency counts". |
| 1091 Because Opus uses static contexts these are not updated as symbols are decoded. |
| 1092 Let f[i] be the frequency of symbol i. |
| 1093 Then the three-tuple corresponding to symbol k is given by |
| 1094 </t> |
| 1095 <figure align="center"> |
| 1096 <artwork align="center"><![CDATA[ |
| 1097 k-1 n-1 |
| 1098 __ __ |
| 1099 fl[k] = \ f[i], fh[k] = fl[k] + f[k], ft = \ f[i] |
| 1100 /_ /_ |
| 1101 i=0 i=0 |
| 1102 ]]></artwork> |
| 1103 </figure> |
| 1104 <t> |
| 1105 The range decoder extracts the symbols and integers encoded using the range |
| 1106 encoder in <xref target="range-encoder"/>. |
| 1107 The range decoder maintains an internal state vector composed of the two-tuple |
| 1108 (val, rng), representing the difference between the high end of the |
| 1109 current range and the actual coded value, minus one, and the size of the |
| 1110 current range, respectively. |
| 1111 Both val and rng are 32-bit unsigned integer values. |
| 1112 </t> |
| 1113 |
| 1114 <section anchor="range-decoder-init" title="Range Decoder Initialization"> |
| 1115 <t> |
| 1116 Let b0 be the first input byte (or zero if there are no bytes in this Opus |
| 1117 frame). |
| 1118 The decoder initializes rng to 128 and initializes val to |
| 1119 (127 - (b0>>1)), where (b0>>1) is the top 7 bits of the |
| 1120 first input byte. |
| 1121 It saves the remaining bit, (b0&1), for use in the renormalization |
| 1122 procedure described in <xref target="range-decoder-renorm"/>, which the |
| 1123 decoder invokes immediately after initialization to read additional bits and |
| 1124 establish the invariant that rng > 2**23. |
| 1125 </t> |
| 1126 </section> |
| 1127 |
| 1128 <section anchor="decoding-symbols" title="Decoding Symbols"> |
| 1129 <t> |
| 1130 Decoding a symbol is a two-step process. |
| 1131 The first step determines a 16-bit unsigned value fs, which lies within the |
| 1132 range of some symbol in the current context. |
| 1133 The second step updates the range decoder state with the three-tuple |
| 1134 (fl[k], fh[k], ft) corresponding to that symbol. |
| 1135 </t> |
| 1136 <t> |
| 1137 The first step is implemented by ec_decode() (entdec.c), which computes |
| 1138 <figure align="center"> |
| 1139 <artwork align="center"><![CDATA[ |
| 1140 val |
| 1141 fs = ft - min(------ + 1, ft) . |
| 1142 rng/ft |
| 1143 ]]></artwork> |
| 1144 </figure> |
| 1145 The divisions here are integer division. |
| 1146 </t> |
| 1147 <t> |
| 1148 The decoder then identifies the symbol in the current context corresponding to |
| 1149 fs; i.e., the value of k whose three-tuple (fl[k], fh[k], ft) |
| 1150 satisfies fl[k] <= fs < fh[k]. |
| 1151 It uses this tuple to update val according to |
| 1152 <figure align="center"> |
| 1153 <artwork align="center"><![CDATA[ |
| 1154 rng |
| 1155 val = val - --- * (ft - fh[k]) . |
| 1156 ft |
| 1157 ]]></artwork> |
| 1158 </figure> |
| 1159 If fl[k] is greater than zero, then the decoder updates rng using |
| 1160 <figure align="center"> |
| 1161 <artwork align="center"><![CDATA[ |
| 1162 rng |
| 1163 rng = --- * (fh[k] - fl[k]) . |
| 1164 ft |
| 1165 ]]></artwork> |
| 1166 </figure> |
| 1167 Otherwise, it updates rng using |
| 1168 <figure align="center"> |
| 1169 <artwork align="center"><![CDATA[ |
| 1170 rng |
| 1171 rng = rng - --- * (ft - fh[k]) . |
| 1172 ft |
| 1173 ]]></artwork> |
| 1174 </figure> |
| 1175 </t> |
| 1176 <t> |
| 1177 Using a special case for the first symbol (rather than the last symbol, as is |
| 1178 commonly done in other arithmetic coders) ensures that all the truncation |
| 1179 error from the finite precision arithmetic accumulates in symbol 0. |
| 1180 This makes the cost of coding a 0 slightly smaller, on average, than its |
| 1181 estimated probability indicates and makes the cost of coding any other symbol |
| 1182 slightly larger. |
| 1183 When contexts are designed so that 0 is the most probable symbol, which is |
| 1184 often the case, this strategy minimizes the inefficiency introduced by the |
| 1185 finite precision. |
| 1186 It also makes some of the special-case decoding routines in |
| 1187 <xref target="decoding-alternate"/> particularly simple. |
| 1188 </t> |
| 1189 <t> |
| 1190 After the updates, implemented by ec_dec_update() (entdec.c), the decoder |
| 1191 normalizes the range using the procedure in the next section, and returns the |
| 1192 index k. |
| 1193 </t> |
| 1194 |
| 1195 <section anchor="range-decoder-renorm" title="Renormalization"> |
| 1196 <t> |
| 1197 To normalize the range, the decoder repeats the following process, implemented |
| 1198 by ec_dec_normalize() (entdec.c), until rng > 2**23. |
| 1199 If rng is already greater than 2**23, the entire process is skipped. |
| 1200 First, it sets rng to (rng<<8). |
| 1201 Then it reads the next byte of the Opus frame and forms an 8-bit value sym, |
| 1202 using the left-over bit buffered from the previous byte as the high bit |
| 1203 and the top 7 bits of the byte just read as the other 7 bits of sym. |
| 1204 The remaining bit in the byte just read is buffered for use in the next |
| 1205 iteration. |
| 1206 If no more input bytes remain, it uses zero bits instead. |
| 1207 See <xref target="range-decoder-init"/> for the initialization used to process |
| 1208 the first byte. |
| 1209 Then, it sets |
| 1210 <figure align="center"> |
| 1211 <artwork align="center"><![CDATA[ |
| 1212 val = ((val<<8) + (255-sym)) & 0x7FFFFFFF . |
| 1213 ]]></artwork> |
| 1214 </figure> |
| 1215 </t> |
| 1216 <t> |
| 1217 It is normal and expected that the range decoder will read several bytes |
| 1218 into the raw bits data (if any) at the end of the packet by the time the frame |
| 1219 is completely decoded, as illustrated in <xref target="finalize-example"/>. |
| 1220 This same data MUST also be returned as raw bits when requested. |
| 1221 The encoder is expected to terminate the stream in such a way that the decoder |
| 1222 will decode the intended values regardless of the data contained in the raw |
| 1223 bits. |
| 1224 <xref target="encoder-finalizing"/> describes a procedure for doing this. |
| 1225 If the range decoder consumes all of the bytes belonging to the current frame, |
| 1226 it MUST continue to use zero when any further input bytes are required, even |
| 1227 if there is additional data in the current packet from padding or other |
| 1228 frames. |
| 1229 </t> |
| 1230 |
| 1231 <figure anchor="finalize-example" title="Illustrative example of raw bits |
| 1232 overlapping range coder data"> |
| 1233 <artwork align="center"><![CDATA[ |
| 1234 n n+1 n+2 n+3 |
| 1235 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 |
| 1236 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1237 : | <----------- Overlap region ------------> | : |
| 1238 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1239 ^ ^ |
| 1240 | End of data buffered by the range coder | |
| 1241 ...-----------------------------------------------+ |
| 1242 | |
| 1243 | End of data consumed by raw bits |
| 1244 +-------------------------------------------------------... |
| 1245 ]]></artwork> |
| 1246 </figure> |
| 1247 </section> |
| 1248 </section> |
| 1249 |
| 1250 <section anchor="decoding-alternate" title="Alternate Decoding Methods"> |
| 1251 <t> |
| 1252 The reference implementation uses three additional decoding methods that are |
| 1253 exactly equivalent to the above, but make assumptions and simplifications that |
| 1254 allow for a more efficient implementation. |
| 1255 </t> |
| 1256 <section anchor="ec_decode_bin" title="ec_decode_bin()"> |
| 1257 <t> |
| 1258 The first is ec_decode_bin() (entdec.c), defined using the parameter ftb |
| 1259 instead of ft. |
| 1260 It is mathematically equivalent to calling ec_decode() with |
| 1261 ft = (1<<ftb), but avoids one of the divisions. |
| 1262 </t> |
| 1263 </section> |
| 1264 <section anchor="ec_dec_bit_logp" title="ec_dec_bit_logp()"> |
| 1265 <t> |
| 1266 The next is ec_dec_bit_logp() (entdec.c), which decodes a single binary symbol, |
| 1267 replacing both the ec_decode() and ec_dec_update() steps. |
| 1268 The context is described by a single parameter, logp, which is the absolute |
| 1269 value of the base-2 logarithm of the probability of a "1". |
| 1270 It is mathematically equivalent to calling ec_decode() with |
| 1271 ft = (1<<logp), followed by ec_dec_update() with |
| 1272 the 3-tuple (fl[k] = 0, |
| 1273 fh[k] = (1<<logp) - 1, |
| 1274 ft = (1<<logp)) if the returned value |
| 1275 of fs is less than (1<<logp) - 1 (a "0" was decoded), and with |
| 1276 (fl[k] = (1<<logp) - 1, |
| 1277 fh[k] = ft = (1<<logp)) otherwise (a "1" was |
| 1278 decoded). |
| 1279 The implementation requires no multiplications or divisions. |
| 1280 </t> |
| 1281 </section> |
| 1282 <section anchor="ec_dec_icdf" title="ec_dec_icdf()"> |
| 1283 <t> |
| 1284 The last is ec_dec_icdf() (entdec.c), which decodes a single symbol with a |
| 1285 table-based context of up to 8 bits, also replacing both the ec_decode() and |
| 1286 ec_dec_update() steps, as well as the search for the decoded symbol in between. |
| 1287 The context is described by two parameters, an icdf |
| 1288 ("inverse" cumulative distribution function) table and ftb. |
| 1289 As with ec_decode_bin(), (1<<ftb) is equivalent to ft. |
| 1290 idcf[k], on the other hand, stores (1<<ftb)-fh[k], which is equal to |
| 1291 (1<<ftb) - fl[k+1]. |
| 1292 fl[0] is assumed to be 0, and the table is terminated by a value of 0 (where |
| 1293 fh[k] == ft). |
| 1294 </t> |
| 1295 <t> |
| 1296 The function is mathematically equivalent to calling ec_decode() with |
| 1297 ft = (1<<ftb), using the returned value fs to search the table |
| 1298 for the first entry where fs < (1<<ftb)-icdf[k], and |
| 1299 calling ec_dec_update() with |
| 1300 fl[k] = (1<<ftb) - icdf[k-1] (or 0 |
| 1301 if k == 0), fh[k] = (1<<ftb) - idcf[k], |
| 1302 and ft = (1<<ftb). |
| 1303 Combining the search with the update allows the division to be replaced by a |
| 1304 series of multiplications (which are usually much cheaper), and using an |
| 1305 inverse CDF allows the use of an ftb as large as 8 in an 8-bit table without |
| 1306 any special cases. |
| 1307 This is the primary interface with the range decoder in the SILK layer, though |
| 1308 it is used in a few places in the CELT layer as well. |
| 1309 </t> |
| 1310 <t> |
| 1311 Although icdf[k] is more convenient for the code, the frequency counts, f[k], |
| 1312 are a more natural representation of the probability distribution function |
| 1313 (PDF) for a given symbol. |
| 1314 Therefore this draft lists the latter, not the former, when describing the |
| 1315 context in which a symbol is coded as a list, e.g., {4, 4, 4, 4}/16 for a |
| 1316 uniform context with four possible values and ft = 16. |
| 1317 The value of ft after the slash is always the sum of the entries in the PDF, |
| 1318 but is included for convenience. |
| 1319 Contexts with identical probabilities, f[k]/ft, but different values of ft |
| 1320 (or equivalently, ftb) are not the same, and cannot, in general, be used in |
| 1321 place of one another. |
| 1322 An icdf table is also not capable of representing a PDF where the first symbol |
| 1323 has 0 probability. |
| 1324 In such contexts, ec_dec_icdf() can decode the symbol by using a table that |
| 1325 drops the entries for any initial zero-probability values and adding the |
| 1326 constant offset of the first value with a non-zero probability to its return |
| 1327 value. |
| 1328 </t> |
| 1329 </section> |
| 1330 </section> |
| 1331 |
| 1332 <section anchor="decoding-bits" title="Decoding Raw Bits"> |
| 1333 <t> |
| 1334 The raw bits used by the CELT layer are packed at the end of the packet, with |
| 1335 the least significant bit of the first value packed in the least significant |
| 1336 bit of the last byte, filling up to the most significant bit in the last byte, |
| 1337 continuing on to the least significant bit of the penultimate byte, and so on. |
| 1338 The reference implementation reads them using ec_dec_bits() (entdec.c). |
| 1339 Because the range decoder must read several bytes ahead in the stream, as |
| 1340 described in <xref target="range-decoder-renorm"/>, the input consumed by the |
| 1341 raw bits may overlap with the input consumed by the range coder, and a decoder |
| 1342 MUST allow this. |
| 1343 The format should render it impossible to attempt to read more raw bits than |
| 1344 there are actual bits in the frame, though a decoder may wish to check for |
| 1345 this and report an error. |
| 1346 </t> |
| 1347 </section> |
| 1348 |
| 1349 <section anchor="ec_dec_uint" title="Decoding Uniformly Distributed Integers"> |
| 1350 <t> |
| 1351 The function ec_dec_uint() (entdec.c) decodes one of ft equiprobable values in |
| 1352 the range 0 to (ft - 1), inclusive, each with a frequency of 1, |
| 1353 where ft may be as large as (2**32 - 1). |
| 1354 Because ec_decode() is limited to a total frequency of (2**16 - 1), |
| 1355 it splits up the value into a range coded symbol representing up to 8 of the |
| 1356 high bits, and, if necessary, raw bits representing the remainder of the |
| 1357 value. |
| 1358 The limit of 8 bits in the range coded symbol is a trade-off between |
| 1359 implementation complexity, modeling error (since the symbols no longer truly |
| 1360 have equal coding cost), and rounding error introduced by the range coder |
| 1361 itself (which gets larger as more bits are included). |
| 1362 Using raw bits reduces the maximum number of divisions required in the worst |
| 1363 case, but means that it may be possible to decode a value outside the range |
| 1364 0 to (ft - 1), inclusive. |
| 1365 </t> |
| 1366 |
| 1367 <t> |
| 1368 ec_dec_uint() takes a single, positive parameter, ft, which is not necessarily |
| 1369 a power of two, and returns an integer, t, whose value lies between 0 and |
| 1370 (ft - 1), inclusive. |
| 1371 Let ftb = ilog(ft - 1), i.e., the number of bits required |
| 1372 to store (ft - 1) in two's complement notation. |
| 1373 If ftb is 8 or less, then t is decoded with t = ec_decode(ft), and |
| 1374 the range coder state is updated using the three-tuple (t, t + 1, |
| 1375 ft). |
| 1376 </t> |
| 1377 <t> |
| 1378 If ftb is greater than 8, then the top 8 bits of t are decoded using |
| 1379 <figure align="center"> |
| 1380 <artwork align="center"><![CDATA[ |
| 1381 t = ec_decode(((ft - 1) >> (ftb - 8)) + 1) , |
| 1382 ]]></artwork> |
| 1383 </figure> |
| 1384 the decoder state is updated using the three-tuple |
| 1385 (t, t + 1, |
| 1386 ((ft - 1) >> (ftb - 8)) + 1), |
| 1387 and the remaining bits are decoded as raw bits, setting |
| 1388 <figure align="center"> |
| 1389 <artwork align="center"><![CDATA[ |
| 1390 t = (t << (ftb - 8)) | ec_dec_bits(ftb - 8) . |
| 1391 ]]></artwork> |
| 1392 </figure> |
| 1393 If, at this point, t >= ft, then the current frame is corrupt. |
| 1394 In that case, the decoder should assume there has been an error in the coding, |
| 1395 decoding, or transmission and SHOULD take measures to conceal the |
| 1396 error and/or report to the application that the error has occurred. |
| 1397 </t> |
| 1398 |
| 1399 </section> |
| 1400 |
| 1401 <section anchor="decoder-tell" title="Current Bit Usage"> |
| 1402 <t> |
| 1403 The bit allocation routines in the CELT decoder need a conservative upper bound |
| 1404 on the number of bits that have been used from the current frame thus far, |
| 1405 including both range coder bits and raw bits. |
| 1406 This drives allocation decisions that must match those made in the encoder. |
| 1407 The upper bound is computed in the reference implementation to whole-bit |
| 1408 precision by the function ec_tell() (entcode.h) and to fractional 1/8th bit |
| 1409 precision by the function ec_tell_frac() (entcode.c). |
| 1410 Like all operations in the range coder, it must be implemented in a bit-exact |
| 1411 manner, and must produce exactly the same value returned by the same functions |
| 1412 in the encoder after encoding the same symbols. |
| 1413 </t> |
| 1414 <t> |
| 1415 ec_tell() is guaranteed to return ceil(ec_tell_frac()/8.0). |
| 1416 In various places the codec will check to ensure there is enough room to |
| 1417 contain a symbol before attempting to decode it. |
| 1418 In practice, although the number of bits used so far is an upper bound, |
| 1419 decoding a symbol whose probability model suggests it has a worst-case cost of |
| 1420 p 1/8th bits may actually advance the return value of ec_tell_frac() by |
| 1421 p-1, p, or p+1 1/8th bits, due to approximation error in that upper bound, |
| 1422 truncation error in the range coder, and for large values of ft, modeling |
| 1423 error in ec_dec_uint(). |
| 1424 </t> |
| 1425 <t> |
| 1426 However, this error is bounded, and periodic calls to ec_tell() or |
| 1427 ec_tell_frac() at precisely defined points in the decoding process prevent it |
| 1428 from accumulating. |
| 1429 For a range coder symbol that requires a whole number of bits (i.e., |
| 1430 for which ft/(fh[k] - fl[k]) is a power of two), where there are at |
| 1431 least p 1/8th bits available, decoding the symbol will never cause ec_tell() or |
| 1432 ec_tell_frac() to exceed the size of the frame ("bust the budget"). |
| 1433 In this case the return value of ec_tell_frac() will only advance by more than |
| 1434 p 1/8th bits if there was an additional, fractional number of bits remaining, |
| 1435 and it will never advance beyond the next whole-bit boundary, which is safe, |
| 1436 since frames always contain a whole number of bits. |
| 1437 However, when p is not a whole number of bits, an extra 1/8th bit is required |
| 1438 to ensure that decoding the symbol will not bust the budget. |
| 1439 </t> |
| 1440 <t> |
| 1441 The reference implementation keeps track of the total number of whole bits that |
| 1442 have been processed by the decoder so far in the variable nbits_total, |
| 1443 including the (possibly fractional) number of bits that are currently |
| 1444 buffered, but not consumed, inside the range coder. |
| 1445 nbits_total is initialized to 9 just before the initial range renormalization |
| 1446 process completes (or equivalently, it can be initialized to 33 after the |
| 1447 first renormalization). |
| 1448 The extra two bits over the actual amount buffered by the range coder |
| 1449 guarantees that it is an upper bound and that there is enough room for the |
| 1450 encoder to terminate the stream. |
| 1451 Each iteration through the range coder's renormalization loop increases |
| 1452 nbits_total by 8. |
| 1453 Reading raw bits increases nbits_total by the number of raw bits read. |
| 1454 </t> |
| 1455 |
| 1456 <section anchor="ec_tell" title="ec_tell()"> |
| 1457 <t> |
| 1458 The whole number of bits buffered in rng may be estimated via lg = ilog(rng). |
| 1459 ec_tell() then becomes a simple matter of removing these bits from the total. |
| 1460 It returns (nbits_total - lg). |
| 1461 </t> |
| 1462 <t> |
| 1463 In a newly initialized decoder, before any symbols have been read, this reports |
| 1464 that 1 bit has been used. |
| 1465 This is the bit reserved for termination of the encoder. |
| 1466 </t> |
| 1467 </section> |
| 1468 |
| 1469 <section anchor="ec_tell_frac" title="ec_tell_frac()"> |
| 1470 <t> |
| 1471 ec_tell_frac() estimates the number of bits buffered in rng to fractional |
| 1472 precision. |
| 1473 Since rng must be greater than 2**23 after renormalization, lg must be at least |
| 1474 24. |
| 1475 Let |
| 1476 <figure align="center"> |
| 1477 <artwork align="center"> |
| 1478 <![CDATA[ |
| 1479 r_Q15 = rng >> (lg-16) , |
| 1480 ]]></artwork> |
| 1481 </figure> |
| 1482 so that 32768 <= r_Q15 < 65536, an unsigned Q15 value representing the |
| 1483 fractional part of rng. |
| 1484 Then the following procedure can be used to add one bit of precision to lg. |
| 1485 First, update |
| 1486 <figure align="center"> |
| 1487 <artwork align="center"> |
| 1488 <![CDATA[ |
| 1489 r_Q15 = (r_Q15*r_Q15) >> 15 . |
| 1490 ]]></artwork> |
| 1491 </figure> |
| 1492 Then add the 16th bit of r_Q15 to lg via |
| 1493 <figure align="center"> |
| 1494 <artwork align="center"> |
| 1495 <![CDATA[ |
| 1496 lg = 2*lg + (r_Q15 >> 16) . |
| 1497 ]]></artwork> |
| 1498 </figure> |
| 1499 Finally, if this bit was a 1, reduce r_Q15 by a factor of two via |
| 1500 <figure align="center"> |
| 1501 <artwork align="center"> |
| 1502 <![CDATA[ |
| 1503 r_Q15 = r_Q15 >> 1 , |
| 1504 ]]></artwork> |
| 1505 </figure> |
| 1506 so that it once again lies in the range 32768 <= r_Q15 < 65536. |
| 1507 </t> |
| 1508 <t> |
| 1509 This procedure is repeated three times to extend lg to 1/8th bit precision. |
| 1510 ec_tell_frac() then returns (nbits_total*8 - lg). |
| 1511 </t> |
| 1512 </section> |
| 1513 |
| 1514 </section> |
| 1515 |
| 1516 </section> |
| 1517 |
| 1518 <section anchor="silk_decoder_outline" title="SILK Decoder"> |
| 1519 <t> |
| 1520 The decoder's LP layer uses a modified version of the SILK codec (herein simply |
| 1521 called "SILK"), which runs a decoded excitation signal through adaptive |
| 1522 long-term and short-term prediction synthesis filters. |
| 1523 It runs at NB, MB, and WB sample rates internally. |
| 1524 When used in a SWB or FB Hybrid frame, the LP layer itself still only runs in |
| 1525 WB. |
| 1526 </t> |
| 1527 |
| 1528 <section title="SILK Decoder Modules"> |
| 1529 <t> |
| 1530 An overview of the decoder is given in <xref target="silk_decoder_figure"/>. |
| 1531 </t> |
| 1532 <figure align="center" anchor="silk_decoder_figure" title="SILK Decoder"> |
| 1533 <artwork align="center"> |
| 1534 <![CDATA[ |
| 1535 +---------+ +------------+ |
| 1536 -->| Range |--->| Decode |---------------------------+ |
| 1537 1 | Decoder | 2 | Parameters |----------+ 5 | |
| 1538 +---------+ +------------+ 4 | | |
| 1539 3 | | | |
| 1540 \/ \/ \/ |
| 1541 +------------+ +------------+ +------------+ |
| 1542 | Generate |-->| LTP |-->| LPC | |
| 1543 | Excitation | | Synthesis | | Synthesis | |
| 1544 +------------+ +------------+ +------------+ |
| 1545 ^ | |
| 1546 | | |
| 1547 +-------------------+----------------+ |
| 1548 | 6 |
| 1549 | +------------+ +-------------+ |
| 1550 +-->| Stereo |-->| Sample Rate |--> |
| 1551 | Unmixing | 7 | Conversion | 8 |
| 1552 +------------+ +-------------+ |
| 1553 |
| 1554 1: Range encoded bitstream |
| 1555 2: Coded parameters |
| 1556 3: Pulses, LSBs, and signs |
| 1557 4: Pitch lags, Long-Term Prediction (LTP) coefficients |
| 1558 5: Linear Predictive Coding (LPC) coefficients and gains |
| 1559 6: Decoded signal (mono or mid-side stereo) |
| 1560 7: Unmixed signal (mono or left-right stereo) |
| 1561 8: Resampled signal |
| 1562 ]]> |
| 1563 </artwork> |
| 1564 </figure> |
| 1565 |
| 1566 <t> |
| 1567 The decoder feeds the bitstream (1) to the range decoder from |
| 1568 <xref target="range-decoder"/>, and then decodes the parameters in it (2) |
| 1569 using the procedures detailed in |
| 1570 Sections <xref format="counter" target="silk_header_bits"/> |
| 1571 through <xref format="counter" target="silk_signs"/>. |
| 1572 These parameters (3, 4, 5) are used to generate an excitation signal (see |
| 1573 <xref target="silk_excitation_reconstruction"/>), which is fed to an optional |
| 1574 long-term prediction (LTP) filter (voiced frames only, see |
| 1575 <xref target="silk_ltp_synthesis"/>) and then a short-term prediction filter |
| 1576 (see <xref target="silk_lpc_synthesis"/>), producing the decoded signal (6). |
| 1577 For stereo streams, the mid-side representation is converted to separate left |
| 1578 and right channels (7). |
| 1579 The result is finally resampled to the desired output sample rate (e.g., |
| 1580 48 kHz) so that the resampled signal (8) can be mixed with the CELT |
| 1581 layer. |
| 1582 </t> |
| 1583 |
| 1584 </section> |
| 1585 |
| 1586 <section anchor="silk_layer_organization" title="LP Layer Organization"> |
| 1587 |
| 1588 <t> |
| 1589 Internally, the LP layer of a single Opus frame is composed of either a single |
| 1590 10 ms regular SILK frame or between one and three 20 ms regular SILK |
| 1591 frames. |
| 1592 A stereo Opus frame may double the number of regular SILK frames (up to a total |
| 1593 of six), since it includes separate frames for a mid channel and, optionally, |
| 1594 a side channel. |
| 1595 Optional Low Bit-Rate Redundancy (LBRR) frames, which are reduced-bitrate |
| 1596 encodings of previous SILK frames, may be included to aid in recovery from |
| 1597 packet loss. |
| 1598 If present, these appear before the regular SILK frames. |
| 1599 They are in most respects identical to regular, active SILK frames, except that |
| 1600 they are usually encoded with a lower bitrate. |
| 1601 This draft uses "SILK frame" to refer to either one and "regular SILK frame" if |
| 1602 it needs to draw a distinction between the two. |
| 1603 </t> |
| 1604 <t> |
| 1605 Logically, each SILK frame is in turn composed of either two or four 5 ms |
| 1606 subframes. |
| 1607 Various parameters, such as the quantization gain of the excitation and the |
| 1608 pitch lag and filter coefficients can vary on a subframe-by-subframe basis. |
| 1609 Physically, the parameters for each subframe are interleaved in the bitstream, |
| 1610 as described in the relevant sections for each parameter. |
| 1611 </t> |
| 1612 <t> |
| 1613 All of these frames and subframes are decoded from the same range coder, with |
| 1614 no padding between them. |
| 1615 Thus packing multiple SILK frames in a single Opus frame saves, on average, |
| 1616 half a byte per SILK frame. |
| 1617 It also allows some parameters to be predicted from prior SILK frames in the |
| 1618 same Opus frame, since this does not degrade packet loss robustness (beyond |
| 1619 any penalty for merely using fewer, larger packets to store multiple frames). |
| 1620 </t> |
| 1621 |
| 1622 <t> |
| 1623 Stereo support in SILK uses a variant of mid-side coding, allowing a mono |
| 1624 decoder to simply decode the mid channel. |
| 1625 However, the data for the two channels is interleaved, so a mono decoder must |
| 1626 still unpack the data for the side channel. |
| 1627 It would be required to do so anyway for Hybrid Opus frames, or to support |
| 1628 decoding individual 20 ms frames. |
| 1629 </t> |
| 1630 |
| 1631 <t> |
| 1632 <xref target="silk_symbols"/> summarizes the overall grouping of the contents of |
| 1633 the LP layer. |
| 1634 Figures <xref format="counter" target="silk_mono_60ms_frame"/> |
| 1635 and <xref format="counter" target="silk_stereo_60ms_frame"/> illustrate |
| 1636 the ordering of the various SILK frames for a 60 ms Opus frame, for both |
| 1637 mono and stereo, respectively. |
| 1638 </t> |
| 1639 |
| 1640 <texttable anchor="silk_symbols" |
| 1641 title="Organization of the SILK layer of an Opus frame"> |
| 1642 <ttcol align="center">Symbol(s)</ttcol> |
| 1643 <ttcol align="center">PDF(s)</ttcol> |
| 1644 <ttcol align="center">Condition</ttcol> |
| 1645 |
| 1646 <c>Voice Activity Detection (VAD) flags</c> |
| 1647 <c>{1, 1}/2</c> |
| 1648 <c/> |
| 1649 |
| 1650 <c>LBRR flag</c> |
| 1651 <c>{1, 1}/2</c> |
| 1652 <c/> |
| 1653 |
| 1654 <c>Per-frame LBRR flags</c> |
| 1655 <c><xref target="silk_lbrr_flag_pdfs"/></c> |
| 1656 <c><xref target="silk_lbrr_flags"/></c> |
| 1657 |
| 1658 <c>LBRR Frame(s)</c> |
| 1659 <c><xref target="silk_frame"/></c> |
| 1660 <c><xref target="silk_lbrr_flags"/></c> |
| 1661 |
| 1662 <c>Regular SILK Frame(s)</c> |
| 1663 <c><xref target="silk_frame"/></c> |
| 1664 <c/> |
| 1665 |
| 1666 </texttable> |
| 1667 |
| 1668 <figure align="center" anchor="silk_mono_60ms_frame" |
| 1669 title="A 60 ms Mono Frame"> |
| 1670 <artwork align="center"><![CDATA[ |
| 1671 +---------------------------------+ |
| 1672 | VAD Flags | |
| 1673 +---------------------------------+ |
| 1674 | LBRR Flag | |
| 1675 +---------------------------------+ |
| 1676 | Per-Frame LBRR Flags (Optional) | |
| 1677 +---------------------------------+ |
| 1678 | LBRR Frame 1 (Optional) | |
| 1679 +---------------------------------+ |
| 1680 | LBRR Frame 2 (Optional) | |
| 1681 +---------------------------------+ |
| 1682 | LBRR Frame 3 (Optional) | |
| 1683 +---------------------------------+ |
| 1684 | Regular SILK Frame 1 | |
| 1685 +---------------------------------+ |
| 1686 | Regular SILK Frame 2 | |
| 1687 +---------------------------------+ |
| 1688 | Regular SILK Frame 3 | |
| 1689 +---------------------------------+ |
| 1690 ]]></artwork> |
| 1691 </figure> |
| 1692 |
| 1693 <figure align="center" anchor="silk_stereo_60ms_frame" |
| 1694 title="A 60 ms Stereo Frame"> |
| 1695 <artwork align="center"><![CDATA[ |
| 1696 +---------------------------------------+ |
| 1697 | Mid VAD Flags | |
| 1698 +---------------------------------------+ |
| 1699 | Mid LBRR Flag | |
| 1700 +---------------------------------------+ |
| 1701 | Side VAD Flags | |
| 1702 +---------------------------------------+ |
| 1703 | Side LBRR Flag | |
| 1704 +---------------------------------------+ |
| 1705 | Mid Per-Frame LBRR Flags (Optional) | |
| 1706 +---------------------------------------+ |
| 1707 | Side Per-Frame LBRR Flags (Optional) | |
| 1708 +---------------------------------------+ |
| 1709 | Mid LBRR Frame 1 (Optional) | |
| 1710 +---------------------------------------+ |
| 1711 | Side LBRR Frame 1 (Optional) | |
| 1712 +---------------------------------------+ |
| 1713 | Mid LBRR Frame 2 (Optional) | |
| 1714 +---------------------------------------+ |
| 1715 | Side LBRR Frame 2 (Optional) | |
| 1716 +---------------------------------------+ |
| 1717 | Mid LBRR Frame 3 (Optional) | |
| 1718 +---------------------------------------+ |
| 1719 | Side LBRR Frame 3 (Optional) | |
| 1720 +---------------------------------------+ |
| 1721 | Mid Regular SILK Frame 1 | |
| 1722 +---------------------------------------+ |
| 1723 | Side Regular SILK Frame 1 (Optional) | |
| 1724 +---------------------------------------+ |
| 1725 | Mid Regular SILK Frame 2 | |
| 1726 +---------------------------------------+ |
| 1727 | Side Regular SILK Frame 2 (Optional) | |
| 1728 +---------------------------------------+ |
| 1729 | Mid Regular SILK Frame 3 | |
| 1730 +---------------------------------------+ |
| 1731 | Side Regular SILK Frame 3 (Optional) | |
| 1732 +---------------------------------------+ |
| 1733 ]]></artwork> |
| 1734 </figure> |
| 1735 |
| 1736 </section> |
| 1737 |
| 1738 <section anchor="silk_header_bits" title="Header Bits"> |
| 1739 <t> |
| 1740 The LP layer begins with two to eight header bits, decoded in silk_Decode() |
| 1741 (dec_API.c). |
| 1742 These consist of one Voice Activity Detection (VAD) bit per frame (up to 3), |
| 1743 followed by a single flag indicating the presence of LBRR frames. |
| 1744 For a stereo packet, these first flags correspond to the mid channel, and a |
| 1745 second set of flags is included for the side channel. |
| 1746 </t> |
| 1747 <t> |
| 1748 Because these are the first symbols decoded by the range coder and because they |
| 1749 are coded as binary values with uniform probability, they can be extracted |
| 1750 directly from the most significant bits of the first byte of compressed data. |
| 1751 Thus, a receiver can determine if an Opus frame contains any active SILK frames |
| 1752 without the overhead of using the range decoder. |
| 1753 </t> |
| 1754 </section> |
| 1755 |
| 1756 <section anchor="silk_lbrr_flags" title="Per-Frame LBRR Flags"> |
| 1757 <t> |
| 1758 For Opus frames longer than 20 ms, a set of LBRR flags is |
| 1759 decoded for each channel that has its LBRR flag set. |
| 1760 Each set contains one flag per 20 ms SILK frame. |
| 1761 40 ms Opus frames use the 2-frame LBRR flag PDF from |
| 1762 <xref target="silk_lbrr_flag_pdfs"/>, and 60 ms Opus frames use the |
| 1763 3-frame LBRR flag PDF. |
| 1764 For each channel, the resulting 2- or 3-bit integer contains the corresponding |
| 1765 LBRR flag for each frame, packed in order from the LSB to the MSB. |
| 1766 </t> |
| 1767 |
| 1768 <texttable anchor="silk_lbrr_flag_pdfs" title="LBRR Flag PDFs"> |
| 1769 <ttcol>Frame Size</ttcol> |
| 1770 <ttcol>PDF</ttcol> |
| 1771 <c>40 ms</c> <c>{0, 53, 53, 150}/256</c> |
| 1772 <c>60 ms</c> <c>{0, 41, 20, 29, 41, 15, 28, 82}/256</c> |
| 1773 </texttable> |
| 1774 |
| 1775 <t> |
| 1776 A 10 or 20 ms Opus frame does not contain any per-frame LBRR flags, |
| 1777 as there may be at most one LBRR frame per channel. |
| 1778 The global LBRR flag in the header bits (see <xref target="silk_header_bits"/>) |
| 1779 is already sufficient to indicate the presence of that single LBRR frame. |
| 1780 </t> |
| 1781 |
| 1782 </section> |
| 1783 |
| 1784 <section anchor="silk_lbrr_frames" title="LBRR Frames"> |
| 1785 <t> |
| 1786 The LBRR frames, if present, contain an encoded representation of the signal |
| 1787 immediately prior to the current Opus frame as if it were encoded with the |
| 1788 current mode, frame size, audio bandwidth, and channel count, even if those |
| 1789 differ from the prior Opus frame. |
| 1790 When one of these parameters changes from one Opus frame to the next, this |
| 1791 implies that the LBRR frames of the current Opus frame may not be simple |
| 1792 drop-in replacements for the contents of the previous Opus frame. |
| 1793 </t> |
| 1794 |
| 1795 <t> |
| 1796 For example, when switching from 20 ms to 60 ms, the 60 ms Opus |
| 1797 frame may contain LBRR frames covering up to three prior 20 ms Opus |
| 1798 frames, even if those frames already contained LBRR frames covering some of |
| 1799 the same time periods. |
| 1800 When switching from 20 ms to 10 ms, the 10 ms Opus frame can |
| 1801 contain an LBRR frame covering at most half the prior 20 ms Opus frame, |
| 1802 potentially leaving a hole that needs to be concealed from even a single |
| 1803 packet loss (see <xref target="Packet Loss Concealment"/>). |
| 1804 When switching from mono to stereo, the LBRR frames in the first stereo Opus |
| 1805 frame MAY contain a non-trivial side channel. |
| 1806 </t> |
| 1807 |
| 1808 <t> |
| 1809 In order to properly produce LBRR frames under all conditions, an encoder might |
| 1810 need to buffer up to 60 ms of audio and re-encode it during these |
| 1811 transitions. |
| 1812 However, the reference implementation opts to disable LBRR frames at the |
| 1813 transition point for simplicity. |
| 1814 Since transitions are relatively infrequent in normal usage, this does not have |
| 1815 a significant impact on packet loss robustness. |
| 1816 </t> |
| 1817 |
| 1818 <t> |
| 1819 The LBRR frames immediately follow the LBRR flags, prior to any regular SILK |
| 1820 frames. |
| 1821 <xref target="silk_frame"/> describes their exact contents. |
| 1822 LBRR frames do not include their own separate VAD flags. |
| 1823 LBRR frames are only meant to be transmitted for active speech, thus all LBRR |
| 1824 frames are treated as active. |
| 1825 </t> |
| 1826 |
| 1827 <t> |
| 1828 In a stereo Opus frame longer than 20 ms, although the per-frame LBRR |
| 1829 flags for the mid channel are coded as a unit before the per-frame LBRR flags |
| 1830 for the side channel, the LBRR frames themselves are interleaved. |
| 1831 The decoder parses an LBRR frame for the mid channel of a given 20 ms |
| 1832 interval (if present) and then immediately parses the corresponding LBRR |
| 1833 frame for the side channel (if present), before proceeding to the next |
| 1834 20 ms interval. |
| 1835 </t> |
| 1836 </section> |
| 1837 |
| 1838 <section anchor="silk_regular_frames" title="Regular SILK Frames"> |
| 1839 <t> |
| 1840 The regular SILK frame(s) follow the LBRR frames (if any). |
| 1841 <xref target="silk_frame"/> describes their contents, as well. |
| 1842 Unlike the LBRR frames, a regular SILK frame is coded for each time interval in |
| 1843 an Opus frame, even if the corresponding VAD flags are unset. |
| 1844 For stereo Opus frames longer than 20 ms, the regular mid and side SILK |
| 1845 frames for each 20 ms interval are interleaved, just as with the LBRR |
| 1846 frames. |
| 1847 The side frame may be skipped by coding an appropriate flag, as detailed in |
| 1848 <xref target="silk_mid_only_flag"/>. |
| 1849 </t> |
| 1850 </section> |
| 1851 |
| 1852 <section anchor="silk_frame" title="SILK Frame Contents"> |
| 1853 <t> |
| 1854 Each SILK frame includes a set of side information that encodes |
| 1855 <list style="symbols"> |
| 1856 <t>The frame type and quantization type (<xref target="silk_frame_type"/>),</t> |
| 1857 <t>Quantization gains (<xref target="silk_gains"/>),</t> |
| 1858 <t>Short-term prediction filter coefficients (<xref target="silk_nlsfs"/>),</t> |
| 1859 <t>A Line Spectral Frequencies (LSF) interpolation weight (<xref target="silk_nl
sf_interpolation"/>),</t> |
| 1860 <t> |
| 1861 Long-term prediction filter lags and gains (<xref target="silk_ltp_params"/>), |
| 1862 and |
| 1863 </t> |
| 1864 <t>A linear congruential generator (LCG) seed (<xref target="silk_seed"/>).</t> |
| 1865 </list> |
| 1866 The quantized excitation signal (see <xref target="silk_excitation"/>) follows |
| 1867 these at the end of the frame. |
| 1868 <xref target="silk_frame_symbols"/> details the overall organization of a |
| 1869 SILK frame. |
| 1870 </t> |
| 1871 |
| 1872 <texttable anchor="silk_frame_symbols" |
| 1873 title="Order of the symbols in an individual SILK frame"> |
| 1874 <ttcol align="center">Symbol(s)</ttcol> |
| 1875 <ttcol align="center">PDF(s)</ttcol> |
| 1876 <ttcol align="center">Condition</ttcol> |
| 1877 |
| 1878 <c>Stereo Prediction Weights</c> |
| 1879 <c><xref target="silk_stereo_pred_pdfs"/></c> |
| 1880 <c><xref target="silk_stereo_pred"/></c> |
| 1881 |
| 1882 <c>Mid-only Flag</c> |
| 1883 <c><xref target="silk_mid_only_pdf"/></c> |
| 1884 <c><xref target="silk_mid_only_flag"/></c> |
| 1885 |
| 1886 <c>Frame Type</c> |
| 1887 <c><xref target="silk_frame_type"/></c> |
| 1888 <c/> |
| 1889 |
| 1890 <c>Subframe Gains</c> |
| 1891 <c><xref target="silk_gains"/></c> |
| 1892 <c/> |
| 1893 |
| 1894 <c>Normalized LSF Stage-1 Index</c> |
| 1895 <c><xref target="silk_nlsf_stage1_pdfs"/></c> |
| 1896 <c/> |
| 1897 |
| 1898 <c>Normalized LSF Stage-2 Residual</c> |
| 1899 <c><xref target="silk_nlsf_stage2"/></c> |
| 1900 <c/> |
| 1901 |
| 1902 <c>Normalized LSF Interpolation Weight</c> |
| 1903 <c><xref target="silk_nlsf_interp_pdf"/></c> |
| 1904 <c>20 ms frame</c> |
| 1905 |
| 1906 <c>Primary Pitch Lag</c> |
| 1907 <c><xref target="silk_ltp_lags"/></c> |
| 1908 <c>Voiced frame</c> |
| 1909 |
| 1910 <c>Subframe Pitch Contour</c> |
| 1911 <c><xref target="silk_pitch_contour_pdfs"/></c> |
| 1912 <c>Voiced frame</c> |
| 1913 |
| 1914 <c>Periodicity Index</c> |
| 1915 <c><xref target="silk_perindex_pdf"/></c> |
| 1916 <c>Voiced frame</c> |
| 1917 |
| 1918 <c>LTP Filter</c> |
| 1919 <c><xref target="silk_ltp_filter_pdfs"/></c> |
| 1920 <c>Voiced frame</c> |
| 1921 |
| 1922 <c>LTP Scaling</c> |
| 1923 <c><xref target="silk_ltp_scaling_pdf"/></c> |
| 1924 <c><xref target="silk_ltp_scaling"/></c> |
| 1925 |
| 1926 <c>LCG Seed</c> |
| 1927 <c><xref target="silk_seed_pdf"/></c> |
| 1928 <c/> |
| 1929 |
| 1930 <c>Excitation Rate Level</c> |
| 1931 <c><xref target="silk_rate_level_pdfs"/></c> |
| 1932 <c/> |
| 1933 |
| 1934 <c>Excitation Pulse Counts</c> |
| 1935 <c><xref target="silk_pulse_count_pdfs"/></c> |
| 1936 <c/> |
| 1937 |
| 1938 <c>Excitation Pulse Locations</c> |
| 1939 <c><xref target="silk_pulse_locations"/></c> |
| 1940 <c>Non-zero pulse count</c> |
| 1941 |
| 1942 <c>Excitation LSBs</c> |
| 1943 <c><xref target="silk_shell_lsb_pdf"/></c> |
| 1944 <c><xref target="silk_pulse_counts"/></c> |
| 1945 |
| 1946 <c>Excitation Signs</c> |
| 1947 <c><xref target="silk_sign_pdfs"/></c> |
| 1948 <c/> |
| 1949 |
| 1950 </texttable> |
| 1951 |
| 1952 <section anchor="silk_stereo_pred" toc="include" |
| 1953 title="Stereo Prediction Weights"> |
| 1954 <t> |
| 1955 A SILK frame corresponding to the mid channel of a stereo Opus frame begins |
| 1956 with a pair of side channel prediction weights, designed such that zeros |
| 1957 indicate normal mid-side coupling. |
| 1958 Since these weights can change on every frame, the first portion of each frame |
| 1959 linearly interpolates between the previous weights and the current ones, using |
| 1960 zeros for the previous weights if none are available. |
| 1961 These prediction weights are never included in a mono Opus frame, and the |
| 1962 previous weights are reset to zeros on any transition from mono to stereo. |
| 1963 They are also not included in an LBRR frame for the side channel, even if the |
| 1964 LBRR flags indicate the corresponding mid channel was not coded. |
| 1965 In that case, the previous weights are used, again substituting in zeros if no |
| 1966 previous weights are available since the last decoder reset |
| 1967 (see <xref target="decoder-reset"/>). |
| 1968 </t> |
| 1969 |
| 1970 <t> |
| 1971 To summarize, these weights are coded if and only if |
| 1972 <list style="symbols"> |
| 1973 <t>This is a stereo Opus frame (<xref target="toc_byte"/>), and</t> |
| 1974 <t>The current SILK frame corresponds to the mid channel.</t> |
| 1975 </list> |
| 1976 </t> |
| 1977 |
| 1978 <t> |
| 1979 The prediction weights are coded in three separate pieces, which are decoded |
| 1980 by silk_stereo_decode_pred() (decode_stereo_pred.c). |
| 1981 The first piece jointly codes the high-order part of a table index for both |
| 1982 weights. |
| 1983 The second piece codes the low-order part of each table index. |
| 1984 The third piece codes an offset used to linearly interpolate between table |
| 1985 indices. |
| 1986 The details are as follows. |
| 1987 </t> |
| 1988 |
| 1989 <t> |
| 1990 Let n be an index decoded with the 25-element stage-1 PDF in |
| 1991 <xref target="silk_stereo_pred_pdfs"/>. |
| 1992 Then let i0 and i1 be indices decoded with the stage-2 and stage-3 PDFs in |
| 1993 <xref target="silk_stereo_pred_pdfs"/>, respectively, and let i2 and i3 |
| 1994 be two more indices decoded with the stage-2 and stage-3 PDFs, all in that |
| 1995 order. |
| 1996 </t> |
| 1997 |
| 1998 <texttable anchor="silk_stereo_pred_pdfs" title="Stereo Weight PDFs"> |
| 1999 <ttcol align="left">Stage</ttcol> |
| 2000 <ttcol align="left">PDF</ttcol> |
| 2001 <c>Stage 1</c> |
| 2002 <c>{7, 2, 1, 1, 1, |
| 2003 10, 24, 8, 1, 1, |
| 2004 3, 23, 92, 23, 3, |
| 2005 1, 1, 8, 24, 10, |
| 2006 1, 1, 1, 2, 7}/256</c> |
| 2007 |
| 2008 <c>Stage 2</c> |
| 2009 <c>{85, 86, 85}/256</c> |
| 2010 |
| 2011 <c>Stage 3</c> |
| 2012 <c>{51, 51, 52, 51, 51}/256</c> |
| 2013 </texttable> |
| 2014 |
| 2015 <t> |
| 2016 Then use n, i0, and i2 to form two table indices, wi0 and wi1, according to |
| 2017 <figure align="center"> |
| 2018 <artwork align="center"><![CDATA[ |
| 2019 wi0 = i0 + 3*(n/5) |
| 2020 wi1 = i2 + 3*(n%5) |
| 2021 ]]></artwork> |
| 2022 </figure> |
| 2023 where the division is integer division. |
| 2024 The range of these indices is 0 to 14, inclusive. |
| 2025 Let w[i] be the i'th weight from <xref target="silk_stereo_weights_table"/>. |
| 2026 Then the two prediction weights, w0_Q13 and w1_Q13, are |
| 2027 <figure align="center"> |
| 2028 <artwork align="center"><![CDATA[ |
| 2029 w1_Q13 = w_Q13[wi1] |
| 2030 + ((w_Q13[wi1+1] - w_Q13[wi1])*6554) >> 16)*(2*i3 + 1) |
| 2031 |
| 2032 w0_Q13 = w_Q13[wi0] |
| 2033 + ((w_Q13[wi0+1] - w_Q13[wi0])*6554) >> 16)*(2*i1 + 1) |
| 2034 - w1_Q13 |
| 2035 ]]></artwork> |
| 2036 </figure> |
| 2037 N.b., w1_Q13 is computed first here, because w0_Q13 depends on it. |
| 2038 The constant 6554 is approximately 0.1 in Q16. |
| 2039 Although wi0 and wi1 only have 15 possible values, |
| 2040 <xref target="silk_stereo_weights_table"/> contains 16 entries to allow |
| 2041 interpolation between entry wi0 and (wi0 + 1) (and likewise for wi1). |
| 2042 </t> |
| 2043 |
| 2044 <texttable anchor="silk_stereo_weights_table" |
| 2045 title="Stereo Weight Table"> |
| 2046 <ttcol align="left">Index</ttcol> |
| 2047 <ttcol align="right">Weight (Q13)</ttcol> |
| 2048 <c>0</c> <c>-13732</c> |
| 2049 <c>1</c> <c>-10050</c> |
| 2050 <c>2</c> <c>-8266</c> |
| 2051 <c>3</c> <c>-7526</c> |
| 2052 <c>4</c> <c>-6500</c> |
| 2053 <c>5</c> <c>-5000</c> |
| 2054 <c>6</c> <c>-2950</c> |
| 2055 <c>7</c> <c>-820</c> |
| 2056 <c>8</c> <c>820</c> |
| 2057 <c>9</c> <c>2950</c> |
| 2058 <c>10</c> <c>5000</c> |
| 2059 <c>11</c> <c>6500</c> |
| 2060 <c>12</c> <c>7526</c> |
| 2061 <c>13</c> <c>8266</c> |
| 2062 <c>14</c> <c>10050</c> |
| 2063 <c>15</c> <c>13732</c> |
| 2064 </texttable> |
| 2065 |
| 2066 </section> |
| 2067 |
| 2068 <section anchor="silk_mid_only_flag" toc="include" title="Mid-only Flag"> |
| 2069 <t> |
| 2070 A flag appears after the stereo prediction weights that indicates if only the |
| 2071 mid channel is coded for this time interval. |
| 2072 It appears only when |
| 2073 <list style="symbols"> |
| 2074 <t>This is a stereo Opus frame (see <xref target="toc_byte"/>),</t> |
| 2075 <t>The current SILK frame corresponds to the mid channel, and</t> |
| 2076 <t>Either |
| 2077 <list style="symbols"> |
| 2078 <t>This is a regular SILK frame where the VAD flags |
| 2079 (see <xref target="silk_header_bits"/>) indicate that the corresponding side |
| 2080 channel is not active.</t> |
| 2081 <t> |
| 2082 This is an LBRR frame where the LBRR flags |
| 2083 (see <xref target="silk_header_bits"/> and <xref target="silk_lbrr_flags"/>) |
| 2084 indicate that the corresponding side channel is not coded. |
| 2085 </t> |
| 2086 </list> |
| 2087 </t> |
| 2088 </list> |
| 2089 It is omitted when there are no stereo weights, for all of the same reasons. |
| 2090 It is also omitted for a regular SILK frame when the VAD flag of the |
| 2091 corresponding side channel frame is set (indicating it is active). |
| 2092 The side channel must be coded in this case, making the mid-only flag |
| 2093 redundant. |
| 2094 It is also omitted for an LBRR frame when the corresponding LBRR flags |
| 2095 indicate the side channel is coded. |
| 2096 </t> |
| 2097 |
| 2098 <t> |
| 2099 When the flag is present, the decoder reads a single value using the PDF in |
| 2100 <xref target="silk_mid_only_pdf"/>, as implemented in |
| 2101 silk_stereo_decode_mid_only() (decode_stereo_pred.c). |
| 2102 If the flag is set, then there is no corresponding SILK frame for the side |
| 2103 channel, the entire decoding process for the side channel is skipped, and |
| 2104 zeros are fed to the stereo unmixing process (see |
| 2105 <xref target="silk_stereo_unmixing"/>) instead. |
| 2106 As stated above, LBRR frames still include this flag when the LBRR flag |
| 2107 indicates that the side channel is not coded. |
| 2108 In that case, if this flag is zero (indicating that there should be a side |
| 2109 channel), then Packet Loss Concealment (PLC, see |
| 2110 <xref target="Packet Loss Concealment"/>) SHOULD be invoked to recover a |
| 2111 side channel signal. |
| 2112 Otherwise, the stereo image will collapse. |
| 2113 </t> |
| 2114 |
| 2115 <texttable anchor="silk_mid_only_pdf" title="Mid-only Flag PDF"> |
| 2116 <ttcol align="left">PDF</ttcol> |
| 2117 <c>{192, 64}/256</c> |
| 2118 </texttable> |
| 2119 |
| 2120 </section> |
| 2121 |
| 2122 <section anchor="silk_frame_type" toc="include" title="Frame Type"> |
| 2123 <t> |
| 2124 Each SILK frame contains a single "frame type" symbol that jointly codes the |
| 2125 signal type and quantization offset type of the corresponding frame. |
| 2126 If the current frame is a regular SILK frame whose VAD bit was not set (an |
| 2127 "inactive" frame), then the frame type symbol takes on a value of either 0 or |
| 2128 1 and is decoded using the first PDF in <xref target="silk_frame_type_pdfs"/>. |
| 2129 If the frame is an LBRR frame or a regular SILK frame whose VAD flag was set |
| 2130 (an "active" frame), then the value of the symbol may range from 2 to 5, |
| 2131 inclusive, and is decoded using the second PDF in |
| 2132 <xref target="silk_frame_type_pdfs"/>. |
| 2133 <xref target="silk_frame_type_table"/> translates between the value of the |
| 2134 frame type symbol and the corresponding signal type and quantization offset |
| 2135 type. |
| 2136 </t> |
| 2137 |
| 2138 <texttable anchor="silk_frame_type_pdfs" title="Frame Type PDFs"> |
| 2139 <ttcol>VAD Flag</ttcol> |
| 2140 <ttcol>PDF</ttcol> |
| 2141 <c>Inactive</c> <c>{26, 230, 0, 0, 0, 0}/256</c> |
| 2142 <c>Active</c> <c>{0, 0, 24, 74, 148, 10}/256</c> |
| 2143 </texttable> |
| 2144 |
| 2145 <texttable anchor="silk_frame_type_table" |
| 2146 title="Signal Type and Quantization Offset Type from Frame Type"> |
| 2147 <ttcol>Frame Type</ttcol> |
| 2148 <ttcol>Signal Type</ttcol> |
| 2149 <ttcol align="right">Quantization Offset Type</ttcol> |
| 2150 <c>0</c> <c>Inactive</c> <c>Low</c> |
| 2151 <c>1</c> <c>Inactive</c> <c>High</c> |
| 2152 <c>2</c> <c>Unvoiced</c> <c>Low</c> |
| 2153 <c>3</c> <c>Unvoiced</c> <c>High</c> |
| 2154 <c>4</c> <c>Voiced</c> <c>Low</c> |
| 2155 <c>5</c> <c>Voiced</c> <c>High</c> |
| 2156 </texttable> |
| 2157 |
| 2158 </section> |
| 2159 |
| 2160 <section anchor="silk_gains" toc="include" title="Subframe Gains"> |
| 2161 <t> |
| 2162 A separate quantization gain is coded for each 5 ms subframe. |
| 2163 These gains control the step size between quantization levels of the excitation |
| 2164 signal and, therefore, the quality of the reconstruction. |
| 2165 They are independent of and unrelated to the pitch contours coded for voiced |
| 2166 frames. |
| 2167 The quantization gains are themselves uniformly quantized to 6 bits on a |
| 2168 log scale, giving them a resolution of approximately 1.369 dB and a range |
| 2169 of approximately 1.94 dB to 88.21 dB. |
| 2170 </t> |
| 2171 <t> |
| 2172 The subframe gains are either coded independently, or relative to the gain from |
| 2173 the most recent coded subframe in the same channel. |
| 2174 Independent coding is used if and only if |
| 2175 <list style="symbols"> |
| 2176 <t> |
| 2177 This is the first subframe in the current SILK frame, and |
| 2178 </t> |
| 2179 <t>Either |
| 2180 <list style="symbols"> |
| 2181 <t> |
| 2182 This is the first SILK frame of its type (LBRR or regular) for this channel in |
| 2183 the current Opus frame, or |
| 2184 </t> |
| 2185 <t> |
| 2186 The previous SILK frame of the same type (LBRR or regular) for this channel in |
| 2187 the same Opus frame was not coded. |
| 2188 </t> |
| 2189 </list> |
| 2190 </t> |
| 2191 </list> |
| 2192 </t> |
| 2193 |
| 2194 <t> |
| 2195 In an independently coded subframe gain, the 3 most significant bits of the |
| 2196 quantization gain are decoded using a PDF selected from |
| 2197 <xref target="silk_independent_gain_msb_pdfs"/> based on the decoded signal |
| 2198 type (see <xref target="silk_frame_type"/>). |
| 2199 </t> |
| 2200 |
| 2201 <texttable anchor="silk_independent_gain_msb_pdfs" |
| 2202 title="PDFs for Independent Quantization Gain MSB Coding"> |
| 2203 <ttcol align="left">Signal Type</ttcol> |
| 2204 <ttcol align="left">PDF</ttcol> |
| 2205 <c>Inactive</c> <c>{32, 112, 68, 29, 12, 1, 1, 1}/256</c> |
| 2206 <c>Unvoiced</c> <c>{2, 17, 45, 60, 62, 47, 19, 4}/256</c> |
| 2207 <c>Voiced</c> <c>{1, 3, 26, 71, 94, 50, 9, 2}/256</c> |
| 2208 </texttable> |
| 2209 |
| 2210 <t> |
| 2211 The 3 least significant bits are decoded using a uniform PDF: |
| 2212 </t> |
| 2213 <texttable anchor="silk_independent_gain_lsb_pdf" |
| 2214 title="PDF for Independent Quantization Gain LSB Coding"> |
| 2215 <ttcol align="left">PDF</ttcol> |
| 2216 <c>{32, 32, 32, 32, 32, 32, 32, 32}/256</c> |
| 2217 </texttable> |
| 2218 |
| 2219 <t> |
| 2220 These 6 bits are combined to form a value, gain_index, between 0 and 63. |
| 2221 When the gain for the previous subframe is available, then the current gain is |
| 2222 limited as follows: |
| 2223 <figure align="center"> |
| 2224 <artwork align="center"><![CDATA[ |
| 2225 log_gain = max(gain_index, previous_log_gain - 16) . |
| 2226 ]]></artwork> |
| 2227 </figure> |
| 2228 This may help some implementations limit the change in precision of their |
| 2229 internal LTP history. |
| 2230 The indices which this clamp applies to cannot simply be removed from the |
| 2231 codebook, because previous_log_gain will not be available after packet loss. |
| 2232 The clamping is skipped after a decoder reset, and in the side channel if the |
| 2233 previous frame in the side channel was not coded, since there is no value for |
| 2234 previous_log_gain available. |
| 2235 It MAY also be skipped after packet loss. |
| 2236 </t> |
| 2237 |
| 2238 <t> |
| 2239 For subframes which do not have an independent gain (including the first |
| 2240 subframe of frames not listed as using independent coding above), the |
| 2241 quantization gain is coded relative to the gain from the previous subframe (in |
| 2242 the same channel). |
| 2243 The PDF in <xref target="silk_delta_gain_pdf"/> yields a delta_gain_index value |
| 2244 between 0 and 40, inclusive. |
| 2245 </t> |
| 2246 <texttable anchor="silk_delta_gain_pdf" |
| 2247 title="PDF for Delta Quantization Gain Coding"> |
| 2248 <ttcol align="left">PDF</ttcol> |
| 2249 <c>{6, 5, 11, 31, 132, 21, 8, 4, |
| 2250 3, 2, 2, 2, 1, 1, 1, 1, |
| 2251 1, 1, 1, 1, 1, 1, 1, 1, |
| 2252 1, 1, 1, 1, 1, 1, 1, 1, |
| 2253 1, 1, 1, 1, 1, 1, 1, 1, 1}/256</c> |
| 2254 </texttable> |
| 2255 <t> |
| 2256 The following formula translates this index into a quantization gain for the |
| 2257 current subframe using the gain from the previous subframe: |
| 2258 <figure align="center"> |
| 2259 <artwork align="center"><![CDATA[ |
| 2260 log_gain = clamp(0, max(2*delta_gain_index - 16, |
| 2261 previous_log_gain + delta_gain_index - 4), 63) . |
| 2262 ]]></artwork> |
| 2263 </figure> |
| 2264 </t> |
| 2265 <t> |
| 2266 silk_gains_dequant() (gain_quant.c) dequantizes log_gain for the k'th subframe |
| 2267 and converts it into a linear Q16 scale factor via |
| 2268 <figure align="center"> |
| 2269 <artwork align="center"><![CDATA[ |
| 2270 gain_Q16[k] = silk_log2lin((0x1D1C71*log_gain>>16) + 2090) |
| 2271 ]]></artwork> |
| 2272 </figure> |
| 2273 </t> |
| 2274 <t> |
| 2275 The function silk_log2lin() (log2lin.c) computes an approximation of |
| 2276 2**(inLog_Q7/128.0), where inLog_Q7 is its Q7 input. |
| 2277 Let i = inLog_Q7>>7 be the integer part of inLogQ7 and |
| 2278 f = inLog_Q7&127 be the fractional part. |
| 2279 Then |
| 2280 <figure align="center"> |
| 2281 <artwork align="center"><![CDATA[ |
| 2282 (1<<i) + ((-174*f*(128-f)>>16)+f)*((1<<i)>>7) |
| 2283 ]]></artwork> |
| 2284 </figure> |
| 2285 yields the approximate exponential. |
| 2286 The final Q16 gain values lies between 81920 and 1686110208, inclusive |
| 2287 (representing scale factors of 1.25 to 25728, respectively). |
| 2288 </t> |
| 2289 </section> |
| 2290 |
| 2291 <section anchor="silk_nlsfs" toc="include" title="Normalized Line Spectral |
| 2292 Frequency (LSF) and Linear Predictive Coding (LPC) Coefficients"> |
| 2293 <t> |
| 2294 A set of normalized Line Spectral Frequency (LSF) coefficients follow the |
| 2295 quantization gains in the bitstream, and represent the Linear Predictive |
| 2296 Coding (LPC) coefficients for the current SILK frame. |
| 2297 Once decoded, the normalized LSFs form an increasing list of Q15 values between |
| 2298 0 and 1. |
| 2299 These represent the interleaved zeros on the upper half of the unit circle |
| 2300 (between 0 and pi, hence "normalized") in the standard decomposition |
| 2301 <xref target="line-spectral-pairs"/> of the LPC filter into a symmetric part |
| 2302 and an anti-symmetric part (P and Q in <xref target="silk_nlsf2lpc"/>). |
| 2303 Because of non-linear effects in the decoding process, an implementation SHOULD |
| 2304 match the fixed-point arithmetic described in this section exactly. |
| 2305 An encoder SHOULD also use the same process. |
| 2306 </t> |
| 2307 <t> |
| 2308 The normalized LSFs are coded using a two-stage vector quantizer (VQ) |
| 2309 (<xref target="silk_nlsf_stage1"/> and <xref target="silk_nlsf_stage2"/>). |
| 2310 NB and MB frames use an order-10 predictor, while WB frames use an order-16 |
| 2311 predictor, and thus have different sets of tables. |
| 2312 After reconstructing the normalized LSFs |
| 2313 (<xref target="silk_nlsf_reconstruction"/>), the decoder runs them through a |
| 2314 stabilization process (<xref target="silk_nlsf_stabilization"/>), interpolates |
| 2315 them between frames (<xref target="silk_nlsf_interpolation"/>), converts them |
| 2316 back into LPC coefficients (<xref target="silk_nlsf2lpc"/>), and then runs |
| 2317 them through further processes to limit the range of the coefficients |
| 2318 (<xref target="silk_lpc_range_limit"/>) and the gain of the filter |
| 2319 (<xref target="silk_lpc_gain_limit"/>). |
| 2320 All of this is necessary to ensure the reconstruction process is stable. |
| 2321 </t> |
| 2322 |
| 2323 <section anchor="silk_nlsf_stage1" title="Normalized LSF Stage 1 Decoding"> |
| 2324 <t> |
| 2325 The first VQ stage uses a 32-element codebook, coded with one of the PDFs in |
| 2326 <xref target="silk_nlsf_stage1_pdfs"/>, depending on the audio bandwidth and |
| 2327 the signal type of the current SILK frame. |
| 2328 This yields a single index, I1, for the entire frame, which |
| 2329 <list style="numbers"> |
| 2330 <t>Indexes an element in a coarse codebook,</t> |
| 2331 <t>Selects the PDFs for the second stage of the VQ, and</t> |
| 2332 <t>Selects the prediction weights used to remove intra-frame redundancy from |
| 2333 the second stage.</t> |
| 2334 </list> |
| 2335 The actual codebook elements are listed in |
| 2336 <xref target="silk_nlsf_nbmb_codebook"/> and |
| 2337 <xref target="silk_nlsf_wb_codebook"/>, but they are not needed until the last |
| 2338 stages of reconstructing the LSF coefficients. |
| 2339 </t> |
| 2340 |
| 2341 <texttable anchor="silk_nlsf_stage1_pdfs" |
| 2342 title="PDFs for Normalized LSF Stage-1 Index Decoding"> |
| 2343 <ttcol align="left">Audio Bandwidth</ttcol> |
| 2344 <ttcol align="left">Signal Type</ttcol> |
| 2345 <ttcol align="left">PDF</ttcol> |
| 2346 <c>NB or MB</c> <c>Inactive or unvoiced</c> |
| 2347 <c> |
| 2348 {44, 34, 30, 19, 21, 12, 11, 3, |
| 2349 3, 2, 16, 2, 2, 1, 5, 2, |
| 2350 1, 3, 3, 1, 1, 2, 2, 2, |
| 2351 3, 1, 9, 9, 2, 7, 2, 1}/256 |
| 2352 </c> |
| 2353 <c>NB or MB</c> <c>Voiced</c> |
| 2354 <c> |
| 2355 {1, 10, 1, 8, 3, 8, 8, 14, |
| 2356 13, 14, 1, 14, 12, 13, 11, 11, |
| 2357 12, 11, 10, 10, 11, 8, 9, 8, |
| 2358 7, 8, 1, 1, 6, 1, 6, 5}/256 |
| 2359 </c> |
| 2360 <c>WB</c> <c>Inactive or unvoiced</c> |
| 2361 <c> |
| 2362 {31, 21, 3, 17, 1, 8, 17, 4, |
| 2363 1, 18, 16, 4, 2, 3, 1, 10, |
| 2364 1, 3, 16, 11, 16, 2, 2, 3, |
| 2365 2, 11, 1, 4, 9, 8, 7, 3}/256 |
| 2366 </c> |
| 2367 <c>WB</c> <c>Voiced</c> |
| 2368 <c> |
| 2369 {1, 4, 16, 5, 18, 11, 5, 14, |
| 2370 15, 1, 3, 12, 13, 14, 14, 6, |
| 2371 14, 12, 2, 6, 1, 12, 12, 11, |
| 2372 10, 3, 10, 5, 1, 1, 1, 3}/256 |
| 2373 </c> |
| 2374 </texttable> |
| 2375 |
| 2376 </section> |
| 2377 |
| 2378 <section anchor="silk_nlsf_stage2" title="Normalized LSF Stage 2 Decoding"> |
| 2379 <t> |
| 2380 A total of 16 PDFs are available for the LSF residual in the second stage: the |
| 2381 8 (a...h) for NB and MB frames given in |
| 2382 <xref target="silk_nlsf_stage2_nbmb_pdfs"/>, and the 8 (i...p) for WB frames |
| 2383 given in <xref target="silk_nlsf_stage2_wb_pdfs"/>. |
| 2384 Which PDF is used for which coefficient is driven by the index, I1, |
| 2385 decoded in the first stage. |
| 2386 <xref target="silk_nlsf_nbmb_stage2_cb_sel"/> lists the letter of the |
| 2387 corresponding PDF for each normalized LSF coefficient for NB and MB, and |
| 2388 <xref target="silk_nlsf_wb_stage2_cb_sel"/> lists the same information for WB. |
| 2389 </t> |
| 2390 |
| 2391 <texttable anchor="silk_nlsf_stage2_nbmb_pdfs" |
| 2392 title="PDFs for NB/MB Normalized LSF Stage-2 Index Decoding"> |
| 2393 <ttcol align="left">Codebook</ttcol> |
| 2394 <ttcol align="left">PDF</ttcol> |
| 2395 <c>a</c> <c>{1, 1, 1, 15, 224, 11, 1, 1, 1}/256</c> |
| 2396 <c>b</c> <c>{1, 1, 2, 34, 183, 32, 1, 1, 1}/256</c> |
| 2397 <c>c</c> <c>{1, 1, 4, 42, 149, 55, 2, 1, 1}/256</c> |
| 2398 <c>d</c> <c>{1, 1, 8, 52, 123, 61, 8, 1, 1}/256</c> |
| 2399 <c>e</c> <c>{1, 3, 16, 53, 101, 74, 6, 1, 1}/256</c> |
| 2400 <c>f</c> <c>{1, 3, 17, 55, 90, 73, 15, 1, 1}/256</c> |
| 2401 <c>g</c> <c>{1, 7, 24, 53, 74, 67, 26, 3, 1}/256</c> |
| 2402 <c>h</c> <c>{1, 1, 18, 63, 78, 58, 30, 6, 1}/256</c> |
| 2403 </texttable> |
| 2404 |
| 2405 <texttable anchor="silk_nlsf_stage2_wb_pdfs" |
| 2406 title="PDFs for WB Normalized LSF Stage-2 Index Decoding"> |
| 2407 <ttcol align="left">Codebook</ttcol> |
| 2408 <ttcol align="left">PDF</ttcol> |
| 2409 <c>i</c> <c>{1, 1, 1, 9, 232, 9, 1, 1, 1}/256</c> |
| 2410 <c>j</c> <c>{1, 1, 2, 28, 186, 35, 1, 1, 1}/256</c> |
| 2411 <c>k</c> <c>{1, 1, 3, 42, 152, 53, 2, 1, 1}/256</c> |
| 2412 <c>l</c> <c>{1, 1, 10, 49, 126, 65, 2, 1, 1}/256</c> |
| 2413 <c>m</c> <c>{1, 4, 19, 48, 100, 77, 5, 1, 1}/256</c> |
| 2414 <c>n</c> <c>{1, 1, 14, 54, 100, 72, 12, 1, 1}/256</c> |
| 2415 <c>o</c> <c>{1, 1, 15, 61, 87, 61, 25, 4, 1}/256</c> |
| 2416 <c>p</c> <c>{1, 7, 21, 50, 77, 81, 17, 1, 1}/256</c> |
| 2417 </texttable> |
| 2418 |
| 2419 <texttable anchor="silk_nlsf_nbmb_stage2_cb_sel" |
| 2420 title="Codebook Selection for NB/MB Normalized LSF Stage-2 Index Decoding"> |
| 2421 <ttcol>I1</ttcol> |
| 2422 <ttcol>Coefficient</ttcol> |
| 2423 <c/> |
| 2424 <c><spanx style="vbare">0 1 2 3 4 5 6 7
8 9</spanx></c> |
| 2425 <c> 0</c> |
| 2426 <c><spanx style="vbare">a a a a a a a a
a a</spanx></c> |
| 2427 <c> 1</c> |
| 2428 <c><spanx style="vbare">b d b c c b c b
b b</spanx></c> |
| 2429 <c> 2</c> |
| 2430 <c><spanx style="vbare">c b b b b b b b
b b</spanx></c> |
| 2431 <c> 3</c> |
| 2432 <c><spanx style="vbare">b c c c c b c b
b b</spanx></c> |
| 2433 <c> 4</c> |
| 2434 <c><spanx style="vbare">c d d d d c c c
c c</spanx></c> |
| 2435 <c> 5</c> |
| 2436 <c><spanx style="vbare">a f d d c c c c
b b</spanx></c> |
| 2437 <c> g</c> |
| 2438 <c><spanx style="vbare">a c c c c c c c
c b</spanx></c> |
| 2439 <c> 7</c> |
| 2440 <c><spanx style="vbare">c d g e e e f e
f f</spanx></c> |
| 2441 <c> 8</c> |
| 2442 <c><spanx style="vbare">c e f f e f e g
e e</spanx></c> |
| 2443 <c> 9</c> |
| 2444 <c><spanx style="vbare">c e e h e f e f
f e</spanx></c> |
| 2445 <c>10</c> |
| 2446 <c><spanx style="vbare">e d d d c d c c
c c</spanx></c> |
| 2447 <c>11</c> |
| 2448 <c><spanx style="vbare">b f f g e f e f
f f</spanx></c> |
| 2449 <c>12</c> |
| 2450 <c><spanx style="vbare">c h e g f f f f
f f</spanx></c> |
| 2451 <c>13</c> |
| 2452 <c><spanx style="vbare">c h f f f f f g
f e</spanx></c> |
| 2453 <c>14</c> |
| 2454 <c><spanx style="vbare">d d f e e f e f
e e</spanx></c> |
| 2455 <c>15</c> |
| 2456 <c><spanx style="vbare">c d d f f e e e
e e</spanx></c> |
| 2457 <c>16</c> |
| 2458 <c><spanx style="vbare">c e e g e f e f
f f</spanx></c> |
| 2459 <c>17</c> |
| 2460 <c><spanx style="vbare">c f e g f f f e
f e</spanx></c> |
| 2461 <c>18</c> |
| 2462 <c><spanx style="vbare">c h e f e f e f
f f</spanx></c> |
| 2463 <c>19</c> |
| 2464 <c><spanx style="vbare">c f e g h g f g
f e</spanx></c> |
| 2465 <c>20</c> |
| 2466 <c><spanx style="vbare">d g h e g f f g
e f</spanx></c> |
| 2467 <c>21</c> |
| 2468 <c><spanx style="vbare">c h g e e e f e
f f</spanx></c> |
| 2469 <c>22</c> |
| 2470 <c><spanx style="vbare">e f f e g g f g
f e</spanx></c> |
| 2471 <c>23</c> |
| 2472 <c><spanx style="vbare">c f f g f g e g
e e</spanx></c> |
| 2473 <c>24</c> |
| 2474 <c><spanx style="vbare">e f f f d h e f
f e</spanx></c> |
| 2475 <c>25</c> |
| 2476 <c><spanx style="vbare">c d e f f g e f
f e</spanx></c> |
| 2477 <c>26</c> |
| 2478 <c><spanx style="vbare">c d c d d e c d
d d</spanx></c> |
| 2479 <c>27</c> |
| 2480 <c><spanx style="vbare">b b c c c c c d
c c</spanx></c> |
| 2481 <c>28</c> |
| 2482 <c><spanx style="vbare">e f f g g g f g
e f</spanx></c> |
| 2483 <c>29</c> |
| 2484 <c><spanx style="vbare">d f f e e e e d
d c</spanx></c> |
| 2485 <c>30</c> |
| 2486 <c><spanx style="vbare">c f d h f f e e
f e</spanx></c> |
| 2487 <c>31</c> |
| 2488 <c><spanx style="vbare">e e f e f g f g
f e</spanx></c> |
| 2489 </texttable> |
| 2490 |
| 2491 <texttable anchor="silk_nlsf_wb_stage2_cb_sel" |
| 2492 title="Codebook Selection for WB Normalized LSF Stage-2 Index Decoding"> |
| 2493 <ttcol>I1</ttcol> |
| 2494 <ttcol>Coefficient</ttcol> |
| 2495 <c/> |
| 2496 <c><spanx style="vbare">0 1 2 3 4&nb
sp; 5 6 7 8 9 10 11&n
bsp;12 13 14 15</spanx></c> |
| 2497 <c> 0</c> |
| 2498 <c><spanx style="vbare">i i i i i&nb
sp; i i i i i i 
; i i i i i</spanx></c> |
| 2499 <c> 1</c> |
| 2500 <c><spanx style="vbare">k l l l l&nb
sp; l k k k k k 
; j j j i l</spanx></c> |
| 2501 <c> 2</c> |
| 2502 <c><spanx style="vbare">k n n l p&nb
sp; m m n k n m 
; n n m l l</spanx></c> |
| 2503 <c> 3</c> |
| 2504 <c><spanx style="vbare">i k j k k&nb
sp; j j j j j i 
; i i i i j</spanx></c> |
| 2505 <c> 4</c> |
| 2506 <c><spanx style="vbare">i o n m o&nb
sp; m p n m m m 
; n n m m l</spanx></c> |
| 2507 <c> 5</c> |
| 2508 <c><spanx style="vbare">i l n n m&nb
sp; l l n l l l 
; l l l k m</spanx></c> |
| 2509 <c> 6</c> |
| 2510 <c><spanx style="vbare">i i i i i&nb
sp; i i i i i i 
; i i i i i</spanx></c> |
| 2511 <c> 7</c> |
| 2512 <c><spanx style="vbare">i k o l p&nb
sp; k n l m n n 
; m l l k l</spanx></c> |
| 2513 <c> 8</c> |
| 2514 <c><spanx style="vbare">i o k o o&nb
sp; m n m o n m 
; m n l l l</spanx></c> |
| 2515 <c> 9</c> |
| 2516 <c><spanx style="vbare">k j i i i&nb
sp; i i i i i i 
; i i i i i</spanx></c> |
| 2517 <c>10</c> |
| 2518 <c><spanx style="vbare">i j i i i&nb
sp; i i i i i i 
; i i i i j</spanx></c> |
| 2519 <c>11</c> |
| 2520 <c><spanx style="vbare">k k l m n&nb
sp; l l l l l l 
; l k k j l</spanx></c> |
| 2521 <c>12</c> |
| 2522 <c><spanx style="vbare">k k l l m&nb
sp; l l l l l l 
; l l k j l</spanx></c> |
| 2523 <c>13</c> |
| 2524 <c><spanx style="vbare">l m m m o&nb
sp; m m n l n m 
; m n m l m</spanx></c> |
| 2525 <c>14</c> |
| 2526 <c><spanx style="vbare">i o m n m&nb
sp; p n k o n p 
; m m l n l</spanx></c> |
| 2527 <c>15</c> |
| 2528 <c><spanx style="vbare">i j i j j&nb
sp; j j j j j i 
; i i i j i</spanx></c> |
| 2529 <c>16</c> |
| 2530 <c><spanx style="vbare">j o n p n&nb
sp; m n l m n m 
; m m l l m</spanx></c> |
| 2531 <c>17</c> |
| 2532 <c><spanx style="vbare">j l l m m&nb
sp; l l n k l l 
; n n n l m</spanx></c> |
| 2533 <c>18</c> |
| 2534 <c><spanx style="vbare">k l l k k&nb
sp; k l k j k j 
; k j j j m</spanx></c> |
| 2535 <c>19</c> |
| 2536 <c><spanx style="vbare">i k l n l&nb
sp; l k k k j j 
; i i i i i</spanx></c> |
| 2537 <c>20</c> |
| 2538 <c><spanx style="vbare">l m l n l&nb
sp; l k k j j j 
; j j k k m</spanx></c> |
| 2539 <c>21</c> |
| 2540 <c><spanx style="vbare">k o l p p&nb
sp; m n m n l n 
; l l k l l</spanx></c> |
| 2541 <c>22</c> |
| 2542 <c><spanx style="vbare">k l n o o&nb
sp; l n l m m l 
; l l l k m</spanx></c> |
| 2543 <c>23</c> |
| 2544 <c><spanx style="vbare">j l l m m&nb
sp; m m l n n n 
; l j j j j</spanx></c> |
| 2545 <c>24</c> |
| 2546 <c><spanx style="vbare">k n l o o&nb
sp; m p m m n l 
; m m l l l</spanx></c> |
| 2547 <c>25</c> |
| 2548 <c><spanx style="vbare">i o j j i&nb
sp; i i i i i i 
; i i i i i</spanx></c> |
| 2549 <c>26</c> |
| 2550 <c><spanx style="vbare">i o o l n&nb
sp; k n n l m m 
; p p m m m</spanx></c> |
| 2551 <c>27</c> |
| 2552 <c><spanx style="vbare">l l p l n&nb
sp; m l l l k k 
; l l l k l</spanx></c> |
| 2553 <c>28</c> |
| 2554 <c><spanx style="vbare">i i j i i&nb
sp; i k j k j j 
; k k k j j</spanx></c> |
| 2555 <c>29</c> |
| 2556 <c><spanx style="vbare">i l k n l&nb
sp; l k l k j i 
; i j i i j</spanx></c> |
| 2557 <c>30</c> |
| 2558 <c><spanx style="vbare">l n n m p&nb
sp; n l l k l k 
; k j i j i</spanx></c> |
| 2559 <c>31</c> |
| 2560 <c><spanx style="vbare">k l n l m&nb
sp; l l l k j k 
; o m i i i</spanx></c> |
| 2561 </texttable> |
| 2562 |
| 2563 <t> |
| 2564 Decoding the second stage residual proceeds as follows. |
| 2565 For each coefficient, the decoder reads a symbol using the PDF corresponding to |
| 2566 I1 from either <xref target="silk_nlsf_nbmb_stage2_cb_sel"/> or |
| 2567 <xref target="silk_nlsf_wb_stage2_cb_sel"/>, and subtracts 4 from the result |
| 2568 to give an index in the range -4 to 4, inclusive. |
| 2569 If the index is either -4 or 4, it reads a second symbol using the PDF in |
| 2570 <xref target="silk_nlsf_ext_pdf"/>, and adds the value of this second symbol |
| 2571 to the index, using the same sign. |
| 2572 This gives the index, I2[k], a total range of -10 to 10, inclusive. |
| 2573 </t> |
| 2574 |
| 2575 <texttable anchor="silk_nlsf_ext_pdf" |
| 2576 title="PDF for Normalized LSF Index Extension Decoding"> |
| 2577 <ttcol align="left">PDF</ttcol> |
| 2578 <c>{156, 60, 24, 9, 4, 2, 1}/256</c> |
| 2579 </texttable> |
| 2580 |
| 2581 <t> |
| 2582 The decoded indices from both stages are translated back into normalized LSF |
| 2583 coefficients in silk_NLSF_decode() (NLSF_decode.c). |
| 2584 The stage-2 indices represent residuals after both the first stage of the VQ |
| 2585 and a separate backwards-prediction step. |
| 2586 The backwards prediction process in the encoder subtracts a prediction from |
| 2587 each residual formed by a multiple of the coefficient that follows it. |
| 2588 The decoder must undo this process. |
| 2589 <xref target="silk_nlsf_pred_weights"/> contains lists of prediction weights |
| 2590 for each coefficient. |
| 2591 There are two lists for NB and MB, and another two lists for WB, giving two |
| 2592 possible prediction weights for each coefficient. |
| 2593 </t> |
| 2594 |
| 2595 <texttable anchor="silk_nlsf_pred_weights" |
| 2596 title="Prediction Weights for Normalized LSF Decoding"> |
| 2597 <ttcol align="left">Coefficient</ttcol> |
| 2598 <ttcol align="right">A</ttcol> |
| 2599 <ttcol align="right">B</ttcol> |
| 2600 <ttcol align="right">C</ttcol> |
| 2601 <ttcol align="right">D</ttcol> |
| 2602 <c>0</c> <c>179</c> <c>116</c> <c>175</c> <c>68</c> |
| 2603 <c>1</c> <c>138</c> <c>67</c> <c>148</c> <c>62</c> |
| 2604 <c>2</c> <c>140</c> <c>82</c> <c>160</c> <c>66</c> |
| 2605 <c>3</c> <c>148</c> <c>59</c> <c>176</c> <c>60</c> |
| 2606 <c>4</c> <c>151</c> <c>92</c> <c>178</c> <c>72</c> |
| 2607 <c>5</c> <c>149</c> <c>72</c> <c>173</c> <c>117</c> |
| 2608 <c>6</c> <c>153</c> <c>100</c> <c>174</c> <c>85</c> |
| 2609 <c>7</c> <c>151</c> <c>89</c> <c>164</c> <c>90</c> |
| 2610 <c>8</c> <c>163</c> <c>92</c> <c>177</c> <c>118</c> |
| 2611 <c>9</c> <c/> <c/> <c>174</c> <c>136</c> |
| 2612 <c>10</c> <c/> <c/> <c>196</c> <c>151</c> |
| 2613 <c>11</c> <c/> <c/> <c>182</c> <c>142</c> |
| 2614 <c>12</c> <c/> <c/> <c>198</c> <c>160</c> |
| 2615 <c>13</c> <c/> <c/> <c>192</c> <c>142</c> |
| 2616 <c>14</c> <c/> <c/> <c>182</c> <c>155</c> |
| 2617 </texttable> |
| 2618 |
| 2619 <t> |
| 2620 The prediction is undone using the procedure implemented in |
| 2621 silk_NLSF_residual_dequant() (NLSF_decode.c), which is as follows. |
| 2622 Each coefficient selects its prediction weight from one of the two lists based |
| 2623 on the stage-1 index, I1. |
| 2624 <xref target="silk_nlsf_nbmb_weight_sel"/> gives the selections for each |
| 2625 coefficient for NB and MB, and <xref target="silk_nlsf_wb_weight_sel"/> gives |
| 2626 the selections for WB. |
| 2627 Let d_LPC be the order of the codebook, i.e., 10 for NB and MB, and 16 for WB, |
| 2628 and let pred_Q8[k] be the weight for the k'th coefficient selected by this |
| 2629 process for 0 <= k < d_LPC-1. |
| 2630 Then, the stage-2 residual for each coefficient is computed via |
| 2631 <figure align="center"> |
| 2632 <artwork align="center"><![CDATA[ |
| 2633 res_Q10[k] = (k+1 < d_LPC ? (res_Q10[k+1]*pred_Q8[k])>>8 : 0) |
| 2634 + ((((I2[k]<<10) - sign(I2[k])*102)*qstep)>>16) , |
| 2635 ]]></artwork> |
| 2636 </figure> |
| 2637 where qstep is the Q16 quantization step size, which is 11796 for NB and MB |
| 2638 and 9830 for WB (representing step sizes of approximately 0.18 and 0.15, |
| 2639 respectively). |
| 2640 </t> |
| 2641 |
| 2642 <texttable anchor="silk_nlsf_nbmb_weight_sel" |
| 2643 title="Prediction Weight Selection for NB/MB Normalized LSF Decoding"> |
| 2644 <ttcol>I1</ttcol> |
| 2645 <ttcol>Coefficient</ttcol> |
| 2646 <c/> |
| 2647 <c><spanx style="vbare">0 1 2 3 4 5 6 7
8</spanx></c> |
| 2648 <c> 0</c> |
| 2649 <c><spanx style="vbare">A B A A A A A A
A</spanx></c> |
| 2650 <c> 1</c> |
| 2651 <c><spanx style="vbare">B A A A A A A A
A</spanx></c> |
| 2652 <c> 2</c> |
| 2653 <c><spanx style="vbare">A A A A A A A A
A</spanx></c> |
| 2654 <c> 3</c> |
| 2655 <c><spanx style="vbare">B B B A A A A B
A</spanx></c> |
| 2656 <c> 4</c> |
| 2657 <c><spanx style="vbare">A B A A A A A A
A</spanx></c> |
| 2658 <c> 5</c> |
| 2659 <c><spanx style="vbare">A B A A A A A A
A</spanx></c> |
| 2660 <c> 6</c> |
| 2661 <c><spanx style="vbare">B A B B A A A B
A</spanx></c> |
| 2662 <c> 7</c> |
| 2663 <c><spanx style="vbare">A B B A A B B A
A</spanx></c> |
| 2664 <c> 8</c> |
| 2665 <c><spanx style="vbare">A A B B A B A B
B</spanx></c> |
| 2666 <c> 9</c> |
| 2667 <c><spanx style="vbare">A A B B A A B B
B</spanx></c> |
| 2668 <c>10</c> |
| 2669 <c><spanx style="vbare">A A A A A A A A
A</spanx></c> |
| 2670 <c>11</c> |
| 2671 <c><spanx style="vbare">A B A B B B B B
A</spanx></c> |
| 2672 <c>12</c> |
| 2673 <c><spanx style="vbare">A B A B B B B B
A</spanx></c> |
| 2674 <c>13</c> |
| 2675 <c><spanx style="vbare">A B B B B B B B
A</spanx></c> |
| 2676 <c>14</c> |
| 2677 <c><spanx style="vbare">B A B B A B B B
B</spanx></c> |
| 2678 <c>15</c> |
| 2679 <c><spanx style="vbare">A B B B B B A B
A</spanx></c> |
| 2680 <c>16</c> |
| 2681 <c><spanx style="vbare">A A B B A B A B
A</spanx></c> |
| 2682 <c>17</c> |
| 2683 <c><spanx style="vbare">A A B B B A B B
B</spanx></c> |
| 2684 <c>18</c> |
| 2685 <c><spanx style="vbare">A B B A A B B B
A</spanx></c> |
| 2686 <c>19</c> |
| 2687 <c><spanx style="vbare">A A A B B B A B
A</spanx></c> |
| 2688 <c>20</c> |
| 2689 <c><spanx style="vbare">A B B A A B A B
A</spanx></c> |
| 2690 <c>21</c> |
| 2691 <c><spanx style="vbare">A B B A A A B B
A</spanx></c> |
| 2692 <c>22</c> |
| 2693 <c><spanx style="vbare">A A A A A B B B
B</spanx></c> |
| 2694 <c>23</c> |
| 2695 <c><spanx style="vbare">A A B B A A A B
B</spanx></c> |
| 2696 <c>24</c> |
| 2697 <c><spanx style="vbare">A A A B A B B B
B</spanx></c> |
| 2698 <c>25</c> |
| 2699 <c><spanx style="vbare">A B B B B B B B
A</spanx></c> |
| 2700 <c>26</c> |
| 2701 <c><spanx style="vbare">A A A A A A A A
A</spanx></c> |
| 2702 <c>27</c> |
| 2703 <c><spanx style="vbare">A A A A A A A A
A</spanx></c> |
| 2704 <c>28</c> |
| 2705 <c><spanx style="vbare">A A B A B B A B
A</spanx></c> |
| 2706 <c>29</c> |
| 2707 <c><spanx style="vbare">B A A B A A A A
A</spanx></c> |
| 2708 <c>30</c> |
| 2709 <c><spanx style="vbare">A A A B B A B A
B</spanx></c> |
| 2710 <c>31</c> |
| 2711 <c><spanx style="vbare">B A B B A B B B
B</spanx></c> |
| 2712 </texttable> |
| 2713 |
| 2714 <texttable anchor="silk_nlsf_wb_weight_sel" |
| 2715 title="Prediction Weight Selection for WB Normalized LSF Decoding"> |
| 2716 <ttcol>I1</ttcol> |
| 2717 <ttcol>Coefficient</ttcol> |
| 2718 <c/> |
| 2719 <c><spanx style="vbare">0 1 2 3 4&nb
sp; 5 6 7 8 9 10 11&n
bsp;12 13 14</spanx></c> |
| 2720 <c> 0</c> |
| 2721 <c><spanx style="vbare">C C C C C&nb
sp; C C C C C C 
; C C C D</spanx></c> |
| 2722 <c> 1</c> |
| 2723 <c><spanx style="vbare">C C C C C&nb
sp; C C C C C C 
; C C C C</spanx></c> |
| 2724 <c> 2</c> |
| 2725 <c><spanx style="vbare">C C D C C&nb
sp; D D D C D D 
; D D C C</spanx></c> |
| 2726 <c> 3</c> |
| 2727 <c><spanx style="vbare">C C C C C&nb
sp; C C C C C C 
; C D C C</spanx></c> |
| 2728 <c> 4</c> |
| 2729 <c><spanx style="vbare">C D D C D&nb
sp; C D D C D D 
; D D D C</spanx></c> |
| 2730 <c> 5</c> |
| 2731 <c><spanx style="vbare">C C D C C&nb
sp; C C C C C C 
; C C C C</spanx></c> |
| 2732 <c> 6</c> |
| 2733 <c><spanx style="vbare">D C C C C&nb
sp; C C C C C C 
; D C D C</spanx></c> |
| 2734 <c> 7</c> |
| 2735 <c><spanx style="vbare">C D D C C&nb
sp; C D C D D D 
; C D C D</spanx></c> |
| 2736 <c> 8</c> |
| 2737 <c><spanx style="vbare">C D C D D&nb
sp; C D C D C D 
; D D D D</spanx></c> |
| 2738 <c> 9</c> |
| 2739 <c><spanx style="vbare">C C C C C&nb
sp; C C C C C C 
; C C C D</spanx></c> |
| 2740 <c>10</c> |
| 2741 <c><spanx style="vbare">C D C C C&nb
sp; C C C C C C 
; C C C C</spanx></c> |
| 2742 <c>11</c> |
| 2743 <c><spanx style="vbare">C C D C D&nb
sp; D D D D D D 
; C D C C</spanx></c> |
| 2744 <c>12</c> |
| 2745 <c><spanx style="vbare">C C D C C&nb
sp; D C D C D C 
; C D C C</spanx></c> |
| 2746 <c>13</c> |
| 2747 <c><spanx style="vbare">C C C C D&nb
sp; D C D C D D 
; D D C C</spanx></c> |
| 2748 <c>14</c> |
| 2749 <c><spanx style="vbare">C D C C C&nb
sp; D D C D D D 
; C D D D</spanx></c> |
| 2750 <c>15</c> |
| 2751 <c><spanx style="vbare">C C D D C&nb
sp; C C C C C C 
; C D D C</spanx></c> |
| 2752 <c>16</c> |
| 2753 <c><spanx style="vbare">C D D C D&nb
sp; C D D D D D 
; C D C C</spanx></c> |
| 2754 <c>17</c> |
| 2755 <c><spanx style="vbare">C C D C C&nb
sp; C C D C C D 
; D D C C</spanx></c> |
| 2756 <c>18</c> |
| 2757 <c><spanx style="vbare">C C C C C&nb
sp; C C C C C C 
; C C C D</spanx></c> |
| 2758 <c>19</c> |
| 2759 <c><spanx style="vbare">C C C C C&nb
sp; C C C C C C 
; C D C C</spanx></c> |
| 2760 <c>20</c> |
| 2761 <c><spanx style="vbare">C C C C C&nb
sp; C C C C C C 
; C C C C</spanx></c> |
| 2762 <c>21</c> |
| 2763 <c><spanx style="vbare">C D C D C&nb
sp; D D C D C D 
; C D D C</spanx></c> |
| 2764 <c>22</c> |
| 2765 <c><spanx style="vbare">C C D D D&nb
sp; D C D D C C 
; D D C C</spanx></c> |
| 2766 <c>23</c> |
| 2767 <c><spanx style="vbare">C D D C D&nb
sp; C D C D C C 
; C C D C</spanx></c> |
| 2768 <c>24</c> |
| 2769 <c><spanx style="vbare">C C C D D&nb
sp; C D C D D D 
; D D D D</spanx></c> |
| 2770 <c>25</c> |
| 2771 <c><spanx style="vbare">C C C C C&nb
sp; C C C C C C 
; C C C D</spanx></c> |
| 2772 <c>26</c> |
| 2773 <c><spanx style="vbare">C D D C C&nb
sp; C D D C C D 
; D D D D</spanx></c> |
| 2774 <c>27</c> |
| 2775 <c><spanx style="vbare">C C C C C&nb
sp; D C D D D D 
; C D D D</spanx></c> |
| 2776 <c>28</c> |
| 2777 <c><spanx style="vbare">C C C C C&nb
sp; C C C C C C 
; C C C D</spanx></c> |
| 2778 <c>29</c> |
| 2779 <c><spanx style="vbare">C C C C C&nb
sp; C C C C C C 
; C C C D</spanx></c> |
| 2780 <c>30</c> |
| 2781 <c><spanx style="vbare">D C C C C&nb
sp; C C C C C C 
; D C C C</spanx></c> |
| 2782 <c>31</c> |
| 2783 <c><spanx style="vbare">C C D C C&nb
sp; D D D C C D 
; C C D C</spanx></c> |
| 2784 </texttable> |
| 2785 |
| 2786 </section> |
| 2787 |
| 2788 <section anchor="silk_nlsf_reconstruction" |
| 2789 title="Reconstructing the Normalized LSF Coefficients"> |
| 2790 <t> |
| 2791 Once the stage-1 index I1 and the stage-2 residual res_Q10[] have been decoded, |
| 2792 the final normalized LSF coefficients can be reconstructed. |
| 2793 </t> |
| 2794 <t> |
| 2795 The spectral distortion introduced by the quantization of each LSF coefficient |
| 2796 varies, so the stage-2 residual is weighted accordingly, using the |
| 2797 low-complexity Inverse Harmonic Mean Weighting (IHMW) function proposed in |
| 2798 <xref target="laroia-icassp"/>. |
| 2799 The weights are derived directly from the stage-1 codebook vector. |
| 2800 Let cb1_Q8[k] be the k'th entry of the stage-1 codebook vector from |
| 2801 <xref target="silk_nlsf_nbmb_codebook"/> or |
| 2802 <xref target="silk_nlsf_wb_codebook"/>. |
| 2803 Then for 0 <= k < d_LPC the following expression |
| 2804 computes the square of the weight as a Q18 value: |
| 2805 <figure align="center"> |
| 2806 <artwork align="center"> |
| 2807 <![CDATA[ |
| 2808 w2_Q18[k] = (1024/(cb1_Q8[k] - cb1_Q8[k-1]) |
| 2809 + 1024/(cb1_Q8[k+1] - cb1_Q8[k])) << 16 , |
| 2810 ]]> |
| 2811 </artwork> |
| 2812 </figure> |
| 2813 where cb1_Q8[-1] = 0 and cb1_Q8[d_LPC] = 256, and the |
| 2814 division is integer division. |
| 2815 This is reduced to an unsquared, Q9 value using the following square-root |
| 2816 approximation: |
| 2817 <figure align="center"> |
| 2818 <artwork align="center"><![CDATA[ |
| 2819 i = ilog(w2_Q18[k]) |
| 2820 f = (w2_Q18[k]>>(i-8)) & 127 |
| 2821 y = ((i&1) ? 32768 : 46214) >> ((32-i)>>1) |
| 2822 w_Q9[k] = y + ((213*f*y)>>16) |
| 2823 ]]></artwork> |
| 2824 </figure> |
| 2825 The constant 46214 here is approximately the square root of 2 in Q15. |
| 2826 The cb1_Q8[] vector completely determines these weights, and they may be |
| 2827 tabulated and stored as 13-bit unsigned values (with a range of 1819 to 5227, |
| 2828 inclusive) to avoid computing them when decoding. |
| 2829 The reference implementation already requires code to compute these weights on |
| 2830 unquantized coefficients in the encoder, in silk_NLSF_VQ_weights_laroia() |
| 2831 (NLSF_VQ_weights_laroia.c) and its callers, so it reuses that code in the |
| 2832 decoder instead of using a pre-computed table to reduce the amount of ROM |
| 2833 required. |
| 2834 </t> |
| 2835 |
| 2836 <texttable anchor="silk_nlsf_nbmb_codebook" |
| 2837 title="NB/MB Normalized LSF Stage-1 Codebook Vectors"> |
| 2838 <ttcol>I1</ttcol> |
| 2839 <ttcol>Codebook (Q8)</ttcol> |
| 2840 <c/> |
| 2841 <c><spanx style="vbare"> 0 1 2  
; 3 4 5 6 &nb
sp;7 8 9</spanx></c> |
| 2842 <c>0</c> |
| 2843 <c><spanx style="vbare">12 35 60 83 108&nb
sp;132 157 180 206 228</spanx></c> |
| 2844 <c>1</c> |
| 2845 <c><spanx style="vbare">15 32 55 77 101&nb
sp;125 151 175 201 225</spanx></c> |
| 2846 <c>2</c> |
| 2847 <c><spanx style="vbare">19 42 66 89 114&nb
sp;137 162 184 209 230</spanx></c> |
| 2848 <c>3</c> |
| 2849 <c><spanx style="vbare">12 25 50 72
97 120 147 172 200 223</spanx></c> |
| 2850 <c>4</c> |
| 2851 <c><spanx style="vbare">26 44 69 90 114&nb
sp;135 159 180 205 225</spanx></c> |
| 2852 <c>5</c> |
| 2853 <c><spanx style="vbare">13 22 53 80 106&nb
sp;130 156 180 205 228</spanx></c> |
| 2854 <c>6</c> |
| 2855 <c><spanx style="vbare">15 25 44 64
90 115 142 168 196 222</spanx></c> |
| 2856 <c>7</c> |
| 2857 <c><spanx style="vbare">19 24 62 82 100&nb
sp;120 145 168 190 214</spanx></c> |
| 2858 <c>8</c> |
| 2859 <c><spanx style="vbare">22 31 50 79 103&nb
sp;120 151 170 203 227</spanx></c> |
| 2860 <c>9</c> |
| 2861 <c><spanx style="vbare">21 29 45 65 106&nb
sp;124 150 171 196 224</spanx></c> |
| 2862 <c>10</c> |
| 2863 <c><spanx style="vbare">30 49 75 97 121&nb
sp;142 165 186 209 229</spanx></c> |
| 2864 <c>11</c> |
| 2865 <c><spanx style="vbare">19 25 52 70
93 116 143 166 192 219</spanx></c> |
| 2866 <c>12</c> |
| 2867 <c><spanx style="vbare">26 34 62 75
97 118 145 167 194 217</spanx></c> |
| 2868 <c>13</c> |
| 2869 <c><spanx style="vbare">25 33 56 70
91 113 143 165 196 223</spanx></c> |
| 2870 <c>14</c> |
| 2871 <c><spanx style="vbare">21 34 51 72
97 117 145 171 196 222</spanx></c> |
| 2872 <c>15</c> |
| 2873 <c><spanx style="vbare">20 29 50 67
90 117 144 168 197 221</spanx></c> |
| 2874 <c>16</c> |
| 2875 <c><spanx style="vbare">22 31 48 66
95 117 146 168 196 222</spanx></c> |
| 2876 <c>17</c> |
| 2877 <c><spanx style="vbare">24 33 51 77 116&nb
sp;134 158 180 200 224</spanx></c> |
| 2878 <c>18</c> |
| 2879 <c><spanx style="vbare">21 28 70 87 106&nb
sp;124 149 170 194 217</spanx></c> |
| 2880 <c>19</c> |
| 2881 <c><spanx style="vbare">26 33 53 64
83 117 152 173 204 225</spanx></c> |
| 2882 <c>20</c> |
| 2883 <c><spanx style="vbare">27 34 65 95 108&nb
sp;129 155 174 210 225</spanx></c> |
| 2884 <c>21</c> |
| 2885 <c><spanx style="vbare">20 26 72 99 113&nb
sp;131 154 176 200 219</spanx></c> |
| 2886 <c>22</c> |
| 2887 <c><spanx style="vbare">34 43 61 78
93 114 155 177 205 229</spanx></c> |
| 2888 <c>23</c> |
| 2889 <c><spanx style="vbare">23 29 54 97 124&nb
sp;138 163 179 209 229</spanx></c> |
| 2890 <c>24</c> |
| 2891 <c><spanx style="vbare">30 38 56 89 118&nb
sp;129 158 178 200 231</spanx></c> |
| 2892 <c>25</c> |
| 2893 <c><spanx style="vbare">21 29 49 63
85 111 142 163 193 222</spanx></c> |
| 2894 <c>26</c> |
| 2895 <c><spanx style="vbare">27 48 77 103 133 15
8 179 196 215 232</spanx></c> |
| 2896 <c>27</c> |
| 2897 <c><spanx style="vbare">29 47 74 99 124&nb
sp;151 176 198 220 237</spanx></c> |
| 2898 <c>28</c> |
| 2899 <c><spanx style="vbare">33 42 61 76
93 121 155 174 207 225</spanx></c> |
| 2900 <c>29</c> |
| 2901 <c><spanx style="vbare">29 53 87 112 136 15
4 170 188 208 227</spanx></c> |
| 2902 <c>30</c> |
| 2903 <c><spanx style="vbare">24 30 52 84 131&nb
sp;150 166 186 203 229</spanx></c> |
| 2904 <c>31</c> |
| 2905 <c><spanx style="vbare">37 48 64 84 104&nb
sp;118 156 177 201 230</spanx></c> |
| 2906 </texttable> |
| 2907 |
| 2908 <texttable anchor="silk_nlsf_wb_codebook" |
| 2909 title="WB Normalized LSF Stage-1 Codebook Vectors"> |
| 2910 <ttcol>I1</ttcol> |
| 2911 <ttcol>Codebook (Q8)</ttcol> |
| 2912 <c/> |
| 2913 <c><spanx style="vbare"> 0 1 2 3 &nbs
p;4 5 6 7 8&
nbsp; 9 10 11 12 13
14 15</spanx></c> |
| 2914 <c>0</c> |
| 2915 <c><spanx style="vbare"> 7 23 38 54 69 85&nb
sp;100 116 131 147 162 178 193 208 223&n
bsp;239</spanx></c> |
| 2916 <c>1</c> |
| 2917 <c><spanx style="vbare">13 25 41 55 69 83 &n
bsp;98 112 127 142 157 171 187 203 220&n
bsp;236</spanx></c> |
| 2918 <c>2</c> |
| 2919 <c><spanx style="vbare">15 21 34 51 61 78 &n
bsp;92 106 126 136 152 167 185 205 225&n
bsp;240</spanx></c> |
| 2920 <c>3</c> |
| 2921 <c><spanx style="vbare">10 21 36 50 63 79 &n
bsp;95 110 126 141 157 173 189 205 221&n
bsp;237</spanx></c> |
| 2922 <c>4</c> |
| 2923 <c><spanx style="vbare">17 20 37 51 59 78 &n
bsp;89 107 123 134 150 164 184 205 224&n
bsp;240</spanx></c> |
| 2924 <c>5</c> |
| 2925 <c><spanx style="vbare">10 15 32 51 67 81 &n
bsp;96 112 129 142 158 173 189 204 220&n
bsp;236</spanx></c> |
| 2926 <c>6</c> |
| 2927 <c><spanx style="vbare"> 8 21 37 51 65 79&nb
sp; 98 113 126 138 155 168 179 192
209 218</spanx></c> |
| 2928 <c>7</c> |
| 2929 <c><spanx style="vbare">12 15 34 55 63 78 &n
bsp;87 108 118 131 148 167 185 203 219&n
bsp;236</spanx></c> |
| 2930 <c>8</c> |
| 2931 <c><spanx style="vbare">16 19 32 36 56 79 &n
bsp;91 108 118 136 154 171 186 204 220&n
bsp;237</spanx></c> |
| 2932 <c>9</c> |
| 2933 <c><spanx style="vbare">11 28 43 58 74 89 10
5 120 135 150 165 180 196 211 226 2
41</spanx></c> |
| 2934 <c>10</c> |
| 2935 <c><spanx style="vbare"> 6 16 33 46 60 75&nb
sp; 92 107 123 137 156 169 185 199
214 225</spanx></c> |
| 2936 <c>11</c> |
| 2937 <c><spanx style="vbare">11 19 30 44 57 74 &n
bsp;89 105 121 135 152 169 186 202 218&n
bsp;234</spanx></c> |
| 2938 <c>12</c> |
| 2939 <c><spanx style="vbare">12 19 29 46 57 71 &n
bsp;88 100 120 132 148 165 182 199 216&n
bsp;233</spanx></c> |
| 2940 <c>13</c> |
| 2941 <c><spanx style="vbare">17 23 35 46 56 77 &n
bsp;92 106 123 134 152 167 185 204 222&n
bsp;237</spanx></c> |
| 2942 <c>14</c> |
| 2943 <c><spanx style="vbare">14 17 45 53 63 75 &n
bsp;89 107 115 132 151 171 188 206 221&n
bsp;240</spanx></c> |
| 2944 <c>15</c> |
| 2945 <c><spanx style="vbare"> 9 16 29 40 56 71&nb
sp; 88 103 119 137 154 171 189 205
222 237</spanx></c> |
| 2946 <c>16</c> |
| 2947 <c><spanx style="vbare">16 19 36 48 57 76 &n
bsp;87 105 118 132 150 167 185 202 218&n
bsp;236</spanx></c> |
| 2948 <c>17</c> |
| 2949 <c><spanx style="vbare">12 17 29 54 71 81 &n
bsp;94 104 126 136 149 164 182 201 221&n
bsp;237</spanx></c> |
| 2950 <c>18</c> |
| 2951 <c><spanx style="vbare">15 28 47 62 79 97 11
5 129 142 155 168 180 194 208 223 2
38</spanx></c> |
| 2952 <c>19</c> |
| 2953 <c><spanx style="vbare"> 8 14 30 45 62 78&nb
sp; 94 111 127 143 159 175 192 207
223 239</spanx></c> |
| 2954 <c>20</c> |
| 2955 <c><spanx style="vbare">17 30 49 62 79 92 10
7 119 132 145 160 174 190 204 220 2
35</spanx></c> |
| 2956 <c>21</c> |
| 2957 <c><spanx style="vbare">14 19 36 45 61 76 &n
bsp;91 108 121 138 154 172 189 205 222&n
bsp;238</spanx></c> |
| 2958 <c>22</c> |
| 2959 <c><spanx style="vbare">12 18 31 45 60 76 &n
bsp;91 107 123 138 154 171 187 204 221&n
bsp;236</spanx></c> |
| 2960 <c>23</c> |
| 2961 <c><spanx style="vbare">13 17 31 43 53 70 &n
bsp;83 103 114 131 149 167 185 203 220&n
bsp;237</spanx></c> |
| 2962 <c>24</c> |
| 2963 <c><spanx style="vbare">17 22 35 42 58 78 &n
bsp;93 110 125 139 155 170 188 206 224&n
bsp;240</spanx></c> |
| 2964 <c>25</c> |
| 2965 <c><spanx style="vbare"> 8 15 34 50 67 83&nb
sp; 99 115 131 146 162 178 193 209
224 239</spanx></c> |
| 2966 <c>26</c> |
| 2967 <c><spanx style="vbare">13 16 41 66 73 86 &n
bsp;95 111 128 137 150 163 183 206 225&n
bsp;241</spanx></c> |
| 2968 <c>27</c> |
| 2969 <c><spanx style="vbare">17 25 37 52 63 75 &n
bsp;92 102 119 132 144 160 175 191 212&n
bsp;231</spanx></c> |
| 2970 <c>28</c> |
| 2971 <c><spanx style="vbare">19 31 49 65 83 100 117&nbs
p;133 147 161 174 187 200 213 227 242</s
panx></c> |
| 2972 <c>29</c> |
| 2973 <c><spanx style="vbare">18 31 52 68 88 103 117&nbs
p;126 138 149 163 177 192 207 223 239</s
panx></c> |
| 2974 <c>30</c> |
| 2975 <c><spanx style="vbare">16 29 47 61 76 90 10
6 119 133 147 161 176 193 209 224 2
40</spanx></c> |
| 2976 <c>31</c> |
| 2977 <c><spanx style="vbare">15 21 35 50 61 73 &n
bsp;86 97 110 119 129 141 175 198
218 237</spanx></c> |
| 2978 </texttable> |
| 2979 |
| 2980 <t> |
| 2981 Given the stage-1 codebook entry cb1_Q8[], the stage-2 residual res_Q10[], and |
| 2982 their corresponding weights, w_Q9[], the reconstructed normalized LSF |
| 2983 coefficients are |
| 2984 <figure align="center"> |
| 2985 <artwork align="center"><![CDATA[ |
| 2986 NLSF_Q15[k] = clamp(0, |
| 2987 (cb1_Q8[k]<<7) + (res_Q10[k]<<14)/w_Q9[k], 32767) , |
| 2988 ]]></artwork> |
| 2989 </figure> |
| 2990 where the division is integer division. |
| 2991 However, nothing in either the reconstruction process or the |
| 2992 quantization process in the encoder thus far guarantees that the coefficients |
| 2993 are monotonically increasing and separated well enough to ensure a stable |
| 2994 filter <xref target="Kabal86"/>. |
| 2995 When using the reference encoder, roughly 2% of frames violate this constraint. |
| 2996 The next section describes a stabilization procedure used to make these |
| 2997 guarantees. |
| 2998 </t> |
| 2999 |
| 3000 </section> |
| 3001 |
| 3002 <section anchor="silk_nlsf_stabilization" title="Normalized LSF Stabilization"> |
| 3003 <t> |
| 3004 The normalized LSF stabilization procedure is implemented in |
| 3005 silk_NLSF_stabilize() (NLSF_stabilize.c). |
| 3006 This process ensures that consecutive values of the normalized LSF |
| 3007 coefficients, NLSF_Q15[], are spaced some minimum distance apart |
| 3008 (predetermined to be the 0.01 percentile of a large training set). |
| 3009 <xref target="silk_nlsf_min_spacing"/> gives the minimum spacings for NB and MB |
| 3010 and those for WB, where row k is the minimum allowed value of |
| 3011 NLSF_Q[k]-NLSF_Q[k-1]. |
| 3012 For the purposes of computing this spacing for the first and last coefficient, |
| 3013 NLSF_Q15[-1] is taken to be 0, and NLSF_Q15[d_LPC] is taken to be 32768. |
| 3014 </t> |
| 3015 |
| 3016 <texttable anchor="silk_nlsf_min_spacing" |
| 3017 title="Minimum Spacing for Normalized LSF Coefficients"> |
| 3018 <ttcol>Coefficient</ttcol> |
| 3019 <ttcol align="right">NB and MB</ttcol> |
| 3020 <ttcol align="right">WB</ttcol> |
| 3021 <c>0</c> <c>250</c> <c>100</c> |
| 3022 <c>1</c> <c>3</c> <c>3</c> |
| 3023 <c>2</c> <c>6</c> <c>40</c> |
| 3024 <c>3</c> <c>3</c> <c>3</c> |
| 3025 <c>4</c> <c>3</c> <c>3</c> |
| 3026 <c>5</c> <c>3</c> <c>3</c> |
| 3027 <c>6</c> <c>4</c> <c>5</c> |
| 3028 <c>7</c> <c>3</c> <c>14</c> |
| 3029 <c>8</c> <c>3</c> <c>14</c> |
| 3030 <c>9</c> <c>3</c> <c>10</c> |
| 3031 <c>10</c> <c>461</c> <c>11</c> |
| 3032 <c>11</c> <c/> <c>3</c> |
| 3033 <c>12</c> <c/> <c>8</c> |
| 3034 <c>13</c> <c/> <c>9</c> |
| 3035 <c>14</c> <c/> <c>7</c> |
| 3036 <c>15</c> <c/> <c>3</c> |
| 3037 <c>16</c> <c/> <c>347</c> |
| 3038 </texttable> |
| 3039 |
| 3040 <t> |
| 3041 The procedure starts off by trying to make small adjustments which attempt to |
| 3042 minimize the amount of distortion introduced. |
| 3043 After 20 such adjustments, it falls back to a more direct method which |
| 3044 guarantees the constraints are enforced but may require large adjustments. |
| 3045 </t> |
| 3046 <t> |
| 3047 Let NDeltaMin_Q15[k] be the minimum required spacing for the current audio |
| 3048 bandwidth from <xref target="silk_nlsf_min_spacing"/>. |
| 3049 First, the procedure finds the index i where |
| 3050 NLSF_Q15[i] - NLSF_Q15[i-1] - NDeltaMin_Q15[i] is the |
| 3051 smallest, breaking ties by using the lower value of i. |
| 3052 If this value is non-negative, then the stabilization stops; the coefficients |
| 3053 satisfy all the constraints. |
| 3054 Otherwise, if i == 0, it sets NLSF_Q15[0] to NDeltaMin_Q15[0], and if |
| 3055 i == d_LPC, it sets NLSF_Q15[d_LPC-1] to |
| 3056 (32768 - NDeltaMin_Q15[d_LPC]). |
| 3057 For all other values of i, both NLSF_Q15[i-1] and NLSF_Q15[i] are updated as |
| 3058 follows: |
| 3059 <figure align="center"> |
| 3060 <artwork align="center"><![CDATA[ |
| 3061 i-1 |
| 3062 __ |
| 3063 min_center_Q15 = (NDeltaMin_Q15[i]>>1) + \ NDeltaMin_Q15[k] |
| 3064 /_ |
| 3065 k=0 |
| 3066 d_LPC |
| 3067 __ |
| 3068 max_center_Q15 = 32768 - (NDeltaMin_Q15[i]>>1) - \ NDeltaMin_Q15[k] |
| 3069 /_ |
| 3070 k=i+1 |
| 3071 center_freq_Q15 = clamp(min_center_Q15[i], |
| 3072 (NLSF_Q15[i-1] + NLSF_Q15[i] + 1)>>1, |
| 3073 max_center_Q15[i]) |
| 3074 |
| 3075 NLSF_Q15[i-1] = center_freq_Q15 - (NDeltaMin_Q15[i]>>1) |
| 3076 |
| 3077 NLSF_Q15[i] = NLSF_Q15[i-1] + NDeltaMin_Q15[i] . |
| 3078 ]]></artwork> |
| 3079 </figure> |
| 3080 Then the procedure repeats again, until it has either executed 20 times or |
| 3081 has stopped because the coefficients satisfy all the constraints. |
| 3082 </t> |
| 3083 <t> |
| 3084 After the 20th repetition of the above procedure, the following fallback |
| 3085 procedure executes once. |
| 3086 First, the values of NLSF_Q15[k] for 0 <= k < d_LPC |
| 3087 are sorted in ascending order. |
| 3088 Then for each value of k from 0 to d_LPC-1, NLSF_Q15[k] is set to |
| 3089 <figure align="center"> |
| 3090 <artwork align="center"><![CDATA[ |
| 3091 max(NLSF_Q15[k], NLSF_Q15[k-1] + NDeltaMin_Q15[k]) . |
| 3092 ]]></artwork> |
| 3093 </figure> |
| 3094 Next, for each value of k from d_LPC-1 down to 0, NLSF_Q15[k] is set to |
| 3095 <figure align="center"> |
| 3096 <artwork align="center"><![CDATA[ |
| 3097 min(NLSF_Q15[k], NLSF_Q15[k+1] - NDeltaMin_Q15[k+1]) . |
| 3098 ]]></artwork> |
| 3099 </figure> |
| 3100 </t> |
| 3101 |
| 3102 </section> |
| 3103 |
| 3104 <section anchor="silk_nlsf_interpolation" title="Normalized LSF Interpolation"> |
| 3105 <t> |
| 3106 For 20 ms SILK frames, the first half of the frame (i.e., the first two |
| 3107 subframes) may use normalized LSF coefficients that are interpolated between |
| 3108 the decoded LSFs for the most recent coded frame (in the same channel) and the |
| 3109 current frame. |
| 3110 A Q2 interpolation factor follows the LSF coefficient indices in the bitstream, |
| 3111 which is decoded using the PDF in <xref target="silk_nlsf_interp_pdf"/>. |
| 3112 This happens in silk_decode_indices() (decode_indices.c). |
| 3113 After either |
| 3114 <list style="symbols"> |
| 3115 <t>An uncoded regular SILK frame in the side channel, or</t> |
| 3116 <t>A decoder reset (see <xref target="decoder-reset"/>),</t> |
| 3117 </list> |
| 3118 the decoder still decodes this factor, but ignores its value and always uses |
| 3119 4 instead. |
| 3120 For 10 ms SILK frames, this factor is not stored at all. |
| 3121 </t> |
| 3122 |
| 3123 <texttable anchor="silk_nlsf_interp_pdf" |
| 3124 title="PDF for Normalized LSF Interpolation Index"> |
| 3125 <ttcol>PDF</ttcol> |
| 3126 <c>{13, 22, 29, 11, 181}/256</c> |
| 3127 </texttable> |
| 3128 |
| 3129 <t> |
| 3130 Let n2_Q15[k] be the normalized LSF coefficients decoded by the procedure in |
| 3131 <xref target="silk_nlsfs"/>, n0_Q15[k] be the LSF coefficients |
| 3132 decoded for the prior frame, and w_Q2 be the interpolation factor. |
| 3133 Then the normalized LSF coefficients used for the first half of a 20 ms |
| 3134 frame, n1_Q15[k], are |
| 3135 <figure align="center"> |
| 3136 <artwork align="center"><![CDATA[ |
| 3137 n1_Q15[k] = n0_Q15[k] + (w_Q2*(n2_Q15[k] - n0_Q15[k]) >> 2) . |
| 3138 ]]></artwork> |
| 3139 </figure> |
| 3140 This interpolation is performed in silk_decode_parameters() |
| 3141 (decode_parameters.c). |
| 3142 </t> |
| 3143 </section> |
| 3144 |
| 3145 <section anchor="silk_nlsf2lpc" |
| 3146 title="Converting Normalized LSFs to LPC Coefficients"> |
| 3147 <t> |
| 3148 Any LPC filter A(z) can be split into a symmetric part P(z) and an |
| 3149 anti-symmetric part Q(z) such that |
| 3150 <figure align="center"> |
| 3151 <artwork align="center"><![CDATA[ |
| 3152 d_LPC |
| 3153 __ -k 1 |
| 3154 A(z) = 1 - \ a[k] * z = - * (P(z) + Q(z)) |
| 3155 /_ 2 |
| 3156 k=1 |
| 3157 ]]></artwork> |
| 3158 </figure> |
| 3159 with |
| 3160 <figure align="center"> |
| 3161 <artwork align="center"><![CDATA[ |
| 3162 -d_LPC-1 -1 |
| 3163 P(z) = A(z) + z * A(z ) |
| 3164 |
| 3165 -d_LPC-1 -1 |
| 3166 Q(z) = A(z) - z * A(z ) . |
| 3167 ]]></artwork> |
| 3168 </figure> |
| 3169 The even normalized LSF coefficients correspond to a pair of conjugate roots of |
| 3170 P(z), while the odd coefficients correspond to a pair of conjugate roots of |
| 3171 Q(z), all of which lie on the unit circle. |
| 3172 In addition, P(z) has a root at pi and Q(z) has a root at 0. |
| 3173 Thus, they may be reconstructed mathematically from a set of normalized LSF |
| 3174 coefficients, n[k], as |
| 3175 <figure align="center"> |
| 3176 <artwork align="center"><![CDATA[ |
| 3177 d_LPC/2-1 |
| 3178 -1 ___ -1 -2 |
| 3179 P(z) = (1 + z ) * | | (1 - 2*cos(pi*n[2*k])*z + z ) |
| 3180 k=0 |
| 3181 |
| 3182 d_LPC/2-1 |
| 3183 -1 ___ -1 -2 |
| 3184 Q(z) = (1 - z ) * | | (1 - 2*cos(pi*n[2*k+1])*z + z ) |
| 3185 k=0 |
| 3186 ]]></artwork> |
| 3187 </figure> |
| 3188 </t> |
| 3189 <t> |
| 3190 However, SILK performs this reconstruction using a fixed-point approximation so |
| 3191 that all decoders can reproduce it in a bit-exact manner to avoid prediction |
| 3192 drift. |
| 3193 The function silk_NLSF2A() (NLSF2A.c) implements this procedure. |
| 3194 </t> |
| 3195 <t> |
| 3196 To start, it approximates cos(pi*n[k]) using a table lookup with linear |
| 3197 interpolation. |
| 3198 The encoder SHOULD use the inverse of this piecewise linear approximation, |
| 3199 rather than the true inverse of the cosine function, when deriving the |
| 3200 normalized LSF coefficients. |
| 3201 These values are also re-ordered to improve numerical accuracy when |
| 3202 constructing the LPC polynomials. |
| 3203 </t> |
| 3204 |
| 3205 <texttable anchor="silk_nlsf_orderings" |
| 3206 title="LSF Ordering for Polynomial Evaluation"> |
| 3207 <ttcol>Coefficient</ttcol> |
| 3208 <ttcol align="right">NB and MB</ttcol> |
| 3209 <ttcol align="right">WB</ttcol> |
| 3210 <c>0</c> <c>0</c> <c>0</c> |
| 3211 <c>1</c> <c>9</c> <c>15</c> |
| 3212 <c>2</c> <c>6</c> <c>8</c> |
| 3213 <c>3</c> <c>3</c> <c>7</c> |
| 3214 <c>4</c> <c>4</c> <c>4</c> |
| 3215 <c>5</c> <c>5</c> <c>11</c> |
| 3216 <c>6</c> <c>8</c> <c>12</c> |
| 3217 <c>7</c> <c>1</c> <c>3</c> |
| 3218 <c>8</c> <c>2</c> <c>2</c> |
| 3219 <c>9</c> <c>7</c> <c>13</c> |
| 3220 <c>10</c> <c/> <c>10</c> |
| 3221 <c>11</c> <c/> <c>5</c> |
| 3222 <c>12</c> <c/> <c>6</c> |
| 3223 <c>13</c> <c/> <c>9</c> |
| 3224 <c>14</c> <c/> <c>14</c> |
| 3225 <c>15</c> <c/> <c>1</c> |
| 3226 </texttable> |
| 3227 |
| 3228 <t> |
| 3229 The top 7 bits of each normalized LSF coefficient index a value in the table, |
| 3230 and the next 8 bits interpolate between it and the next value. |
| 3231 Let i = (n[k] >> 8) be the integer index and |
| 3232 f = (n[k] & 255) be the fractional part of a given |
| 3233 coefficient. |
| 3234 Then the re-ordered, approximated cosine, c_Q17[ordering[k]], is |
| 3235 <figure align="center"> |
| 3236 <artwork align="center"><![CDATA[ |
| 3237 c_Q17[ordering[k]] = (cos_Q12[i]*256 |
| 3238 + (cos_Q12[i+1]-cos_Q12[i])*f + 4) >> 3 , |
| 3239 ]]></artwork> |
| 3240 </figure> |
| 3241 where ordering[k] is the k'th entry of the column of |
| 3242 <xref target="silk_nlsf_orderings"/> corresponding to the current audio |
| 3243 bandwidth and cos_Q12[i] is the i'th entry of <xref target="silk_cos_table"/>. |
| 3244 </t> |
| 3245 |
| 3246 <texttable anchor="silk_cos_table" |
| 3247 title="Q12 Cosine Table for LSF Conversion"> |
| 3248 <ttcol align="right">i</ttcol> |
| 3249 <ttcol align="right">+0</ttcol> |
| 3250 <ttcol align="right">+1</ttcol> |
| 3251 <ttcol align="right">+2</ttcol> |
| 3252 <ttcol align="right">+3</ttcol> |
| 3253 <c>0</c> |
| 3254 <c>4096</c> <c>4095</c> <c>4091</c> <c>4085</c> |
| 3255 <c>4</c> |
| 3256 <c>4076</c> <c>4065</c> <c>4052</c> <c>4036</c> |
| 3257 <c>8</c> |
| 3258 <c>4017</c> <c>3997</c> <c>3973</c> <c>3948</c> |
| 3259 <c>12</c> |
| 3260 <c>3920</c> <c>3889</c> <c>3857</c> <c>3822</c> |
| 3261 <c>16</c> |
| 3262 <c>3784</c> <c>3745</c> <c>3703</c> <c>3659</c> |
| 3263 <c>20</c> |
| 3264 <c>3613</c> <c>3564</c> <c>3513</c> <c>3461</c> |
| 3265 <c>24</c> |
| 3266 <c>3406</c> <c>3349</c> <c>3290</c> <c>3229</c> |
| 3267 <c>28</c> |
| 3268 <c>3166</c> <c>3102</c> <c>3035</c> <c>2967</c> |
| 3269 <c>32</c> |
| 3270 <c>2896</c> <c>2824</c> <c>2751</c> <c>2676</c> |
| 3271 <c>36</c> |
| 3272 <c>2599</c> <c>2520</c> <c>2440</c> <c>2359</c> |
| 3273 <c>40</c> |
| 3274 <c>2276</c> <c>2191</c> <c>2106</c> <c>2019</c> |
| 3275 <c>44</c> |
| 3276 <c>1931</c> <c>1842</c> <c>1751</c> <c>1660</c> |
| 3277 <c>48</c> |
| 3278 <c>1568</c> <c>1474</c> <c>1380</c> <c>1285</c> |
| 3279 <c>52</c> |
| 3280 <c>1189</c> <c>1093</c> <c>995</c> <c>897</c> |
| 3281 <c>56</c> |
| 3282 <c>799</c> <c>700</c> <c>601</c> <c>501</c> |
| 3283 <c>60</c> |
| 3284 <c>401</c> <c>301</c> <c>201</c> <c>101</c> |
| 3285 <c>64</c> |
| 3286 <c>0</c> <c>-101</c> <c>-201</c> <c>-301</c> |
| 3287 <c>68</c> |
| 3288 <c>-401</c> <c>-501</c> <c>-601</c> <c>-700</c> |
| 3289 <c>72</c> |
| 3290 <c>-799</c> <c>-897</c> <c>-995</c> <c>-1093</c> |
| 3291 <c>76</c> |
| 3292 <c>-1189</c><c>-1285</c><c>-1380</c><c>-1474</c> |
| 3293 <c>80</c> |
| 3294 <c>-1568</c><c>-1660</c><c>-1751</c><c>-1842</c> |
| 3295 <c>84</c> |
| 3296 <c>-1931</c><c>-2019</c><c>-2106</c><c>-2191</c> |
| 3297 <c>88</c> |
| 3298 <c>-2276</c><c>-2359</c><c>-2440</c><c>-2520</c> |
| 3299 <c>92</c> |
| 3300 <c>-2599</c><c>-2676</c><c>-2751</c><c>-2824</c> |
| 3301 <c>96</c> |
| 3302 <c>-2896</c><c>-2967</c><c>-3035</c><c>-3102</c> |
| 3303 <c>100</c> |
| 3304 <c>-3166</c><c>-3229</c><c>-3290</c><c>-3349</c> |
| 3305 <c>104</c> |
| 3306 <c>-3406</c><c>-3461</c><c>-3513</c><c>-3564</c> |
| 3307 <c>108</c> |
| 3308 <c>-3613</c><c>-3659</c><c>-3703</c><c>-3745</c> |
| 3309 <c>112</c> |
| 3310 <c>-3784</c><c>-3822</c><c>-3857</c><c>-3889</c> |
| 3311 <c>116</c> |
| 3312 <c>-3920</c><c>-3948</c><c>-3973</c><c>-3997</c> |
| 3313 <c>120</c> |
| 3314 <c>-4017</c><c>-4036</c><c>-4052</c><c>-4065</c> |
| 3315 <c>124</c> |
| 3316 <c>-4076</c><c>-4085</c><c>-4091</c><c>-4095</c> |
| 3317 <c>128</c> |
| 3318 <c>-4096</c> <c/> <c/> <c/> |
| 3319 </texttable> |
| 3320 |
| 3321 <t> |
| 3322 Given the list of cosine values, silk_NLSF2A_find_poly() (NLSF2A.c) |
| 3323 computes the coefficients of P and Q, described here via a simple recurrence. |
| 3324 Let p_Q16[k][j] and q_Q16[k][j] be the coefficients of the products of the |
| 3325 first (k+1) root pairs for P and Q, with j indexing the coefficient number. |
| 3326 Only the first (k+2) coefficients are needed, as the products are symmetric. |
| 3327 Let p_Q16[0][0] = q_Q16[0][0] = 1<<16, |
| 3328 p_Q16[0][1] = -c_Q17[0], q_Q16[0][1] = -c_Q17[1], and |
| 3329 d2 = d_LPC/2. |
| 3330 As boundary conditions, assume |
| 3331 p_Q16[k][j] = q_Q16[k][j] = 0 for all |
| 3332 j < 0. |
| 3333 Also, assume p_Q16[k][k+2] = p_Q16[k][k] and |
| 3334 q_Q16[k][k+2] = q_Q16[k][k] (because of the symmetry). |
| 3335 Then, for 0 < k < d2 and 0 <= j <
= k+1, |
| 3336 <figure align="center"> |
| 3337 <artwork align="center"><![CDATA[ |
| 3338 p_Q16[k][j] = p_Q16[k-1][j] + p_Q16[k-1][j-2] |
| 3339 - ((c_Q17[2*k]*p_Q16[k-1][j-1] + 32768)>>16) , |
| 3340 |
| 3341 q_Q16[k][j] = q_Q16[k-1][j] + q_Q16[k-1][j-2] |
| 3342 - ((c_Q17[2*k+1]*q_Q16[k-1][j-1] + 32768)>>16) . |
| 3343 ]]></artwork> |
| 3344 </figure> |
| 3345 The use of Q17 values for the cosine terms in an otherwise Q16 expression |
| 3346 implicitly scales them by a factor of 2. |
| 3347 The multiplications in this recurrence may require up to 48 bits of precision |
| 3348 in the result to avoid overflow. |
| 3349 In practice, each row of the recurrence only depends on the previous row, so an |
| 3350 implementation does not need to store all of them. |
| 3351 </t> |
| 3352 <t> |
| 3353 silk_NLSF2A() uses the values from the last row of this recurrence to |
| 3354 reconstruct a 32-bit version of the LPC filter (without the leading 1.0 |
| 3355 coefficient), a32_Q17[k], 0 <= k < d2: |
| 3356 <figure align="center"> |
| 3357 <artwork align="center"><![CDATA[ |
| 3358 a32_Q17[k] = -(q_Q16[d2-1][k+1] - q_Q16[d2-1][k]) |
| 3359 - (p_Q16[d2-1][k+1] + p_Q16[d2-1][k])) , |
| 3360 |
| 3361 a32_Q17[d_LPC-k-1] = (q_Q16[d2-1][k+1] - q_Q16[d2-1][k]) |
| 3362 - (p_Q16[d2-1][k+1] + p_Q16[d2-1][k])) . |
| 3363 ]]></artwork> |
| 3364 </figure> |
| 3365 The sum and difference of two terms from each of the p_Q16 and q_Q16 |
| 3366 coefficient lists reflect the (1 + z**-1) and |
| 3367 (1 - z**-1) factors of P and Q, respectively. |
| 3368 The promotion of the expression from Q16 to Q17 implicitly scales the result |
| 3369 by 1/2. |
| 3370 </t> |
| 3371 </section> |
| 3372 |
| 3373 <section anchor="silk_lpc_range_limit" |
| 3374 title="Limiting the Range of the LPC Coefficients"> |
| 3375 <t> |
| 3376 The a32_Q17[] coefficients are too large to fit in a 16-bit value, which |
| 3377 significantly increases the cost of applying this filter in fixed-point |
| 3378 decoders. |
| 3379 Reducing them to Q12 precision doesn't incur any significant quality loss, |
| 3380 but still does not guarantee they will fit. |
| 3381 silk_NLSF2A() applies up to 10 rounds of bandwidth expansion to limit |
| 3382 the dynamic range of these coefficients. |
| 3383 Even floating-point decoders SHOULD perform these steps, to avoid mismatch. |
| 3384 </t> |
| 3385 <t> |
| 3386 For each round, the process first finds the index k such that abs(a32_Q17[k]) |
| 3387 is largest, breaking ties by choosing the lowest value of k. |
| 3388 Then, it computes the corresponding Q12 precision value, maxabs_Q12, subject to |
| 3389 an upper bound to avoid overflow in subsequent computations: |
| 3390 <figure align="center"> |
| 3391 <artwork align="center"><![CDATA[ |
| 3392 maxabs_Q12 = min((maxabs_Q17 + 16) >> 5, 163838) . |
| 3393 ]]></artwork> |
| 3394 </figure> |
| 3395 If this is larger than 32767, the procedure derives the chirp factor, |
| 3396 sc_Q16[0], to use in the bandwidth expansion as |
| 3397 <figure align="center"> |
| 3398 <artwork align="center"><![CDATA[ |
| 3399 (maxabs_Q12 - 32767) << 14 |
| 3400 sc_Q16[0] = 65470 - -------------------------- , |
| 3401 (maxabs_Q12 * (k+1)) >> 2 |
| 3402 ]]></artwork> |
| 3403 </figure> |
| 3404 where the division here is integer division. |
| 3405 This is an approximation of the chirp factor needed to reduce the target |
| 3406 coefficient to 32767, though it is both less than 0.999 and, for |
| 3407 k > 0 when maxabs_Q12 is much greater than 32767, still slightly |
| 3408 too large. |
| 3409 The upper bound on maxabs_Q12, 163838, was chosen because it is equal to |
| 3410 ((2**31 - 1) >> 14) + 32767, i.e., the |
| 3411 largest value of maxabs_Q12 that would not overflow the numerator in the |
| 3412 equation above when stored in a signed 32-bit integer. |
| 3413 </t> |
| 3414 <t> |
| 3415 silk_bwexpander_32() (bwexpander_32.c) performs the bandwidth expansion (again, |
| 3416 only when maxabs_Q12 is greater than 32767) using the following recurrence: |
| 3417 <figure align="center"> |
| 3418 <artwork align="center"><![CDATA[ |
| 3419 a32_Q17[k] = (a32_Q17[k]*sc_Q16[k]) >> 16 |
| 3420 |
| 3421 sc_Q16[k+1] = (sc_Q16[0]*sc_Q16[k] + 32768) >> 16 |
| 3422 ]]></artwork> |
| 3423 </figure> |
| 3424 The first multiply may require up to 48 bits of precision in the result to |
| 3425 avoid overflow. |
| 3426 The second multiply must be unsigned to avoid overflow with only 32 bits of |
| 3427 precision. |
| 3428 The reference implementation uses a slightly more complex formulation that |
| 3429 avoids the 32-bit overflow using signed multiplication, but is otherwise |
| 3430 equivalent. |
| 3431 </t> |
| 3432 <t> |
| 3433 After 10 rounds of bandwidth expansion are performed, they are simply saturated |
| 3434 to 16 bits: |
| 3435 <figure align="center"> |
| 3436 <artwork align="center"><![CDATA[ |
| 3437 a32_Q17[k] = clamp(-32768, (a32_Q17[k] + 16) >> 5, 32767) << 5 . |
| 3438 ]]></artwork> |
| 3439 </figure> |
| 3440 Because this performs the actual saturation in the Q12 domain, but converts the |
| 3441 coefficients back to the Q17 domain for the purposes of prediction gain |
| 3442 limiting, this step must be performed after the 10th round of bandwidth |
| 3443 expansion, regardless of whether or not the Q12 version of any coefficient |
| 3444 still overflows a 16-bit integer. |
| 3445 This saturation is not performed if maxabs_Q12 drops to 32767 or less prior to |
| 3446 the 10th round. |
| 3447 </t> |
| 3448 </section> |
| 3449 |
| 3450 <section anchor="silk_lpc_gain_limit" |
| 3451 title="Limiting the Prediction Gain of the LPC Filter"> |
| 3452 <t> |
| 3453 The prediction gain of an LPC synthesis filter is the square-root of the output |
| 3454 energy when the filter is excited by a unit-energy impulse. |
| 3455 Even if the Q12 coefficients would fit, the resulting filter may still have a |
| 3456 significant gain (especially for voiced sounds), making the filter unstable. |
| 3457 silk_NLSF2A() applies up to 18 additional rounds of bandwidth expansion to |
| 3458 limit the prediction gain. |
| 3459 Instead of controlling the amount of bandwidth expansion using the prediction |
| 3460 gain itself (which may diverge to infinity for an unstable filter), |
| 3461 silk_NLSF2A() uses silk_LPC_inverse_pred_gain_QA() (LPC_inv_pred_gain.c) to |
| 3462 compute the reflection coefficients associated with the filter. |
| 3463 The filter is stable if and only if the magnitude of these coefficients is |
| 3464 sufficiently less than one. |
| 3465 The reflection coefficients, rc[k], can be computed using a simple Levinson |
| 3466 recurrence, initialized with the LPC coefficients |
| 3467 a[d_LPC-1][n] = a[n], and then updated via |
| 3468 <figure align="center"> |
| 3469 <artwork align="center"><![CDATA[ |
| 3470 rc[k] = -a[k][k] , |
| 3471 |
| 3472 a[k][n] - a[k][k-n-1]*rc[k] |
| 3473 a[k-1][n] = --------------------------- . |
| 3474 2 |
| 3475 1 - rc[k] |
| 3476 ]]></artwork> |
| 3477 </figure> |
| 3478 </t> |
| 3479 <t> |
| 3480 However, silk_LPC_inverse_pred_gain_QA() approximates this using fixed-point |
| 3481 arithmetic to guarantee reproducible results across platforms and |
| 3482 implementations. |
| 3483 Since small changes in the coefficients can make a stable filter unstable, it |
| 3484 takes the real Q12 coefficients that will be used during reconstruction as |
| 3485 input. |
| 3486 Thus, let |
| 3487 <figure align="center"> |
| 3488 <artwork align="center"><![CDATA[ |
| 3489 a32_Q12[n] = (a32_Q17[n] + 16) >> 5 |
| 3490 ]]></artwork> |
| 3491 </figure> |
| 3492 be the Q12 version of the LPC coefficients that will eventually be used. |
| 3493 As a simple initial check, the decoder computes the DC response as |
| 3494 <figure align="center"> |
| 3495 <artwork align="center"><![CDATA[ |
| 3496 d_PLC-1 |
| 3497 __ |
| 3498 DC_resp = \ a32_Q12[n] |
| 3499 /_ |
| 3500 n=0 |
| 3501 ]]></artwork> |
| 3502 </figure> |
| 3503 and if DC_resp > 4096, the filter is unstable. |
| 3504 </t> |
| 3505 <t> |
| 3506 Increasing the precision of these Q12 coefficients to Q24 for intermediate |
| 3507 computations allows more accurate computation of the reflection coefficients, |
| 3508 so the decoder initializes the recurrence via |
| 3509 <figure align="center"> |
| 3510 <artwork align="center"><![CDATA[ |
| 3511 a32_Q24[d_LPC-1][n] = a32_Q12[n] << 12 . |
| 3512 ]]></artwork> |
| 3513 </figure> |
| 3514 Then for each k from d_LPC-1 down to 0, if |
| 3515 abs(a32_Q24[k][k]) > 16773022, the filter is unstable and the |
| 3516 recurrence stops. |
| 3517 The constant 16773022 here is approximately 0.99975 in Q24. |
| 3518 Otherwise, row k-1 of a32_Q24 is computed from row k as |
| 3519 <figure align="center"> |
| 3520 <artwork align="center"><![CDATA[ |
| 3521 rc_Q31[k] = -a32_Q24[k][k] << 7 , |
| 3522 |
| 3523 div_Q30[k] = (1<<30) - (rc_Q31[k]*rc_Q31[k] >> 32) , |
| 3524 |
| 3525 b1[k] = ilog(div_Q30[k]) , |
| 3526 |
| 3527 b2[k] = b1[k] - 16 , |
| 3528 |
| 3529 (1<<29) - 1 |
| 3530 inv_Qb2[k] = ----------------------- , |
| 3531 div_Q30[k] >> (b2[k]+1) |
| 3532 |
| 3533 err_Q29[k] = (1<<29) |
| 3534 - ((div_Q30[k]<<(15-b2[k]))*inv_Qb2[k] >> 16) , |
| 3535 |
| 3536 gain_Qb1[k] = ((inv_Qb2[k] << 16) |
| 3537 + (err_Q29[k]*inv_Qb2[k] >> 13)) , |
| 3538 |
| 3539 num_Q24[k-1][n] = a32_Q24[k][n] |
| 3540 - ((a32_Q24[k][k-n-1]*rc_Q31[k] + (1<<30)) >> 31) , |
| 3541 |
| 3542 a32_Q24[k-1][n] = (num_Q24[k-1][n]*gain_Qb1[k] |
| 3543 + (1<<(b1[k]-1))) >> b1[k] , |
| 3544 ]]></artwork> |
| 3545 </figure> |
| 3546 where 0 <= n < k. |
| 3547 Here, rc_Q30[k] are the reflection coefficients. |
| 3548 div_Q30[k] is the denominator for each iteration, and gain_Qb1[k] is its |
| 3549 multiplicative inverse (with b1[k] fractional bits, where b1[k] ranges from |
| 3550 20 to 31). |
| 3551 inv_Qb2[k], which ranges from 16384 to 32767, is a low-precision version of |
| 3552 that inverse (with b2[k] fractional bits). |
| 3553 err_Q29[k] is the residual error, ranging from -32763 to 32392, which is used |
| 3554 to improve the accuracy. |
| 3555 The values t_Q24[k-1][n] for each n are the numerators for the next row of |
| 3556 coefficients in the recursion, and a32_Q24[k-1][n] is the final version of |
| 3557 that row. |
| 3558 Every multiply in this procedure except the one used to compute gain_Qb1[k] |
| 3559 requires more than 32 bits of precision, but otherwise all intermediate |
| 3560 results fit in 32 bits or less. |
| 3561 In practice, because each row only depends on the next one, an implementation |
| 3562 does not need to store them all. |
| 3563 </t> |
| 3564 <t> |
| 3565 If abs(a32_Q24[k][k]) <= 16773022 for |
| 3566 0 <= k < d_LPC, then the filter is considered stable. |
| 3567 However, the problem of determining stability is ill-conditioned when the |
| 3568 filter contains several reflection coefficients whose magnitude is very close |
| 3569 to one. |
| 3570 This fixed-point algorithm is not mathematically guaranteed to correctly |
| 3571 classify filters as stable or unstable in this case, though it does very well |
| 3572 in practice. |
| 3573 </t> |
| 3574 <t> |
| 3575 On round i, 1 <= i <= 18, if the filter passes these |
| 3576 stability checks, then this procedure stops, and the final LPC coefficients to |
| 3577 use for reconstruction in <xref target="silk_lpc_synthesis"/> are |
| 3578 <figure align="center"> |
| 3579 <artwork align="center"><![CDATA[ |
| 3580 a_Q12[k] = (a32_Q17[k] + 16) >> 5 . |
| 3581 ]]></artwork> |
| 3582 </figure> |
| 3583 Otherwise, a round of bandwidth expansion is applied using the same procedure |
| 3584 as in <xref target="silk_lpc_range_limit"/>, with |
| 3585 <figure align="center"> |
| 3586 <artwork align="center"><![CDATA[ |
| 3587 sc_Q16[0] = 65536 - (2<<i) . |
| 3588 ]]></artwork> |
| 3589 </figure> |
| 3590 During the 15th round, sc_Q16[0] becomes 0 in the above equation, so a_Q12[k] |
| 3591 is set to 0 for all k, guaranteeing a stable filter. |
| 3592 </t> |
| 3593 </section> |
| 3594 |
| 3595 </section> |
| 3596 |
| 3597 <section anchor="silk_ltp_params" toc="include" |
| 3598 title="Long-Term Prediction (LTP) Parameters"> |
| 3599 <t> |
| 3600 After the normalized LSF indices and, for 20 ms frames, the LSF |
| 3601 interpolation index, voiced frames (see <xref target="silk_frame_type"/>) |
| 3602 include additional LTP parameters. |
| 3603 There is one primary lag index for each SILK frame, but this is refined to |
| 3604 produce a separate lag index per subframe using a vector quantizer. |
| 3605 Each subframe also gets its own prediction gain coefficient. |
| 3606 </t> |
| 3607 |
| 3608 <section anchor="silk_ltp_lags" title="Pitch Lags"> |
| 3609 <t> |
| 3610 The primary lag index is coded either relative to the primary lag of the prior |
| 3611 frame in the same channel, or as an absolute index. |
| 3612 Absolute coding is used if and only if |
| 3613 <list style="symbols"> |
| 3614 <t> |
| 3615 This is the first SILK frame of its type (LBRR or regular) for this channel in |
| 3616 the current Opus frame, |
| 3617 </t> |
| 3618 <t> |
| 3619 The previous SILK frame of the same type (LBRR or regular) for this channel in |
| 3620 the same Opus frame was not coded, or |
| 3621 </t> |
| 3622 <t> |
| 3623 That previous SILK frame was coded, but was not voiced (see |
| 3624 <xref target="silk_frame_type"/>). |
| 3625 </t> |
| 3626 </list> |
| 3627 </t> |
| 3628 |
| 3629 <t> |
| 3630 With absolute coding, the primary pitch lag may range from 2 ms |
| 3631 (inclusive) up to 18 ms (exclusive), corresponding to pitches from |
| 3632 500 Hz down to 55.6 Hz, respectively. |
| 3633 It is comprised of a high part and a low part, where the decoder reads the high |
| 3634 part using the 32-entry codebook in <xref target="silk_abs_pitch_high_pdf"/> |
| 3635 and the low part using the codebook corresponding to the current audio |
| 3636 bandwidth from <xref target="silk_abs_pitch_low_pdf"/>. |
| 3637 The final primary pitch lag is then |
| 3638 <figure align="center"> |
| 3639 <artwork align="center"><![CDATA[ |
| 3640 lag = lag_high*lag_scale + lag_low + lag_min |
| 3641 ]]></artwork> |
| 3642 </figure> |
| 3643 where lag_high is the high part, lag_low is the low part, and lag_scale |
| 3644 and lag_min are the values from the "Scale" and "Minimum Lag" columns of |
| 3645 <xref target="silk_abs_pitch_low_pdf"/>, respectively. |
| 3646 </t> |
| 3647 |
| 3648 <texttable anchor="silk_abs_pitch_high_pdf" |
| 3649 title="PDF for High Part of Primary Pitch Lag"> |
| 3650 <ttcol align="left">PDF</ttcol> |
| 3651 <c>{3, 3, 6, 11, 21, 30, 32, 19, |
| 3652 11, 10, 12, 13, 13, 12, 11, 9, |
| 3653 8, 7, 6, 4, 2, 2, 2, 1, |
| 3654 1, 1, 1, 1, 1, 1, 1, 1}/256</c> |
| 3655 </texttable> |
| 3656 |
| 3657 <texttable anchor="silk_abs_pitch_low_pdf" |
| 3658 title="PDF for Low Part of Primary Pitch Lag"> |
| 3659 <ttcol>Audio Bandwidth</ttcol> |
| 3660 <ttcol>PDF</ttcol> |
| 3661 <ttcol>Scale</ttcol> |
| 3662 <ttcol>Minimum Lag</ttcol> |
| 3663 <ttcol>Maximum Lag</ttcol> |
| 3664 <c>NB</c> <c>{64, 64, 64, 64}/256</c> <c>4</c> <c>16</c> <c>144<
/c> |
| 3665 <c>MB</c> <c>{43, 42, 43, 43, 42, 43}/256</c> <c>6</c> <c>24</c> <c>216<
/c> |
| 3666 <c>WB</c> <c>{32, 32, 32, 32, 32, 32, 32, 32}/256</c> <c>8</c> <c>32</c> <c>288<
/c> |
| 3667 </texttable> |
| 3668 |
| 3669 <t> |
| 3670 All frames that do not use absolute coding for the primary lag index use |
| 3671 relative coding instead. |
| 3672 The decoder reads a single delta value using the 21-entry PDF in |
| 3673 <xref target="silk_rel_pitch_pdf"/>. |
| 3674 If the resulting value is zero, it falls back to the absolute coding procedure |
| 3675 from the prior paragraph. |
| 3676 Otherwise, the final primary pitch lag is then |
| 3677 <figure align="center"> |
| 3678 <artwork align="center"><![CDATA[ |
| 3679 lag = previous_lag + (delta_lag_index - 9) |
| 3680 ]]></artwork> |
| 3681 </figure> |
| 3682 where previous_lag is the primary pitch lag from the most recent frame in the |
| 3683 same channel and delta_lag_index is the value just decoded. |
| 3684 This allows a per-frame change in the pitch lag of -8 to +11 samples. |
| 3685 The decoder does no clamping at this point, so this value can fall outside the |
| 3686 range of 2 ms to 18 ms, and the decoder must use this unclamped |
| 3687 value when using relative coding in the next SILK frame (if any). |
| 3688 However, because an Opus frame can use relative coding for at most two |
| 3689 consecutive SILK frames, integer overflow should not be an issue. |
| 3690 </t> |
| 3691 |
| 3692 <texttable anchor="silk_rel_pitch_pdf" |
| 3693 title="PDF for Primary Pitch Lag Change"> |
| 3694 <ttcol align="left">PDF</ttcol> |
| 3695 <c>{46, 2, 2, 3, 4, 6, 10, 15, |
| 3696 26, 38, 30, 22, 15, 10, 7, 6, |
| 3697 4, 4, 2, 2, 2}/256</c> |
| 3698 </texttable> |
| 3699 |
| 3700 <t> |
| 3701 After the primary pitch lag, a "pitch contour", stored as a single entry from |
| 3702 one of four small VQ codebooks, gives lag offsets for each subframe in the |
| 3703 current SILK frame. |
| 3704 The codebook index is decoded using one of the PDFs in |
| 3705 <xref target="silk_pitch_contour_pdfs"/> depending on the current frame size |
| 3706 and audio bandwidth. |
| 3707 Tables <xref format="counter" target="silk_pitch_contour_cb_nb10ms"/> |
| 3708 through <xref format="counter" target="silk_pitch_contour_cb_mbwb20ms"/> |
| 3709 give the corresponding offsets to apply to the primary pitch lag for each |
| 3710 subframe given the decoded codebook index. |
| 3711 </t> |
| 3712 |
| 3713 <texttable anchor="silk_pitch_contour_pdfs" |
| 3714 title="PDFs for Subframe Pitch Contour"> |
| 3715 <ttcol>Audio Bandwidth</ttcol> |
| 3716 <ttcol>SILK Frame Size</ttcol> |
| 3717 <ttcol align="right">Codebook Size</ttcol> |
| 3718 <ttcol>PDF</ttcol> |
| 3719 <c>NB</c> <c>10 ms</c> <c>3</c> |
| 3720 <c>{143, 50, 63}/256</c> |
| 3721 <c>NB</c> <c>20 ms</c> <c>11</c> |
| 3722 <c>{68, 12, 21, 17, 19, 22, 30, 24, |
| 3723 17, 16, 10}/256</c> |
| 3724 <c>MB or WB</c> <c>10 ms</c> <c>12</c> |
| 3725 <c>{91, 46, 39, 19, 14, 12, 8, 7, |
| 3726 6, 5, 5, 4}/256</c> |
| 3727 <c>MB or WB</c> <c>20 ms</c> <c>34</c> |
| 3728 <c>{33, 22, 18, 16, 15, 14, 14, 13, |
| 3729 13, 10, 9, 9, 8, 6, 6, 6, |
| 3730 5, 4, 4, 4, 3, 3, 3, 2, |
| 3731 2, 2, 2, 2, 2, 2, 1, 1, |
| 3732 1, 1}/256</c> |
| 3733 </texttable> |
| 3734 |
| 3735 <texttable anchor="silk_pitch_contour_cb_nb10ms" |
| 3736 title="Codebook Vectors for Subframe Pitch Contour: NB, 10 ms Frames"> |
| 3737 <ttcol>Index</ttcol> |
| 3738 <ttcol align="right">Subframe Offsets</ttcol> |
| 3739 <c>0</c> <c><spanx style="vbare"> 0 0</spanx></c> |
| 3740 <c>1</c> <c><spanx style="vbare"> 1 0</spanx></c> |
| 3741 <c>2</c> <c><spanx style="vbare"> 0 1</spanx></c> |
| 3742 </texttable> |
| 3743 |
| 3744 <texttable anchor="silk_pitch_contour_cb_nb20ms" |
| 3745 title="Codebook Vectors for Subframe Pitch Contour: NB, 20 ms Frames"> |
| 3746 <ttcol>Index</ttcol> |
| 3747 <ttcol align="right">Subframe Offsets</ttcol> |
| 3748 <c>0</c> <c><spanx style="vbare"> 0 0 0 0
</spanx></c> |
| 3749 <c>1</c> <c><spanx style="vbare"> 2 1 0 -1</spa
nx></c> |
| 3750 <c>2</c> <c><spanx style="vbare">-1 0 1 2</spa
nx></c> |
| 3751 <c>3</c> <c><spanx style="vbare">-1 0 0 1</spa
nx></c> |
| 3752 <c>4</c> <c><spanx style="vbare">-1 0 0 0</spa
nx></c> |
| 3753 <c>5</c> <c><spanx style="vbare"> 0 0 0 1
</spanx></c> |
| 3754 <c>6</c> <c><spanx style="vbare"> 0 0 1 1
</spanx></c> |
| 3755 <c>7</c> <c><spanx style="vbare"> 1 1 0 0
</spanx></c> |
| 3756 <c>8</c> <c><spanx style="vbare"> 1 0 0 0
</spanx></c> |
| 3757 <c>9</c> <c><spanx style="vbare"> 0 0 0 -1</spa
nx></c> |
| 3758 <c>10</c> <c><spanx style="vbare"> 1 0 0 -1</spa
nx></c> |
| 3759 </texttable> |
| 3760 |
| 3761 <texttable anchor="silk_pitch_contour_cb_mbwb10ms" |
| 3762 title="Codebook Vectors for Subframe Pitch Contour: MB or WB, 10 ms Frames
"> |
| 3763 <ttcol>Index</ttcol> |
| 3764 <ttcol align="right">Subframe Offsets</ttcol> |
| 3765 <c>0</c> <c><spanx style="vbare"> 0 0</spanx></c> |
| 3766 <c>1</c> <c><spanx style="vbare"> 0 1</spanx></c> |
| 3767 <c>2</c> <c><spanx style="vbare"> 1 0</spanx></c> |
| 3768 <c>3</c> <c><spanx style="vbare">-1 1</spanx></c> |
| 3769 <c>4</c> <c><spanx style="vbare"> 1 -1</spanx></c> |
| 3770 <c>5</c> <c><spanx style="vbare">-1 2</spanx></c> |
| 3771 <c>6</c> <c><spanx style="vbare"> 2 -1</spanx></c> |
| 3772 <c>7</c> <c><spanx style="vbare">-2 2</spanx></c> |
| 3773 <c>8</c> <c><spanx style="vbare"> 2 -2</spanx></c> |
| 3774 <c>9</c> <c><spanx style="vbare">-2 3</spanx></c> |
| 3775 <c>10</c> <c><spanx style="vbare"> 3 -2</spanx></c> |
| 3776 <c>11</c> <c><spanx style="vbare">-3 3</spanx></c> |
| 3777 </texttable> |
| 3778 |
| 3779 <texttable anchor="silk_pitch_contour_cb_mbwb20ms" |
| 3780 title="Codebook Vectors for Subframe Pitch Contour: MB or WB, 20 ms Frames
"> |
| 3781 <ttcol>Index</ttcol> |
| 3782 <ttcol align="right">Subframe Offsets</ttcol> |
| 3783 <c>0</c> <c><spanx style="vbare"> 0 0 0 0
</spanx></c> |
| 3784 <c>1</c> <c><spanx style="vbare"> 0 0 1 1
</spanx></c> |
| 3785 <c>2</c> <c><spanx style="vbare"> 1 1 0 0
</spanx></c> |
| 3786 <c>3</c> <c><spanx style="vbare">-1 0 0 0</spa
nx></c> |
| 3787 <c>4</c> <c><spanx style="vbare"> 0 0 0 1
</spanx></c> |
| 3788 <c>5</c> <c><spanx style="vbare"> 1 0 0 0
</spanx></c> |
| 3789 <c>6</c> <c><spanx style="vbare">-1 0 0 1</spa
nx></c> |
| 3790 <c>7</c> <c><spanx style="vbare"> 0 0 0 -1</spa
nx></c> |
| 3791 <c>8</c> <c><spanx style="vbare">-1 0 1 2</spa
nx></c> |
| 3792 <c>9</c> <c><spanx style="vbare"> 1 0 0 -1</spa
nx></c> |
| 3793 <c>10</c> <c><spanx style="vbare">-2 -1 1 2</spanx></
c> |
| 3794 <c>11</c> <c><spanx style="vbare"> 2 1 0 -1</spa
nx></c> |
| 3795 <c>12</c> <c><spanx style="vbare">-2 0 0 2</spa
nx></c> |
| 3796 <c>13</c> <c><spanx style="vbare">-2 0 1 3</spa
nx></c> |
| 3797 <c>14</c> <c><spanx style="vbare"> 2 1 -1 -2</spanx></
c> |
| 3798 <c>15</c> <c><spanx style="vbare">-3 -1 1 3</spanx></
c> |
| 3799 <c>16</c> <c><spanx style="vbare"> 2 0 0 -2</spa
nx></c> |
| 3800 <c>17</c> <c><spanx style="vbare"> 3 1 0 -2</spa
nx></c> |
| 3801 <c>18</c> <c><spanx style="vbare">-3 -1 2 4</spanx></
c> |
| 3802 <c>19</c> <c><spanx style="vbare">-4 -1 1 4</spanx></
c> |
| 3803 <c>20</c> <c><spanx style="vbare"> 3 1 -1 -3</spanx></
c> |
| 3804 <c>21</c> <c><spanx style="vbare">-4 -1 2 5</spanx></
c> |
| 3805 <c>22</c> <c><spanx style="vbare"> 4 2 -1 -3</spanx></
c> |
| 3806 <c>23</c> <c><spanx style="vbare"> 4 1 -1 -4</spanx></
c> |
| 3807 <c>24</c> <c><spanx style="vbare">-5 -1 2 6</spanx></
c> |
| 3808 <c>25</c> <c><spanx style="vbare"> 5 2 -1 -4</spanx></
c> |
| 3809 <c>26</c> <c><spanx style="vbare">-6 -2 2 6</spanx></
c> |
| 3810 <c>27</c> <c><spanx style="vbare">-5 -2 2 5</spanx></
c> |
| 3811 <c>28</c> <c><spanx style="vbare"> 6 2 -1 -5</spanx></
c> |
| 3812 <c>29</c> <c><spanx style="vbare">-7 -2 3 8</spanx></
c> |
| 3813 <c>30</c> <c><spanx style="vbare"> 6 2 -2 -6</spanx></
c> |
| 3814 <c>31</c> <c><spanx style="vbare"> 5 2 -2 -5</spanx></
c> |
| 3815 <c>32</c> <c><spanx style="vbare"> 8 3 -2 -7</spanx></
c> |
| 3816 <c>33</c> <c><spanx style="vbare">-9 -3 3 9</spanx></
c> |
| 3817 </texttable> |
| 3818 |
| 3819 <t> |
| 3820 The final pitch lag for each subframe is assembled in silk_decode_pitch() |
| 3821 (decode_pitch.c). |
| 3822 Let lag be the primary pitch lag for the current SILK frame, contour_index be |
| 3823 index of the VQ codebook, and lag_cb[contour_index][k] be the corresponding |
| 3824 entry of the codebook from the appropriate table given above for the k'th |
| 3825 subframe. |
| 3826 Then the final pitch lag for that subframe is |
| 3827 <figure align="center"> |
| 3828 <artwork align="center"><![CDATA[ |
| 3829 pitch_lags[k] = clamp(lag_min, lag + lag_cb[contour_index][k], |
| 3830 lag_max) |
| 3831 ]]></artwork> |
| 3832 </figure> |
| 3833 where lag_min and lag_max are the values from the "Minimum Lag" and |
| 3834 "Maximum Lag" columns of <xref target="silk_abs_pitch_low_pdf"/>, |
| 3835 respectively. |
| 3836 </t> |
| 3837 |
| 3838 </section> |
| 3839 |
| 3840 <section anchor="silk_ltp_filter" title="LTP Filter Coefficients"> |
| 3841 <t> |
| 3842 SILK uses a separate 5-tap pitch filter for each subframe, selected from one |
| 3843 of three codebooks. |
| 3844 The three codebooks each represent different rate-distortion trade-offs, with |
| 3845 average rates of 1.61 bits/subframe, 3.68 bits/subframe, and |
| 3846 4.85 bits/subframe, respectively. |
| 3847 </t> |
| 3848 |
| 3849 <t> |
| 3850 The importance of the filter coefficients generally depends on two factors: the |
| 3851 periodicity of the signal and relative energy between the current subframe and |
| 3852 the signal from one period earlier. |
| 3853 Greater periodicity and decaying energy both lead to more important filter |
| 3854 coefficients, and thus should be coded with lower distortion and higher rate. |
| 3855 These properties are relatively stable over the duration of a single SILK |
| 3856 frame, hence all of the subframes in a SILK frame choose their filter from the |
| 3857 same codebook. |
| 3858 This is signaled with an explicitly-coded "periodicity index". |
| 3859 This immediately follows the subframe pitch lags, and is coded using the |
| 3860 3-entry PDF from <xref target="silk_perindex_pdf"/>. |
| 3861 </t> |
| 3862 |
| 3863 <texttable anchor="silk_perindex_pdf" title="Periodicity Index PDF"> |
| 3864 <ttcol>PDF</ttcol> |
| 3865 <c>{77, 80, 99}/256</c> |
| 3866 </texttable> |
| 3867 |
| 3868 <t> |
| 3869 The indices of the filters for each subframe follow. |
| 3870 They are all coded using the PDF from <xref target="silk_ltp_filter_pdfs"/> |
| 3871 corresponding to the periodicity index. |
| 3872 Tables <xref format="counter" target="silk_ltp_filter_coeffs0"/> |
| 3873 through <xref format="counter" target="silk_ltp_filter_coeffs2"/> |
| 3874 contain the corresponding filter taps as signed Q7 integers. |
| 3875 </t> |
| 3876 |
| 3877 <texttable anchor="silk_ltp_filter_pdfs" title="LTP Filter PDFs"> |
| 3878 <ttcol>Periodicity Index</ttcol> |
| 3879 <ttcol align="right">Codebook Size</ttcol> |
| 3880 <ttcol>PDF</ttcol> |
| 3881 <c>0</c> <c>8</c> <c>{185, 15, 13, 13, 9, 9, 6, 6}/256</c> |
| 3882 <c>1</c> <c>16</c> <c>{57, 34, 21, 20, 15, 13, 12, 13, |
| 3883 10, 10, 9, 10, 9, 8, 7, 8}/256</c> |
| 3884 <c>2</c> <c>32</c> <c>{15, 16, 14, 12, 12, 12, 11, 11, |
| 3885 11, 10, 9, 9, 9, 9, 8, 8, |
| 3886 8, 8, 7, 7, 6, 6, 5, 4, |
| 3887 5, 4, 4, 4, 3, 4, 3, 2}/256</c> |
| 3888 </texttable> |
| 3889 |
| 3890 <texttable anchor="silk_ltp_filter_coeffs0" |
| 3891 title="Codebook Vectors for LTP Filter, Periodicity Index 0"> |
| 3892 <ttcol>Index</ttcol> |
| 3893 <ttcol align="right">Filter Taps (Q7)</ttcol> |
| 3894 <c>0</c> |
| 3895 <c><spanx style="vbare"> 4 6 24 &nbs
p; 7 5</spanx></c> |
| 3896 <c>1</c> |
| 3897 <c><spanx style="vbare"> 0 0 2 
; 0 0</spanx></c> |
| 3898 <c>2</c> |
| 3899 <c><spanx style="vbare"> 12 28 41 13
-4</spanx></c> |
| 3900 <c>3</c> |
| 3901 <c><spanx style="vbare"> -9 15 42 25
14</spanx></c> |
| 3902 <c>4</c> |
| 3903 <c><spanx style="vbare"> 1 -2 62 41&
nbsp; -9</spanx></c> |
| 3904 <c>5</c> |
| 3905 <c><spanx style="vbare">-10 37 65 -4  
; 3</spanx></c> |
| 3906 <c>6</c> |
| 3907 <c><spanx style="vbare"> -6 4 66 &nb
sp;7 -8</spanx></c> |
| 3908 <c>7</c> |
| 3909 <c><spanx style="vbare"> 16 14 38 -3
33</spanx></c> |
| 3910 </texttable> |
| 3911 |
| 3912 <texttable anchor="silk_ltp_filter_coeffs1" |
| 3913 title="Codebook Vectors for LTP Filter, Periodicity Index 1"> |
| 3914 <ttcol>Index</ttcol> |
| 3915 <ttcol align="right">Filter Taps (Q7)</ttcol> |
| 3916 |
| 3917 <c>0</c> |
| 3918 <c><spanx style="vbare"> 13 22 39 23
12</spanx></c> |
| 3919 <c>1</c> |
| 3920 <c><spanx style="vbare"> -1 36 64 27
-6</spanx></c> |
| 3921 <c>2</c> |
| 3922 <c><spanx style="vbare"> -7 10 55 43
17</spanx></c> |
| 3923 <c>3</c> |
| 3924 <c><spanx style="vbare"> 1 1 8 
; 1 1</spanx></c> |
| 3925 <c>4</c> |
| 3926 <c><spanx style="vbare"> 6 -11 74 53
-9</spanx></c> |
| 3927 <c>5</c> |
| 3928 <c><spanx style="vbare">-12 55 76 -12 &nbs
p;8</spanx></c> |
| 3929 <c>6</c> |
| 3930 <c><spanx style="vbare"> -3 3 93 27&
nbsp; -4</spanx></c> |
| 3931 <c>7</c> |
| 3932 <c><spanx style="vbare"> 26 39 59 3&
nbsp; -8</spanx></c> |
| 3933 <c>8</c> |
| 3934 <c><spanx style="vbare"> 2 0 77 &nbs
p;11 9</spanx></c> |
| 3935 <c>9</c> |
| 3936 <c><spanx style="vbare"> -8 22 44 -6
7</spanx></c> |
| 3937 <c>10</c> |
| 3938 <c><spanx style="vbare"> 40 9 26 &nb
sp;3 9</spanx></c> |
| 3939 <c>11</c> |
| 3940 <c><spanx style="vbare"> -7 20 101 -7  
; 4</spanx></c> |
| 3941 <c>12</c> |
| 3942 <c><spanx style="vbare"> 3 -8 42 26&
nbsp; 0</spanx></c> |
| 3943 <c>13</c> |
| 3944 <c><spanx style="vbare">-15 33 68 2
23</spanx></c> |
| 3945 <c>14</c> |
| 3946 <c><spanx style="vbare"> -2 55 46 -2
15</spanx></c> |
| 3947 <c>15</c> |
| 3948 <c><spanx style="vbare"> 3 -1 21 16&
nbsp; 41</spanx></c> |
| 3949 </texttable> |
| 3950 |
| 3951 <texttable anchor="silk_ltp_filter_coeffs2" |
| 3952 title="Codebook Vectors for LTP Filter, Periodicity Index 2"> |
| 3953 <ttcol>Index</ttcol> |
| 3954 <ttcol align="right">Filter Taps (Q7)</ttcol> |
| 3955 <c>0</c> |
| 3956 <c><spanx style="vbare"> -6 27 61 39
5</spanx></c> |
| 3957 <c>1</c> |
| 3958 <c><spanx style="vbare">-11 42 88 4
1</spanx></c> |
| 3959 <c>2</c> |
| 3960 <c><spanx style="vbare"> -2 60 65 6&
nbsp; -4</spanx></c> |
| 3961 <c>3</c> |
| 3962 <c><spanx style="vbare"> -1 -5 73 56
1</spanx></c> |
| 3963 <c>4</c> |
| 3964 <c><spanx style="vbare"> -9 19 94 29
-9</spanx></c> |
| 3965 <c>5</c> |
| 3966 <c><spanx style="vbare"> 0 12 99 &nb
sp;6 4</spanx></c> |
| 3967 <c>6</c> |
| 3968 <c><spanx style="vbare"> 8 -19 102 46 -13</
spanx></c> |
| 3969 <c>7</c> |
| 3970 <c><spanx style="vbare"> 3 2 13 &nbs
p; 3 2</spanx></c> |
| 3971 <c>8</c> |
| 3972 <c><spanx style="vbare"> 9 -21 84 72
-18</spanx></c> |
| 3973 <c>9</c> |
| 3974 <c><spanx style="vbare">-11 46 104 -22 8</
spanx></c> |
| 3975 <c>10</c> |
| 3976 <c><spanx style="vbare"> 18 38 48 23
0</spanx></c> |
| 3977 <c>11</c> |
| 3978 <c><spanx style="vbare">-16 70 83 -21 11</
spanx></c> |
| 3979 <c>12</c> |
| 3980 <c><spanx style="vbare"> 5 -11 117 22  
;-8</spanx></c> |
| 3981 <c>13</c> |
| 3982 <c><spanx style="vbare"> -6 23 117 -12 &nbs
p;3</spanx></c> |
| 3983 <c>14</c> |
| 3984 <c><spanx style="vbare"> 3 -8 95 28&
nbsp; 4</spanx></c> |
| 3985 <c>15</c> |
| 3986 <c><spanx style="vbare">-10 15 77 60 -15</
spanx></c> |
| 3987 <c>16</c> |
| 3988 <c><spanx style="vbare"> -1 4 124 2&
nbsp; -4</spanx></c> |
| 3989 <c>17</c> |
| 3990 <c><spanx style="vbare"> 3 38 84 24&
nbsp;-25</spanx></c> |
| 3991 <c>18</c> |
| 3992 <c><spanx style="vbare"> 2 13 42 13&
nbsp; 31</spanx></c> |
| 3993 <c>19</c> |
| 3994 <c><spanx style="vbare"> 21 -4 56 46
-1</spanx></c> |
| 3995 <c>20</c> |
| 3996 <c><spanx style="vbare"> -1 35 79 -13  
;19</spanx></c> |
| 3997 <c>21</c> |
| 3998 <c><spanx style="vbare"> -7 65 88 -9
-14</spanx></c> |
| 3999 <c>22</c> |
| 4000 <c><spanx style="vbare"> 20 4 81 49&
nbsp;-29</spanx></c> |
| 4001 <c>23</c> |
| 4002 <c><spanx style="vbare"> 20 0 75 &nb
sp;3 -17</spanx></c> |
| 4003 <c>24</c> |
| 4004 <c><spanx style="vbare"> 5 -9 44 92&
nbsp; -8</spanx></c> |
| 4005 <c>25</c> |
| 4006 <c><spanx style="vbare"> 1 -3 22 69&
nbsp; 31</spanx></c> |
| 4007 <c>26</c> |
| 4008 <c><spanx style="vbare"> -6 95 41 -12  
; 5</spanx></c> |
| 4009 <c>27</c> |
| 4010 <c><spanx style="vbare"> 39 67 16 -4
1</spanx></c> |
| 4011 <c>28</c> |
| 4012 <c><spanx style="vbare"> 0 -6 120 55
-36</spanx></c> |
| 4013 <c>29</c> |
| 4014 <c><spanx style="vbare">-13 44 122 4 -24</
spanx></c> |
| 4015 <c>30</c> |
| 4016 <c><spanx style="vbare"> 81 5 11 &nb
sp;3 7</spanx></c> |
| 4017 <c>31</c> |
| 4018 <c><spanx style="vbare"> 2 0 9 
; 10 88</spanx></c> |
| 4019 </texttable> |
| 4020 |
| 4021 </section> |
| 4022 |
| 4023 <section anchor="silk_ltp_scaling" title="LTP Scaling Parameter"> |
| 4024 <t> |
| 4025 An LTP scaling parameter appears after the LTP filter coefficients if and only |
| 4026 if |
| 4027 <list style="symbols"> |
| 4028 <t>This is a voiced frame (see <xref target="silk_frame_type"/>), and</t> |
| 4029 <t>Either |
| 4030 <list style="symbols"> |
| 4031 <t> |
| 4032 This SILK frame corresponds to the first time interval of the |
| 4033 current Opus frame for its type (LBRR or regular), or |
| 4034 </t> |
| 4035 <t> |
| 4036 This is an LBRR frame where the LBRR flags (see |
| 4037 <xref target="silk_lbrr_flags"/>) indicate the previous LBRR frame in the same |
| 4038 channel is not coded. |
| 4039 </t> |
| 4040 </list> |
| 4041 </t> |
| 4042 </list> |
| 4043 This allows the encoder to trade off the prediction gain between |
| 4044 packets against the recovery time after packet loss. |
| 4045 Unlike absolute-coding for pitch lags, regular SILK frames that are not at the |
| 4046 start of an Opus frame (i.e., that do not correspond to the first 20 ms |
| 4047 time interval in Opus frames of 40 or 60 ms) do not include this |
| 4048 field, even if the prior frame was not voiced, or (in the case of the side |
| 4049 channel) not even coded. |
| 4050 After an uncoded frame in the side channel, the LTP buffer (see |
| 4051 <xref target="silk_ltp_synthesis"/>) is cleared to zero, and is thus in a |
| 4052 known state. |
| 4053 In contrast, LBRR frames do include this field when the prior frame was not |
| 4054 coded, since the LTP buffer contains the output of the PLC, which is |
| 4055 non-normative. |
| 4056 </t> |
| 4057 <t> |
| 4058 If present, the decoder reads a value using the 3-entry PDF in |
| 4059 <xref target="silk_ltp_scaling_pdf"/>. |
| 4060 The three possible values represent Q14 scale factors of 15565, 12288, and |
| 4061 8192, respectively (corresponding to approximately 0.95, 0.75, and 0.5). |
| 4062 Frames that do not code the scaling parameter use the default factor of 15565 |
| 4063 (approximately 0.95). |
| 4064 </t> |
| 4065 |
| 4066 <texttable anchor="silk_ltp_scaling_pdf" |
| 4067 title="PDF for LTP Scaling Parameter"> |
| 4068 <ttcol align="left">PDF</ttcol> |
| 4069 <c>{128, 64, 64}/256</c> |
| 4070 </texttable> |
| 4071 |
| 4072 </section> |
| 4073 |
| 4074 </section> |
| 4075 |
| 4076 <section anchor="silk_seed" toc="include" |
| 4077 title="Linear Congruential Generator (LCG) Seed"> |
| 4078 <t> |
| 4079 As described in <xref target="silk_excitation_reconstruction"/>, SILK uses a |
| 4080 linear congruential generator (LCG) to inject pseudorandom noise into the |
| 4081 quantized excitation. |
| 4082 To ensure synchronization of this process between the encoder and decoder, each |
| 4083 SILK frame stores a 2-bit seed after the LTP parameters (if any). |
| 4084 The encoder may consider the choice of seed during quantization, and the |
| 4085 flexibility of this choice lets it reduce distortion, helping to pay for the |
| 4086 bit cost required to signal it. |
| 4087 The decoder reads the seed using the uniform 4-entry PDF in |
| 4088 <xref target="silk_seed_pdf"/>, yielding a value between 0 and 3, inclusive. |
| 4089 </t> |
| 4090 |
| 4091 <texttable anchor="silk_seed_pdf" |
| 4092 title="PDF for LCG Seed"> |
| 4093 <ttcol align="left">PDF</ttcol> |
| 4094 <c>{64, 64, 64, 64}/256</c> |
| 4095 </texttable> |
| 4096 |
| 4097 </section> |
| 4098 |
| 4099 <section anchor="silk_excitation" toc="include" title="Excitation"> |
| 4100 <t> |
| 4101 SILK codes the excitation using a modified version of the Pyramid Vector |
| 4102 Quantization (PVQ) codebook <xref target="PVQ"/>. |
| 4103 The PVQ codebook is designed for Laplace-distributed values and consists of all |
| 4104 sums of K signed, unit pulses in a vector of dimension N, where two pulses at |
| 4105 the same position are required to have the same sign. |
| 4106 Thus the codebook includes all integer codevectors y of dimension N that |
| 4107 satisfy |
| 4108 <figure align="center"> |
| 4109 <artwork align="center"><![CDATA[ |
| 4110 N-1 |
| 4111 __ |
| 4112 \ abs(y[j]) = K . |
| 4113 /_ |
| 4114 j=0 |
| 4115 ]]></artwork> |
| 4116 </figure> |
| 4117 Unlike regular PVQ, SILK uses a variable-length, rather than fixed-length, |
| 4118 encoding. |
| 4119 This encoding is better suited to the more Gaussian-like distribution of the |
| 4120 coefficient magnitudes and the non-uniform distribution of their signs (caused |
| 4121 by the quantization offset described below). |
| 4122 SILK also handles large codebooks by coding the least significant bits (LSBs) |
| 4123 of each coefficient directly. |
| 4124 This adds a small coding efficiency loss, but greatly reduces the computation |
| 4125 time and ROM size required for decoding, as implemented in |
| 4126 silk_decode_pulses() (decode_pulses.c). |
| 4127 </t> |
| 4128 |
| 4129 <t> |
| 4130 SILK fixes the dimension of the codebook to N = 16. |
| 4131 The excitation is made up of a number of "shell blocks", each 16 samples in |
| 4132 size. |
| 4133 <xref target="silk_shell_block_table"/> lists the number of shell blocks |
| 4134 required for a SILK frame for each possible audio bandwidth and frame size. |
| 4135 10 ms MB frames nominally contain 120 samples (10 ms at |
| 4136 12 kHz), which is not a multiple of 16. |
| 4137 This is handled by coding 8 shell blocks (128 samples) and discarding the final |
| 4138 8 samples of the last block. |
| 4139 The decoder contains no special case that prevents an encoder from placing |
| 4140 pulses in these samples, and they must be correctly parsed from the bitstream |
| 4141 if present, but they are otherwise ignored. |
| 4142 </t> |
| 4143 |
| 4144 <texttable anchor="silk_shell_block_table" |
| 4145 title="Number of Shell Blocks Per SILK Frame"> |
| 4146 <ttcol>Audio Bandwidth</ttcol> |
| 4147 <ttcol>Frame Size</ttcol> |
| 4148 <ttcol align="right">Number of Shell Blocks</ttcol> |
| 4149 <c>NB</c> <c>10 ms</c> <c>5</c> |
| 4150 <c>MB</c> <c>10 ms</c> <c>8</c> |
| 4151 <c>WB</c> <c>10 ms</c> <c>10</c> |
| 4152 <c>NB</c> <c>20 ms</c> <c>10</c> |
| 4153 <c>MB</c> <c>20 ms</c> <c>15</c> |
| 4154 <c>WB</c> <c>20 ms</c> <c>20</c> |
| 4155 </texttable> |
| 4156 |
| 4157 <section anchor="silk_rate_level" title="Rate Level"> |
| 4158 <t> |
| 4159 The first symbol in the excitation is a "rate level", which is an index from 0 |
| 4160 to 8, inclusive, coded using the PDF in <xref target="silk_rate_level_pdfs"/> |
| 4161 corresponding to the signal type of the current frame (from |
| 4162 <xref target="silk_frame_type"/>). |
| 4163 The rate level selects the PDF used to decode the number of pulses in |
| 4164 the individual shell blocks. |
| 4165 It does not directly convey any information about the bitrate or the number of |
| 4166 pulses itself, but merely changes the probability of the symbols in |
| 4167 <xref target="silk_pulse_counts"/>. |
| 4168 Level 0 provides a more efficient encoding at low rates generally, and |
| 4169 level 8 provides a more efficient encoding at high rates generally, |
| 4170 though the most efficient level for a particular SILK frame may depend on the |
| 4171 exact distribution of the coded symbols. |
| 4172 An encoder should, but is not required to, use the most efficient rate level. |
| 4173 </t> |
| 4174 |
| 4175 <texttable anchor="silk_rate_level_pdfs" |
| 4176 title="PDFs for the Rate Level"> |
| 4177 <ttcol>Signal Type</ttcol> |
| 4178 <ttcol>PDF</ttcol> |
| 4179 <c>Inactive or Unvoiced</c> |
| 4180 <c>{15, 51, 12, 46, 45, 13, 33, 27, 14}/256</c> |
| 4181 <c>Voiced</c> |
| 4182 <c>{33, 30, 36, 17, 34, 49, 18, 21, 18}/256</c> |
| 4183 </texttable> |
| 4184 |
| 4185 </section> |
| 4186 |
| 4187 <section anchor="silk_pulse_counts" title="Pulses Per Shell Block"> |
| 4188 <t> |
| 4189 The total number of pulses in each of the shell blocks follows the rate level. |
| 4190 The pulse counts for all of the shell blocks are coded consecutively, before |
| 4191 the content of any of the blocks. |
| 4192 Each block may have anywhere from 0 to 16 pulses, inclusive, coded using the |
| 4193 18-entry PDF in <xref target="silk_pulse_count_pdfs"/> corresponding to the |
| 4194 rate level from <xref target="silk_rate_level"/>. |
| 4195 The special value 17 indicates that this block has one or more additional |
| 4196 LSBs to decode for each coefficient. |
| 4197 If the decoder encounters this value, it decodes another value for the actual |
| 4198 pulse count of the block, but uses the PDF corresponding to the special rate |
| 4199 level 9 instead of the normal rate level. |
| 4200 This process repeats until the decoder reads a value less than 17, and it then |
| 4201 sets the number of extra LSBs used to the number of 17's decoded for that |
| 4202 block. |
| 4203 If it reads the value 17 ten times, then the next iteration uses the special |
| 4204 rate level 10 instead of 9. |
| 4205 The probability of decoding a 17 when using the PDF for rate level 10 is |
| 4206 zero, ensuring that the number of LSBs for a block will not exceed 10. |
| 4207 The cumulative distribution for rate level 10 is just a shifted version of |
| 4208 that for 9 and thus does not require any additional storage. |
| 4209 </t> |
| 4210 |
| 4211 <texttable anchor="silk_pulse_count_pdfs" |
| 4212 title="PDFs for the Pulse Count"> |
| 4213 <ttcol>Rate Level</ttcol> |
| 4214 <ttcol>PDF</ttcol> |
| 4215 <c>0</c> |
| 4216 <c>{131, 74, 25, 8, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}/256</c> |
| 4217 <c>1</c> |
| 4218 <c>{58, 93, 60, 23, 7, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}/256</c> |
| 4219 <c>2</c> |
| 4220 <c>{43, 51, 46, 33, 24, 16, 11, 8, 6, 3, 3, 3, 2, 1, 1, 2, 1, 2}/256</c> |
| 4221 <c>3</c> |
| 4222 <c>{17, 52, 71, 57, 31, 12, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}/256</c> |
| 4223 <c>4</c> |
| 4224 <c>{6, 21, 41, 53, 49, 35, 21, 11, 6, 3, 2, 2, 1, 1, 1, 1, 1, 1}/256</c> |
| 4225 <c>5</c> |
| 4226 <c>{7, 14, 22, 28, 29, 28, 25, 20, 17, 13, 11, 9, 7, 5, 4, 4, 3, 10}/256</c> |
| 4227 <c>6</c> |
| 4228 <c>{2, 5, 14, 29, 42, 46, 41, 31, 19, 11, 6, 3, 2, 1, 1, 1, 1, 1}/256</c> |
| 4229 <c>7</c> |
| 4230 <c>{1, 2, 4, 10, 19, 29, 35, 37, 34, 28, 20, 14, 8, 5, 4, 2, 2, 2}/256</c> |
| 4231 <c>8</c> |
| 4232 <c>{1, 2, 2, 5, 9, 14, 20, 24, 27, 28, 26, 23, 20, 15, 11, 8, 6, 15}/256</c> |
| 4233 <c>9</c> |
| 4234 <c>{1, 1, 1, 6, 27, 58, 56, 39, 25, 14, 10, 6, 3, 3, 2, 1, 1, 2}/256</c> |
| 4235 <c>10</c> |
| 4236 <c>{2, 1, 6, 27, 58, 56, 39, 25, 14, 10, 6, 3, 3, 2, 1, 1, 2, 0}/256</c> |
| 4237 </texttable> |
| 4238 |
| 4239 </section> |
| 4240 |
| 4241 <section anchor="silk_pulse_locations" title="Pulse Location Decoding"> |
| 4242 <t> |
| 4243 The locations of the pulses in each shell block follow the pulse counts, |
| 4244 as decoded by silk_shell_decoder() (shell_coder.c). |
| 4245 As with the pulse counts, these locations are coded for all the shell blocks |
| 4246 before any of the remaining information for each block. |
| 4247 Unlike many other codecs, SILK places no restriction on the distribution of |
| 4248 pulses within a shell block. |
| 4249 All of the pulses may be placed in a single location, or each one in a unique |
| 4250 location, or anything in between. |
| 4251 </t> |
| 4252 |
| 4253 <t> |
| 4254 The location of pulses is coded by recursively partitioning each block into |
| 4255 halves, and coding how many pulses fall on the left side of the split. |
| 4256 All remaining pulses must fall on the right side of the split. |
| 4257 The process then recurses into the left half, and after that returns, the |
| 4258 right half (preorder traversal). |
| 4259 The PDF to use is chosen by the size of the current partition (16, 8, 4, or 2) |
| 4260 and the number of pulses in the partition (1 to 16, inclusive). |
| 4261 Tables <xref format="counter" target="silk_shell_code3_pdfs"/> |
| 4262 through <xref format="counter" target="silk_shell_code0_pdfs"/> list the |
| 4263 PDFs used for each partition size and pulse count. |
| 4264 This process skips partitions without any pulses, i.e., where the initial pulse |
| 4265 count from <xref target="silk_pulse_counts"/> was zero, or where the split in |
| 4266 the prior level indicated that all of the pulses fell on the other side. |
| 4267 These partitions have nothing to code, so they require no PDF. |
| 4268 </t> |
| 4269 |
| 4270 <texttable anchor="silk_shell_code3_pdfs" |
| 4271 title="PDFs for Pulse Count Split, 16 Sample Partitions"> |
| 4272 <ttcol>Pulse Count</ttcol> |
| 4273 <ttcol>PDF</ttcol> |
| 4274 <c>1</c> <c>{126, 130}/256</c> |
| 4275 <c>2</c> <c>{56, 142, 58}/256</c> |
| 4276 <c>3</c> <c>{25, 101, 104, 26}/256</c> |
| 4277 <c>4</c> <c>{12, 60, 108, 64, 12}/256</c> |
| 4278 <c>5</c> <c>{7, 35, 84, 87, 37, 6}/256</c> |
| 4279 <c>6</c> <c>{4, 20, 59, 86, 63, 21, 3}/256</c> |
| 4280 <c>7</c> <c>{3, 12, 38, 72, 75, 42, 12, 2}/256</c> |
| 4281 <c>8</c> <c>{2, 8, 25, 54, 73, 59, 27, 7, 1}/256</c> |
| 4282 <c>9</c> <c>{2, 5, 17, 39, 63, 65, 42, 18, 4, 1}/256</c> |
| 4283 <c>10</c> <c>{1, 4, 12, 28, 49, 63, 54, 30, 11, 3, 1}/256</c> |
| 4284 <c>11</c> <c>{1, 4, 8, 20, 37, 55, 57, 41, 22, 8, 2, 1}/256</c> |
| 4285 <c>12</c> <c>{1, 3, 7, 15, 28, 44, 53, 48, 33, 16, 6, 1, 1}/256</c> |
| 4286 <c>13</c> <c>{1, 2, 6, 12, 21, 35, 47, 48, 40, 25, 12, 5, 1, 1}/256</c> |
| 4287 <c>14</c> <c>{1, 1, 4, 10, 17, 27, 37, 47, 43, 33, 21, 9, 4, 1, 1}/256</c> |
| 4288 <c>15</c> <c>{1, 1, 1, 8, 14, 22, 33, 40, 43, 38, 28, 16, 8, 1, 1, 1}/256</c> |
| 4289 <c>16</c> <c>{1, 1, 1, 1, 13, 18, 27, 36, 41, 41, 34, 24, 14, 1, 1, 1, 1}/256</c
> |
| 4290 </texttable> |
| 4291 |
| 4292 <texttable anchor="silk_shell_code2_pdfs" |
| 4293 title="PDFs for Pulse Count Split, 8 Sample Partitions"> |
| 4294 <ttcol>Pulse Count</ttcol> |
| 4295 <ttcol>PDF</ttcol> |
| 4296 <c>1</c> <c>{127, 129}/256</c> |
| 4297 <c>2</c> <c>{53, 149, 54}/256</c> |
| 4298 <c>3</c> <c>{22, 105, 106, 23}/256</c> |
| 4299 <c>4</c> <c>{11, 61, 111, 63, 10}/256</c> |
| 4300 <c>5</c> <c>{6, 35, 86, 88, 36, 5}/256</c> |
| 4301 <c>6</c> <c>{4, 20, 59, 87, 62, 21, 3}/256</c> |
| 4302 <c>7</c> <c>{3, 13, 40, 71, 73, 41, 13, 2}/256</c> |
| 4303 <c>8</c> <c>{3, 9, 27, 53, 70, 56, 28, 9, 1}/256</c> |
| 4304 <c>9</c> <c>{3, 8, 19, 37, 57, 61, 44, 20, 6, 1}/256</c> |
| 4305 <c>10</c> <c>{3, 7, 15, 28, 44, 54, 49, 33, 17, 5, 1}/256</c> |
| 4306 <c>11</c> <c>{1, 7, 13, 22, 34, 46, 48, 38, 28, 14, 4, 1}/256</c> |
| 4307 <c>12</c> <c>{1, 1, 11, 22, 27, 35, 42, 47, 33, 25, 10, 1, 1}/256</c> |
| 4308 <c>13</c> <c>{1, 1, 6, 14, 26, 37, 43, 43, 37, 26, 14, 6, 1, 1}/256</c> |
| 4309 <c>14</c> <c>{1, 1, 4, 10, 20, 31, 40, 42, 40, 31, 20, 10, 4, 1, 1}/256</c> |
| 4310 <c>15</c> <c>{1, 1, 3, 8, 16, 26, 35, 38, 38, 35, 26, 16, 8, 3, 1, 1}/256</c> |
| 4311 <c>16</c> <c>{1, 1, 2, 6, 12, 21, 30, 36, 38, 36, 30, 21, 12, 6, 2, 1, 1}/256</c
> |
| 4312 </texttable> |
| 4313 |
| 4314 <texttable anchor="silk_shell_code1_pdfs" |
| 4315 title="PDFs for Pulse Count Split, 4 Sample Partitions"> |
| 4316 <ttcol>Pulse Count</ttcol> |
| 4317 <ttcol>PDF</ttcol> |
| 4318 <c>1</c> <c>{127, 129}/256</c> |
| 4319 <c>2</c> <c>{49, 157, 50}/256</c> |
| 4320 <c>3</c> <c>{20, 107, 109, 20}/256</c> |
| 4321 <c>4</c> <c>{11, 60, 113, 62, 10}/256</c> |
| 4322 <c>5</c> <c>{7, 36, 84, 87, 36, 6}/256</c> |
| 4323 <c>6</c> <c>{6, 24, 57, 82, 60, 23, 4}/256</c> |
| 4324 <c>7</c> <c>{5, 18, 39, 64, 68, 42, 16, 4}/256</c> |
| 4325 <c>8</c> <c>{6, 14, 29, 47, 61, 52, 30, 14, 3}/256</c> |
| 4326 <c>9</c> <c>{1, 15, 23, 35, 51, 50, 40, 30, 10, 1}/256</c> |
| 4327 <c>10</c> <c>{1, 1, 21, 32, 42, 52, 46, 41, 18, 1, 1}/256</c> |
| 4328 <c>11</c> <c>{1, 6, 16, 27, 36, 42, 42, 36, 27, 16, 6, 1}/256</c> |
| 4329 <c>12</c> <c>{1, 5, 12, 21, 31, 38, 40, 38, 31, 21, 12, 5, 1}/256</c> |
| 4330 <c>13</c> <c>{1, 3, 9, 17, 26, 34, 38, 38, 34, 26, 17, 9, 3, 1}/256</c> |
| 4331 <c>14</c> <c>{1, 3, 7, 14, 22, 29, 34, 36, 34, 29, 22, 14, 7, 3, 1}/256</c> |
| 4332 <c>15</c> <c>{1, 2, 5, 11, 18, 25, 31, 35, 35, 31, 25, 18, 11, 5, 2, 1}/256</c> |
| 4333 <c>16</c> <c>{1, 1, 4, 9, 15, 21, 28, 32, 34, 32, 28, 21, 15, 9, 4, 1, 1}/256</c
> |
| 4334 </texttable> |
| 4335 |
| 4336 <texttable anchor="silk_shell_code0_pdfs" |
| 4337 title="PDFs for Pulse Count Split, 2 Sample Partitions"> |
| 4338 <ttcol>Pulse Count</ttcol> |
| 4339 <ttcol>PDF</ttcol> |
| 4340 <c>1</c> <c>{128, 128}/256</c> |
| 4341 <c>2</c> <c>{42, 172, 42}/256</c> |
| 4342 <c>3</c> <c>{21, 107, 107, 21}/256</c> |
| 4343 <c>4</c> <c>{12, 60, 112, 61, 11}/256</c> |
| 4344 <c>5</c> <c>{8, 34, 86, 86, 35, 7}/256</c> |
| 4345 <c>6</c> <c>{8, 23, 55, 90, 55, 20, 5}/256</c> |
| 4346 <c>7</c> <c>{5, 15, 38, 72, 72, 36, 15, 3}/256</c> |
| 4347 <c>8</c> <c>{6, 12, 27, 52, 77, 47, 20, 10, 5}/256</c> |
| 4348 <c>9</c> <c>{6, 19, 28, 35, 40, 40, 35, 28, 19, 6}/256</c> |
| 4349 <c>10</c> <c>{4, 14, 22, 31, 37, 40, 37, 31, 22, 14, 4}/256</c> |
| 4350 <c>11</c> <c>{3, 10, 18, 26, 33, 38, 38, 33, 26, 18, 10, 3}/256</c> |
| 4351 <c>12</c> <c>{2, 8, 13, 21, 29, 36, 38, 36, 29, 21, 13, 8, 2}/256</c> |
| 4352 <c>13</c> <c>{1, 5, 10, 17, 25, 32, 38, 38, 32, 25, 17, 10, 5, 1}/256</c> |
| 4353 <c>14</c> <c>{1, 4, 7, 13, 21, 29, 35, 36, 35, 29, 21, 13, 7, 4, 1}/256</c> |
| 4354 <c>15</c> <c>{1, 2, 5, 10, 17, 25, 32, 36, 36, 32, 25, 17, 10, 5, 2, 1}/256</c> |
| 4355 <c>16</c> <c>{1, 2, 4, 7, 13, 21, 28, 34, 36, 34, 28, 21, 13, 7, 4, 2, 1}/256</c
> |
| 4356 </texttable> |
| 4357 |
| 4358 </section> |
| 4359 |
| 4360 <section anchor="silk_shell_lsb" title="LSB Decoding"> |
| 4361 <t> |
| 4362 After the decoder reads the pulse locations for all blocks, it reads the LSBs |
| 4363 (if any) for each block in turn. |
| 4364 Inside each block, it reads all the LSBs for each coefficient in turn, even |
| 4365 those where no pulses were allocated, before proceeding to the next one. |
| 4366 For 10 ms MB frames, it reads LSBs even for the extra 8 samples in |
| 4367 the last block. |
| 4368 The LSBs are coded from most significant to least significant, and they all use |
| 4369 the PDF in <xref target="silk_shell_lsb_pdf"/>. |
| 4370 </t> |
| 4371 |
| 4372 <texttable anchor="silk_shell_lsb_pdf" title="PDF for Excitation LSBs"> |
| 4373 <ttcol>PDF</ttcol> |
| 4374 <c>{136, 120}/256</c> |
| 4375 </texttable> |
| 4376 |
| 4377 <t> |
| 4378 The number of LSBs read for each coefficient in a block is determined in |
| 4379 <xref target="silk_pulse_counts"/>. |
| 4380 The magnitude of the coefficient is initially equal to the number of pulses |
| 4381 placed at that location in <xref target="silk_pulse_locations"/>. |
| 4382 As each LSB is decoded, the magnitude is doubled, and then the value of the LSB |
| 4383 added to it, to obtain an updated magnitude. |
| 4384 </t> |
| 4385 </section> |
| 4386 |
| 4387 <section anchor="silk_signs" title="Sign Decoding"> |
| 4388 <t> |
| 4389 After decoding the pulse locations and the LSBs, the decoder knows the |
| 4390 magnitude of each coefficient in the excitation. |
| 4391 It then decodes a sign for all coefficients with a non-zero magnitude, using |
| 4392 one of the PDFs from <xref target="silk_sign_pdfs"/>. |
| 4393 If the value decoded is 0, then the coefficient magnitude is negated. |
| 4394 Otherwise, it remains positive. |
| 4395 </t> |
| 4396 |
| 4397 <t> |
| 4398 The decoder chooses the PDF for the sign based on the signal type and |
| 4399 quantization offset type (from <xref target="silk_frame_type"/>) and the |
| 4400 number of pulses in the block (from <xref target="silk_pulse_counts"/>). |
| 4401 The number of pulses in the block does not take into account any LSBs. |
| 4402 Most PDFs are skewed towards negative signs because of the quantization offset, |
| 4403 but the PDFs for zero pulses are highly skewed towards positive signs. |
| 4404 If a block contains many positive coefficients, it is sometimes beneficial to |
| 4405 code it solely using LSBs (i.e., with zero pulses), since the encoder may be |
| 4406 able to save enough bits on the signs to justify the less efficient |
| 4407 coefficient magnitude encoding. |
| 4408 </t> |
| 4409 |
| 4410 <texttable anchor="silk_sign_pdfs" |
| 4411 title="PDFs for Excitation Signs"> |
| 4412 <ttcol>Signal Type</ttcol> |
| 4413 <ttcol>Quantization Offset Type</ttcol> |
| 4414 <ttcol>Pulse Count</ttcol> |
| 4415 <ttcol>PDF</ttcol> |
| 4416 <c>Inactive</c> <c>Low</c> <c>0</c> <c>{2, 254}/256</c> |
| 4417 <c>Inactive</c> <c>Low</c> <c>1</c> <c>{207, 49}/256</c> |
| 4418 <c>Inactive</c> <c>Low</c> <c>2</c> <c>{189, 67}/256</c> |
| 4419 <c>Inactive</c> <c>Low</c> <c>3</c> <c>{179, 77}/256</c> |
| 4420 <c>Inactive</c> <c>Low</c> <c>4</c> <c>{174, 82}/256</c> |
| 4421 <c>Inactive</c> <c>Low</c> <c>5</c> <c>{163, 93}/256</c> |
| 4422 <c>Inactive</c> <c>Low</c> <c>6 or more</c> <c>{157, 99}/256</c> |
| 4423 <c>Inactive</c> <c>High</c> <c>0</c> <c>{58, 198}/256</c> |
| 4424 <c>Inactive</c> <c>High</c> <c>1</c> <c>{245, 11}/256</c> |
| 4425 <c>Inactive</c> <c>High</c> <c>2</c> <c>{238, 18}/256</c> |
| 4426 <c>Inactive</c> <c>High</c> <c>3</c> <c>{232, 24}/256</c> |
| 4427 <c>Inactive</c> <c>High</c> <c>4</c> <c>{225, 31}/256</c> |
| 4428 <c>Inactive</c> <c>High</c> <c>5</c> <c>{220, 36}/256</c> |
| 4429 <c>Inactive</c> <c>High</c> <c>6 or more</c> <c>{211, 45}/256</c> |
| 4430 <c>Unvoiced</c> <c>Low</c> <c>0</c> <c>{1, 255}/256</c> |
| 4431 <c>Unvoiced</c> <c>Low</c> <c>1</c> <c>{210, 46}/256</c> |
| 4432 <c>Unvoiced</c> <c>Low</c> <c>2</c> <c>{190, 66}/256</c> |
| 4433 <c>Unvoiced</c> <c>Low</c> <c>3</c> <c>{178, 78}/256</c> |
| 4434 <c>Unvoiced</c> <c>Low</c> <c>4</c> <c>{169, 87}/256</c> |
| 4435 <c>Unvoiced</c> <c>Low</c> <c>5</c> <c>{162, 94}/256</c> |
| 4436 <c>Unvoiced</c> <c>Low</c> <c>6 or more</c> <c>{152, 104}/256</c> |
| 4437 <c>Unvoiced</c> <c>High</c> <c>0</c> <c>{48, 208}/256</c> |
| 4438 <c>Unvoiced</c> <c>High</c> <c>1</c> <c>{242, 14}/256</c> |
| 4439 <c>Unvoiced</c> <c>High</c> <c>2</c> <c>{235, 21}/256</c> |
| 4440 <c>Unvoiced</c> <c>High</c> <c>3</c> <c>{224, 32}/256</c> |
| 4441 <c>Unvoiced</c> <c>High</c> <c>4</c> <c>{214, 42}/256</c> |
| 4442 <c>Unvoiced</c> <c>High</c> <c>5</c> <c>{205, 51}/256</c> |
| 4443 <c>Unvoiced</c> <c>High</c> <c>6 or more</c> <c>{190, 66}/256</c> |
| 4444 <c>Voiced</c> <c>Low</c> <c>0</c> <c>{1, 255}/256</c> |
| 4445 <c>Voiced</c> <c>Low</c> <c>1</c> <c>{162, 94}/256</c> |
| 4446 <c>Voiced</c> <c>Low</c> <c>2</c> <c>{152, 104}/256</c> |
| 4447 <c>Voiced</c> <c>Low</c> <c>3</c> <c>{147, 109}/256</c> |
| 4448 <c>Voiced</c> <c>Low</c> <c>4</c> <c>{144, 112}/256</c> |
| 4449 <c>Voiced</c> <c>Low</c> <c>5</c> <c>{141, 115}/256</c> |
| 4450 <c>Voiced</c> <c>Low</c> <c>6 or more</c> <c>{138, 118}/256</c> |
| 4451 <c>Voiced</c> <c>High</c> <c>0</c> <c>{8, 248}/256</c> |
| 4452 <c>Voiced</c> <c>High</c> <c>1</c> <c>{203, 53}/256</c> |
| 4453 <c>Voiced</c> <c>High</c> <c>2</c> <c>{187, 69}/256</c> |
| 4454 <c>Voiced</c> <c>High</c> <c>3</c> <c>{176, 80}/256</c> |
| 4455 <c>Voiced</c> <c>High</c> <c>4</c> <c>{168, 88}/256</c> |
| 4456 <c>Voiced</c> <c>High</c> <c>5</c> <c>{161, 95}/256</c> |
| 4457 <c>Voiced</c> <c>High</c> <c>6 or more</c> <c>{154, 102}/256</c> |
| 4458 </texttable> |
| 4459 |
| 4460 </section> |
| 4461 |
| 4462 <section anchor="silk_excitation_reconstruction" |
| 4463 title="Reconstructing the Excitation"> |
| 4464 |
| 4465 <t> |
| 4466 After the signs have been read, there is enough information to reconstruct the |
| 4467 complete excitation signal. |
| 4468 This requires adding a constant quantization offset to each non-zero sample, |
| 4469 and then pseudorandomly inverting and offsetting every sample. |
| 4470 The constant quantization offset varies depending on the signal type and |
| 4471 quantization offset type (see <xref target="silk_frame_type"/>). |
| 4472 </t> |
| 4473 |
| 4474 <texttable anchor="silk_quantization_offsets" |
| 4475 title="Excitation Quantization Offsets"> |
| 4476 <ttcol align="left">Signal Type</ttcol> |
| 4477 <ttcol align="left">Quantization Offset Type</ttcol> |
| 4478 <ttcol align="right">Quantization Offset (Q23)</ttcol> |
| 4479 <c>Inactive</c> <c>Low</c> <c>25</c> |
| 4480 <c>Inactive</c> <c>High</c> <c>60</c> |
| 4481 <c>Unvoiced</c> <c>Low</c> <c>25</c> |
| 4482 <c>Unvoiced</c> <c>High</c> <c>60</c> |
| 4483 <c>Voiced</c> <c>Low</c> <c>8</c> |
| 4484 <c>Voiced</c> <c>High</c> <c>25</c> |
| 4485 </texttable> |
| 4486 |
| 4487 <t> |
| 4488 Let e_raw[i] be the raw excitation value at position i, with a magnitude |
| 4489 composed of the pulses at that location (see |
| 4490 <xref target="silk_pulse_locations"/>) combined with any additional LSBs (see |
| 4491 <xref target="silk_shell_lsb"/>), and with the corresponding sign decoded in |
| 4492 <xref target="silk_signs"/>. |
| 4493 Additionally, let seed be the current pseudorandom seed, which is initialized |
| 4494 to the value decoded from <xref target="silk_seed"/> for the first sample in |
| 4495 the current SILK frame, and updated for each subsequent sample according to |
| 4496 the procedure below. |
| 4497 Finally, let offset_Q23 be the quantization offset from |
| 4498 <xref target="silk_quantization_offsets"/>. |
| 4499 Then the following procedure produces the final reconstructed excitation value, |
| 4500 e_Q23[i]: |
| 4501 <figure align="center"> |
| 4502 <artwork align="center"><![CDATA[ |
| 4503 e_Q23[i] = (e_raw[i] << 8) - sign(e_raw[i])*20 + offset_Q23; |
| 4504 seed = (196314165*seed + 907633515) & 0xFFFFFFFF; |
| 4505 e_Q23[i] = (seed & 0x80000000) ? -e_Q23[i] : e_Q23[i]; |
| 4506 seed = (seed + e_raw[i]) & 0xFFFFFFFF; |
| 4507 ]]></artwork> |
| 4508 </figure> |
| 4509 When e_raw[i] is zero, sign() returns 0 by the definition in |
| 4510 <xref target="sign"/>, so the factor of 20 does not get added. |
| 4511 The final e_Q23[i] value may require more than 16 bits per sample, but will not |
| 4512 require more than 23, including the sign. |
| 4513 </t> |
| 4514 |
| 4515 </section> |
| 4516 |
| 4517 </section> |
| 4518 |
| 4519 <section anchor="silk_frame_reconstruction" toc="include" |
| 4520 title="SILK Frame Reconstruction"> |
| 4521 |
| 4522 <t> |
| 4523 The remainder of the reconstruction process for the frame does not need to be |
| 4524 bit-exact, as small errors should only introduce proportionally small |
| 4525 distortions. |
| 4526 Although the reference implementation only includes a fixed-point version of |
| 4527 the remaining steps, this section describes them in terms of a floating-point |
| 4528 version for simplicity. |
| 4529 This produces a signal with a nominal range of -1.0 to 1.0. |
| 4530 </t> |
| 4531 |
| 4532 <t> |
| 4533 silk_decode_core() (decode_core.c) contains the code for the main |
| 4534 reconstruction process. |
| 4535 It proceeds subframe-by-subframe, since quantization gains, LTP parameters, and |
| 4536 (in 20 ms SILK frames) LPC coefficients can vary from one to the |
| 4537 next. |
| 4538 </t> |
| 4539 |
| 4540 <t> |
| 4541 Let a_Q12[k] be the LPC coefficients for the current subframe. |
| 4542 If this is the first or second subframe of a 20 ms SILK frame and the LSF |
| 4543 interpolation factor, w_Q2 (see <xref target="silk_nlsf_interpolation"/>), is |
| 4544 less than 4, then these correspond to the final LPC coefficients produced by |
| 4545 <xref target="silk_lpc_gain_limit"/> from the interpolated LSF coefficients, |
| 4546 n1_Q15[k] (computed in <xref target="silk_nlsf_interpolation"/>). |
| 4547 Otherwise, they correspond to the final LPC coefficients produced from the |
| 4548 uninterpolated LSF coefficients for the current frame, n2_Q15[k]. |
| 4549 </t> |
| 4550 |
| 4551 <t> |
| 4552 Also, let n be the number of samples in a subframe (40 for NB, 60 for MB, and |
| 4553 80 for WB), s be the index of the current subframe in this SILK frame (0 or 1 |
| 4554 for 10 ms frames, or 0 to 3 for 20 ms frames), and j be the index of |
| 4555 the first sample in the residual corresponding to the current subframe. |
| 4556 </t> |
| 4557 |
| 4558 <section anchor="silk_ltp_synthesis" title="LTP Synthesis"> |
| 4559 <t> |
| 4560 Voiced SILK frames (see <xref target="silk_frame_type"/>) pass the excitation |
| 4561 through an LTP filter using the parameters decoded in |
| 4562 <xref target="silk_ltp_params"/> to produce an LPC residual. |
| 4563 The LTP filter requires LPC residual values from before the current subframe as |
| 4564 input. |
| 4565 However, since the LPC coefficients may have changed, it obtains this residual |
| 4566 by "rewhitening" the corresponding output signal using the LPC coefficients |
| 4567 from the current subframe. |
| 4568 Let out[i] for |
| 4569 (j - pitch_lags[s] - d_LPC - 2) <= i
< j |
| 4570 be the fully reconstructed output signal from the last |
| 4571 (pitch_lags[s] + d_LPC + 2) samples of previous subframes |
| 4572 (see <xref target="silk_lpc_synthesis"/>), where pitch_lags[s] is the pitch |
| 4573 lag for the current subframe from <xref target="silk_ltp_lags"/>. |
| 4574 During reconstruction of the first subframe for this channel after either |
| 4575 <list style="symbols"> |
| 4576 <t>An uncoded regular SILK frame (if this is the side channel), or</t> |
| 4577 <t>A decoder reset (see <xref target="decoder-reset"/>),</t> |
| 4578 </list> |
| 4579 out[] is rewhitened into an LPC residual, |
| 4580 res[i], via |
| 4581 <figure align="center"> |
| 4582 <artwork align="center"><![CDATA[ |
| 4583 4.0*LTP_scale_Q14 |
| 4584 res[i] = ----------------- * clamp(-1.0, |
| 4585 gain_Q16[s] |
| 4586 |
| 4587 d_LPC-1 |
| 4588 __ a_Q12[k] |
| 4589 out[i] - \ out[i-k-1] * --------, 1.0) . |
| 4590 /_ 4096.0 |
| 4591 k=0 |
| 4592 ]]></artwork> |
| 4593 </figure> |
| 4594 This requires storage to buffer up to 306 values of out[i] from previous |
| 4595 subframes. |
| 4596 This corresponds to WB with a maximum pitch lag of |
| 4597 18 ms * 16 kHz samples, plus 16 samples for d_LPC, plus 2 |
| 4598 samples for the width of the LTP filter. |
| 4599 </t> |
| 4600 |
| 4601 <t> |
| 4602 Let e_Q23[i] for j <= i < (j + n) be the |
| 4603 excitation for the current subframe, and b_Q7[k] for |
| 4604 0 <= k < 5 be the coefficients of the LTP filter |
| 4605 taken from the codebook entry in one of |
| 4606 Tables <xref format="counter" target="silk_ltp_filter_coeffs0"/> |
| 4607 through <xref format="counter" target="silk_ltp_filter_coeffs2"/> |
| 4608 corresponding to the index decoded for the current subframe in |
| 4609 <xref target="silk_ltp_filter"/>. |
| 4610 Then for i such that j <= i < (j + n), |
| 4611 the LPC residual is |
| 4612 <figure align="center"> |
| 4613 <artwork align="center"><![CDATA[ |
| 4614 4 |
| 4615 e_Q23[i] __ b_Q7[k] |
| 4616 res[i] = --------- + \ res[i - pitch_lags[s] + 2 - k] * ------- . |
| 4617 2.0**23 /_ 128.0 |
| 4618 k=0 |
| 4619 ]]></artwork> |
| 4620 </figure> |
| 4621 </t> |
| 4622 |
| 4623 <t> |
| 4624 For unvoiced frames, the LPC residual for |
| 4625 j <= i < (j + n) is simply a normalized |
| 4626 copy of the excitation signal, i.e., |
| 4627 <figure align="center"> |
| 4628 <artwork align="center"><![CDATA[ |
| 4629 e_Q23[i] |
| 4630 res[i] = --------- |
| 4631 2.0**23 |
| 4632 ]]></artwork> |
| 4633 </figure> |
| 4634 </t> |
| 4635 </section> |
| 4636 |
| 4637 <section anchor="silk_lpc_synthesis" title="LPC Synthesis"> |
| 4638 <t> |
| 4639 LPC synthesis uses the short-term LPC filter to predict the next output |
| 4640 coefficient. |
| 4641 For i such that (j - d_LPC) <= i < j, let |
| 4642 lpc[i] be the result of LPC synthesis from the last d_LPC samples of the |
| 4643 previous subframe, or zeros in the first subframe for this channel after |
| 4644 either |
| 4645 <list style="symbols"> |
| 4646 <t>An uncoded regular SILK frame (if this is the side channel), or</t> |
| 4647 <t>A decoder reset (see <xref target="decoder-reset"/>).</t> |
| 4648 </list> |
| 4649 Then for i such that j <= i < (j + n), the |
| 4650 result of LPC synthesis for the current subframe is |
| 4651 <figure align="center"> |
| 4652 <artwork align="center"><![CDATA[ |
| 4653 d_LPC-1 |
| 4654 gain_Q16[i] __ a_Q12[k] |
| 4655 lpc[i] = ----------- * res[i] + \ lpc[i-k-1] * -------- . |
| 4656 65536.0 /_ 4096.0 |
| 4657 k=0 |
| 4658 ]]></artwork> |
| 4659 </figure> |
| 4660 The decoder saves the final d_LPC values, i.e., lpc[i] such that |
| 4661 (j + n - d_LPC) <= i < (j +&
nbsp;n), |
| 4662 to feed into the LPC synthesis of the next subframe. |
| 4663 This requires storage for up to 16 values of lpc[i] (for WB frames). |
| 4664 </t> |
| 4665 |
| 4666 <t> |
| 4667 Then, the signal is clamped into the final nominal range: |
| 4668 <figure align="center"> |
| 4669 <artwork align="center"><![CDATA[ |
| 4670 out[i] = clamp(-1.0, lpc[i], 1.0) . |
| 4671 ]]></artwork> |
| 4672 </figure> |
| 4673 This clamping occurs entirely after the LPC synthesis filter has run. |
| 4674 The decoder saves the unclamped values, lpc[i], to feed into the LPC filter for |
| 4675 the next subframe, but saves the clamped values, out[i], for rewhitening in |
| 4676 voiced frames. |
| 4677 </t> |
| 4678 </section> |
| 4679 |
| 4680 </section> |
| 4681 |
| 4682 </section> |
| 4683 |
| 4684 <section anchor="silk_stereo_unmixing" title="Stereo Unmixing"> |
| 4685 <t> |
| 4686 For stereo streams, after decoding a frame from each channel, the decoder must |
| 4687 convert the mid-side (MS) representation into a left-right (LR) |
| 4688 representation. |
| 4689 The function silk_stereo_MS_to_LR (stereo_MS_to_LR.c) implements this process. |
| 4690 In it, the decoder predicts the side channel using a) a simple low-passed |
| 4691 version of the mid channel, and b) the unfiltered mid channel, using the |
| 4692 prediction weights decoded in <xref target="silk_stereo_pred"/>. |
| 4693 This simple low-pass filter imposes a one-sample delay, and the unfiltered |
| 4694 mid channel is also delayed by one sample. |
| 4695 In order to allow seamless switching between stereo and mono, mono streams must |
| 4696 also impose the same one-sample delay. |
| 4697 The encoder requires an additional one-sample delay for both mono and stereo |
| 4698 streams, though an encoder may omit the delay for mono if it knows it will |
| 4699 never switch to stereo. |
| 4700 </t> |
| 4701 |
| 4702 <t> |
| 4703 The unmixing process operates in two phases. |
| 4704 The first phase lasts for 8 ms, during which it interpolates the |
| 4705 prediction weights from the previous frame, prev_w0_Q13 and prev_w1_Q13, to |
| 4706 the values for the current frame, w0_Q13 and w1_Q13. |
| 4707 The second phase simply uses these weights for the remainder of the frame. |
| 4708 </t> |
| 4709 |
| 4710 <t> |
| 4711 Let mid[i] and side[i] be the contents of out[i] (from |
| 4712 <xref target="silk_lpc_synthesis"/>) for the current mid and side channels, |
| 4713 respectively, and let left[i] and right[i] be the corresponding stereo output |
| 4714 channels. |
| 4715 If the side channel is not coded (see <xref target="silk_mid_only_flag"/>), |
| 4716 then side[i] is set to zero. |
| 4717 Also let j be defined as in <xref target="silk_frame_reconstruction"/>, n1 be |
| 4718 the number of samples in phase 1 (64 for NB, 96 for MB, and 128 for WB), |
| 4719 and n2 be the total number of samples in the frame. |
| 4720 Then for i such that j <= i < (j + n2), |
| 4721 the left and right channel output is |
| 4722 <figure align="center"> |
| 4723 <artwork align="center"><![CDATA[ |
| 4724 prev_w0_Q13 (w0_Q13 - prev_w0_Q13) |
| 4725 w0 = ----------- + min(i - j, n1)*---------------------- , |
| 4726 8192.0 8192.0*n1 |
| 4727 |
| 4728 prev_w1_Q13 (w1_Q13 - prev_w1_Q13) |
| 4729 w1 = ----------- + min(i - j, n1)*---------------------- , |
| 4730 8192.0 8192.0*n1 |
| 4731 |
| 4732 mid[i-2] + 2*mid[i-1] + mid[i] |
| 4733 p0 = ------------------------------ , |
| 4734 4.0 |
| 4735 |
| 4736 left[i] = clamp(-1.0, (1 + w1)*mid[i-1] + side[i-1] + w0*p0, 1.0) , |
| 4737 |
| 4738 right[i] = clamp(-1.0, (1 - w1)*mid[i-1] - side[i-1] - w0*p0, 1.0) . |
| 4739 ]]></artwork> |
| 4740 </figure> |
| 4741 These formulas require two samples prior to index j, the start of the |
| 4742 frame, for the mid channel, and one prior sample for the side channel. |
| 4743 For the first frame after a decoder reset, zeros are used instead. |
| 4744 </t> |
| 4745 |
| 4746 </section> |
| 4747 |
| 4748 <section title="Resampling"> |
| 4749 <t> |
| 4750 After stereo unmixing (if any), the decoder applies resampling to convert the |
| 4751 decoded SILK output to the sample rate desired by the application. |
| 4752 This is necessary when decoding a Hybrid frame at SWB or FB sample rates, or |
| 4753 whenever the decoder wants the output at a different sample rate than the |
| 4754 internal SILK sampling rate (e.g., to allow a constant sample rate when the |
| 4755 audio bandwidth changes, or to allow mixing with audio from other |
| 4756 applications). |
| 4757 The resampler itself is non-normative, and a decoder can use any method it |
| 4758 wants to perform the resampling. |
| 4759 </t> |
| 4760 |
| 4761 <t> |
| 4762 However, a minimum amount of delay is imposed to allow the resampler to |
| 4763 operate, and this delay is normative, so that the corresponding delay can be |
| 4764 applied to the MDCT layer in the encoder. |
| 4765 A decoder is always free to use a resampler which requires more delay than |
| 4766 allowed for here (e.g., to improve quality), but it must then delay the output |
| 4767 of the MDCT layer by this extra amount. |
| 4768 Keeping as much delay as possible on the encoder side allows an encoder which |
| 4769 knows it will never use any of the SILK or Hybrid modes to skip this delay. |
| 4770 By contrast, if it were all applied by the decoder, then a decoder which |
| 4771 processes audio in fixed-size blocks would be forced to delay the output of |
| 4772 CELT frames just in case of a later switch to a SILK or Hybrid mode. |
| 4773 </t> |
| 4774 |
| 4775 <t> |
| 4776 <xref target="silk_resampler_delay_alloc"/> gives the maximum resampler delay |
| 4777 in samples at 48 kHz for each SILK audio bandwidth. |
| 4778 Because the actual output rate may not be 48 kHz, it may not be possible |
| 4779 to achieve exactly these delays while using a whole number of input or output |
| 4780 samples. |
| 4781 The reference implementation is able to resample to any of the supported |
| 4782 output sampling rates (8, 12, 16, 24, or 48 kHz) within or near this |
| 4783 delay constraint. |
| 4784 Some resampling filters (including those used by the reference implementation) |
| 4785 may add a delay that is not an exact integer, or is not linear-phase, and so |
| 4786 cannot be represented by a single delay at all frequencies. |
| 4787 However, such deviations are unlikely to be perceptible, and the comparison |
| 4788 tool described in <xref target="conformance"/> is designed to be relatively |
| 4789 insensitive to them. |
| 4790 The delays listed here are the ones that should be targeted by the encoder. |
| 4791 </t> |
| 4792 |
| 4793 <texttable anchor="silk_resampler_delay_alloc" |
| 4794 title="SILK Resampler Delay Allocations"> |
| 4795 <ttcol>Audio Bandwidth</ttcol> |
| 4796 <ttcol>Delay in millisecond</ttcol> |
| 4797 <c>NB</c> <c>0.538</c> |
| 4798 <c>MB</c> <c>0.692</c> |
| 4799 <c>WB</c> <c>0.706</c> |
| 4800 </texttable> |
| 4801 |
| 4802 <t> |
| 4803 NB is given a smaller decoder delay allocation than MB and WB to allow a |
| 4804 higher-order filter when resampling to 8 kHz in both the encoder and |
| 4805 decoder. |
| 4806 This implies that the audio content of two SILK frames operating at different |
| 4807 bandwidths are not perfectly aligned in time. |
| 4808 This is not an issue for any transitions described in |
| 4809 <xref target="switching"/>, because they all involve a SILK decoder reset. |
| 4810 When the decoder is reset, any samples remaining in the resampling buffer |
| 4811 are discarded, and the resampler is re-initialized with silence. |
| 4812 </t> |
| 4813 |
| 4814 </section> |
| 4815 |
| 4816 </section> |
| 4817 |
| 4818 |
| 4819 <section title="CELT Decoder"> |
| 4820 |
| 4821 <t> |
| 4822 The CELT layer of Opus is based on the Modified Discrete Cosine Transform |
| 4823 <xref target='MDCT'/> with partially overlapping windows of 5 to 22.5 ms. |
| 4824 The main principle behind CELT is that the MDCT spectrum is divided into |
| 4825 bands that (roughly) follow the Bark scale, i.e., the scale of the ear's |
| 4826 critical bands <xref target="Zwicker61"/>. The normal CELT layer uses 21 of
those bands, though Opus |
| 4827 Custom (see <xref target="opus-custom"/>) may use a different number of bands. |
| 4828 In Hybrid mode, the first 17 bands (up to 8 kHz) are not coded. |
| 4829 A band can contain as little as one MDCT bin per channel, and as many as 176 |
| 4830 bins per channel, as detailed in <xref target="celt_band_sizes"/>. |
| 4831 In each band, the gain (energy) is coded separately from |
| 4832 the shape of the spectrum. Coding the gain explicitly makes it easy to |
| 4833 preserve the spectral envelope of the signal. The remaining unit-norm shape |
| 4834 vector is encoded using a Pyramid Vector Quantizer (PVQ) <xref target='PVQ-
decoder'/>. |
| 4835 </t> |
| 4836 |
| 4837 <texttable anchor="celt_band_sizes" |
| 4838 title="MDCT Bins Per Channel Per Band for Each Frame Size"> |
| 4839 <ttcol>Frame Size:</ttcol> |
| 4840 <ttcol align="right">2.5 ms</ttcol> |
| 4841 <ttcol align="right">5 ms</ttcol> |
| 4842 <ttcol align="right">10 ms</ttcol> |
| 4843 <ttcol align="right">20 ms</ttcol> |
| 4844 <ttcol align="right">Start Frequency</ttcol> |
| 4845 <ttcol align="right">Stop Frequency</ttcol> |
| 4846 <c>Band</c> <c>Bins:</c> <c/> <c/> <c/> <c/> <c/> |
| 4847 <c>0</c> <c>1</c> <c>2</c> <c>4</c> <c>8</c> <c>0 Hz</c> <c>200
Hz</c> |
| 4848 <c>1</c> <c>1</c> <c>2</c> <c>4</c> <c>8</c> <c>200 Hz</c> <c>400
Hz</c> |
| 4849 <c>2</c> <c>1</c> <c>2</c> <c>4</c> <c>8</c> <c>400 Hz</c> <c>600
Hz</c> |
| 4850 <c>3</c> <c>1</c> <c>2</c> <c>4</c> <c>8</c> <c>600 Hz</c> <c>800
Hz</c> |
| 4851 <c>4</c> <c>1</c> <c>2</c> <c>4</c> <c>8</c> <c>800 Hz</c> <c>1000
Hz</c> |
| 4852 <c>5</c> <c>1</c> <c>2</c> <c>4</c> <c>8</c> <c>1000 Hz</c> <c>1200
Hz</c> |
| 4853 <c>6</c> <c>1</c> <c>2</c> <c>4</c> <c>8</c> <c>1200 Hz</c> <c>1400
Hz</c> |
| 4854 <c>7</c> <c>1</c> <c>2</c> <c>4</c> <c>8</c> <c>1400 Hz</c> <c>1600
Hz</c> |
| 4855 <c>8</c> <c>2</c> <c>4</c> <c>8</c> <c>16</c> <c>1600 Hz</c> <c>2000
Hz</c> |
| 4856 <c>9</c> <c>2</c> <c>4</c> <c>8</c> <c>16</c> <c>2000 Hz</c> <c>2400
Hz</c> |
| 4857 <c>10</c> <c>2</c> <c>4</c> <c>8</c> <c>16</c> <c>2400 Hz</c> <c>2800
Hz</c> |
| 4858 <c>11</c> <c>2</c> <c>4</c> <c>8</c> <c>16</c> <c>2800 Hz</c> <c>3200
Hz</c> |
| 4859 <c>12</c> <c>4</c> <c>8</c> <c>16</c> <c>32</c> <c>3200 Hz</c> <c>4000
Hz</c> |
| 4860 <c>13</c> <c>4</c> <c>8</c> <c>16</c> <c>32</c> <c>4000 Hz</c> <c>4800
Hz</c> |
| 4861 <c>14</c> <c>4</c> <c>8</c> <c>16</c> <c>32</c> <c>4800 Hz</c> <c>5600
Hz</c> |
| 4862 <c>15</c> <c>6</c> <c>12</c> <c>24</c> <c>48</c> <c>5600 Hz</c> <c>6800
Hz</c> |
| 4863 <c>16</c> <c>6</c> <c>12</c> <c>24</c> <c>48</c> <c>6800 Hz</c> <c>8000
Hz</c> |
| 4864 <c>17</c> <c>8</c> <c>16</c> <c>32</c> <c>64</c> <c>8000 Hz</c> <c>9600
Hz</c> |
| 4865 <c>18</c> <c>12</c> <c>24</c> <c>48</c> <c>96</c> <c>9600 Hz</c> <c>12000
Hz</c> |
| 4866 <c>19</c> <c>18</c> <c>36</c> <c>72</c> <c>144</c> <c>12000 Hz</c> <c>15600
Hz</c> |
| 4867 <c>20</c> <c>22</c> <c>44</c> <c>88</c> <c>176</c> <c>15600 Hz</c> <c>20000
Hz</c> |
| 4868 </texttable> |
| 4869 |
| 4870 <t> |
| 4871 Transients are notoriously difficult for transform codecs to code. |
| 4872 CELT uses two different strategies for them: |
| 4873 <list style="numbers"> |
| 4874 <t>Using multiple smaller MDCTs instead of a single large MDCT, and</t> |
| 4875 <t>Dynamic time-frequency resolution changes (See <xref target='tf-change'/>).</
t> |
| 4876 </list> |
| 4877 To improve quality on highly tonal and periodic signals, CELT includes |
| 4878 a prefilter/postfilter combination. The prefilter on the encoder side |
| 4879 attenuates the signal's harmonics. The postfilter on the decoder side |
| 4880 restores the original gain of the harmonics, while shaping the coding noise |
| 4881 to roughly follow the harmonics. Such noise shaping reduces the perception |
| 4882 of the noise. |
| 4883 </t> |
| 4884 |
| 4885 <t> |
| 4886 When coding a stereo signal, three coding methods are available: |
| 4887 <list style="symbols"> |
| 4888 <t>mid-side stereo: encodes the mean and the difference of the left and right ch
annels,</t> |
| 4889 <t>intensity stereo: only encodes the mean of the left and right channels (disca
rds the difference),</t> |
| 4890 <t>dual stereo: encodes the left and right channels separately.</t> |
| 4891 </list> |
| 4892 </t> |
| 4893 |
| 4894 <t> |
| 4895 An overview of the decoder is given in <xref target="celt-decoder-overview"/>. |
| 4896 </t> |
| 4897 |
| 4898 <figure anchor="celt-decoder-overview" title="Structure of the CELT decoder"> |
| 4899 <artwork align="center"><![CDATA[ |
| 4900 +---------+ |
| 4901 | Coarse | |
| 4902 +->| decoder |----+ |
| 4903 | +---------+ | |
| 4904 | | |
| 4905 | +---------+ v |
| 4906 | | Fine | +---+ |
| 4907 +->| decoder |->| + | |
| 4908 | +---------+ +---+ |
| 4909 | ^ | |
| 4910 +---------+ | | | |
| 4911 | Range | | +----------+ v |
| 4912 | Decoder |-+ | Bit | +------+ |
| 4913 +---------+ | |Allocation| | 2**x | |
| 4914 | +----------+ +------+ |
| 4915 | | | |
| 4916 | v v +--------+ |
| 4917 | +---------+ +---+ +-------+ | pitch | |
| 4918 +->| PVQ |->| * |->| IMDCT |->| post- |---> |
| 4919 | | decoder | +---+ +-------+ | filter | |
| 4920 | +---------+ +--------+ |
| 4921 | ^ |
| 4922 +--------------------------------------+ |
| 4923 ]]></artwork> |
| 4924 </figure> |
| 4925 |
| 4926 <t> |
| 4927 The decoder is based on the following symbols and sets of symbols: |
| 4928 </t> |
| 4929 |
| 4930 <texttable anchor="celt_symbols" |
| 4931 title="Order of the Symbols in the CELT Section of the Bitstream"> |
| 4932 <ttcol align="center">Symbol(s)</ttcol> |
| 4933 <ttcol align="center">PDF</ttcol> |
| 4934 <ttcol align="center">Condition</ttcol> |
| 4935 <c>silence</c> <c>{32767, 1}/32768</c> <c></c> |
| 4936 <c>post-filter</c> <c>{1, 1}/2</c> <c></c> |
| 4937 <c>octave</c> <c>uniform (6)</c><c>post-filter</c> |
| 4938 <c>period</c> <c>raw bits (4+octave)</c><c>post-filter</c> |
| 4939 <c>gain</c> <c>raw bits (3)</c><c>post-filter</c> |
| 4940 <c>tapset</c> <c>{2, 1, 1}/4</c><c>post-filter</c> |
| 4941 <c>transient</c> <c>{7, 1}/8</c><c></c> |
| 4942 <c>intra</c> <c>{7, 1}/8</c><c></c> |
| 4943 <c>coarse energy</c><c><xref target="energy-decoding"/></c><c></c> |
| 4944 <c>tf_change</c> <c><xref target="transient-decoding"/></c><c></c> |
| 4945 <c>tf_select</c> <c>{1, 1}/2</c><c><xref target="transient-decoding"/></c> |
| 4946 <c>spread</c> <c>{7, 2, 21, 2}/32</c><c></c> |
| 4947 <c>dyn. alloc.</c> <c><xref target="allocation"/></c><c></c> |
| 4948 <c>alloc. trim</c> <c>{2, 2, 5, 10, 22, 46, 22, 10, 5, 2, 2}/128</c><c></c> |
| 4949 <c>skip</c> <c>{1, 1}/2</c><c><xref target="allocation"/></c> |
| 4950 <c>intensity</c> <c>uniform</c><c><xref target="allocation"/></c> |
| 4951 <c>dual</c> <c>{1, 1}/2</c><c></c> |
| 4952 <c>fine energy</c> <c><xref target="energy-decoding"/></c><c></c> |
| 4953 <c>residual</c> <c><xref target="PVQ-decoder"/></c><c></c> |
| 4954 <c>anti-collapse</c><c>{1, 1}/2</c><c><xref target="anti-collapse"/></c> |
| 4955 <c>finalize</c> <c><xref target="energy-decoding"/></c><c></c> |
| 4956 </texttable> |
| 4957 |
| 4958 <t> |
| 4959 The decoder extracts information from the range-coded bitstream in the order |
| 4960 described in <xref target='celt_symbols'/>. In some circumstances, it is |
| 4961 possible for a decoded value to be out of range due to a very small amount of re
dundancy |
| 4962 in the encoding of large integers by the range coder. |
| 4963 In that case, the decoder should assume there has been an error in the coding, |
| 4964 decoding, or transmission and SHOULD take measures to conceal the error and/or r
eport |
| 4965 to the application that a problem has occurred. Such out of range errors cannot
occur |
| 4966 in the SILK layer. |
| 4967 </t> |
| 4968 |
| 4969 <section anchor="transient-decoding" title="Transient Decoding"> |
| 4970 <t> |
| 4971 The "transient" flag indicates whether the frame uses a single long MDCT or seve
ral short MDCTs. |
| 4972 When it is set, then the MDCT coefficients represent multiple |
| 4973 short MDCTs in the frame. When not set, the coefficients represent a single |
| 4974 long MDCT for the frame. The flag is encoded in the bitstream with a probability
of 1/8. |
| 4975 In addition to the global transient flag is a per-band |
| 4976 binary flag to change the time-frequency (tf) resolution independently in each b
and. The |
| 4977 change in tf resolution is defined in tf_select_table[][] in celt.c and depends |
| 4978 on the frame size, whether the transient flag is set, and the value of tf_select
. |
| 4979 The tf_select flag uses a 1/2 probability, but is only decoded |
| 4980 if it can have an impact on the result knowing the value of all per-band |
| 4981 tf_change flags. |
| 4982 </t> |
| 4983 </section> |
| 4984 |
| 4985 <section anchor="energy-decoding" title="Energy Envelope Decoding"> |
| 4986 |
| 4987 <t> |
| 4988 It is important to quantize the energy with sufficient resolution because |
| 4989 any energy quantization error cannot be compensated for at a later |
| 4990 stage. Regardless of the resolution used for encoding the spectral shape of a ba
nd, |
| 4991 it is perceptually important to preserve the energy in each band. CELT uses a |
| 4992 three-step coarse-fine-fine strategy for encoding the energy in the base-2 log |
| 4993 domain, as implemented in quant_bands.c</t> |
| 4994 |
| 4995 <section anchor="coarse-energy-decoding" title="Coarse energy decoding"> |
| 4996 <t> |
| 4997 Coarse quantization of the energy uses a fixed resolution of 6 dB |
| 4998 (integer part of base-2 log). To minimize the bitrate, prediction is applied |
| 4999 both in time (using the previous frame) and in frequency (using the previous |
| 5000 bands). The part of the prediction that is based on the |
| 5001 previous frame can be disabled, creating an "intra" frame where the energy |
| 5002 is coded without reference to prior frames. The decoder first reads the intra fl
ag |
| 5003 to determine what prediction is used. |
| 5004 The 2-D z-transform <xref target='z-transform'/> of |
| 5005 the prediction filter is: |
| 5006 <figure align="center"> |
| 5007 <artwork align="center"><![CDATA[ |
| 5008 -1 -1 |
| 5009 (1 - alpha*z_l )*(1 - z_b ) |
| 5010 A(z_l, z_b) = ----------------------------- |
| 5011 -1 |
| 5012 1 - beta*z_b |
| 5013 ]]></artwork> |
| 5014 </figure> |
| 5015 where b is the band index and l is the frame index. The prediction coefficients |
| 5016 applied depend on the frame size in use when not using intra energy and are alph
a=0, beta=4915/32768 |
| 5017 when using intra energy. |
| 5018 The time-domain prediction is based on the final fine quantization of the previo
us |
| 5019 frame, while the frequency domain (within the current frame) prediction is based |
| 5020 on coarse quantization only (because the fine quantization has not been computed |
| 5021 yet). The prediction is clamped internally so that fixed point implementations w
ith |
| 5022 limited dynamic range always remain in the same state as floating point implemen
tations. |
| 5023 We approximate the ideal |
| 5024 probability distribution of the prediction error using a Laplace distribution |
| 5025 with separate parameters for each frame size in intra- and inter-frame modes. Th
ese |
| 5026 parameters are held in the e_prob_model table in quant_bands.c. |
| 5027 The |
| 5028 coarse energy quantization is performed by unquant_coarse_energy() and |
| 5029 unquant_coarse_energy_impl() (quant_bands.c). The encoding of the Laplace-distri
buted values is |
| 5030 implemented in ec_laplace_decode() (laplace.c). |
| 5031 </t> |
| 5032 |
| 5033 </section> |
| 5034 |
| 5035 <section anchor="fine-energy-decoding" title="Fine energy quantization"> |
| 5036 <t> |
| 5037 The number of bits assigned to fine energy quantization in each band is determin
ed |
| 5038 by the bit allocation computation described in <xref target="allocation"></xref>
. |
| 5039 Let B_i be the number of fine energy bits |
| 5040 for band i; the refinement is an integer f in the range [0,2**B_i-1]. The mappin
g between f |
| 5041 and the correction applied to the coarse energy is equal to (f+1/2)/2**B_i - 1/2
. Fine |
| 5042 energy quantization is implemented in quant_fine_energy() (quant_bands.c). |
| 5043 </t> |
| 5044 <t> |
| 5045 When some bits are left "unused" after all other flags have been decoded, these
bits |
| 5046 are assigned to a "final" step of fine allocation. In effect, these bits are use
d |
| 5047 to add one extra fine energy bit per band per channel. The allocation process |
| 5048 determines two "priorities" for the final fine bits. |
| 5049 Any remaining bits are first assigned only to bands of priority 0, starting |
| 5050 from band 0 and going up. If all bands of priority 0 have received one bit per |
| 5051 channel, then bands of priority 1 are assigned an extra bit per channel, |
| 5052 starting from band 0. If any bits are left after this, they are left unused. |
| 5053 This is implemented in unquant_energy_finalise() (quant_bands.c). |
| 5054 </t> |
| 5055 |
| 5056 </section> <!-- fine energy --> |
| 5057 |
| 5058 </section> <!-- Energy decode --> |
| 5059 |
| 5060 <section anchor="allocation" title="Bit Allocation"> |
| 5061 |
| 5062 <t>Because the bit allocation drives the decoding of the range-coder |
| 5063 stream, it MUST be recovered exactly so that identical coding decisions are |
| 5064 made in the encoder and decoder. Any deviation from the reference's resulting |
| 5065 bit allocation will result in corrupted output, though implementers are |
| 5066 free to implement the procedure in any way which produces identical results.</t> |
| 5067 |
| 5068 <t>The per-band gain-shape structure of the CELT layer ensures that using |
| 5069 the same number of bits for the spectral shape of a band in every frame will |
| 5070 result in a roughly constant signal-to-noise ratio in that band. |
| 5071 This results in coding noise that has the same spectral envelope as the signal. |
| 5072 The masking curve produced by a standard psychoacoustic model also closely |
| 5073 follows the spectral envelope of the signal. |
| 5074 This structure means that the ideal allocation is more consistent from frame to |
| 5075 frame than it is for other codecs without an equivalent structure, and that a |
| 5076 fixed allocation provides fairly consistent perceptual |
| 5077 performance <xref target='Valin2010'/>.</t> |
| 5078 |
| 5079 <t>Many codecs transmit significant amounts of side information to control the |
| 5080 bit allocation within a frame. |
| 5081 Often this control is only indirect, and must be exercised carefully to |
| 5082 achieve the desired rate constraints. |
| 5083 The CELT layer, however, can adapt over a very wide range of rates, and thus |
| 5084 has a large number of codebook sizes to choose from for each band. |
| 5085 Explicitly signaling the size of each of these codebooks would impose |
| 5086 considerable overhead, even though the allocation is relatively static from |
| 5087 frame to frame. |
| 5088 This is because all of the information required to compute these codebook sizes |
| 5089 must be derived from a single frame by itself, in order to retain robustness |
| 5090 to packet loss, so the signaling cannot take advantage of knowledge of the |
| 5091 allocation in neighboring frames. |
| 5092 This problem is exacerbated in low-latency (small frame size) applications, |
| 5093 which would include this overhead in every frame.</t> |
| 5094 |
| 5095 <t>For this reason, in the MDCT mode Opus uses a primarily implicit bit |
| 5096 allocation. The available bitstream capacity is known in advance to both |
| 5097 the encoder and decoder without additional signaling, ultimately from the |
| 5098 packet sizes expressed by a higher-level protocol. Using this information, |
| 5099 the codec interpolates an allocation from a hard-coded table.</t> |
| 5100 |
| 5101 <t>While the band-energy structure effectively models intra-band masking, |
| 5102 it ignores the weaker inter-band masking, band-temporal masking, and |
| 5103 other less significant perceptual effects. While these effects can |
| 5104 often be ignored, they can become significant for particular samples. One |
| 5105 mechanism available to encoders would be to simply increase the overall |
| 5106 rate for these frames, but this is not possible in a constant rate mode |
| 5107 and can be fairly inefficient. As a result three explicitly signaled |
| 5108 mechanisms are provided to alter the implicit allocation:</t> |
| 5109 |
| 5110 <t> |
| 5111 <list style="symbols"> |
| 5112 <t>Band boost</t> |
| 5113 <t>Allocation trim</t> |
| 5114 <t>Band skipping</t> |
| 5115 </list> |
| 5116 </t> |
| 5117 |
| 5118 <t>The first of these mechanisms, band boost, allows an encoder to boost |
| 5119 the allocation in specific bands. The second, allocation trim, works by |
| 5120 biasing the overall allocation towards higher or lower frequency bands. The thir
d, band |
| 5121 skipping, selects which low-precision high frequency bands |
| 5122 will be allocated no shape bits at all.</t> |
| 5123 |
| 5124 <t>In stereo mode there are two additional parameters |
| 5125 potentially coded as part of the allocation procedure: a parameter to allow the |
| 5126 selective elimination of allocation for the 'side' (i.e., intensity stereo) in j
ointly coded bands, |
| 5127 and a flag to deactivate joint coding (i.e., dual stereo). These values are not
signaled if |
| 5128 they would be meaningless in the overall context of the allocation.</t> |
| 5129 |
| 5130 <t>Because every signaled adjustment increases overhead and implementation |
| 5131 complexity, none were included speculatively: the reference encoder makes use |
| 5132 of all of these mechanisms. While the decision logic in the reference was |
| 5133 found to be effective enough to justify the overhead and complexity, further |
| 5134 analysis techniques may be discovered which increase the effectiveness of these |
| 5135 parameters. As with other signaled parameters, an encoder is free to choose the |
| 5136 values in any manner, but unless a technique is known to deliver superior |
| 5137 perceptual results the methods used by the reference implementation should be |
| 5138 used.</t> |
| 5139 |
| 5140 <t>The allocation process consists of the following steps: determining the per-b
and |
| 5141 maximum allocation vector, decoding the boosts, decoding the tilt, determining |
| 5142 the remaining capacity of the frame, searching the mode table for the |
| 5143 entry nearest but not exceeding the available space (subject to the tilt, boosts
, band |
| 5144 maximums, and band minimums), linear interpolation, reallocation of |
| 5145 unused bits with concurrent skip decoding, determination of the |
| 5146 fine-energy vs. shape split, and final reallocation. This process results |
| 5147 in a per-band shape allocation (in 1/8th bit units), a per-band fine-energy |
| 5148 allocation (in 1 bit per channel units), a set of band priorities for |
| 5149 controlling the use of remaining bits at the end of the frame, and a |
| 5150 remaining balance of unallocated space, which is usually zero except |
| 5151 at very high rates.</t> |
| 5152 |
| 5153 <t> |
| 5154 The "static" bit allocation (in 1/8 bits) for a quality q, excluding the minimum
s, maximums, |
| 5155 tilt and boosts, is equal to channels*N*alloc[band][q]<<LM>>2, where |
| 5156 alloc[][] is given in <xref target="static_alloc"/> and LM=log2(frame_size/120).
The allocation |
| 5157 is obtained by linearly interpolating between two values of q (in steps of 1/64)
to find the |
| 5158 highest allocation that does not exceed the number of bits remaining. |
| 5159 </t> |
| 5160 |
| 5161 <texttable anchor="static_alloc" |
| 5162 title="CELT Static Allocation Table"> |
| 5163 <preamble>Rows indicate the MDCT bands, columns are the different quality (q) p
arameters. The units are 1/32 bit per MDCT bin.</preamble> |
| 5164 <ttcol align="right">0</ttcol> |
| 5165 <ttcol align="right">1</ttcol> |
| 5166 <ttcol align="right">2</ttcol> |
| 5167 <ttcol align="right">3</ttcol> |
| 5168 <ttcol align="right">4</ttcol> |
| 5169 <ttcol align="right">5</ttcol> |
| 5170 <ttcol align="right">6</ttcol> |
| 5171 <ttcol align="right">7</ttcol> |
| 5172 <ttcol align="right">8</ttcol> |
| 5173 <ttcol align="right">9</ttcol> |
| 5174 <ttcol align="right">10</ttcol> |
| 5175 <c>0</c><c>90</c><c>110</c><c>118</c><c>126</c><c>134</c><c>144</c><c>152</c><c>
162</c><c>172</c><c>200</c> |
| 5176 <c>0</c><c>80</c><c>100</c><c>110</c><c>119</c><c>127</c><c>137</c><c>145</c><c>
155</c><c>165</c><c>200</c> |
| 5177 <c>0</c><c>75</c><c>90</c><c>103</c><c>112</c><c>120</c><c>130</c><c>138</c><c>1
48</c><c>158</c><c>200</c> |
| 5178 <c>0</c><c>69</c><c>84</c><c>93</c><c>104</c><c>114</c><c>124</c><c>132</c><c>14
2</c><c>152</c><c>200</c> |
| 5179 <c>0</c><c>63</c><c>78</c><c>86</c><c>95</c><c>103</c><c>113</c><c>123</c><c>133
</c><c>143</c><c>200</c> |
| 5180 <c>0</c><c>56</c><c>71</c><c>80</c><c>89</c><c>97</c><c>107</c><c>117</c><c>127<
/c><c>137</c><c>200</c> |
| 5181 <c>0</c><c>49</c><c>65</c><c>75</c><c>83</c><c>91</c><c>101</c><c>111</c><c>121<
/c><c>131</c><c>200</c> |
| 5182 <c>0</c><c>40</c><c>58</c><c>70</c><c>78</c><c>85</c><c>95</c><c>105</c><c>115</
c><c>125</c><c>200</c> |
| 5183 <c>0</c><c>34</c><c>51</c><c>65</c><c>72</c><c>78</c><c>88</c><c>98</c><c>108</c
><c>118</c><c>198</c> |
| 5184 <c>0</c><c>29</c><c>45</c><c>59</c><c>66</c><c>72</c><c>82</c><c>92</c><c>102</c
><c>112</c><c>193</c> |
| 5185 <c>0</c><c>20</c><c>39</c><c>53</c><c>60</c><c>66</c><c>76</c><c>86</c><c>96</c>
<c>106</c><c>188</c> |
| 5186 <c>0</c><c>18</c><c>32</c><c>47</c><c>54</c><c>60</c><c>70</c><c>80</c><c>90</c>
<c>100</c><c>183</c> |
| 5187 <c>0</c><c>10</c><c>26</c><c>40</c><c>47</c><c>54</c><c>64</c><c>74</c><c>84</c>
<c>94</c><c>178</c> |
| 5188 <c>0</c><c>0</c><c>20</c><c>31</c><c>39</c><c>47</c><c>57</c><c>67</c><c>77</c><
c>87</c><c>173</c> |
| 5189 <c>0</c><c>0</c><c>12</c><c>23</c><c>32</c><c>41</c><c>51</c><c>61</c><c>71</c><
c>81</c><c>168</c> |
| 5190 <c>0</c><c>0</c><c>0</c><c>15</c><c>25</c><c>35</c><c>45</c><c>55</c><c>65</c><c
>75</c><c>163</c> |
| 5191 <c>0</c><c>0</c><c>0</c><c>4</c><c>17</c><c>29</c><c>39</c><c>49</c><c>59</c><c>
69</c><c>158</c> |
| 5192 <c>0</c><c>0</c><c>0</c><c>0</c><c>12</c><c>23</c><c>33</c><c>43</c><c>53</c><c>
63</c><c>153</c> |
| 5193 <c>0</c><c>0</c><c>0</c><c>0</c><c>1</c><c>16</c><c>26</c><c>36</c><c>46</c><c>5
6</c><c>148</c> |
| 5194 <c>0</c><c>0</c><c>0</c><c>0</c><c>0</c><c>10</c><c>15</c><c>20</c><c>30</c><c>4
5</c><c>129</c> |
| 5195 <c>0</c><c>0</c><c>0</c><c>0</c><c>0</c><c>1</c><c>1</c><c>1</c><c>1</c><c>20</c
><c>104</c> |
| 5196 </texttable> |
| 5197 |
| 5198 <t>The maximum allocation vector is an approximation of the maximum space |
| 5199 that can be used by each band for a given mode. The value is |
| 5200 approximate because the shape encoding is variable rate (due |
| 5201 to entropy coding of splitting parameters). Setting the maximum too low reduces
the |
| 5202 maximum achievable quality in a band while setting it too high |
| 5203 may result in waste: bitstream capacity available at the end |
| 5204 of the frame which can not be put to any use. The maximums |
| 5205 specified by the codec reflect the average maximum. In the reference |
| 5206 implementation, the maximums in bits/sample are precomputed in a static table |
| 5207 (see cache_caps50[] in static_modes_float.h) for each band, |
| 5208 for each value of LM, and for both mono and stereo. |
| 5209 |
| 5210 Implementations are expected |
| 5211 to simply use the same table data, but the procedure for generating |
| 5212 this table is included in rate.c as part of compute_pulse_cache().</t> |
| 5213 |
| 5214 <t>To convert the values in cache.caps into the actual maximums: first |
| 5215 set nbBands to the maximum number of bands for this mode, and stereo to |
| 5216 zero if stereo is not in use and one otherwise. For each band set N |
| 5217 to the number of MDCT bins covered by the band (for one channel), set LM |
| 5218 to the shift value for the frame size, |
| 5219 then set i to nbBands*(2*LM+stereo). Then set the maximum for the band to |
| 5220 the i-th index of cache.caps + 64 and multiply by the number of channels |
| 5221 in the current frame (one or two) and by N, then divide the result by 4 |
| 5222 using integer division. The resulting vector will be called |
| 5223 cap[]. The elements fit in signed 16-bit integers but do not fit in 8 bits. |
| 5224 This procedure is implemented in the reference in the function init_caps() in ce
lt.c. |
| 5225 </t> |
| 5226 |
| 5227 <t>The band boosts are represented by a series of binary symbols which |
| 5228 are entropy coded with very low probability. Each band can potentially be booste
d |
| 5229 multiple times, subject to the frame actually having enough room to obey |
| 5230 the boost and having enough room to code the boost symbol. The default |
| 5231 coding cost for a boost starts out at six bits (probability p=1/64), but subsequ
ent boosts |
| 5232 in a band cost only a single bit and every time a band is boosted the |
| 5233 initial cost is reduced (down to a minimum of two bits, or p=1/4). Since the ini
tial |
| 5234 cost of coding a boost is 6 bits, the coding cost of the boost symbols when |
| 5235 completely unused is 0.48 bits/frame for a 21 band mode (21*-log2(1-1/2**6)).</t
> |
| 5236 |
| 5237 <t>To decode the band boosts: First set 'dynalloc_logp' to 6, the initial |
| 5238 amount of storage required to signal a boost in bits, 'total_bits' to the |
| 5239 size of the frame in 8th bits, 'total_boost' to zero, and 'tell' to the total nu
mber |
| 5240 of 8th bits decoded |
| 5241 so far. For each band from the coding start (0 normally, but 17 in Hybrid mode) |
| 5242 to the coding end (which changes depending on the signaled bandwidth), the boost
quanta |
| 5243 in units of 1/8 bit is calculated as quanta = min(8*N, max(48, N)). |
| 5244 This represents a boost step size of six bits, subject to a lower limit of |
| 5245 1/8th bit/sample and an upper limit of 1 bit/sample. |
| 5246 Set 'boost' to zero and 'dynalloc_loop_logp' |
| 5247 to dynalloc_logp. While dynalloc_loop_log (the current worst case symbol cost) i
n |
| 5248 8th bits plus tell is less than total_bits plus total_boost and boost is less th
an cap[] for this |
| 5249 band: Decode a bit from the bitstream with a with dynalloc_loop_logp as the cost |
| 5250 of a one, update tell to reflect the current used capacity, if the decoded value |
| 5251 is zero break the loop otherwise add quanta to boost and total_boost, subtract
quanta from |
| 5252 total_bits, and set dynalloc_loop_log to 1. When the while loop finishes |
| 5253 boost contains the boost for this band. If boost is non-zero and dynalloc_logp |
| 5254 is greater than 2, decrease dynalloc_logp. Once this process has been |
| 5255 executed on all bands, the band boosts have been decoded. This procedure |
| 5256 is implemented around line 2474 of celt.c.</t> |
| 5257 |
| 5258 <t>At very low rates it is possible that there won't be enough available |
| 5259 space to execute the inner loop even once. In these cases band boost |
| 5260 is not possible but its overhead is completely eliminated. Because of the |
| 5261 high cost of band boost when activated, a reasonable encoder should not be |
| 5262 using it at very low rates. The reference implements its dynalloc decision |
| 5263 logic around line 1304 of celt.c.</t> |
| 5264 |
| 5265 <t>The allocation trim is a integer value from 0-10. The default value of |
| 5266 5 indicates no trim. The trim parameter is entropy coded in order to |
| 5267 lower the coding cost of less extreme adjustments. Values lower than |
| 5268 5 bias the allocation towards lower frequencies and values above 5 |
| 5269 bias it towards higher frequencies. Like other signaled parameters, signaling |
| 5270 of the trim is gated so that it is not included if there is insufficient space |
| 5271 available in the bitstream. To decode the trim, first set |
| 5272 the trim value to 5, then if and only if the count of decoded 8th bits so far (e
c_tell_frac) |
| 5273 plus 48 (6 bits) is less than or equal to the total frame size in 8th |
| 5274 bits minus total_boost (a product of the above band boost procedure), |
| 5275 decode the trim value using the PDF in <xref target="celt_trim_pdf"/>.</t> |
| 5276 |
| 5277 <texttable anchor="celt_trim_pdf" title="PDF for the Trim"> |
| 5278 <ttcol>PDF</ttcol> |
| 5279 <c>{1, 1, 2, 5, 10, 22, 46, 22, 10, 5, 2, 2}/128</c> |
| 5280 </texttable> |
| 5281 |
| 5282 <t>For 10 ms and 20 ms frames using short blocks and that have at least LM+2 bit
s left prior to |
| 5283 the allocation process, then one anti-collapse bit is reserved in the allocation
process so it can |
| 5284 be decoded later. Following the the anti-collapse reservation, one bit is reserv
ed for skip if available.</t> |
| 5285 |
| 5286 <t>For stereo frames, bits are reserved for intensity stereo and for dual stereo
. Intensity stereo |
| 5287 requires ilog2(end-start) bits. Those bits are reserved if there is enough bits
left. Following this, one |
| 5288 bit is reserved for dual stereo if available.</t> |
| 5289 |
| 5290 |
| 5291 <t>The allocation computation begins by setting up some initial conditions. |
| 5292 'total' is set to the remaining available 8th bits, computed by taking the |
| 5293 size of the coded frame times 8 and subtracting ec_tell_frac(). From this value,
one (8th bit) |
| 5294 is subtracted to ensure that the resulting allocation will be conservative. 'ant
i_collapse_rsv' |
| 5295 is set to 8 (8th bits) if and only if the frame is a transient, LM is greater th
an 1, and total is |
| 5296 greater than or equal to (LM+2) * 8. Total is then decremented by anti_collapse_
rsv and clamped |
| 5297 to be equal to or greater than zero. 'skip_rsv' is set to 8 (8th bits) if total
is greater than |
| 5298 8, otherwise it is zero. Total is then decremented by skip_rsv. This reserves sp
ace for the |
| 5299 final skipping flag.</t> |
| 5300 |
| 5301 <t>If the current frame is stereo, intensity_rsv is set to the conservative log2
in 8th bits |
| 5302 of the number of coded bands for this frame (given by the table LOG2_FRAC_TABLE
in rate.c). If |
| 5303 intensity_rsv is greater than total then intensity_rsv is set to zero. Otherwise
total is |
| 5304 decremented by intensity_rsv, and if total is still greater than 8, dual_stereo_
rsv is |
| 5305 set to 8 and total is decremented by dual_stereo_rsv.</t> |
| 5306 |
| 5307 <t>The allocation process then computes a vector representing the hard minimum a
mounts allocation |
| 5308 any band will receive for shape. This minimum is higher than the technical limit
of the PVQ |
| 5309 process, but very low rate allocations produce an excessively sparse spectrum an
d these bands |
| 5310 are better served by having no allocation at all. For each coded band, set thres
h[band] to |
| 5311 twenty-four times the number of MDCT bins in the band and divide by 16. If 8 tim
es the number |
| 5312 of channels is greater, use that instead. This sets the minimum allocation to on
e bit per channel |
| 5313 or 48 128th bits per MDCT bin, whichever is greater. The band-size dependent par
t of this |
| 5314 value is not scaled by the channel count, because at the very low rates where th
is limit is |
| 5315 applicable there will usually be no bits allocated to the side.</t> |
| 5316 |
| 5317 <t>The previously decoded allocation trim is used to derive a vector of per-band
adjustments, |
| 5318 'trim_offsets[]'. For each coded band take the alloc_trim and subtract 5 and LM.
Then multiply |
| 5319 the result by the number of channels, the number of MDCT bins in the shortest fr
ame size for this mode, |
| 5320 the number of remaining bands, 2**LM, and 8. Then divide this value by 64. Final
ly, if the |
| 5321 number of MDCT bins in the band per channel is only one, 8 times the number of c
hannels is subtracted |
| 5322 in order to diminish the allocation by one bit, because width 1 bands receive gr
eater benefit |
| 5323 from the coarse energy coding.</t> |
| 5324 |
| 5325 |
| 5326 </section> |
| 5327 |
| 5328 <section anchor="PVQ-decoder" title="Shape Decoding"> |
| 5329 <t> |
| 5330 In each band, the normalized "shape" is encoded |
| 5331 using a vector quantization scheme called a "pyramid vector quantizer". |
| 5332 </t> |
| 5333 |
| 5334 <t>In |
| 5335 the simplest case, the number of bits allocated in |
| 5336 <xref target="allocation"></xref> is converted to a number of pulses as describe
d |
| 5337 by <xref target="bits-pulses"></xref>. Knowing the number of pulses and the |
| 5338 number of samples in the band, the decoder calculates the size of the codebook |
| 5339 as detailed in <xref target="cwrs-decoder"></xref>. The size is used to decode |
| 5340 an unsigned integer (uniform probability model), which is the codeword index. |
| 5341 This index is converted into the corresponding vector as explained in |
| 5342 <xref target="cwrs-decoder"></xref>. This vector is then scaled to unit norm. |
| 5343 </t> |
| 5344 |
| 5345 <section anchor="bits-pulses" title="Bits to Pulses"> |
| 5346 <t> |
| 5347 Although the allocation is performed in 1/8th bit units, the quantization requir
es |
| 5348 an integer number of pulses K. To do this, the encoder searches for the value |
| 5349 of K that produces the number of bits nearest to the allocated value |
| 5350 (rounding down if exactly halfway between two values), not to exceed |
| 5351 the total number of bits available. For efficiency reasons, the search is perfor
med against a |
| 5352 precomputed allocation table which only permits some K values for each N. The nu
mber of |
| 5353 codebook entries can be computed as explained in <xref target="cwrs-decoder"></x
ref>. The difference |
| 5354 between the number of bits allocated and the number of bits used is accumulated
to a |
| 5355 "balance" (initialized to zero) that helps adjust the |
| 5356 allocation for the next bands. One third of the balance is applied to the |
| 5357 bit allocation of each band to help achieve the target allocation. The only |
| 5358 exceptions are the band before the last and the last band, for which half the ba
lance |
| 5359 and the whole balance are applied, respectively. |
| 5360 </t> |
| 5361 </section> |
| 5362 |
| 5363 <section anchor="cwrs-decoder" title="PVQ Decoding"> |
| 5364 |
| 5365 <t> |
| 5366 Decoding of PVQ vectors is implemented in decode_pulses() (cwrs.c). |
| 5367 The unique codeword index is decoded as a uniformly-distributed integer value be
tween 0 and |
| 5368 V(N,K)-1, where V(N,K) is the number of possible combinations of K pulses in |
| 5369 N samples. The index is then converted to a vector in the same way specified in |
| 5370 <xref target="PVQ"></xref>. The indexing is based on the calculation of V(N,K) |
| 5371 (denoted N(L,K) in <xref target="PVQ"></xref>). |
| 5372 </t> |
| 5373 |
| 5374 <t> |
| 5375 The number of combinations can be computed recursively as |
| 5376 V(N,K) = V(N-1,K) + V(N,K-1) + V(N-1,K-1), with V(N,0) = 1 and V(0,K) = 0, K !=
0. |
| 5377 There are many different ways to compute V(N,K), including precomputed tables an
d direct |
| 5378 use of the recursive formulation. The reference implementation applies the recur
sive |
| 5379 formulation one line (or column) at a time to save on memory use, |
| 5380 along with an alternate, |
| 5381 univariate recurrence to initialize an arbitrary line, and direct |
| 5382 polynomial solutions for small N. All of these methods are |
| 5383 equivalent, and have different trade-offs in speed, memory usage, and |
| 5384 code size. Implementations MAY use any methods they like, as long as |
| 5385 they are equivalent to the mathematical definition. |
| 5386 </t> |
| 5387 |
| 5388 <t> |
| 5389 The decoded vector X is recovered as follows. |
| 5390 Let i be the index decoded with the procedure in <xref target="ec_dec_uint"/> |
| 5391 with ft = V(N,K), so that 0 <= i < V(N,K). |
| 5392 Let k = K. |
| 5393 Then for j = 0 to (N - 1), inclusive, do: |
| 5394 <list style="numbers"> |
| 5395 <t>Let p = (V(N-j-1,k) + V(N-j,k))/2.</t> |
| 5396 <t> |
| 5397 If i < p, then let sgn = 1, else let sgn = -1 |
| 5398 and set i = i - p. |
| 5399 </t> |
| 5400 <t>Let k0 = k and set p = p - V(N-j-1,k).</t> |
| 5401 <t> |
| 5402 While p > i, set k = k - 1 and |
| 5403 p = p - V(N-j-1,k). |
| 5404 </t> |
| 5405 <t> |
| 5406 Set X[j] = sgn*(k0 - k) and i = i - p. |
| 5407 </t> |
| 5408 </list> |
| 5409 </t> |
| 5410 |
| 5411 <t> |
| 5412 The decoded vector X is then normalized such that its |
| 5413 L2-norm equals one. |
| 5414 </t> |
| 5415 </section> |
| 5416 |
| 5417 <section anchor="spreading" title="Spreading"> |
| 5418 <t> |
| 5419 The normalized vector decoded in <xref target="cwrs-decoder"/> is then rotated |
| 5420 for the purpose of avoiding tonal artifacts. The rotation gain is equal to |
| 5421 <figure align="center"> |
| 5422 <artwork align="center"><![CDATA[ |
| 5423 g_r = N / (N + f_r*K) |
| 5424 ]]></artwork> |
| 5425 </figure> |
| 5426 |
| 5427 where N is the number of dimensions, K is the number of pulses, and f_r depends
on |
| 5428 the value of the "spread" parameter in the bit-stream. |
| 5429 </t> |
| 5430 |
| 5431 <texttable anchor="spread values" title="Spreading Values"> |
| 5432 <ttcol>Spread value</ttcol> |
| 5433 <ttcol>f_r</ttcol> |
| 5434 <c>0</c> <c>infinite (no rotation)</c> |
| 5435 <c>1</c> <c>15</c> |
| 5436 <c>2</c> <c>10</c> |
| 5437 <c>3</c> <c>5</c> |
| 5438 </texttable> |
| 5439 |
| 5440 <t> |
| 5441 The rotation angle is then calculated as |
| 5442 <figure align="center"> |
| 5443 <artwork align="center"><![CDATA[ |
| 5444 2 |
| 5445 pi * g_r |
| 5446 theta = ---------- |
| 5447 4 |
| 5448 ]]></artwork> |
| 5449 </figure> |
| 5450 A 2-D rotation R(i,j) between points x_i and x_j is defined as: |
| 5451 <figure align="center"> |
| 5452 <artwork align="center"><![CDATA[ |
| 5453 x_i' = cos(theta)*x_i + sin(theta)*x_j |
| 5454 x_j' = -sin(theta)*x_i + cos(theta)*x_j |
| 5455 ]]></artwork> |
| 5456 </figure> |
| 5457 |
| 5458 An N-D rotation is then achieved by applying a series of 2-D rotations back and
forth, in the |
| 5459 following order: R(x_1, x_2), R(x_2, x_3), ..., R(x_N-2, X_N-1), R(x_N-1, X_N), |
| 5460 R(x_N-2, X_N-1), ..., R(x_1, x_2). |
| 5461 </t> |
| 5462 |
| 5463 <t> |
| 5464 If the decoded vector represents more |
| 5465 than one time block, then this spreading process is applied separately on each t
ime block. |
| 5466 Also, if each block represents 8 samples or more, then another N-D rotation, by |
| 5467 (pi/2-theta), is applied <spanx style="emph">before</spanx> the rotation describ
ed above. This |
| 5468 extra rotation is applied in an interleaved manner with a stride equal to round(
sqrt(N/nb_blocks)), |
| 5469 i.e., it is applied independently for each set of sample S_k = {stride*n + k}, n
=0..N/stride-1. |
| 5470 </t> |
| 5471 </section> |
| 5472 |
| 5473 <section anchor="split" title="Split decoding"> |
| 5474 <t> |
| 5475 To avoid the need for multi-precision calculations when decoding PVQ codevectors
, |
| 5476 the maximum size allowed for codebooks is 32 bits. When larger codebooks are |
| 5477 needed, the vector is instead split in two sub-vectors of size N/2. |
| 5478 A quantized gain parameter with precision |
| 5479 derived from the current allocation is entropy coded to represent the relative |
| 5480 gains of each side of the split, and the entire decoding process is recursively |
| 5481 applied. Multiple levels of splitting may be applied up to a limit of LM+1 split
s. |
| 5482 The same recursive mechanism is applied for the joint coding |
| 5483 of stereo audio. |
| 5484 </t> |
| 5485 |
| 5486 </section> |
| 5487 |
| 5488 <section anchor="tf-change" title="Time-Frequency change"> |
| 5489 <t> |
| 5490 The time-frequency (TF) parameters are used to control the time-frequency resolu
tion tradeoff |
| 5491 in each coded band. For each band, there are two possible TF choices. For the fi
rst |
| 5492 band coded, the PDF is {3, 1}/4 for frames marked as transient and {15, 1}/16 fo
r |
| 5493 the other frames. For subsequent bands, the TF choice is coded relative to the |
| 5494 previous TF choice with probability {15, 1}/15 for transient frames and {31, 1}/
32 |
| 5495 otherwise. The mapping between the decoded TF choices and the adjustment in TF |
| 5496 resolution is shown in the tables below. |
| 5497 </t> |
| 5498 |
| 5499 <texttable anchor='tf_00' |
| 5500 title="TF Adjustments for Non-transient Frames and tf_select=0"> |
| 5501 <ttcol align='center'>Frame size (ms)</ttcol> |
| 5502 <ttcol align='center'>0</ttcol> |
| 5503 <ttcol align='center'>1</ttcol> |
| 5504 <c>2.5</c> <c>0</c> <c>-1</c> |
| 5505 <c>5</c> <c>0</c> <c>-1</c> |
| 5506 <c>10</c> <c>0</c> <c>-2</c> |
| 5507 <c>20</c> <c>0</c> <c>-2</c> |
| 5508 </texttable> |
| 5509 |
| 5510 <texttable anchor='tf_01' |
| 5511 title="TF Adjustments for Non-transient Frames and tf_select=1"> |
| 5512 <ttcol align='center'>Frame size (ms)</ttcol> |
| 5513 <ttcol align='center'>0</ttcol> |
| 5514 <ttcol align='center'>1</ttcol> |
| 5515 <c>2.5</c> <c>0</c> <c>-1</c> |
| 5516 <c>5</c> <c>0</c> <c>-2</c> |
| 5517 <c>10</c> <c>0</c> <c>-3</c> |
| 5518 <c>20</c> <c>0</c> <c>-3</c> |
| 5519 </texttable> |
| 5520 |
| 5521 |
| 5522 <texttable anchor='tf_10' |
| 5523 title="TF Adjustments for Transient Frames and tf_select=0"> |
| 5524 <ttcol align='center'>Frame size (ms)</ttcol> |
| 5525 <ttcol align='center'>0</ttcol> |
| 5526 <ttcol align='center'>1</ttcol> |
| 5527 <c>2.5</c> <c>0</c> <c>-1</c> |
| 5528 <c>5</c> <c>1</c> <c>0</c> |
| 5529 <c>10</c> <c>2</c> <c>0</c> |
| 5530 <c>20</c> <c>3</c> <c>0</c> |
| 5531 </texttable> |
| 5532 |
| 5533 <texttable anchor='tf_11' |
| 5534 title="TF Adjustments for Transient Frames and tf_select=1"> |
| 5535 <ttcol align='center'>Frame size (ms)</ttcol> |
| 5536 <ttcol align='center'>0</ttcol> |
| 5537 <ttcol align='center'>1</ttcol> |
| 5538 <c>2.5</c> <c>0</c> <c>-1</c> |
| 5539 <c>5</c> <c>1</c> <c>-1</c> |
| 5540 <c>10</c> <c>1</c> <c>-1</c> |
| 5541 <c>20</c> <c>1</c> <c>-1</c> |
| 5542 </texttable> |
| 5543 |
| 5544 <t> |
| 5545 A negative TF adjustment means that the temporal resolution is increased, |
| 5546 while a positive TF adjustment means that the frequency resolution is increased. |
| 5547 Changes in TF resolution are implemented using the Hadamard transform <xref targ
et="Hadamard"/>. To increase |
| 5548 the time resolution by N, N "levels" of the Hadamard transform are applied to th
e |
| 5549 decoded vector for each interleaved MDCT vector. To increase the frequency resol
ution |
| 5550 (assumes a transient frame), then N levels of the Hadamard transform are applied |
| 5551 <spanx style="emph">across</spanx> the interleaved MDCT vector. In the case of i
ncreased |
| 5552 time resolution the decoder uses the "sequency order" because the input vector |
| 5553 is sorted in time. |
| 5554 </t> |
| 5555 </section> |
| 5556 |
| 5557 |
| 5558 </section> |
| 5559 |
| 5560 <section anchor="anti-collapse" title="Anti-Collapse Processing"> |
| 5561 <t> |
| 5562 The anti-collapse feature is designed to avoid the situation where the use of mu
ltiple |
| 5563 short MDCTs causes the energy in one or more of the MDCTs to be zero for |
| 5564 some bands, causing unpleasant artifacts. |
| 5565 When the frame has the transient bit set, an anti-collapse bit is decoded. |
| 5566 When anti-collapse is set, the energy in each small MDCT is prevented |
| 5567 from collapsing to zero. For each band of each MDCT where a collapse is |
| 5568 detected, a pseudo-random signal is inserted with an energy corresponding |
| 5569 to the minimum energy over the two previous frames. A renormalization step is |
| 5570 then required to ensure that the anti-collapse step did not alter the |
| 5571 energy preservation property. |
| 5572 </t> |
| 5573 </section> |
| 5574 |
| 5575 <section anchor="denormalization" title="Denormalization"> |
| 5576 <t> |
| 5577 Just as each band was normalized in the encoder, the last step of the decoder be
fore |
| 5578 the inverse MDCT is to denormalize the bands. Each decoded normalized band is |
| 5579 multiplied by the square root of the decoded energy. This is done by denormalise
_bands() |
| 5580 (bands.c). |
| 5581 </t> |
| 5582 </section> |
| 5583 |
| 5584 <section anchor="inverse-mdct" title="Inverse MDCT"> |
| 5585 |
| 5586 |
| 5587 <t>The inverse MDCT implementation has no special characteristics. The |
| 5588 input is N frequency-domain samples and the output is 2*N time-domain |
| 5589 samples, while scaling by 1/2. A "low-overlap" window reduces the algorithmic de
lay. |
| 5590 It is derived from a basic (full overlap) 240-sample version of the window used
by the Vorbis codec: |
| 5591 <figure align="center"> |
| 5592 <artwork align="center"><![CDATA[ |
| 5593 2 |
| 5594 / /pi /pi n + 1/2\ \ \ |
| 5595 W(n) = |sin|-- * sin|-- * -------| | | . |
| 5596 \ \2 \2 L / / / |
| 5597 ]]></artwork> |
| 5598 </figure> |
| 5599 The low-overlap window is created by zero-padding the basic window and inserting
ones in the |
| 5600 middle, such that the resulting window still satisfies power complementarity <xr
ef target='Princen86'/>. |
| 5601 The IMDCT and |
| 5602 windowing are performed by mdct_backward (mdct.c). |
| 5603 </t> |
| 5604 |
| 5605 <section anchor="post-filter" title="Post-filter"> |
| 5606 <t> |
| 5607 The output of the inverse MDCT (after weighted overlap-add) is sent to the |
| 5608 post-filter. Although the post-filter is applied at the end, the post-filter |
| 5609 parameters are encoded at the beginning, just after the silence flag. |
| 5610 The post-filter can be switched on or off using one bit (logp=1). |
| 5611 If the post-filter is enabled, then the octave is decoded as an integer value |
| 5612 between 0 and 6 of uniform probability. Once the octave is known, the fine pitch |
| 5613 within the octave is decoded using 4+octave raw bits. The final pitch period |
| 5614 is equal to (16<<octave)+fine_pitch-1 so it is bounded between 15 and 1022
, |
| 5615 inclusively. Next, the gain is decoded as three raw bits and is equal to |
| 5616 G=3*(int_gain+1)/32. The set of post-filter taps is decoded last, using |
| 5617 a pdf equal to {2, 1, 1}/4. Tapset zero corresponds to the filter coefficients |
| 5618 g0 = 0.3066406250, g1 = 0.2170410156, g2 = 0.1296386719. Tapset one |
| 5619 corresponds to the filter coefficients g0 = 0.4638671875, g1 = 0.2680664062, |
| 5620 g2 = 0, and tapset two uses filter coefficients g0 = 0.7998046875, |
| 5621 g1 = 0.1000976562, g2 = 0. |
| 5622 </t> |
| 5623 |
| 5624 <t> |
| 5625 The post-filter response is thus computed as: |
| 5626 <figure align="center"> |
| 5627 <artwork align="center"> |
| 5628 <![CDATA[ |
| 5629 y(n) = x(n) + G*(g0*y(n-T) + g1*(y(n-T+1)+y(n-T+1)) |
| 5630 + g2*(y(n-T+2)+y(n-T+2))) |
| 5631 ]]> |
| 5632 </artwork> |
| 5633 </figure> |
| 5634 |
| 5635 During a transition between different gains, a smooth transition is calculated |
| 5636 using the square of the MDCT window. It is important that values of y(n) be |
| 5637 interpolated one at a time such that the past value of y(n) used is interpolated
. |
| 5638 </t> |
| 5639 </section> |
| 5640 |
| 5641 <section anchor="deemphasis" title="De-emphasis"> |
| 5642 <t> |
| 5643 After the post-filter, |
| 5644 the signal is de-emphasized using the inverse of the pre-emphasis filter |
| 5645 used in the encoder: |
| 5646 <figure align="center"> |
| 5647 <artwork align="center"><![CDATA[ |
| 5648 1 1 |
| 5649 ---- = --------------- , |
| 5650 A(z) -1 |
| 5651 1 - alpha_p*z |
| 5652 ]]></artwork> |
| 5653 </figure> |
| 5654 where alpha_p=0.8500061035. |
| 5655 </t> |
| 5656 </section> |
| 5657 |
| 5658 </section> |
| 5659 |
| 5660 </section> |
| 5661 |
| 5662 <section anchor="Packet Loss Concealment" title="Packet Loss Concealment (PLC)"> |
| 5663 <t> |
| 5664 Packet loss concealment (PLC) is an optional decoder-side feature that |
| 5665 SHOULD be included when receiving from an unreliable channel. Because |
| 5666 PLC is not part of the bitstream, there are many acceptable ways to |
| 5667 implement PLC with different complexity/quality trade-offs. |
| 5668 </t> |
| 5669 |
| 5670 <t> |
| 5671 The PLC in |
| 5672 the reference implementation depends on the mode of last packet received. |
| 5673 In CELT mode, the PLC finds a periodicity in the decoded |
| 5674 signal and repeats the windowed waveform using the pitch offset. The windowed |
| 5675 waveform is overlapped in such a way as to preserve the time-domain aliasing |
| 5676 cancellation with the previous frame and the next frame. This is implemented |
| 5677 in celt_decode_lost() (mdct.c). In SILK mode, the PLC uses LPC extrapolation |
| 5678 from the previous frame, implemented in silk_PLC() (PLC.c). |
| 5679 </t> |
| 5680 |
| 5681 <section anchor="clock-drift" title="Clock Drift Compensation"> |
| 5682 <t> |
| 5683 Clock drift refers to the gradual desynchronization of two endpoints |
| 5684 whose sample clocks run at different frequencies while they are streaming |
| 5685 live audio. Differences in clock frequencies are generally attributable to |
| 5686 manufacturing variation in the endpoints' clock hardware. For long-lived |
| 5687 streams, the time difference between sender and receiver can grow without |
| 5688 bound. |
| 5689 </t> |
| 5690 |
| 5691 <t> |
| 5692 When the sender's clock runs slower than the receiver's, the effect is similar |
| 5693 to packet loss: too few packets are received. The receiver can distinguish |
| 5694 between drift and loss if the transport provides packet timestamps. A receiver |
| 5695 for live streams SHOULD conceal the effects of drift, and MAY do so by invoking |
| 5696 the PLC. |
| 5697 </t> |
| 5698 |
| 5699 <t> |
| 5700 When the sender's clock runs faster than the receiver's, too many packets will |
| 5701 be received. The receiver MAY respond by skipping any packet (i.e., not |
| 5702 submitting the packet for decoding). This is likely to produce a less severe |
| 5703 artifact than if the frame were dropped after decoding. |
| 5704 </t> |
| 5705 |
| 5706 <t> |
| 5707 A decoder MAY employ a more sophisticated drift compensation method. For |
| 5708 example, the |
| 5709 <xref target='Google-NetEQ'>NetEQ component</xref> |
| 5710 of the |
| 5711 <xref target='Google-WebRTC'>Google WebRTC codebase</xref> |
| 5712 compensates for drift by adding or removing |
| 5713 one period when the signal is highly periodic. The reference implementation of |
| 5714 Opus allows a caller to learn whether the current frame's signal is highly |
| 5715 periodic, and if so what the period is, using the OPUS_GET_PITCH() request. |
| 5716 </t> |
| 5717 </section> |
| 5718 |
| 5719 </section> |
| 5720 |
| 5721 <section anchor="switching" title="Configuration Switching"> |
| 5722 |
| 5723 <t> |
| 5724 Switching between the Opus coding modes, audio bandwidths, and channel counts |
| 5725 requires careful consideration to avoid audible glitches. |
| 5726 Switching between any two configurations of the CELT-only mode, any two |
| 5727 configurations of the Hybrid mode, or from WB SILK to Hybrid mode does not |
| 5728 require any special treatment in the decoder, as the MDCT overlap will smooth |
| 5729 the transition. |
| 5730 Switching from Hybrid mode to WB SILK requires adding in the final contents |
| 5731 of the CELT overlap buffer to the first SILK-only packet. |
| 5732 This can be done by decoding a 2.5 ms silence frame with the CELT decoder |
| 5733 using the channel count of the SILK-only packet (and any choice of audio |
| 5734 bandwidth), which will correctly handle the cases when the channel count |
| 5735 changes as well. |
| 5736 </t> |
| 5737 |
| 5738 <t> |
| 5739 When changing the channel count for SILK-only or Hybrid packets, the encoder |
| 5740 can avoid glitches by smoothly varying the stereo width of the input signal |
| 5741 before or after the transition, and SHOULD do so. |
| 5742 However, other transitions between SILK-only packets or between NB or MB SILK |
| 5743 and Hybrid packets may cause glitches, because neither the LSF coefficients |
| 5744 nor the LTP, LPC, stereo unmixing, and resampler buffers are available at the |
| 5745 new sample rate. |
| 5746 These switches SHOULD be delayed by the encoder until quiet periods or |
| 5747 transients, where the inevitable glitches will be less audible. Additionally, |
| 5748 the bit-stream MAY include redundant side information ("redundancy"), in the |
| 5749 form of additional CELT frames embedded in each of the Opus frames around the |
| 5750 transition. |
| 5751 </t> |
| 5752 |
| 5753 <t> |
| 5754 The other transitions that cannot be easily handled are those where the lower |
| 5755 frequencies switch between the SILK LP-based model and the CELT MDCT model. |
| 5756 However, an encoder may not have an opportunity to delay such a switch to a |
| 5757 convenient point. |
| 5758 For example, if the content switches from speech to music, and the encoder does |
| 5759 not have enough latency in its analysis to detect this in advance, there may |
| 5760 be no convenient silence period during which to make the transition for quite |
| 5761 some time. |
| 5762 To avoid or reduce glitches during these problematic mode transitions, and |
| 5763 also between audio bandwidth changes in the SILK-only modes, transitions MAY |
| 5764 include redundant side information ("redundancy"), in the form of an |
| 5765 additional CELT frame embedded in the Opus frame. |
| 5766 </t> |
| 5767 |
| 5768 <t> |
| 5769 A transition between coding the lower frequencies with the LP model and the |
| 5770 MDCT model or a transition that involves changing the SILK bandwidth |
| 5771 is only normatively specified when it includes redundancy. |
| 5772 For those without redundancy, it is RECOMMENDED that the decoder use a |
| 5773 concealment technique (e.g., make use of a PLC algorithm) to "fill in" the |
| 5774 gap or discontinuity caused by the mode transition. |
| 5775 Therefore, PLC MUST NOT be applied during any normative transition, i.e., when |
| 5776 <list style="symbols"> |
| 5777 <t>A packet includes redundancy for this transition (as described below),</t> |
| 5778 <t>The transition is between any WB SILK packet and any Hybrid packet, or vice |
| 5779 versa,</t> |
| 5780 <t>The transition is between any two Hybrid mode packets, or</t> |
| 5781 <t>The transition is between any two CELT mode packets,</t> |
| 5782 </list> |
| 5783 unless there is actual packet loss. |
| 5784 </t> |
| 5785 |
| 5786 <section anchor="side-info" title="Transition Side Information (Redundancy)"> |
| 5787 <t> |
| 5788 Transitions with side information include an extra 5 ms "redundant" CELT |
| 5789 frame within the Opus frame. |
| 5790 This frame is designed to fill in the gap or discontinuity in the different |
| 5791 layers without requiring the decoder to conceal it. |
| 5792 For transitions from CELT-only to SILK-only or Hybrid, the redundant frame is |
| 5793 inserted in the first Opus frame after the transition (i.e., the first |
| 5794 SILK-only or Hybrid frame). |
| 5795 For transitions from SILK-only or Hybrid to CELT-only, the redundant frame is |
| 5796 inserted in the last Opus frame before the transition (i.e., the last |
| 5797 SILK-only or Hybrid frame). |
| 5798 </t> |
| 5799 |
| 5800 <section anchor="opus_redundancy_flag" title="Redundancy Flag"> |
| 5801 <t> |
| 5802 The presence of redundancy is signaled in all SILK-only and Hybrid frames, not |
| 5803 just those involved in a mode transition. |
| 5804 This allows the frames to be decoded correctly even if an adjacent frame is |
| 5805 lost. |
| 5806 For SILK-only frames, this signaling is implicit, based on the size of the |
| 5807 of the Opus frame and the number of bits consumed decoding the SILK portion of |
| 5808 it. |
| 5809 After decoding the SILK portion of the Opus frame, the decoder uses ec_tell() |
| 5810 (see <xref target="ec_tell"/>) to check if there are at least 17 bits |
| 5811 remaining. |
| 5812 If so, then the frame contains redundancy. |
| 5813 </t> |
| 5814 |
| 5815 <t> |
| 5816 For Hybrid frames, this signaling is explicit. |
| 5817 After decoding the SILK portion of the Opus frame, the decoder uses ec_tell() |
| 5818 (see <xref target="ec_tell"/>) to ensure there are at least 37 bits remaining. |
| 5819 If so, it reads a symbol with the PDF in |
| 5820 <xref target="opus_redundancy_flag_pdf"/>, and if the value is 1, then the |
| 5821 frame contains redundancy. |
| 5822 Otherwise (if there were fewer than 37 bits left or the value was 0), the frame |
| 5823 does not contain redundancy. |
| 5824 </t> |
| 5825 |
| 5826 <texttable anchor="opus_redundancy_flag_pdf" title="Redundancy Flag PDF"> |
| 5827 <ttcol>PDF</ttcol> |
| 5828 <c>{4095, 1}/4096</c> |
| 5829 </texttable> |
| 5830 </section> |
| 5831 |
| 5832 <section anchor="opus_redundancy_pos" title="Redundancy Position Flag"> |
| 5833 <t> |
| 5834 Since the current frame is a SILK-only or a Hybrid frame, it must be at least |
| 5835 10 ms. |
| 5836 Therefore, it needs an additional flag to indicate whether the redundant |
| 5837 5 ms CELT frame should be mixed into the beginning of the current frame, |
| 5838 or the end. |
| 5839 After determining that a frame contains redundancy, the decoder reads a |
| 5840 1 bit symbol with a uniform PDF |
| 5841 (<xref target="opus_redundancy_pos_pdf"/>). |
| 5842 </t> |
| 5843 |
| 5844 <texttable anchor="opus_redundancy_pos_pdf" title="Redundancy Position PDF"> |
| 5845 <ttcol>PDF</ttcol> |
| 5846 <c>{1, 1}/2</c> |
| 5847 </texttable> |
| 5848 |
| 5849 <t> |
| 5850 If the value is zero, this is the first frame in the transition, and the |
| 5851 redundancy belongs at the end. |
| 5852 If the value is one, this is the second frame in the transition, and the |
| 5853 redundancy belongs at the beginning. |
| 5854 There is no way to specify that an Opus frame contains separate redundant CELT |
| 5855 frames at both the beginning and the end. |
| 5856 </t> |
| 5857 </section> |
| 5858 |
| 5859 <section anchor="opus_redundancy_size" title="Redundancy Size"> |
| 5860 <t> |
| 5861 Unlike the CELT portion of a Hybrid frame, the redundant CELT frame does not |
| 5862 use the same entropy coder state as the rest of the Opus frame, because this |
| 5863 would break the CELT bit allocation mechanism in Hybrid frames. |
| 5864 Thus, a redundant CELT frame always starts and ends on a byte boundary, even in |
| 5865 SILK-only frames, where this is not strictly necessary. |
| 5866 </t> |
| 5867 |
| 5868 <t> |
| 5869 For SILK-only frames, the number of bytes in the redundant CELT frame is simply |
| 5870 the number of whole bytes remaining, which must be at least 2, due to the |
| 5871 space check in <xref target="opus_redundancy_flag"/>. |
| 5872 For Hybrid frames, the number of bytes is equal to 2, plus a decoded unsigned |
| 5873 integer less than 256 (see <xref target="ec_dec_uint"/>). |
| 5874 This may be more than the number of whole bytes remaining in the Opus frame, |
| 5875 in which case the frame is invalid. |
| 5876 However, a decoder is not required to ignore the entire frame, as this may be |
| 5877 the result of a bit error that desynchronized the range coder. |
| 5878 There may still be useful data before the error, and a decoder MAY keep any |
| 5879 audio decoded so far instead of invoking the PLC, but it is RECOMMENDED that |
| 5880 the decoder stop decoding and discard the rest of the current Opus frame. |
| 5881 </t> |
| 5882 |
| 5883 <t> |
| 5884 It would have been possible to avoid these invalid states in the design of Opus |
| 5885 by limiting the range of the explicit length decoded from Hybrid frames by the |
| 5886 actual number of whole bytes remaining. |
| 5887 However, this would require an encoder to determine the rate allocation for the |
| 5888 MDCT layer up front, before it began encoding that layer. |
| 5889 By allowing some invalid sizes, the encoder is able to defer that decision |
| 5890 until much later. |
| 5891 When encoding Hybrid frames which do not include redundancy, the encoder must |
| 5892 still decide up-front if it wishes to use the minimum 37 bits required to |
| 5893 trigger encoding of the redundancy flag, but this is a much looser |
| 5894 restriction. |
| 5895 </t> |
| 5896 |
| 5897 <t> |
| 5898 After determining the size of the redundant CELT frame, the decoder reduces |
| 5899 the size of the buffer currently in use by the range coder by that amount. |
| 5900 The CELT layer read any raw bits from the end of this reduced buffer, and all |
| 5901 calculations of the number of bits remaining in the buffer must be done using |
| 5902 this new, reduced size, rather than the original size of the Opus frame. |
| 5903 </t> |
| 5904 </section> |
| 5905 |
| 5906 <section anchor="opus_redundancy_decoding" title="Decoding the Redundancy"> |
| 5907 <t> |
| 5908 The redundant frame is decoded like any other CELT-only frame, with the |
| 5909 exception that it does not contain a TOC byte. |
| 5910 The frame size is fixed at 5 ms, the channel count is set to that of the |
| 5911 current frame, and the audio bandwidth is also set to that of the current |
| 5912 frame, with the exception that for MB SILK frames, it is set to WB. |
| 5913 </t> |
| 5914 |
| 5915 <t> |
| 5916 If the redundancy belongs at the beginning (in a CELT-only to SILK-only or |
| 5917 Hybrid transition), the final reconstructed output uses the first 2.5 ms |
| 5918 of audio output by the decoder for the redundant frame as-is, discarding |
| 5919 the corresponding output from the SILK-only or Hybrid portion of the frame. |
| 5920 The remaining 2.5 ms is cross-lapped with the decoded SILK/Hybrid signal |
| 5921 using the CELT's power-complementary MDCT window to ensure a smooth |
| 5922 transition. |
| 5923 </t> |
| 5924 |
| 5925 <t> |
| 5926 If the redundancy belongs at the end (in a SILK-only or Hybrid to CELT-only |
| 5927 transition), only the second half (2.5 ms) of the audio output by the |
| 5928 decoder for the redundant frame is used. |
| 5929 In that case, the second half of the redundant frame is cross-lapped with the |
| 5930 end of the SILK/Hybrid signal, again using CELT's power-complementary MDCT |
| 5931 window to ensure a smooth transition. |
| 5932 </t> |
| 5933 </section> |
| 5934 |
| 5935 </section> |
| 5936 |
| 5937 <section anchor="decoder-reset" title="State Reset"> |
| 5938 <t> |
| 5939 When a transition occurs, the state of the SILK or the CELT decoder (or both) |
| 5940 may need to be reset before decoding a frame in the new mode. |
| 5941 This avoids reusing "out of date" memory, which may not have been updated in |
| 5942 some time or may not be in a well-defined state due to, e.g., PLC. |
| 5943 The SILK state is reset before every SILK-only or Hybrid frame where the |
| 5944 previous frame was CELT-only. |
| 5945 The CELT state is reset every time the operating mode changes and the new mode |
| 5946 is either Hybrid or CELT-only, except when the transition uses redundancy as |
| 5947 described above. |
| 5948 When switching from SILK-only or Hybrid to CELT-only with redundancy, the CELT |
| 5949 state is reset before decoding the redundant CELT frame embedded in the |
| 5950 SILK-only or Hybrid frame, but it is not reset before decoding the following |
| 5951 CELT-only frame. |
| 5952 When switching from CELT-only mode to SILK-only or Hybrid mode with redundancy, |
| 5953 the CELT decoder is not reset for decoding the redundant CELT frame. |
| 5954 </t> |
| 5955 </section> |
| 5956 |
| 5957 <section title="Summary of Transitions"> |
| 5958 |
| 5959 <t> |
| 5960 <xref target="normative_transitions"/> illustrates all of the normative |
| 5961 transitions involving a mode change, an audio bandwidth change, or both. |
| 5962 Each one uses an S, H, or C to represent an Opus frame in the corresponding |
| 5963 mode. |
| 5964 In addition, an R indicates the presence of redundancy in the Opus frame it is |
| 5965 cross-lapped with. |
| 5966 Its location in the first or last 5 ms is assumed to correspond to whether |
| 5967 it is the frame before or after the transition. |
| 5968 Other uses of redundancy are non-normative. |
| 5969 Finally, a c indicates the contents of the CELT overlap buffer after the |
| 5970 previously decoded frame (i.e., as extracted by decoding a silence frame). |
| 5971 <figure align="center" anchor="normative_transitions" |
| 5972 title="Normative Transitions"> |
| 5973 <artwork align="center"><![CDATA[ |
| 5974 SILK to SILK with Redundancy: S -> S -> S |
| 5975 & |
| 5976 !R -> R |
| 5977 & |
| 5978 ;S -> S -> S |
| 5979 |
| 5980 NB or MB SILK to Hybrid with Redundancy: S -> S -> S |
| 5981 & |
| 5982 !R ->;H -> H -> H |
| 5983 |
| 5984 WB SILK to Hybrid: S -> S -> S ->!H -> H -> H |
| 5985 |
| 5986 SILK to CELT with Redundancy: S -> S -> S |
| 5987 & |
| 5988 !R -> C -> C -> C |
| 5989 |
| 5990 Hybrid to NB or MB SILK with Redundancy: H -> H -> H |
| 5991 & |
| 5992 !R -> R |
| 5993 & |
| 5994 ;S -> S -> S |
| 5995 |
| 5996 Hybrid to WB SILK: H -> H -> H -> c |
| 5997 \ + |
| 5998 > S -> S -> S |
| 5999 |
| 6000 Hybrid to CELT with Redundancy: H -> H -> H |
| 6001 & |
| 6002 !R -> C -> C -> C |
| 6003 |
| 6004 CELT to SILK with Redundancy: C -> C -> C -> R |
| 6005 & |
| 6006 ;S -> S -> S |
| 6007 |
| 6008 CELT to Hybrid with Redundancy: C -> C -> C -> R |
| 6009 & |
| 6010 |H -> H -> H |
| 6011 |
| 6012 Key: |
| 6013 S SILK-only frame ; SILK decoder reset |
| 6014 H Hybrid frame | CELT and SILK decoder resets |
| 6015 C CELT-only frame ! CELT decoder reset |
| 6016 c CELT overlap + Direct mixing |
| 6017 R Redundant CELT frame & Windowed cross-lap |
| 6018 ]]></artwork> |
| 6019 </figure> |
| 6020 The first two and the last two Opus frames in each example are illustrative, |
| 6021 i.e., there is no requirement that a stream remain in the same configuration |
| 6022 for three consecutive frames before or after a switch. |
| 6023 </t> |
| 6024 |
| 6025 <t> |
| 6026 The behavior of transitions without redundancy where PLC is allowed is non-norma
tive. |
| 6027 An encoder might still wish to use these transitions if, for example, it |
| 6028 doesn't want to add the extra bitrate required for redundancy or if it makes |
| 6029 a decision to switch after it has already transmitted the frame that would |
| 6030 have had to contain the redundancy. |
| 6031 <xref target="nonnormative_transitions"/> illustrates the recommended |
| 6032 cross-lapping and decoder resets for these transitions. |
| 6033 <figure align="center" anchor="nonnormative_transitions" |
| 6034 title="Recommended Non-Normative Transitions"> |
| 6035 <artwork align="center"><![CDATA[ |
| 6036 SILK to SILK (audio bandwidth change): S -> S -> S ;S -> S -> S |
| 6037 |
| 6038 NB or MB SILK to Hybrid: S -> S -> S |H -> H -> H |
| 6039 |
| 6040 SILK to CELT without Redundancy: S -> S -> S -> P |
| 6041 & |
| 6042 !C -> C -> C |
| 6043 |
| 6044 Hybrid to NB or MB SILK: H -> H -> H -> c |
| 6045 + |
| 6046 ;S -> S -> S |
| 6047 |
| 6048 Hybrid to CELT without Redundancy: H -> H -> H -> P |
| 6049 & |
| 6050 !C -> C -> C |
| 6051 |
| 6052 CELT to SILK without Redundancy: C -> C -> C -> P |
| 6053 & |
| 6054 ;S -> S -> S |
| 6055 |
| 6056 CELT to Hybrid without Redundancy: C -> C -> C -> P |
| 6057 & |
| 6058 |H -> H -> H |
| 6059 |
| 6060 Key: |
| 6061 S SILK-only frame ; SILK decoder reset |
| 6062 H Hybrid frame | CELT and SILK decoder resets |
| 6063 C CELT-only frame ! CELT decoder reset |
| 6064 c CELT overlap + Direct mixing |
| 6065 P Packet Loss Concealment & Windowed cross-lap |
| 6066 ]]></artwork> |
| 6067 </figure> |
| 6068 Encoders SHOULD NOT use other transitions, e.g., those that involve redundancy |
| 6069 in ways not illustrated in <xref target="normative_transitions"/>. |
| 6070 </t> |
| 6071 |
| 6072 </section> |
| 6073 |
| 6074 </section> |
| 6075 |
| 6076 </section> |
| 6077 |
| 6078 |
| 6079 <!-- ******************************************************************* --> |
| 6080 <!-- ************************** OPUS ENCODER *********************** --> |
| 6081 <!-- ******************************************************************* --> |
| 6082 |
| 6083 <section title="Opus Encoder"> |
| 6084 <t> |
| 6085 Just like the decoder, the Opus encoder also normally consists of two main block
s: the |
| 6086 SILK encoder and the CELT encoder. However, unlike the case of the decoder, a va
lid |
| 6087 (though potentially suboptimal) Opus encoder is not required to support all mode
s and |
| 6088 may thus only include a SILK encoder module or a CELT encoder module. |
| 6089 The output bit-stream of the Opus encoding contains bits from the SILK and CELT |
| 6090 encoders, though these are not separable due to the use of a range coder. |
| 6091 A block diagram of the encoder is illustrated below. |
| 6092 |
| 6093 <figure align="center" anchor="opus-encoder-figure" title="Opus Encoder"> |
| 6094 <artwork> |
| 6095 <![CDATA[ |
| 6096 +------------+ +---------+ |
| 6097 | Sample | | SILK |------+ |
| 6098 +->| Rate |--->| Encoder | V |
| 6099 +-----------+ | | Conversion | | | +---------+ |
| 6100 | Optional | | +------------+ +---------+ | Range | |
| 6101 ->| High-pass |--+ | Encoder |----> |
| 6102 | Filter | | +--------------+ +---------+ | | Bit- |
| 6103 +-----------+ | | Delay | | CELT | +---------+ stream |
| 6104 +->| Compensation |->| Encoder | ^ |
| 6105 | | | |------+ |
| 6106 +--------------+ +---------+ |
| 6107 ]]> |
| 6108 </artwork> |
| 6109 </figure> |
| 6110 </t> |
| 6111 |
| 6112 <t> |
| 6113 For a normal encoder where both the SILK and the CELT modules are included, an o
ptimal |
| 6114 encoder should select which coding mode to use at run-time depending on the cond
itions. |
| 6115 In the reference implementation, the frame size is selected by the application,
but the |
| 6116 other configuration parameters (number of channels, bandwidth, mode) are automat
ically |
| 6117 selected (unless explicitly overridden by the application) depend on the followi
ng: |
| 6118 <list style="symbols"> |
| 6119 <t>Requested bitrate</t> |
| 6120 <t>Input sampling rate</t> |
| 6121 <t>Type of signal (speech vs music)</t> |
| 6122 <t>Frame size in use</t> |
| 6123 </list> |
| 6124 |
| 6125 The type of signal currently needs to be provided by the application (though it
can be |
| 6126 changed in real-time). An Opus encoder implementation could also do automatic de
tection, |
| 6127 but since Opus is an interactive codec, such an implementation would likely have
to either |
| 6128 delay the signal (for non-interactive applications) or delay the mode switching
decisions (for |
| 6129 interactive applications). |
| 6130 </t> |
| 6131 |
| 6132 <t> |
| 6133 When the encoder is configured for voice over IP applications, the input signal
is |
| 6134 filtered by a high-pass filter to remove the lowest part of the spectrum |
| 6135 that contains little speech energy and may contain background noise. This is a s
econd order |
| 6136 Auto Regressive Moving Average (i.e., with poles and zeros) filter with a cut-of
f frequency around 50 Hz. |
| 6137 In the future, a music detector may also be used to lower the cut-off frequency
when the |
| 6138 input signal is detected to be music rather than speech. |
| 6139 </t> |
| 6140 |
| 6141 <section anchor="range-encoder" title="Range Encoder"> |
| 6142 <t> |
| 6143 The range coder acts as the bit-packer for Opus. |
| 6144 It is used in three different ways: to encode |
| 6145 <list style="symbols"> |
| 6146 <t> |
| 6147 Entropy-coded symbols with a fixed probability model using ec_encode() |
| 6148 (entenc.c), |
| 6149 </t> |
| 6150 <t> |
| 6151 Integers from 0 to (2**M - 1) using ec_enc_uint() or ec_enc_bits() |
| 6152 (entenc.c),</t> |
| 6153 <t> |
| 6154 Integers from 0 to (ft - 1) (where ft is not a power of two) using |
| 6155 ec_enc_uint() (entenc.c). |
| 6156 </t> |
| 6157 </list> |
| 6158 </t> |
| 6159 |
| 6160 <t> |
| 6161 The range encoder maintains an internal state vector composed of the four-tuple |
| 6162 (val, rng, rem, ext) representing the low end of the current |
| 6163 range, the size of the current range, a single buffered output byte, and a |
| 6164 count of additional carry-propagating output bytes. |
| 6165 Both val and rng are 32-bit unsigned integer values, rem is a byte value or |
| 6166 less than 255 or the special value -1, and ext is an unsigned integer with at |
| 6167 least 11 bits. |
| 6168 This state vector is initialized at the start of each each frame to the value |
| 6169 (0, 2**31, -1, 0). |
| 6170 After encoding a sequence of symbols, the value of rng in the encoder should |
| 6171 exactly match the value of rng in the decoder after decoding the same sequence |
| 6172 of symbols. |
| 6173 This is a powerful tool for detecting errors in either an encoder or decoder |
| 6174 implementation. |
| 6175 The value of val, on the other hand, represents different things in the encoder |
| 6176 and decoder, and is not expected to match. |
| 6177 </t> |
| 6178 |
| 6179 <t> |
| 6180 The decoder has no analog for rem and ext. |
| 6181 These are used to perform carry propagation in the renormalization loop below. |
| 6182 Each iteration of this loop produces 9 bits of output, consisting of 8 data |
| 6183 bits and a carry flag. |
| 6184 The encoder cannot determine the final value of the output bytes until it |
| 6185 propagates these carry flags. |
| 6186 Therefore the reference implementation buffers a single non-propagating output |
| 6187 byte (i.e., one less than 255) in rem and keeps a count of additional |
| 6188 propagating (i.e., 255) output bytes in ext. |
| 6189 An implementation may choose to use any mathematically equivalent scheme to |
| 6190 perform carry propagation. |
| 6191 </t> |
| 6192 |
| 6193 <section anchor="encoding-symbols" title="Encoding Symbols"> |
| 6194 <t> |
| 6195 The main encoding function is ec_encode() (entenc.c), which encodes symbol k in |
| 6196 the current context using the same three-tuple (fl[k], fh[k], ft) |
| 6197 as the decoder to describe the range of the symbol (see |
| 6198 <xref target="range-decoder"/>). |
| 6199 </t> |
| 6200 <t> |
| 6201 ec_encode() updates the state of the encoder as follows. |
| 6202 If fl[k] is greater than zero, then |
| 6203 <figure align="center"> |
| 6204 <artwork align="center"><![CDATA[ |
| 6205 rng |
| 6206 val = val + rng - --- * (ft - fl) , |
| 6207 ft |
| 6208 |
| 6209 rng |
| 6210 rng = --- * (fh - fl) . |
| 6211 ft |
| 6212 ]]></artwork> |
| 6213 </figure> |
| 6214 Otherwise, val is unchanged and |
| 6215 <figure align="center"> |
| 6216 <artwork align="center"><![CDATA[ |
| 6217 rng |
| 6218 rng = rng - --- * (fh - fl) . |
| 6219 ft |
| 6220 ]]></artwork> |
| 6221 </figure> |
| 6222 The divisions here are integer division. |
| 6223 </t> |
| 6224 |
| 6225 <section anchor="range-encoder-renorm" title="Renormalization"> |
| 6226 <t> |
| 6227 After this update, the range is normalized using a procedure very similar to |
| 6228 that of <xref target="range-decoder-renorm"/>, implemented by |
| 6229 ec_enc_normalize() (entenc.c). |
| 6230 The following process is repeated until rng > 2**23. |
| 6231 First, the top 9 bits of val, (val>>23), are sent to the carry buffer, |
| 6232 described in <xref target="ec_enc_carry_out"/>. |
| 6233 Then, the encoder sets |
| 6234 <figure align="center"> |
| 6235 <artwork align="center"><![CDATA[ |
| 6236 val = (val<<8) & 0x7FFFFFFF , |
| 6237 |
| 6238 rng = rng<<8 . |
| 6239 ]]></artwork> |
| 6240 </figure> |
| 6241 </t> |
| 6242 </section> |
| 6243 |
| 6244 <section anchor="ec_enc_carry_out" |
| 6245 title="Carry Propagation and Output Buffering"> |
| 6246 <t> |
| 6247 The function ec_enc_carry_out() (entenc.c) implements carry propagation and |
| 6248 output buffering. |
| 6249 It takes as input a 9-bit value, c, consisting of 8 data bits and an additional |
| 6250 carry bit. |
| 6251 If c is equal to the value 255, then ext is simply incremented, and no other |
| 6252 state updates are performed. |
| 6253 Otherwise, let b = (c>>8) be the carry bit. |
| 6254 Then, |
| 6255 <list style="symbols"> |
| 6256 <t> |
| 6257 If the buffered byte rem contains a value other than -1, the encoder outputs |
| 6258 the byte (rem + b). |
| 6259 Otherwise, if rem is -1, no byte is output. |
| 6260 </t> |
| 6261 <t> |
| 6262 If ext is non-zero, then the encoder outputs ext bytes---all with a value of 0 |
| 6263 if b is set, or 255 if b is unset---and sets ext to 0. |
| 6264 </t> |
| 6265 <t> |
| 6266 rem is set to the 8 data bits: |
| 6267 <figure align="center"> |
| 6268 <artwork align="center"><![CDATA[ |
| 6269 rem = c & 255 . |
| 6270 ]]></artwork> |
| 6271 </figure> |
| 6272 </t> |
| 6273 </list> |
| 6274 </t> |
| 6275 </section> |
| 6276 |
| 6277 </section> |
| 6278 |
| 6279 <section anchor="encoding-alternate" title="Alternate Encoding Methods"> |
| 6280 <t> |
| 6281 The reference implementation uses three additional encoding methods that are |
| 6282 exactly equivalent to the above, but make assumptions and simplifications that |
| 6283 allow for a more efficient implementation. |
| 6284 </t> |
| 6285 |
| 6286 <section anchor="ec_encode_bin" title="ec_encode_bin()"> |
| 6287 <t> |
| 6288 The first is ec_encode_bin() (entenc.c), defined using the parameter ftb |
| 6289 instead of ft. |
| 6290 It is mathematically equivalent to calling ec_encode() with |
| 6291 ft = (1<<ftb), but avoids using division. |
| 6292 </t> |
| 6293 </section> |
| 6294 |
| 6295 <section anchor="ec_enc_bit_logp" title="ec_enc_bit_logp()"> |
| 6296 <t> |
| 6297 The next is ec_enc_bit_logp() (entenc.c), which encodes a single binary symbol. |
| 6298 The context is described by a single parameter, logp, which is the absolute |
| 6299 value of the base-2 logarithm of the probability of a "1". |
| 6300 It is mathematically equivalent to calling ec_encode() with the 3-tuple |
| 6301 (fl[k] = 0, fh[k] = (1<<logp) - 1, |
| 6302 ft = (1<<logp)) if k is 0 and with |
| 6303 (fl[k] = (1<<logp) - 1, |
| 6304 fh[k] = ft = (1<<logp)) if k is 1. |
| 6305 The implementation requires no multiplications or divisions. |
| 6306 </t> |
| 6307 </section> |
| 6308 |
| 6309 <section anchor="ec_enc_icdf" title="ec_enc_icdf()"> |
| 6310 <t> |
| 6311 The last is ec_enc_icdf() (entenc.c), which encodes a single binary symbol with |
| 6312 a table-based context of up to 8 bits. |
| 6313 This uses the same icdf table as ec_dec_icdf() from |
| 6314 <xref target="ec_dec_icdf"/>. |
| 6315 The function is mathematically equivalent to calling ec_encode() with |
| 6316 fl[k] = (1<<ftb) - icdf[k-1] (or 0 if |
| 6317 k == 0), fh[k] = (1<<ftb) - icdf[k], and |
| 6318 ft = (1<<ftb). |
| 6319 This only saves a few arithmetic operations over ec_encode_bin(), but allows |
| 6320 the encoder to use the same icdf tables as the decoder. |
| 6321 </t> |
| 6322 </section> |
| 6323 |
| 6324 </section> |
| 6325 |
| 6326 <section anchor="encoding-bits" title="Encoding Raw Bits"> |
| 6327 <t> |
| 6328 The raw bits used by the CELT layer are packed at the end of the buffer using |
| 6329 ec_enc_bits() (entenc.c). |
| 6330 Because the raw bits may continue into the last byte output by the range coder |
| 6331 if there is room in the low-order bits, the encoder must be prepared to merge |
| 6332 these values into a single byte. |
| 6333 The procedure in <xref target="encoder-finalizing"/> does this in a way that |
| 6334 ensures both the range coded data and the raw bits can be decoded |
| 6335 successfully. |
| 6336 </t> |
| 6337 </section> |
| 6338 |
| 6339 <section anchor="encoding-ints" title="Encoding Uniformly Distributed Integers"> |
| 6340 <t> |
| 6341 The function ec_enc_uint() (entenc.c) encodes one of ft equiprobable symbols in |
| 6342 the range 0 to (ft - 1), inclusive, each with a frequency of 1, |
| 6343 where ft may be as large as (2**32 - 1). |
| 6344 Like the decoder (see <xref target="ec_dec_uint"/>), it splits up the |
| 6345 value into a range coded symbol representing up to 8 of the high bits, and, if |
| 6346 necessary, raw bits representing the remainder of the value. |
| 6347 </t> |
| 6348 <t> |
| 6349 ec_enc_uint() takes a two-tuple (t, ft), where t is the value to be |
| 6350 encoded, 0 <= t < ft, and ft is not necessarily a |
| 6351 power of two. |
| 6352 Let ftb = ilog(ft - 1), i.e., the number of bits required |
| 6353 to store (ft - 1) in two's complement notation. |
| 6354 If ftb is 8 or less, then t is encoded directly using ec_encode() with the |
| 6355 three-tuple (t, t + 1, ft). |
| 6356 </t> |
| 6357 <t> |
| 6358 If ftb is greater than 8, then the top 8 bits of t are encoded using the |
| 6359 three-tuple (t>>(ftb - 8), |
| 6360 (t>>(ftb - 8)) + 1, |
| 6361 ((ft - 1)>>(ftb - 8)) + 1), and the |
| 6362 remaining bits, |
| 6363 (t & ((1<<(ftb - 8)) - 1), |
| 6364 are encoded as raw bits with ec_enc_bits(). |
| 6365 </t> |
| 6366 </section> |
| 6367 |
| 6368 <section anchor="encoder-finalizing" title="Finalizing the Stream"> |
| 6369 <t> |
| 6370 After all symbols are encoded, the stream must be finalized by outputting a |
| 6371 value inside the current range. |
| 6372 Let end be the integer in the interval [val, val + rng) with the |
| 6373 largest number of trailing zero bits, b, such that |
| 6374 (end + (1<<b) - 1) is also in the interval |
| 6375 [val, val + rng). |
| 6376 This choice of end allows the maximum number of trailing bits to be set to |
| 6377 arbitrary values while still ensuring the range coded part of the buffer can |
| 6378 be decoded correctly. |
| 6379 Then, while end is not zero, the top 9 bits of end, i.e., (end>>23), are |
| 6380 passed to the carry buffer in accordance with the procedure in |
| 6381 <xref target="ec_enc_carry_out"/>, and end is updated via |
| 6382 <figure align="center"> |
| 6383 <artwork align="center"><![CDATA[ |
| 6384 end = (end<<8) & 0x7FFFFFFF . |
| 6385 ]]></artwork> |
| 6386 </figure> |
| 6387 Finally, if the buffered output byte, rem, is neither zero nor the special |
| 6388 value -1, or the carry count, ext, is greater than zero, then 9 zero bits are |
| 6389 sent to the carry buffer to flush it to the output buffer. |
| 6390 When outputting the final byte from the range coder, if it would overlap any |
| 6391 raw bits already packed into the end of the output buffer, they should be ORed |
| 6392 into the same byte. |
| 6393 The bit allocation routines in the CELT layer should ensure that this can be |
| 6394 done without corrupting the range coder data so long as end is chosen as |
| 6395 described above. |
| 6396 If there is any space between the end of the range coder data and the end of |
| 6397 the raw bits, it is padded with zero bits. |
| 6398 This entire process is implemented by ec_enc_done() (entenc.c). |
| 6399 </t> |
| 6400 </section> |
| 6401 |
| 6402 <section anchor="encoder-tell" title="Current Bit Usage"> |
| 6403 <t> |
| 6404 The bit allocation routines in Opus need to be able to determine a |
| 6405 conservative upper bound on the number of bits that have been used |
| 6406 to encode the current frame thus far. This drives allocation |
| 6407 decisions and ensures that the range coder and raw bits will not |
| 6408 overflow the output buffer. This is computed in the |
| 6409 reference implementation to whole-bit precision by |
| 6410 the function ec_tell() (entcode.h) and to fractional 1/8th bit |
| 6411 precision by the function ec_tell_frac() (entcode.c). |
| 6412 Like all operations in the range coder, it must be implemented in a |
| 6413 bit-exact manner, and must produce exactly the same value returned by |
| 6414 the same functions in the decoder after decoding the same symbols. |
| 6415 </t> |
| 6416 </section> |
| 6417 |
| 6418 </section> |
| 6419 |
| 6420 <section title='SILK Encoder'> |
| 6421 <t> |
| 6422 In many respects the SILK encoder mirrors the SILK decoder described |
| 6423 in <xref target='silk_decoder_outline'/>. |
| 6424 Details such as the quantization and range coder tables can be found |
| 6425 there, while this section describes the high-level design choices that |
| 6426 were made. |
| 6427 The diagram below shows the basic modules of the SILK encoder. |
| 6428 <figure align="center" anchor="silk_encoder_figure" title="SILK Encoder"> |
| 6429 <artwork> |
| 6430 <![CDATA[ |
| 6431 +----------+ +--------+ +---------+ |
| 6432 | Sample | | Stereo | | SILK | |
| 6433 ------>| Rate |--->| Mixing |--->| Core |----------> |
| 6434 Input |Conversion| | | | Encoder | Bitstream |
| 6435 +----------+ +--------+ +---------+ |
| 6436 ]]> |
| 6437 </artwork> |
| 6438 </figure> |
| 6439 </t> |
| 6440 |
| 6441 <section title='Sample Rate Conversion'> |
| 6442 <t> |
| 6443 The input signal's sampling rate is adjusted by a sample rate conversion |
| 6444 module so that it matches the SILK internal sampling rate. |
| 6445 The input to the sample rate converter is delayed by a number of samples |
| 6446 depending on the sample rate ratio, such that the overall delay is constant |
| 6447 for all input and output sample rates. |
| 6448 </t> |
| 6449 </section> |
| 6450 |
| 6451 <section title='Stereo Mixing'> |
| 6452 <t> |
| 6453 The stereo mixer is only used for stereo input signals. |
| 6454 It converts a stereo left/right signal into an adaptive |
| 6455 mid/side representation. |
| 6456 The first step is to compute non-adaptive mid/side signals |
| 6457 as half the sum and difference between left and right signals. |
| 6458 The side signal is then minimized in energy by subtracting a |
| 6459 prediction of it based on the mid signal. |
| 6460 This prediction works well when the left and right signals |
| 6461 exhibit linear dependency, for instance for an amplitude-panned |
| 6462 input signal. |
| 6463 Like in the decoder, the prediction coefficients are linearly |
| 6464 interpolated during the first 8 ms of the frame. |
| 6465 The mid signal is always encoded, whereas the residual |
| 6466 side signal is only encoded if it has sufficient |
| 6467 energy compared to the mid signal's energy. |
| 6468 If it has not, |
| 6469 the "mid_only_flag" is set without encoding the side signal. |
| 6470 </t> |
| 6471 <t> |
| 6472 The predictor coefficients are coded regardless of whether |
| 6473 the side signal is encoded. |
| 6474 For each frame, two predictor coefficients are computed, one |
| 6475 that predicts between low-passed mid and side channels, and |
| 6476 one that predicts between high-passed mid and side channels. |
| 6477 The low-pass filter is a simple three-tap filter |
| 6478 and creates a delay of one sample. |
| 6479 The high-pass filtered signal is the difference between |
| 6480 the mid signal delayed by one sample and the low-passed |
| 6481 signal. Instead of explicitly computing the high-passed |
| 6482 signal, it is computationally more efficient to transform |
| 6483 the prediction coefficients before applying them to the |
| 6484 filtered mid signal, as follows |
| 6485 <figure align="center"> |
| 6486 <artwork align="center"> |
| 6487 <![CDATA[ |
| 6488 pred(n) = LP(n) * w0 + HP(n) * w1 |
| 6489 = LP(n) * w0 + (mid(n-1) - LP(n)) * w1 |
| 6490 = LP(n) * (w0 - w1) + mid(n-1) * w1 |
| 6491 ]]> |
| 6492 </artwork> |
| 6493 </figure> |
| 6494 where w0 and w1 are the low-pass and high-pass prediction |
| 6495 coefficients, mid(n-1) is the mid signal delayed by one sample, |
| 6496 LP(n) and HP(n) are the low-passed and high-passed |
| 6497 signals and pred(n) is the prediction signal that is subtracted |
| 6498 from the side signal. |
| 6499 </t> |
| 6500 </section> |
| 6501 |
| 6502 <section title='SILK Core Encoder'> |
| 6503 <t> |
| 6504 What follows is a description of the core encoder and its components. |
| 6505 For simplicity, the core encoder is referred to simply as the encoder in |
| 6506 the remainder of this section. An overview of the encoder is given in |
| 6507 <xref target="encoder_figure" />. |
| 6508 </t> |
| 6509 <figure align="center" anchor="encoder_figure" title="SILK Core Encoder"> |
| 6510 <artwork align="center"> |
| 6511 <![CDATA[ |
| 6512 +---+ |
| 6513 +--------------------------------->| | |
| 6514 +---------+ | +---------+ | | |
| 6515 |Voice | | |LTP |12 | | |
| 6516 +-->|Activity |--+ +----->|Scaling |-----------+---->| | |
| 6517 | |Detector |3 | | |Control |<--+ | | | |
| 6518 | +---------+ | | +---------+ | | | | |
| 6519 | | | +---------+ | | | | |
| 6520 | | | |Gains | | | | | |
| 6521 | | | +-->|Processor|---|---+---|---->| R | |
| 6522 | | | | | |11 | | | | a | |
| 6523 | \/ | | +---------+ | | | | n | |
| 6524 | +---------+ | | +---------+ | | | | g | |
| 6525 | |Pitch | | | |LSF | | | | | e | |
| 6526 | +->|Analysis |---+ | |Quantizer|---|---|---|---->| | |
| 6527 | | | |4 | | | |8 | | | | E |--> |
| 6528 | | +---------+ | | +---------+ | | | | n | 2 |
| 6529 | | | | 9/\ 10| | | | | c | |
| 6530 | | | | | \/ | | | | o | |
| 6531 | | +---------+ | | +----------+ | | | | d | |
| 6532 | | |Noise | +--|-->|Prediction|--+---|---|---->| e | |
| 6533 | +->|Shaping |---|--+ |Analysis |7 | | | | r | |
| 6534 | | |Analysis |5 | | | | | | | | | |
| 6535 | | +---------+ | | +----------+ | | | | | |
| 6536 | | | | /\ | | | | | |
| 6537 | | +----------|--|--------+ | | | | | |
| 6538 | | | \/ \/ \/ \/ \/ | | |
| 6539 | | | +---------+ +------------+ | | |
| 6540 | | | | | |Noise | | | |
| 6541 -+-------+-----+------>|Prefilter|--------->|Shaping |-->| | |
| 6542 1 | | 6 |Quantization|13 | | |
| 6543 +---------+ +------------+ +---+ |
| 6544 |
| 6545 1: Input speech signal |
| 6546 2: Range encoded bitstream |
| 6547 3: Voice activity estimate |
| 6548 4: Pitch lags (per 5 ms) and voicing decision (per 20 ms) |
| 6549 5: Noise shaping quantization coefficients |
| 6550 - Short term synthesis and analysis |
| 6551 noise shaping coefficients (per 5 ms) |
| 6552 - Long term synthesis and analysis noise |
| 6553 shaping coefficients (per 5 ms and for voiced speech only) |
| 6554 - Noise shaping tilt (per 5 ms) |
| 6555 - Quantizer gain/step size (per 5 ms) |
| 6556 6: Input signal filtered with analysis noise shaping filters |
| 6557 7: Short and long term prediction coefficients |
| 6558 LTP (per 5 ms) and LPC (per 20 ms) |
| 6559 8: LSF quantization indices |
| 6560 9: LSF coefficients |
| 6561 10: Quantized LSF coefficients |
| 6562 11: Processed gains, and synthesis noise shape coefficients |
| 6563 12: LTP state scaling coefficient. Controlling error propagation |
| 6564 / prediction gain trade-off |
| 6565 13: Quantized signal |
| 6566 ]]> |
| 6567 </artwork> |
| 6568 </figure> |
| 6569 |
| 6570 <section title='Voice Activity Detection'> |
| 6571 <t> |
| 6572 The input signal is processed by a Voice Activity Detector (VAD) to produce |
| 6573 a measure of voice activity, spectral tilt, and signal-to-noise estimates for |
| 6574 each frame. The VAD uses a sequence of half-band filterbanks to split the |
| 6575 signal into four subbands: 0...Fs/16, Fs/16...Fs/8, Fs/8...Fs/4, and |
| 6576 Fs/4...Fs/2, where Fs is the sampling frequency (8, 12, 16, or 24 kHz). |
| 6577 The lowest subband, from 0 - Fs/16, is high-pass filtered with a first-order |
| 6578 moving average (MA) filter (with transfer function H(z) = 1-z**(-1)) to |
| 6579 reduce the energy at the lowest frequencies. For each frame, the signal |
| 6580 energy per subband is computed. |
| 6581 In each subband, a noise level estimator tracks the background noise level |
| 6582 and a Signal-to-Noise Ratio (SNR) value is computed as the logarithm of the |
| 6583 ratio of energy to noise level. |
| 6584 Using these intermediate variables, the following parameters are calculated |
| 6585 for use in other SILK modules: |
| 6586 <list style="symbols"> |
| 6587 <t> |
| 6588 Average SNR. The average of the subband SNR values. |
| 6589 </t> |
| 6590 |
| 6591 <t> |
| 6592 Smoothed subband SNRs. Temporally smoothed subband SNR values. |
| 6593 </t> |
| 6594 |
| 6595 <t> |
| 6596 Speech activity level. Based on the average SNR and a weighted average of the |
| 6597 subband energies. |
| 6598 </t> |
| 6599 |
| 6600 <t> |
| 6601 Spectral tilt. A weighted average of the subband SNRs, with positive weights |
| 6602 for the low subbands and negative weights for the high subbands. |
| 6603 </t> |
| 6604 </list> |
| 6605 </t> |
| 6606 </section> |
| 6607 |
| 6608 <section title='Pitch Analysis' anchor='pitch_estimator_overview_section'> |
| 6609 <t> |
| 6610 The input signal is processed by the open loop pitch estimator shown in |
| 6611 <xref target='pitch_estimator_figure' />. |
| 6612 <figure align="center" anchor="pitch_estimator_figure" |
| 6613 title="Block diagram of the pitch estimator"> |
| 6614 <artwork align="center"> |
| 6615 <![CDATA[ |
| 6616 +--------+ +----------+ |
| 6617 |2 x Down| |Time- | |
| 6618 +->|sampling|->|Correlator| | |
| 6619 | | | | | |4 |
| 6620 | +--------+ +----------+ \/ |
| 6621 | | 2 +-------+ |
| 6622 | | +-->|Speech |5 |
| 6623 +---------+ +--------+ | \/ | |Type |-> |
| 6624 |LPC | |Down | | +----------+ | | |
| 6625 +->|Analysis | +->|sample |-+------------->|Time- | +-------+ |
| 6626 | | | | |to 8 kHz| |Correlator|-----------> |
| 6627 | +---------+ | +--------+ |__________| 6 |
| 6628 | | | |3 |
| 6629 | \/ | \/ |
| 6630 | +---------+ | +----------+ |
| 6631 | |Whitening| | |Time- | |
| 6632 -+->|Filter |-+--------------------------->|Correlator|-----------> |
| 6633 1 | | | | 7 |
| 6634 +---------+ +----------+ |
| 6635 |
| 6636 1: Input signal |
| 6637 2: Lag candidates from stage 1 |
| 6638 3: Lag candidates from stage 2 |
| 6639 4: Correlation threshold |
| 6640 5: Voiced/unvoiced flag |
| 6641 6: Pitch correlation |
| 6642 7: Pitch lags |
| 6643 ]]> |
| 6644 </artwork> |
| 6645 </figure> |
| 6646 The pitch analysis finds a binary voiced/unvoiced classification, and, for |
| 6647 frames classified as voiced, four pitch lags per frame - one for each |
| 6648 5 ms subframe - and a pitch correlation indicating the periodicity of |
| 6649 the signal. |
| 6650 The input is first whitened using a Linear Prediction (LP) whitening filter, |
| 6651 where the coefficients are computed through standard Linear Prediction Coding |
| 6652 (LPC) analysis. The order of the whitening filter is 16 for best results, but |
| 6653 is reduced to 12 for medium complexity and 8 for low complexity modes. |
| 6654 The whitened signal is analyzed to find pitch lags for which the time |
| 6655 correlation is high. |
| 6656 The analysis consists of three stages for reducing the complexity: |
| 6657 <list style="symbols"> |
| 6658 <t>In the first stage, the whitened signal is downsampled to 4 kHz |
| 6659 (from 8 kHz) and the current frame is correlated to a signal delayed |
| 6660 by a range of lags, starting from a shortest lag corresponding to |
| 6661 500 Hz, to a longest lag corresponding to 56 Hz.</t> |
| 6662 |
| 6663 <t> |
| 6664 The second stage operates on an 8 kHz signal (downsampled from 12, 16, |
| 6665 or 24 kHz) and measures time correlations only near the lags |
| 6666 corresponding to those that had sufficiently high correlations in the first |
| 6667 stage. The resulting correlations are adjusted for a small bias towards |
| 6668 short lags to avoid ending up with a multiple of the true pitch lag. |
| 6669 The highest adjusted correlation is compared to a threshold depending on: |
| 6670 <list style="symbols"> |
| 6671 <t> |
| 6672 Whether the previous frame was classified as voiced |
| 6673 </t> |
| 6674 <t> |
| 6675 The speech activity level |
| 6676 </t> |
| 6677 <t> |
| 6678 The spectral tilt. |
| 6679 </t> |
| 6680 </list> |
| 6681 If the threshold is exceeded, the current frame is classified as voiced and |
| 6682 the lag with the highest adjusted correlation is stored for a final pitch |
| 6683 analysis of the highest precision in the third stage. |
| 6684 </t> |
| 6685 <t> |
| 6686 The last stage operates directly on the whitened input signal to compute time |
| 6687 correlations for each of the four subframes independently in a narrow range |
| 6688 around the lag with highest correlation from the second stage. |
| 6689 </t> |
| 6690 </list> |
| 6691 </t> |
| 6692 </section> |
| 6693 |
| 6694 <section title='Noise Shaping Analysis' anchor='noise_shaping_analysis_overview_
section'> |
| 6695 <t> |
| 6696 The noise shaping analysis finds gains and filter coefficients used in the |
| 6697 prefilter and noise shaping quantizer. These parameters are chosen such that |
| 6698 they will fulfill several requirements: |
| 6699 <list style="symbols"> |
| 6700 <t> |
| 6701 Balancing quantization noise and bitrate. |
| 6702 The quantization gains determine the step size between reconstruction levels |
| 6703 of the excitation signal. Therefore, increasing the quantization gain |
| 6704 amplifies quantization noise, but also reduces the bitrate by lowering |
| 6705 the entropy of the quantization indices. |
| 6706 </t> |
| 6707 <t> |
| 6708 Spectral shaping of the quantization noise; the noise shaping quantizer is |
| 6709 capable of reducing quantization noise in some parts of the spectrum at the |
| 6710 cost of increased noise in other parts without substantially changing the |
| 6711 bitrate. |
| 6712 By shaping the noise such that it follows the signal spectrum, it becomes |
| 6713 less audible. In practice, best results are obtained by making the shape |
| 6714 of the noise spectrum slightly flatter than the signal spectrum. |
| 6715 </t> |
| 6716 <t> |
| 6717 De-emphasizing spectral valleys; by using different coefficients in the |
| 6718 analysis and synthesis part of the prefilter and noise shaping quantizer, |
| 6719 the levels of the spectral valleys can be decreased relative to the levels |
| 6720 of the spectral peaks such as speech formants and harmonics. |
| 6721 This reduces the entropy of the signal, which is the difference between the |
| 6722 coded signal and the quantization noise, thus lowering the bitrate. |
| 6723 </t> |
| 6724 <t> |
| 6725 Matching the levels of the decoded speech formants to the levels of the |
| 6726 original speech formants; an adjustment gain and a first order tilt |
| 6727 coefficient are computed to compensate for the effect of the noise |
| 6728 shaping quantization on the level and spectral tilt. |
| 6729 </t> |
| 6730 </list> |
| 6731 </t> |
| 6732 <t> |
| 6733 <figure align="center" anchor="noise_shape_analysis_spectra_figure" |
| 6734 title="Noise shaping and spectral de-emphasis illustration"> |
| 6735 <artwork align="center"> |
| 6736 <![CDATA[ |
| 6737 / \ ___ |
| 6738 | // \\ |
| 6739 | // \\ ____ |
| 6740 |_// \\___// \\ ____ |
| 6741 | / ___ \ / \\ // \\ |
| 6742 P |/ / \ \_/ \\_____// \\ |
| 6743 o | / \ ____ \ / \\ |
| 6744 w | / \___/ \ \___/ ____ \\___ 1 |
| 6745 e |/ \ / \ \ |
| 6746 r | \_____/ \ \__ 2 |
| 6747 | \ |
| 6748 | \___ 3 |
| 6749 | |
| 6750 +----------------------------------------> |
| 6751 Frequency |
| 6752 |
| 6753 1: Input signal spectrum |
| 6754 2: De-emphasized and level matched spectrum |
| 6755 3: Quantization noise spectrum |
| 6756 ]]> |
| 6757 </artwork> |
| 6758 </figure> |
| 6759 <xref target='noise_shape_analysis_spectra_figure' /> shows an example of an |
| 6760 input signal spectrum (1). |
| 6761 After de-emphasis and level matching, the spectrum has deeper valleys (2). |
| 6762 The quantization noise spectrum (3) more or less follows the input signal |
| 6763 spectrum, while having slightly less pronounced peaks. |
| 6764 The entropy, which provides a lower bound on the bitrate for encoding the |
| 6765 excitation signal, is proportional to the area between the de-emphasized |
| 6766 spectrum (2) and the quantization noise spectrum (3). Without de-emphasis, |
| 6767 the entropy is proportional to the area between input spectrum (1) and |
| 6768 quantization noise (3) - clearly higher. |
| 6769 </t> |
| 6770 |
| 6771 <t> |
| 6772 The transformation from input signal to de-emphasized signal can be |
| 6773 described as a filtering operation with a filter |
| 6774 <figure align="center"> |
| 6775 <artwork align="center"> |
| 6776 <![CDATA[ |
| 6777 -1 Wana(z) |
| 6778 H(z) = G * ( 1 - c_tilt * z ) * ------- |
| 6779 Wsyn(z), |
| 6780 ]]> |
| 6781 </artwork> |
| 6782 </figure> |
| 6783 having an adjustment gain G, a first order tilt adjustment filter with |
| 6784 tilt coefficient c_tilt, and where |
| 6785 <figure align="center"> |
| 6786 <artwork align="center"> |
| 6787 <![CDATA[ |
| 6788 16 d |
| 6789 __ -k -L __ -k |
| 6790 Wana(z) = (1 - \ (a_ana(k) * z )*(1 - z * \ b_ana(k) * z ), |
| 6791 /_ /_ |
| 6792 k=1 k=-d |
| 6793 ]]> |
| 6794 </artwork> |
| 6795 </figure> |
| 6796 is the analysis part of the de-emphasis filter, consisting of the short-term |
| 6797 shaping filter with coefficients a_ana(k), and the long-term shaping filter |
| 6798 with coefficients b_ana(k) and pitch lag L. |
| 6799 The parameter d determines the number of long-term shaping filter taps. |
| 6800 </t> |
| 6801 |
| 6802 <t> |
| 6803 Similarly, but without the tilt adjustment, the synthesis part can be written as |
| 6804 <figure align="center"> |
| 6805 <artwork align="center"> |
| 6806 <![CDATA[ |
| 6807 16 d |
| 6808 __ -k -L __ -k |
| 6809 Wsyn(z) = (1 - \ (a_syn(k) * z )*(1 - z * \ b_syn(k) * z ). |
| 6810 /_ /_ |
| 6811 k=1 k=-d |
| 6812 ]]> |
| 6813 </artwork> |
| 6814 </figure> |
| 6815 </t> |
| 6816 <t> |
| 6817 All noise shaping parameters are computed and applied per subframe of 5 ms. |
| 6818 First, an LPC analysis is performed on a windowed signal block of 15 ms. |
| 6819 The signal block has a look-ahead of 5 ms relative to the current subframe, |
| 6820 and the window is an asymmetric sine window. The LPC analysis is done with the |
| 6821 autocorrelation method, with an order of between 8, in lowest-complexity mode, |
| 6822 and 16, for best quality. |
| 6823 </t> |
| 6824 <t> |
| 6825 Optionally the LPC analysis and noise shaping filters are warped by replacing |
| 6826 the delay elements by first-order allpass filters. |
| 6827 This increases the frequency resolution at low frequencies and reduces it at |
| 6828 high ones, which better matches the human auditory system and improves |
| 6829 quality. |
| 6830 The warped analysis and filtering comes at a cost in complexity |
| 6831 and is therefore only done in higher complexity modes. |
| 6832 </t> |
| 6833 <t> |
| 6834 The quantization gain is found by taking the square root of the residual energy |
| 6835 from the LPC analysis and multiplying it by a value inversely proportional |
| 6836 to the coding quality control parameter and the pitch correlation. |
| 6837 </t> |
| 6838 <t> |
| 6839 Next the two sets of short-term noise shaping coefficients a_ana(k) and |
| 6840 a_syn(k) are obtained by applying different amounts of bandwidth expansion to th
e |
| 6841 coefficients found in the LPC analysis. |
| 6842 This bandwidth expansion moves the roots of the LPC polynomial towards the |
| 6843 origin, using the formulas |
| 6844 <figure align="center"> |
| 6845 <artwork align="center"> |
| 6846 <![CDATA[ |
| 6847 k |
| 6848 a_ana(k) = a(k)*g_ana , and |
| 6849 |
| 6850 k |
| 6851 a_syn(k) = a(k)*g_syn , |
| 6852 ]]> |
| 6853 </artwork> |
| 6854 </figure> |
| 6855 where a(k) is the k'th LPC coefficient, and the bandwidth expansion factors |
| 6856 g_ana and g_syn are calculated as |
| 6857 <figure align="center"> |
| 6858 <artwork align="center"> |
| 6859 <![CDATA[ |
| 6860 g_ana = 0.95 - 0.01*C, and |
| 6861 |
| 6862 g_syn = 0.95 + 0.01*C, |
| 6863 ]]> |
| 6864 </artwork> |
| 6865 </figure> |
| 6866 where C is the coding quality control parameter between 0 and 1. |
| 6867 Applying more bandwidth expansion to the analysis part than to the synthesis |
| 6868 part gives the desired de-emphasis of spectral valleys in between formants. |
| 6869 </t> |
| 6870 |
| 6871 <t> |
| 6872 The long-term shaping is applied only during voiced frames. |
| 6873 It uses three filter taps, described by |
| 6874 <figure align="center"> |
| 6875 <artwork align="center"> |
| 6876 <![CDATA[ |
| 6877 b_ana = F_ana * [0.25, 0.5, 0.25], and |
| 6878 |
| 6879 b_syn = F_syn * [0.25, 0.5, 0.25]. |
| 6880 ]]> |
| 6881 </artwork> |
| 6882 </figure> |
| 6883 For unvoiced frames these coefficients are set to 0. The multiplication factors |
| 6884 F_ana and F_syn are chosen between 0 and 1, depending on the coding quality |
| 6885 control parameter, as well as the calculated pitch correlation and smoothed |
| 6886 subband SNR of the lowest subband. By having F_ana less than F_syn, |
| 6887 the pitch harmonics are emphasized relative to the valleys in between the |
| 6888 harmonics. |
| 6889 </t> |
| 6890 |
| 6891 <t> |
| 6892 The tilt coefficient c_tilt is for unvoiced frames chosen as |
| 6893 <figure align="center"> |
| 6894 <artwork align="center"> |
| 6895 <![CDATA[ |
| 6896 c_tilt = 0.25, |
| 6897 ]]> |
| 6898 </artwork> |
| 6899 </figure> |
| 6900 and as |
| 6901 <figure align="center"> |
| 6902 <artwork align="center"> |
| 6903 <![CDATA[ |
| 6904 c_tilt = 0.25 + 0.2625 * V |
| 6905 ]]> |
| 6906 </artwork> |
| 6907 </figure> |
| 6908 for voiced frames, where V is the voice activity level between 0 and 1. |
| 6909 </t> |
| 6910 <t> |
| 6911 The adjustment gain G serves to correct any level mismatch between the original |
| 6912 and decoded signals that might arise from the noise shaping and de-emphasis. |
| 6913 This gain is computed as the ratio of the prediction gain of the short-term |
| 6914 analysis and synthesis filter coefficients. The prediction gain of an LPC |
| 6915 synthesis filter is the square root of the output energy when the filter is |
| 6916 excited by a unit-energy impulse on the input. |
| 6917 An efficient way to compute the prediction gain is by first computing the |
| 6918 reflection coefficients from the LPC coefficients through the step-down |
| 6919 algorithm, and extracting the prediction gain from the reflection coefficients |
| 6920 as |
| 6921 <figure align="center"> |
| 6922 <artwork align="center"> |
| 6923 <![CDATA[ |
| 6924 K |
| 6925 ___ 2 -0.5 |
| 6926 predGain = ( | | 1 - (r_k) ) , |
| 6927 k=1 |
| 6928 ]]> |
| 6929 </artwork> |
| 6930 </figure> |
| 6931 where r_k is the k'th reflection coefficient. |
| 6932 </t> |
| 6933 |
| 6934 <t> |
| 6935 Initial values for the quantization gains are computed as the square-root of |
| 6936 the residual energy of the LPC analysis, adjusted by the coding quality control |
| 6937 parameter. |
| 6938 These quantization gains are later adjusted based on the results of the |
| 6939 prediction analysis. |
| 6940 </t> |
| 6941 </section> |
| 6942 |
| 6943 <section title='Prediction Analysis' anchor='pred_ana_overview_section'> |
| 6944 <t> |
| 6945 The prediction analysis is performed in one of two ways depending on how |
| 6946 the pitch estimator classified the frame. |
| 6947 The processing for voiced and unvoiced speech is described in |
| 6948 <xref target='pred_ana_voiced_overview_section' /> and |
| 6949 <xref target='pred_ana_unvoiced_overview_section' />, respectively. |
| 6950 Inputs to this function include the pre-whitened signal from the |
| 6951 pitch estimator (see <xref target='pitch_estimator_overview_section'/>). |
| 6952 </t> |
| 6953 |
| 6954 <section title='Voiced Speech' anchor='pred_ana_voiced_overview_section'> |
| 6955 <t> |
| 6956 For a frame of voiced speech the pitch pulses will remain dominant in the |
| 6957 pre-whitened input signal. |
| 6958 Further whitening is desirable as it leads to higher quality at the same |
| 6959 available bitrate. |
| 6960 To achieve this, a Long-Term Prediction (LTP) analysis is carried out to |
| 6961 estimate the coefficients of a fifth-order LTP filter for each of four |
| 6962 subframes. |
| 6963 The LTP coefficients are quantized using the method described in |
| 6964 <xref target='ltp_quantizer_overview_section'/>, and the quantized LTP |
| 6965 coefficients are used to compute the LTP residual signal. |
| 6966 This LTP residual signal is the input to an LPC analysis where the LPC coeffic
ients are |
| 6967 estimated using Burg's method <xref target="Burg"/>, such that the residual en
ergy is minimized. |
| 6968 The estimated LPC coefficients are converted to a Line Spectral Frequency (LSF
) vector |
| 6969 and quantized as described in <xref target='lsf_quantizer_overview_section'/>. |
| 6970 After quantization, the quantized LSF vector is converted back to LPC |
| 6971 coefficients using the full procedure in <xref target="silk_nlsfs"/>. |
| 6972 By using quantized LTP coefficients and LPC coefficients derived from the |
| 6973 quantized LSF coefficients, the encoder remains fully synchronized with the |
| 6974 decoder. |
| 6975 The quantized LPC and LTP coefficients are also used to filter the input |
| 6976 signal and measure residual energy for each of the four subframes. |
| 6977 </t> |
| 6978 </section> |
| 6979 <section title='Unvoiced Speech' anchor='pred_ana_unvoiced_overview_section'> |
| 6980 <t> |
| 6981 For a speech signal that has been classified as unvoiced, there is no need |
| 6982 for LTP filtering, as it has already been determined that the pre-whitened |
| 6983 input signal is not periodic enough within the allowed pitch period range |
| 6984 for LTP analysis to be worth the cost in terms of complexity and bitrate. |
| 6985 The pre-whitened input signal is therefore discarded, and instead the input |
| 6986 signal is used for LPC analysis using Burg's method. |
| 6987 The resulting LPC coefficients are converted to an LSF vector and quantized |
| 6988 as described in the following section. |
| 6989 They are then transformed back to obtain quantized LPC coefficients, which |
| 6990 are then used to filter the input signal and measure residual energy for |
| 6991 each of the four subframes. |
| 6992 </t> |
| 6993 <section title="Burg's Method"> |
| 6994 <t> |
| 6995 The main purpose of linear prediction in SILK is to reduce the bitrate by |
| 6996 minimizing the residual energy. |
| 6997 At least at high bitrates, perceptual aspects are handled |
| 6998 independently by the noise shaping filter. |
| 6999 Burg's method is used because it provides higher prediction gain |
| 7000 than the autocorrelation method and, unlike the covariance method, |
| 7001 produces stable filters (assuming numerical errors don't spoil |
| 7002 that). SILK's implementation of Burg's method is also computationally |
| 7003 faster than the autocovariance method. |
| 7004 The implementation of Burg's method differs from traditional |
| 7005 implementations in two aspects. |
| 7006 The first difference is that it |
| 7007 operates on autocorrelations, similar to the Schur algorithm <xref target="Schur
"/>, but |
| 7008 with a simple update to the autocorrelations after finding each |
| 7009 reflection coefficient to make the result identical to Burg's method. |
| 7010 This brings down the complexity of Burg's method to near that of |
| 7011 the autocorrelation method. |
| 7012 The second difference is that the signal in each subframe is scaled |
| 7013 by the inverse of the residual quantization step size. Subframes with |
| 7014 a small quantization step size will on average spend more bits for a |
| 7015 given amount of residual energy than subframes with a large step size. |
| 7016 Without scaling, Burg's method minimizes the total residual energy in |
| 7017 all subframes, which doesn't necessarily minimize the total number of |
| 7018 bits needed for coding the quantized residual. The residual energy |
| 7019 of the scaled subframes is a better measure for that number of |
| 7020 bits. |
| 7021 </t> |
| 7022 </section> |
| 7023 </section> |
| 7024 </section> |
| 7025 |
| 7026 <section title='LSF Quantization' anchor='lsf_quantizer_overview_section'> |
| 7027 <t> |
| 7028 Unlike many other speech codecs, SILK uses variable bitrate coding |
| 7029 for the LSFs. |
| 7030 This improves the average rate-distortion (R-D) tradeoff and reduces outliers. |
| 7031 The variable bitrate coding minimizes a linear combination of the weighted |
| 7032 quantization errors and the bitrate. |
| 7033 The weights for the quantization errors are the Inverse |
| 7034 Harmonic Mean Weighting (IHMW) function proposed by Laroia et al. |
| 7035 (see <xref target="laroia-icassp" />). |
| 7036 These weights are referred to here as Laroia weights. |
| 7037 </t> |
| 7038 <t> |
| 7039 The LSF quantizer consists of two stages. |
| 7040 The first stage is an (unweighted) vector quantizer (VQ), with a |
| 7041 codebook size of 32 vectors. |
| 7042 The quantization errors for the codebook vector are sorted, and |
| 7043 for the N best vectors a second stage quantizer is run. |
| 7044 By varying the number N a tradeoff is made between R-D performance |
| 7045 and computational efficiency. |
| 7046 For each of the N codebook vectors the Laroia weights corresponding |
| 7047 to that vector (and not to the input vector) are calculated. |
| 7048 Then the residual between the input LSF vector and the codebook |
| 7049 vector is scaled by the square roots of these Laroia weights. |
| 7050 This scaling partially normalizes error sensitivity for the |
| 7051 residual vector, so that a uniform quantizer with fixed |
| 7052 step sizes can be used in the second stage without too much |
| 7053 performance loss. |
| 7054 And by scaling with Laroia weights determined from the first-stage |
| 7055 codebook vector, the process can be reversed in the decoder. |
| 7056 </t> |
| 7057 <t> |
| 7058 The second stage uses predictive delayed decision scalar |
| 7059 quantization. |
| 7060 The quantization error is weighted by Laroia weights determined |
| 7061 from the LSF input vector. |
| 7062 The predictor multiplies the previous quantized residual value |
| 7063 by a prediction coefficient that depends on the vector index from the |
| 7064 first stage VQ and on the location in the LSF vector. |
| 7065 The prediction is subtracted from the LSF residual value before |
| 7066 quantizing the result, and added back afterwards. |
| 7067 This subtraction can be interpreted as shifting the quantization levels |
| 7068 of the scalar quantizer, and as a result the quantization error of |
| 7069 each value depends on the quantization decision of the previous value. |
| 7070 This dependency is exploited by the delayed decision mechanism to |
| 7071 search for a quantization sequency with best R-D performance |
| 7072 with a Viterbi-like algorithm <xref target="Viterbi"/>. |
| 7073 The quantizer processes the residual LSF vector in reverse order |
| 7074 (i.e., it starts with the highest residual LSF value). |
| 7075 This is done because the prediction works slightly |
| 7076 better in the reverse direction. |
| 7077 </t> |
| 7078 <t> |
| 7079 The quantization index of the first stage is entropy coded. |
| 7080 The quantization sequence from the second stage is also entropy |
| 7081 coded, where for each element the probability table is chosen |
| 7082 depending on the vector index from the first stage and the location |
| 7083 of that element in the LSF vector. |
| 7084 </t> |
| 7085 |
| 7086 <section title='LSF Stabilization' anchor='lsf_stabilizer_overview_section'> |
| 7087 <t> |
| 7088 If the input is stable, finding the best candidate usually results in a |
| 7089 quantized vector that is also stable. Because of the two-stage approach, |
| 7090 however, it is possible that the best quantization candidate is unstable. |
| 7091 The encoder applies the same stabilization procedure applied by the decoder |
| 7092 (see <xref target="silk_nlsf_stabilization"/> to ensure the LSF parameters |
| 7093 are within their valid range, increasingly sorted, and have minimum |
| 7094 distances between each other and the border values. |
| 7095 </t> |
| 7096 </section> |
| 7097 </section> |
| 7098 |
| 7099 <section title='LTP Quantization' anchor='ltp_quantizer_overview_section'> |
| 7100 <t> |
| 7101 For voiced frames, the prediction analysis described in |
| 7102 <xref target='pred_ana_voiced_overview_section' /> resulted in four sets |
| 7103 (one set per subframe) of five LTP coefficients, plus four weighting matrices. |
| 7104 The LTP coefficients for each subframe are quantized using entropy constrained |
| 7105 vector quantization. |
| 7106 A total of three vector codebooks are available for quantization, with |
| 7107 different rate-distortion trade-offs. The three codebooks have 10, 20, and |
| 7108 40 vectors and average rates of about 3, 4, and 5 bits per vector, respectively. |
| 7109 Consequently, the first codebook has larger average quantization distortion at |
| 7110 a lower rate, whereas the last codebook has smaller average quantization |
| 7111 distortion at a higher rate. |
| 7112 Given the weighting matrix W_ltp and LTP vector b, the weighted rate-distortion |
| 7113 measure for a codebook vector cb_i with rate r_i is give by |
| 7114 <figure align="center"> |
| 7115 <artwork align="center"> |
| 7116 <![CDATA[ |
| 7117 RD = u * (b - cb_i)' * W_ltp * (b - cb_i) + r_i, |
| 7118 ]]> |
| 7119 </artwork> |
| 7120 </figure> |
| 7121 where u is a fixed, heuristically-determined parameter balancing the distortion |
| 7122 and rate. |
| 7123 Which codebook gives the best performance for a given LTP vector depends on the |
| 7124 weighting matrix for that LTP vector. |
| 7125 For example, for a low valued W_ltp, it is advantageous to use the codebook |
| 7126 with 10 vectors as it has a lower average rate. |
| 7127 For a large W_ltp, on the other hand, it is often better to use the codebook |
| 7128 with 40 vectors, as it is more likely to contain the best codebook vector. |
| 7129 The weighting matrix W_ltp depends mostly on two aspects of the input signal. |
| 7130 The first is the periodicity of the signal; the more periodic, the larger W_ltp. |
| 7131 The second is the change in signal energy in the current subframe, relative to |
| 7132 the signal one pitch lag earlier. |
| 7133 A decaying energy leads to a larger W_ltp than an increasing energy. |
| 7134 Both aspects fluctuate relatively slowly, which causes the W_ltp matrices for |
| 7135 different subframes of one frame often to be similar. |
| 7136 Because of this, one of the three codebooks typically gives good performance |
| 7137 for all subframes, and therefore the codebook search for the subframe LTP |
| 7138 vectors is constrained to only allow codebook vectors to be chosen from the |
| 7139 same codebook, resulting in a rate reduction. |
| 7140 </t> |
| 7141 |
| 7142 <t> |
| 7143 To find the best codebook, each of the three vector codebooks is |
| 7144 used to quantize all subframe LTP vectors and produce a combined |
| 7145 weighted rate-distortion measure for each vector codebook. |
| 7146 The vector codebook with the lowest combined rate-distortion |
| 7147 over all subframes is chosen. The quantized LTP vectors are used |
| 7148 in the noise shaping quantizer, and the index of the codebook |
| 7149 plus the four indices for the four subframe codebook vectors |
| 7150 are passed on to the range encoder. |
| 7151 </t> |
| 7152 </section> |
| 7153 |
| 7154 <section title='Prefilter'> |
| 7155 <t> |
| 7156 In the prefilter the input signal is filtered using the spectral valley |
| 7157 de-emphasis filter coefficients from the noise shaping analysis |
| 7158 (see <xref target='noise_shaping_analysis_overview_section'/>). |
| 7159 By applying only the noise shaping analysis filter to the input signal, |
| 7160 it provides the input to the noise shaping quantizer. |
| 7161 </t> |
| 7162 </section> |
| 7163 |
| 7164 <section title='Noise Shaping Quantizer'> |
| 7165 <t> |
| 7166 The noise shaping quantizer independently shapes the signal and coding noise |
| 7167 spectra to obtain a perceptually higher quality at the same bitrate. |
| 7168 </t> |
| 7169 <t> |
| 7170 The prefilter output signal is multiplied with a compensation gain G computed |
| 7171 in the noise shaping analysis. Then the output of a synthesis shaping filter |
| 7172 is added, and the output of a prediction filter is subtracted to create a |
| 7173 residual signal. |
| 7174 The residual signal is multiplied by the inverse quantized quantization gain |
| 7175 from the noise shaping analysis, and input to a scalar quantizer. |
| 7176 The quantization indices of the scalar quantizer represent a signal of pulses |
| 7177 that is input to the pyramid range encoder. |
| 7178 The scalar quantizer also outputs a quantization signal, which is multiplied |
| 7179 by the quantized quantization gain from the noise shaping analysis to create |
| 7180 an excitation signal. |
| 7181 The output of the prediction filter is added to the excitation signal to form |
| 7182 the quantized output signal y(n). |
| 7183 The quantized output signal y(n) is input to the synthesis shaping and |
| 7184 prediction filters. |
| 7185 </t> |
| 7186 <t> |
| 7187 Optionally the noise shaping quantizer operates in a delayed decision |
| 7188 mode. |
| 7189 In this mode it uses a Viterbi algorithm to keep track of |
| 7190 multiple rounding choices in the quantizer and select the best |
| 7191 one after a delay of 32 samples. This improves the rate/distortion |
| 7192 performance of the quantizer. |
| 7193 </t> |
| 7194 </section> |
| 7195 |
| 7196 <section title='Constant Bitrate Mode'> |
| 7197 <t> |
| 7198 SILK was designed to run in Variable Bitrate (VBR) mode. However |
| 7199 the reference implementation also has a Constant Bitrate (CBR) mode |
| 7200 for SILK. In CBR mode SILK will attempt to encode each packet with |
| 7201 no more than the allowed number of bits. The Opus wrapper code |
| 7202 then pads the bitstream if any unused bits are left in SILK mode, or |
| 7203 encodes the high band with the remaining number of bits in Hybrid mode. |
| 7204 The number of payload bits is adjusted by changing |
| 7205 the quantization gains and the rate/distortion tradeoff in the noise |
| 7206 shaping quantizer, in an iterative loop |
| 7207 around the noise shaping quantizer and entropy coding. |
| 7208 Compared to the SILK VBR mode, the CBR mode has lower |
| 7209 audio quality at a given average bitrate, and also has higher |
| 7210 computational complexity. |
| 7211 </t> |
| 7212 </section> |
| 7213 |
| 7214 </section> |
| 7215 |
| 7216 </section> |
| 7217 |
| 7218 |
| 7219 <section title="CELT Encoder"> |
| 7220 <t> |
| 7221 Most of the aspects of the CELT encoder can be directly derived from the descrip
tion |
| 7222 of the decoder. For example, the filters and rotations in the encoder are simply
the |
| 7223 inverse of the operation performed by the decoder. Similarly, the quantizers gen
erally |
| 7224 optimize for the mean square error (because noise shaping is part of the bit-str
eam itself), |
| 7225 so no special search is required. For this reason, only the less straightforward
aspects of the |
| 7226 encoder are described here. |
| 7227 </t> |
| 7228 |
| 7229 <section anchor="pitch-prefilter" title="Pitch Prefilter"> |
| 7230 <t>The pitch prefilter is applied after the pre-emphasis. It is applied |
| 7231 in such a way as to be the inverse of the decoder's post-filter. The main non-ob
vious aspect of the |
| 7232 prefilter is the selection of the pitch period. The pitch search should be optim
ized for the |
| 7233 following criteria: |
| 7234 <list style="symbols"> |
| 7235 <t>continuity: it is important that the pitch period |
| 7236 does not change abruptly between frames; and</t> |
| 7237 <t>avoidance of pitch multiples: when the period used is a multiple of the real
period |
| 7238 (lower frequency fundamental), the post-filter loses most of its ability to redu
ce noise</t> |
| 7239 </list> |
| 7240 </t> |
| 7241 </section> |
| 7242 |
| 7243 <section anchor="normalization" title="Bands and Normalization"> |
| 7244 <t> |
| 7245 The MDCT output is divided into bands that are designed to match the ear's criti
cal |
| 7246 bands for the smallest (2.5 ms) frame size. The larger frame sizes use inte
ger |
| 7247 multiples of the 2.5 ms layout. For each band, the encoder |
| 7248 computes the energy that will later be encoded. Each band is then normalized by
the |
| 7249 square root of the <spanx style="strong">unquantized</spanx> energy, such that e
ach band now forms a unit vector X. |
| 7250 The energy and the normalization are computed by compute_band_energies() |
| 7251 and normalise_bands() (bands.c), respectively. |
| 7252 </t> |
| 7253 </section> |
| 7254 |
| 7255 <section anchor="energy-quantization" title="Energy Envelope Quantization"> |
| 7256 |
| 7257 <t> |
| 7258 Energy quantization (both coarse and fine) can be easily understood from the dec
oding process. |
| 7259 For all useful bitrates, the coarse quantizer always chooses the quantized log e
nergy value that |
| 7260 minimizes the error for each band. Only at very low rate does the encoder allow
larger errors to |
| 7261 minimize the rate and avoid using more bits than are available. When the |
| 7262 available CPU requirements allow it, it is best to try encoding the coarse energ
y both with and without |
| 7263 inter-frame prediction such that the best prediction mode can be selected. The o
ptimal mode depends on |
| 7264 the coding rate, the available bitrate, and the current rate of packet loss. |
| 7265 </t> |
| 7266 |
| 7267 <t>The fine energy quantizer always chooses the quantized log energy value that |
| 7268 minimizes the error for each band because the rate of the fine quantization depe
nds only |
| 7269 on the bit allocation and not on the values that are coded. |
| 7270 </t> |
| 7271 </section> <!-- Energy quant --> |
| 7272 |
| 7273 <section title="Bit Allocation"> |
| 7274 <t>The encoder must use exactly the same bit allocation process as used by the d
ecoder |
| 7275 and described in <xref target="allocation"/>. The three mechanisms that can be u
sed by the |
| 7276 encoder to adjust the bitrate on a frame-by-frame basis are band boost, allocati
on trim, |
| 7277 and band skipping. |
| 7278 </t> |
| 7279 |
| 7280 <section title="Band Boost"> |
| 7281 <t>The reference encoder makes a decision to boost a band when the energy of tha
t band is significantly |
| 7282 higher than that of the neighboring bands. Let E_j be the log-energy of band j,
we define |
| 7283 <list> |
| 7284 <t>D_j = 2*E_j - E_j-1 - E_j+1 </t> |
| 7285 </list> |
| 7286 |
| 7287 The allocation of band j is boosted once if D_j > t1 and twice if D_j > t2
. For LM>=1, t1=2 and t2=4, |
| 7288 while for LM<1, t1=3 and t2=5. |
| 7289 </t> |
| 7290 |
| 7291 </section> |
| 7292 |
| 7293 <section title="Allocation Trim"> |
| 7294 <t>The allocation trim is a value between 0 and 10 (inclusively) that controls t
he allocation |
| 7295 balance between the low and high frequencies. The encoder starts with a safe "de
fault" of 5 |
| 7296 and deviates from that default in two different ways. First the trim can deviate
by +/- 2 |
| 7297 depending on the spectral tilt of the input signal. For signals with more low fr
equencies, the |
| 7298 trim is increased by up to 2, while for signals with more high frequencies, the
trim is |
| 7299 decreased by up to 2. |
| 7300 For stereo inputs, the trim value can |
| 7301 be decreased by up to 4 when the inter-channel correlation at low frequency (fir
st 8 bands) |
| 7302 is high. </t> |
| 7303 </section> |
| 7304 |
| 7305 <section title="Band Skipping"> |
| 7306 <t>The encoder uses band skipping to ensure that the shape of the bands is only
coded |
| 7307 if there is at least 1/2 bit per sample available for the PVQ. If not, then no b
it is allocated |
| 7308 and folding is used instead. To ensure continuity in the allocation, some amount
of hysteresis is |
| 7309 added to the process, such that a band that received PVQ bits in the previous fr
ame only needs 7/16 |
| 7310 bit/sample to be coded for the current frame, while a band that did not receive
PVQ bits in the |
| 7311 previous frames needs at least 9/16 bit/sample to be coded.</t> |
| 7312 </section> |
| 7313 |
| 7314 </section> |
| 7315 |
| 7316 <section title="Stereo Decisions"> |
| 7317 <t>Because CELT applies mid-side stereo coupling in the normalized domain, it do
es not suffer from |
| 7318 important stereo image problems even when the two channels are completely uncorr
elated. For this reason |
| 7319 it is always safe to use stereo coupling on any audio frame. That being said, th
ere are some frames |
| 7320 for which dual (independent) stereo is still more efficient. This decision is ma
de by comparing the estimated |
| 7321 entropy with and without coupling over the first 13 bands, taking into account t
he fact that all bands with |
| 7322 more than two MDCT bins require one extra degree of freedom when coded in mid-si
de. Let L1_ms and L1_lr |
| 7323 be the L1-norm of the mid-side vector and the L1-norm of the left-right vector,
respectively. The decision |
| 7324 to use mid-side is made if and only if |
| 7325 <figure align="center"> |
| 7326 <artwork align="center"><![CDATA[ |
| 7327 L1_ms L1_lr |
| 7328 -------- < ----- |
| 7329 bins + E bins |
| 7330 ]]></artwork> |
| 7331 </figure> |
| 7332 where bins is the number of MDCT bins in the first 13 bands and E is the number
of extra degrees of |
| 7333 freedom for mid-side coding. For LM>1, E=13, otherwise E=5. |
| 7334 </t> |
| 7335 |
| 7336 <t>The reference encoder decides on the intensity stereo threshold based on the
bitrate alone. After |
| 7337 taking into account the frame size by subtracting 80 bits per frame for coarse e
nergy, the first |
| 7338 band using intensity coding is as follows: |
| 7339 </t> |
| 7340 |
| 7341 <texttable anchor="intensity-thresholds" |
| 7342 title="Thresholds for Intensity Stereo"> |
| 7343 <ttcol align='center'>bitrate (kb/s)</ttcol> |
| 7344 <ttcol align='center'>start band</ttcol> |
| 7345 <c><35</c> <c>8</c> |
| 7346 <c>35-50</c> <c>12</c> |
| 7347 <c>50-68</c> <c>16</c> |
| 7348 <c>84-84</c> <c>18</c> |
| 7349 <c>84-102</c> <c>19</c> |
| 7350 <c>102-130</c> <c>20</c> |
| 7351 <c>>130</c> <c>disabled</c> |
| 7352 </texttable> |
| 7353 |
| 7354 |
| 7355 </section> |
| 7356 |
| 7357 <section title="Time-Frequency Decision"> |
| 7358 <t> |
| 7359 The choice of time-frequency resolution used in <xref target="tf-change"></xref>
is based on |
| 7360 R-D optimization. The distortion is the L1-norm (sum of absolute values) of each
band |
| 7361 after each TF resolution under consideration. The L1 norm is used because it rep
resents the entropy |
| 7362 for a Laplacian source. The number of bits required to code a change in TF resol
ution between |
| 7363 two bands is higher than the cost of having those two bands use the same resolut
ion, which is |
| 7364 what requires the R-D optimization. The optimal decision is computed using the V
iterbi algorithm. |
| 7365 See tf_analysis() in celt/celt.c. |
| 7366 </t> |
| 7367 </section> |
| 7368 |
| 7369 <section title="Spreading Values Decision"> |
| 7370 <t> |
| 7371 The choice of the spreading value in <xref target="spread values"></xref> has an |
| 7372 impact on the nature of the coding noise introduced by CELT. The larger the f_r
value, the |
| 7373 lower the impact of the rotation, and the more tonal the coding noise. The |
| 7374 more tonal the signal, the more tonal the noise should be, so the CELT encoder d
etermines |
| 7375 the optimal value for f_r by estimating how tonal the signal is. The tonality es
timate |
| 7376 is based on discrete pdf (4-bin histogram) of each band. Bands that have a large
number of small |
| 7377 values are considered more tonal and a decision is made by combining all bands w
ith more than |
| 7378 8 samples. See spreading_decision() in celt/bands.c. |
| 7379 </t> |
| 7380 </section> |
| 7381 |
| 7382 <section anchor="pvq" title="Spherical Vector Quantization"> |
| 7383 <t>CELT uses a Pyramid Vector Quantization (PVQ) <xref target="PVQ"></xref> |
| 7384 codebook for quantizing the details of the spectrum in each band that have not |
| 7385 been predicted by the pitch predictor. The PVQ codebook consists of all sums |
| 7386 of K signed pulses in a vector of N samples, where two pulses at the same positi
on |
| 7387 are required to have the same sign. Thus the codebook includes |
| 7388 all integer codevectors y of N dimensions that satisfy sum(abs(y(j))) = K. |
| 7389 </t> |
| 7390 |
| 7391 <t> |
| 7392 In bands where there are sufficient bits allocated PVQ is used to encode |
| 7393 the unit vector that results from the normalization in |
| 7394 <xref target="normalization"></xref> directly. Given a PVQ codevector y, |
| 7395 the unit vector X is obtained as X = y/||y||, where ||.|| denotes the |
| 7396 L2 norm. |
| 7397 </t> |
| 7398 |
| 7399 |
| 7400 <section anchor="pvq-search" title="PVQ Search"> |
| 7401 |
| 7402 <t> |
| 7403 The search for the best codevector y is performed by alg_quant() |
| 7404 (vq.c). There are several possible approaches to the |
| 7405 search, with a trade-off between quality and complexity. The method used in the
reference |
| 7406 implementation computes an initial codeword y1 by projecting the normalized spec
trum |
| 7407 X onto the codebook pyramid of K-1 pulses: |
| 7408 </t> |
| 7409 <t> |
| 7410 y0 = truncate_towards_zero( (K-1) * X / sum(abs(X))) |
| 7411 </t> |
| 7412 |
| 7413 <t> |
| 7414 Depending on N, K and the input data, the initial codeword y0 may contain from |
| 7415 0 to K-1 non-zero values. All the remaining pulses, with the exception of the la
st one, |
| 7416 are found iteratively with a greedy search that minimizes the normalized correla
tion |
| 7417 between y and X: |
| 7418 <figure align="center"> |
| 7419 <artwork align="center"><![CDATA[ |
| 7420 T |
| 7421 J = -X * y / ||y|| |
| 7422 ]]></artwork> |
| 7423 </figure> |
| 7424 </t> |
| 7425 |
| 7426 <t> |
| 7427 The search described above is considered to be a good trade-off between quality |
| 7428 and computational cost. However, there are other possible ways to search the PVQ |
| 7429 codebook and the implementers MAY use any other search methods. See alg_quant()
in celt/vq.c. |
| 7430 </t> |
| 7431 </section> |
| 7432 |
| 7433 <section anchor="cwrs-encoder" title="PVQ Encoding"> |
| 7434 |
| 7435 <t> |
| 7436 The vector to encode, X, is converted into an index i such that |
| 7437 0 <= i < V(N,K) as follows. |
| 7438 Let i = 0 and k = 0. |
| 7439 Then for j = (N - 1) down to 0, inclusive, do: |
| 7440 <list style="numbers"> |
| 7441 <t> |
| 7442 If k > 0, set |
| 7443 i = i + (V(N-j-1,k-1) + V(N-j,k-1))/2. |
| 7444 </t> |
| 7445 <t>Set k = k + abs(X[j]).</t> |
| 7446 <t> |
| 7447 If X[j] < 0, set |
| 7448 i = i + (V(N-j-1,k) + V(N-j,k))/2. |
| 7449 </t> |
| 7450 </list> |
| 7451 </t> |
| 7452 |
| 7453 <t> |
| 7454 The index i is then encoded using the procedure in |
| 7455 <xref target="encoding-ints"/> with ft = V(N,K). |
| 7456 </t> |
| 7457 |
| 7458 </section> |
| 7459 |
| 7460 </section> |
| 7461 |
| 7462 |
| 7463 |
| 7464 |
| 7465 |
| 7466 </section> |
| 7467 |
| 7468 </section> |
| 7469 |
| 7470 |
| 7471 <section anchor="conformance" title="Conformance"> |
| 7472 |
| 7473 <t> |
| 7474 It is our intention to allow the greatest possible choice of freedom in |
| 7475 implementing the specification. For this reason, outside of the exceptions |
| 7476 noted in this section, conformance is defined through the reference |
| 7477 implementation of the decoder provided in <xref target="ref-implementation"/>. |
| 7478 Although this document includes an English description of the codec, should |
| 7479 the description contradict the source code of the reference implementation, |
| 7480 the latter shall take precedence. |
| 7481 </t> |
| 7482 |
| 7483 <t> |
| 7484 Compliance with this specification means that in addition to following the norma
tive keywords in this document, |
| 7485 a decoder's output MUST also be |
| 7486 within the thresholds specified by the opus_compare.c tool (included |
| 7487 with the code) when compared to the reference implementation for each of the |
| 7488 test vectors provided (see <xref target="test-vectors"></xref>) and for each ou
tput |
| 7489 sampling rate and channel count supported. In addition, a compliant |
| 7490 decoder implementation MUST have the same final range decoder state as that of
the |
| 7491 reference decoder. It is therefore RECOMMENDED that the |
| 7492 decoder implement the same functional behavior as the reference. |
| 7493 |
| 7494 A decoder implementation is not required to support all output sampling |
| 7495 rates or all output channel counts. |
| 7496 </t> |
| 7497 |
| 7498 <section title="Testing"> |
| 7499 <t> |
| 7500 Using the reference code provided in <xref target="ref-implementation"></xref>, |
| 7501 a test vector can be decoded with |
| 7502 <list> |
| 7503 <t>opus_demo -d <rate> <channels> testvectorX.bit testX.out</t> |
| 7504 </list> |
| 7505 where <rate> is the sampling rate and can be 8000, 12000, 16000, 24000, or
48000, and |
| 7506 <channels> is 1 for mono or 2 for stereo. |
| 7507 </t> |
| 7508 |
| 7509 <t> |
| 7510 If the range decoder state is incorrect for one of the frames, the decoder will
exit with |
| 7511 "Error: Range coder state mismatch between encoder and decoder". If the decoder
succeeds, then |
| 7512 the output can be compared with the "reference" output with |
| 7513 <list> |
| 7514 <t>opus_compare -s -r <rate> testvectorX.dec testX.out</t> |
| 7515 </list> |
| 7516 for stereo or |
| 7517 <list> |
| 7518 <t>opus_compare -r <rate> testvectorX.dec testX.out</t> |
| 7519 </list> |
| 7520 for mono. |
| 7521 </t> |
| 7522 |
| 7523 <t>In addition to indicating whether the test vector comparison passes, the opus
_compare tool |
| 7524 outputs an "Opus quality metric" that indicates how well the tested decoder matc
hes the |
| 7525 reference implementation. A quality of 0 corresponds to the passing threshold, w
hile |
| 7526 a quality of 100 is the highest possible value and means that the output of the
tested decoder is identical to the reference |
| 7527 implementation. The passing threshold (quality 0) was calibrated in such a way t
hat it corresponds to |
| 7528 additive white noise with a 48 dB SNR (similar to what can be obtained on a cass
ette deck). |
| 7529 It is still possible for an implementation to sound very good with such a low qu
ality measure |
| 7530 (e.g. if the deviation is due to inaudible phase distortion), but unless this is
verified by |
| 7531 listening tests, it is RECOMMENDED that implementations achieve a quality above
90 for 48 kHz |
| 7532 decoding. For other sampling rates, it is normal for the quality metric to be lo
wer |
| 7533 (typically as low as 50 even for a good implementation) because of harmless mism
atch with |
| 7534 the delay and phase of the internal sampling rate conversion. |
| 7535 </t> |
| 7536 |
| 7537 <t> |
| 7538 On POSIX environments, the run_vectors.sh script can be used to verify all test |
| 7539 vectors. This can be done with |
| 7540 <list> |
| 7541 <t>run_vectors.sh <exec path> <vector path> <rate></t> |
| 7542 </list> |
| 7543 where <exec path> is the directory where the opus_demo and opus_compare ex
ecutables |
| 7544 are built and <vector path> is the directory containing the test vectors. |
| 7545 </t> |
| 7546 </section> |
| 7547 |
| 7548 <section anchor="opus-custom" title="Opus Custom"> |
| 7549 <t> |
| 7550 Opus Custom is an OPTIONAL part of the specification that is defined to |
| 7551 handle special sample rates and frame rates that are not supported by the |
| 7552 main Opus specification. Use of Opus Custom is discouraged for all but very |
| 7553 special applications for which a frame size different from 2.5, 5, 10, or 20&nbs
p;ms is |
| 7554 needed (for either complexity or latency reasons). Because Opus Custom is |
| 7555 optional, streams encoded using Opus Custom cannot be expected to be decodable b
y all Opus |
| 7556 implementations. Also, because no in-band mechanism exists for specifying the sa
mpling |
| 7557 rate and frame size of Opus Custom streams, out-of-band signaling is required. |
| 7558 In Opus Custom operation, only the CELT layer is available, using the opus_custo
m_* function |
| 7559 calls in opus_custom.h. |
| 7560 </t> |
| 7561 </section> |
| 7562 |
| 7563 </section> |
| 7564 |
| 7565 <section anchor="security" title="Security Considerations"> |
| 7566 |
| 7567 <t> |
| 7568 Implementations of the Opus codec need to take appropriate security consideratio
ns |
| 7569 into account, as outlined in <xref target="DOS"/>. |
| 7570 It is extremely important for the decoder to be robust against malicious |
| 7571 payloads. |
| 7572 Malicious payloads must not cause the decoder to overrun its allocated memory |
| 7573 or to take an excessive amount of resources to decode. |
| 7574 Although problems |
| 7575 in encoders are typically rarer, the same applies to the encoder. Malicious |
| 7576 audio streams must not cause the encoder to misbehave because this would |
| 7577 allow an attacker to attack transcoding gateways. |
| 7578 </t> |
| 7579 <t> |
| 7580 The reference implementation contains no known buffer overflow or cases where |
| 7581 a specially crafted packet or audio segment could cause a significant increase |
| 7582 in CPU load. |
| 7583 However, on certain CPU architectures where denormalized floating-point |
| 7584 operations are much slower than normal floating-point operations, it is |
| 7585 possible for some audio content (e.g., silence or near-silence) to cause an |
| 7586 increase in CPU load. |
| 7587 Denormals can be introduced by reordering operations in the compiler and depend |
| 7588 on the target architecture, so it is difficult to guarantee that an implementat
ion |
| 7589 avoids them. |
| 7590 For architectures on which denormals are problematic, adding very small |
| 7591 floating-point offsets to the affected signals to prevent significant numbers |
| 7592 of denormalized operations is RECOMMENDED. |
| 7593 Alternatively, it is often possible to configure the hardware to treat |
| 7594 denormals as zero (DAZ). |
| 7595 No such issue exists for the fixed-point reference implementation. |
| 7596 </t> |
| 7597 <t>The reference implementation was validated in the following conditions: |
| 7598 <list style="numbers"> |
| 7599 <t> |
| 7600 Sending the decoder valid packets generated by the reference encoder and |
| 7601 verifying that the decoder's final range coder state matches that of the |
| 7602 encoder. |
| 7603 </t> |
| 7604 <t> |
| 7605 Sending the decoder packets generated by the reference encoder and then |
| 7606 subjected to random corruption. |
| 7607 </t> |
| 7608 <t>Sending the decoder random packets.</t> |
| 7609 <t> |
| 7610 Sending the decoder packets generated by a version of the reference encoder |
| 7611 modified to make random coding decisions (internal fuzzing), including mode |
| 7612 switching, and verifying that the range coder final states match. |
| 7613 </t> |
| 7614 </list> |
| 7615 In all of the conditions above, both the encoder and the decoder were run |
| 7616 inside the <xref target="Valgrind">Valgrind</xref> memory |
| 7617 debugger, which tracks reads and writes to invalid memory regions as well as |
| 7618 the use of uninitialized memory. |
| 7619 There were no errors reported on any of the tested conditions. |
| 7620 </t> |
| 7621 </section> |
| 7622 |
| 7623 |
| 7624 <section title="IANA Considerations"> |
| 7625 <t> |
| 7626 This document has no actions for IANA. |
| 7627 </t> |
| 7628 </section> |
| 7629 |
| 7630 <section anchor="Acknowledgements" title="Acknowledgements"> |
| 7631 <t> |
| 7632 Thanks to all other developers, including Raymond Chen, Soeren Skak Jensen, Greg
ory Maxwell, |
| 7633 Christopher Montgomery, and Karsten Vandborg Soerensen. We would also |
| 7634 like to thank Igor Dyakonov, Jan Skoglund, and Christian Hoene for their help wi
th subjective testing of the |
| 7635 Opus codec. Thanks to Ralph Giles, John Ridges, Ben Schwartz, Keith Yan, Christi
an Hoene, Kat Walsh, and many others on the Opus and CELT mailing lists |
| 7636 for their bug reports and feedback. |
| 7637 </t> |
| 7638 </section> |
| 7639 |
| 7640 <section title="Copying Conditions"> |
| 7641 <t>The authors agree to grant third parties the irrevocable right to copy, use a
nd distribute |
| 7642 the work (excluding Code Components available under the simplified BSD license),
with or |
| 7643 without modification, in any medium, without royalty, provided that, unless sepa
rate |
| 7644 permission is granted, redistributed modified works do not contain misleading au
thor, version, |
| 7645 name of work, or endorsement information.</t> |
| 7646 </section> |
| 7647 |
| 7648 </middle> |
| 7649 |
| 7650 <back> |
| 7651 |
| 7652 <references title="Normative References"> |
| 7653 |
| 7654 <reference anchor="rfc2119"> |
| 7655 <front> |
| 7656 <title>Key words for use in RFCs to Indicate Requirement Levels </title> |
| 7657 <author initials="S." surname="Bradner" fullname="Scott Bradner"></author> |
| 7658 </front> |
| 7659 <seriesInfo name="RFC" value="2119" /> |
| 7660 </reference> |
| 7661 |
| 7662 </references> |
| 7663 |
| 7664 <references title="Informative References"> |
| 7665 |
| 7666 <reference anchor='requirements'> |
| 7667 <front> |
| 7668 <title>Requirements for an Internet Audio Codec</title> |
| 7669 <author initials='J.-M.' surname='Valin' fullname='J.-M. Valin'> |
| 7670 <organization /></author> |
| 7671 <author initials='K.' surname='Vos' fullname='K. Vos'> |
| 7672 <organization /></author> |
| 7673 <author> |
| 7674 <organization>IETF</organization></author> |
| 7675 <date year='2011' month='August' /> |
| 7676 <abstract> |
| 7677 <t>This document provides specific requirements for an Internet audio |
| 7678 codec. These requirements address quality, sample rate, bitrate, |
| 7679 and packet-loss robustness, as well as other desirable properties. |
| 7680 </t></abstract></front> |
| 7681 <seriesInfo name='RFC' value='6366' /> |
| 7682 <format type='TXT' target='http://tools.ietf.org/rfc/rfc6366.txt' /> |
| 7683 </reference> |
| 7684 |
| 7685 <?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.3550.xml"
?> |
| 7686 <?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.3533.xml"
?> |
| 7687 |
| 7688 <reference anchor='SILK' target='http://developer.skype.com/silk'> |
| 7689 <front> |
| 7690 <title>SILK Speech Codec</title> |
| 7691 <author initials='K.' surname='Vos' fullname='K. Vos'> |
| 7692 <organization /></author> |
| 7693 <author initials='S.' surname='Jensen' fullname='S. Jensen'> |
| 7694 <organization /></author> |
| 7695 <author initials='K.' surname='Soerensen' fullname='K. Soerensen'> |
| 7696 <organization /></author> |
| 7697 <date year='2010' month='March' /> |
| 7698 <abstract> |
| 7699 <t></t> |
| 7700 </abstract></front> |
| 7701 <seriesInfo name='Internet-Draft' value='draft-vos-silk-01' /> |
| 7702 <format type='TXT' target='http://tools.ietf.org/html/draft-vos-silk-01' /> |
| 7703 </reference> |
| 7704 |
| 7705 <reference anchor="laroia-icassp"> |
| 7706 <front> |
| 7707 <title abbrev="Robust and Efficient Quantization of Speech LSP"> |
| 7708 Robust and Efficient Quantization of Speech LSP Parameters Using Structured Vect
or Quantization |
| 7709 </title> |
| 7710 <author initials="R.L." surname="Laroia" fullname="R."> |
| 7711 <organization/> |
| 7712 </author> |
| 7713 <author initials="N.P." surname="Phamdo" fullname="N."> |
| 7714 <organization/> |
| 7715 </author> |
| 7716 <author initials="N.F." surname="Farvardin" fullname="N."> |
| 7717 <organization/> |
| 7718 </author> |
| 7719 </front> |
| 7720 <seriesInfo name="ICASSP-1991, Proc. IEEE Int. Conf. Acoust., Speech, Signal Pro
cessing, pp. 641-644, October" value="1991"/> |
| 7721 </reference> |
| 7722 |
| 7723 <reference anchor='CELT' target='http://celt-codec.org/'> |
| 7724 <front> |
| 7725 <title>Constrained-Energy Lapped Transform (CELT) Codec</title> |
| 7726 <author initials='J-M.' surname='Valin' fullname='J-M. Valin'> |
| 7727 <organization /></author> |
| 7728 <author initials='T.B.' surname='Terriberry' fullname='Timothy B. Terriberr
y'> |
| 7729 <organization /></author> |
| 7730 <author initials='G.' surname='Maxwell' fullname='G. Maxwell'> |
| 7731 <organization /></author> |
| 7732 <author initials='C.' surname='Montgomery' fullname='C. Montgomery'> |
| 7733 <organization /></author> |
| 7734 <date year='2010' month='July' /> |
| 7735 <abstract> |
| 7736 <t></t> |
| 7737 </abstract></front> |
| 7738 <seriesInfo name='Internet-Draft' value='draft-valin-celt-codec-02' /> |
| 7739 <format type='TXT' target='http://tools.ietf.org/html/draft-valin-celt-codec-02'
/> |
| 7740 </reference> |
| 7741 |
| 7742 <reference anchor='SRTP-VBR'> |
| 7743 <front> |
| 7744 <title>Guidelines for the use of Variable Bit Rate Audio with Secure RTP</title> |
| 7745 <author initials='C.' surname='Perkins' fullname='K. Vos'> |
| 7746 <organization /></author> |
| 7747 <author initials='J.M.' surname='Valin' fullname='J.M. Valin'> |
| 7748 <organization /></author> |
| 7749 <date year='2011' month='July' /> |
| 7750 <abstract> |
| 7751 <t></t> |
| 7752 </abstract></front> |
| 7753 <seriesInfo name='RFC' value='6562' /> |
| 7754 <format type='TXT' target='http://tools.ietf.org/html/rfc6562' /> |
| 7755 </reference> |
| 7756 |
| 7757 <reference anchor='DOS'> |
| 7758 <front> |
| 7759 <title>Internet Denial-of-Service Considerations</title> |
| 7760 <author initials='M.' surname='Handley' fullname='M. Handley'> |
| 7761 <organization /></author> |
| 7762 <author initials='E.' surname='Rescorla' fullname='E. Rescorla'> |
| 7763 <organization /></author> |
| 7764 <author> |
| 7765 <organization>IAB</organization></author> |
| 7766 <date year='2006' month='December' /> |
| 7767 <abstract> |
| 7768 <t>This document provides an overview of possible avenues for denial-of-service
(DoS) attack on Internet systems. The aim is to encourage protocol designers an
d network engineers towards designs that are more robust. We discuss partial so
lutions that reduce the effectiveness of attacks, and how some solutions might i
nadvertently open up alternative vulnerabilities. This memo provides informatio
n for the Internet community.</t></abstract></front> |
| 7769 <seriesInfo name='RFC' value='4732' /> |
| 7770 <format type='TXT' octets='91844' target='ftp://ftp.isi.edu/in-notes/rfc4732.txt
' /> |
| 7771 </reference> |
| 7772 |
| 7773 <reference anchor="Martin79"> |
| 7774 <front> |
| 7775 <title>Range encoding: An algorithm for removing redundancy from a digitised mes
sage</title> |
| 7776 <author initials="G.N.N." surname="Martin" fullname="G. Nigel N. Martin"><organi
zation/></author> |
| 7777 <date year="1979" /> |
| 7778 </front> |
| 7779 <seriesInfo name="Proc. Institution of Electronic and Radio Engineers Internatio
nal Conference on Video and Data Recording" value="" /> |
| 7780 </reference> |
| 7781 |
| 7782 <reference anchor="coding-thesis"> |
| 7783 <front> |
| 7784 <title>Source coding algorithms for fast data compression</title> |
| 7785 <author initials="R." surname="Pasco" fullname=""><organization/></author> |
| 7786 <date month="May" year="1976" /> |
| 7787 </front> |
| 7788 <seriesInfo name="Ph.D. thesis" value="Dept. of Electrical Engineering, Stanford
University" /> |
| 7789 </reference> |
| 7790 |
| 7791 <reference anchor="PVQ"> |
| 7792 <front> |
| 7793 <title>A Pyramid Vector Quantizer</title> |
| 7794 <author initials="T." surname="Fischer" fullname=""><organization/></author> |
| 7795 <date month="July" year="1986" /> |
| 7796 </front> |
| 7797 <seriesInfo name="IEEE Trans. on Information Theory, Vol. 32" value="pp. 568-583
" /> |
| 7798 </reference> |
| 7799 |
| 7800 <reference anchor="Kabal86"> |
| 7801 <front> |
| 7802 <title>The Computation of Line Spectral Frequencies Using Chebyshev Polynomials<
/title> |
| 7803 <author initials="P." surname="Kabal" fullname="P. Kabal"><organization/></autho
r> |
| 7804 <author initials="R." surname="Ramachandran" fullname="R. P. Ramachandran"><orga
nization/></author> |
| 7805 <date month="December" year="1986" /> |
| 7806 </front> |
| 7807 <seriesInfo name="IEEE Trans. Acoustics, Speech, Signal Processing, vol. 34, no.
6" value="pp. 1419-1426" /> |
| 7808 </reference> |
| 7809 |
| 7810 |
| 7811 <reference anchor="Valgrind" target="http://valgrind.org/"> |
| 7812 <front> |
| 7813 <title>Valgrind website</title> |
| 7814 <author></author> |
| 7815 </front> |
| 7816 </reference> |
| 7817 |
| 7818 <reference anchor="Google-NetEQ" target="http://code.google.com/p/webrtc/source/
browse/trunk/src/modules/audio_coding/NetEQ/main/source/?r=583"> |
| 7819 <front> |
| 7820 <title>Google NetEQ code</title> |
| 7821 <author></author> |
| 7822 </front> |
| 7823 </reference> |
| 7824 |
| 7825 <reference anchor="Google-WebRTC" target="http://code.google.com/p/webrtc/"> |
| 7826 <front> |
| 7827 <title>Google WebRTC code</title> |
| 7828 <author></author> |
| 7829 </front> |
| 7830 </reference> |
| 7831 |
| 7832 |
| 7833 <reference anchor="Opus-git" target="git://git.xiph.org/opus.git"> |
| 7834 <front> |
| 7835 <title>Opus Git Repository</title> |
| 7836 <author></author> |
| 7837 </front> |
| 7838 </reference> |
| 7839 |
| 7840 <reference anchor="Opus-website" target="http://opus-codec.org/"> |
| 7841 <front> |
| 7842 <title>Opus website</title> |
| 7843 <author></author> |
| 7844 </front> |
| 7845 </reference> |
| 7846 |
| 7847 <reference anchor="Vorbis-website" target="http://xiph.org/vorbis/"> |
| 7848 <front> |
| 7849 <title>Vorbis website</title> |
| 7850 <author></author> |
| 7851 </front> |
| 7852 </reference> |
| 7853 |
| 7854 <reference anchor="Matroska-website" target="http://matroska.org/"> |
| 7855 <front> |
| 7856 <title>Matroska website</title> |
| 7857 <author></author> |
| 7858 </front> |
| 7859 </reference> |
| 7860 |
| 7861 <reference anchor="Vectors-website" target="http://opus-codec.org/testvectors/"> |
| 7862 <front> |
| 7863 <title>Opus Testvectors (webside)</title> |
| 7864 <author></author> |
| 7865 </front> |
| 7866 </reference> |
| 7867 |
| 7868 <reference anchor="Vectors-proc" target="http://www.ietf.org/proceedings/83/slid
es/slides-83-codec-0.gz"> |
| 7869 <front> |
| 7870 <title>Opus Testvectors (proceedings)</title> |
| 7871 <author></author> |
| 7872 </front> |
| 7873 </reference> |
| 7874 |
| 7875 <reference anchor="line-spectral-pairs" target="http://en.wikipedia.org/wiki/Lin
e_spectral_pairs"> |
| 7876 <front> |
| 7877 <title>Line Spectral Pairs</title> |
| 7878 <author><organization>Wikipedia</organization></author> |
| 7879 </front> |
| 7880 </reference> |
| 7881 |
| 7882 <reference anchor="range-coding" target="http://en.wikipedia.org/wiki/Range_codi
ng"> |
| 7883 <front> |
| 7884 <title>Range Coding</title> |
| 7885 <author><organization>Wikipedia</organization></author> |
| 7886 </front> |
| 7887 </reference> |
| 7888 |
| 7889 <reference anchor="Hadamard" target="http://en.wikipedia.org/wiki/Hadamard_trans
form"> |
| 7890 <front> |
| 7891 <title>Hadamard Transform</title> |
| 7892 <author><organization>Wikipedia</organization></author> |
| 7893 </front> |
| 7894 </reference> |
| 7895 |
| 7896 <reference anchor="Viterbi" target="http://en.wikipedia.org/wiki/Viterbi_algorit
hm"> |
| 7897 <front> |
| 7898 <title>Viterbi Algorithm</title> |
| 7899 <author><organization>Wikipedia</organization></author> |
| 7900 </front> |
| 7901 </reference> |
| 7902 |
| 7903 <reference anchor="Whitening" target="http://en.wikipedia.org/wiki/White_noise"> |
| 7904 <front> |
| 7905 <title>White Noise</title> |
| 7906 <author><organization>Wikipedia</organization></author> |
| 7907 </front> |
| 7908 </reference> |
| 7909 |
| 7910 <reference anchor="LPC" target="http://en.wikipedia.org/wiki/Linear_prediction"> |
| 7911 <front> |
| 7912 <title>Linear Prediction</title> |
| 7913 <author><organization>Wikipedia</organization></author> |
| 7914 </front> |
| 7915 </reference> |
| 7916 |
| 7917 <reference anchor="MDCT" target="http://en.wikipedia.org/wiki/Modified_discrete_
cosine_transform"> |
| 7918 <front> |
| 7919 <title>Modified Discrete Cosine Transform</title> |
| 7920 <author><organization>Wikipedia</organization></author> |
| 7921 </front> |
| 7922 </reference> |
| 7923 |
| 7924 <reference anchor="FFT" target="http://en.wikipedia.org/wiki/Fast_Fourier_transf
orm"> |
| 7925 <front> |
| 7926 <title>Fast Fourier Transform</title> |
| 7927 <author><organization>Wikipedia</organization></author> |
| 7928 </front> |
| 7929 </reference> |
| 7930 |
| 7931 <reference anchor="z-transform" target="http://en.wikipedia.org/wiki/Z-transform
"> |
| 7932 <front> |
| 7933 <title>Z-transform</title> |
| 7934 <author><organization>Wikipedia</organization></author> |
| 7935 </front> |
| 7936 </reference> |
| 7937 |
| 7938 |
| 7939 <reference anchor="Burg"> |
| 7940 <front> |
| 7941 <title>Maximum Entropy Spectral Analysis</title> |
| 7942 <author initials="JP." surname="Burg" fullname="J.P. Burg"><organization/></auth
or> |
| 7943 </front> |
| 7944 </reference> |
| 7945 |
| 7946 <reference anchor="Schur"> |
| 7947 <front> |
| 7948 <title>A fixed point computation of partial correlation coefficients</title> |
| 7949 <author initials="J." surname="Le Roux" fullname="J. Le Roux"><organization/></a
uthor> |
| 7950 <author initials="C." surname="Gueguen" fullname="C. Gueguen"><organization/></a
uthor> |
| 7951 </front> |
| 7952 <seriesInfo name="ICASSP-1977, Proc. IEEE Int. Conf. Acoust., Speech, Signal Pro
cessing, pp. 257-259, October" value="1977"/> |
| 7953 </reference> |
| 7954 |
| 7955 <reference anchor="Princen86"> |
| 7956 <front> |
| 7957 <title>Analysis/synthesis filter bank design based on time domain aliasing cance
llation</title> |
| 7958 <author initials="J." surname="Princen" fullname="John P. Princen"><organization
/></author> |
| 7959 <author initials="A." surname="Bradley" fullname="Alan B. Bradley"><organization
/></author> |
| 7960 </front> |
| 7961 <seriesInfo name="IEEE Trans. Acoust. Speech Sig. Proc. ASSP-34 (5), 1153-1161"
value="1986"/> |
| 7962 </reference> |
| 7963 |
| 7964 <reference anchor="Valin2010"> |
| 7965 <front> |
| 7966 <title>A High-Quality Speech and Audio Codec With Less Than 10 ms delay</title> |
| 7967 <author initials="JM" surname="Valin" fullname="Jean-Marc Valin"><organization/> |
| 7968 </author> |
| 7969 <author initials="T. B." surname="Terriberry" fullname="Timothy Terriberry"><org
anization/></author> |
| 7970 <author initials="C." surname="Montgomery" fullname="Christopher Montgomery"><or
ganization/></author> |
| 7971 <author initials="G." surname="Maxwell" fullname="Gregory Maxwell"><organization
/></author> |
| 7972 </front> |
| 7973 <seriesInfo name="IEEE Trans. on Audio, Speech and Language Processing, Vol. 18,
No. 1, pp. 58-67" value="2010" /> |
| 7974 </reference> |
| 7975 |
| 7976 |
| 7977 <reference anchor="Zwicker61"> |
| 7978 <front> |
| 7979 <title>Subdivision of the audible frequency range into critical bands</title> |
| 7980 <author initials="E." surname="Zwicker" fullname="E. Zwicker"><organization/></a
uthor> |
| 7981 <date month="February" year="1961" /> |
| 7982 </front> |
| 7983 <seriesInfo name="The Journal of the Acoustical Society of America, Vol. 33, No
2" value="p. 248" /> |
| 7984 </reference> |
| 7985 |
| 7986 |
| 7987 </references> |
| 7988 |
| 7989 <section anchor="ref-implementation" title="Reference Implementation"> |
| 7990 |
| 7991 <t>This appendix contains the complete source code for the |
| 7992 reference implementation of the Opus codec written in C. By default, |
| 7993 this implementation relies on floating-point arithmetic, but it can be |
| 7994 compiled to use only fixed-point arithmetic by defining the FIXED_POINT |
| 7995 macro. Information on building and using the reference implementation is |
| 7996 available in the README file. |
| 7997 </t> |
| 7998 |
| 7999 <t>The implementation can be compiled with either a C89 or a C99 |
| 8000 compiler. It is reasonably optimized for most platforms such that |
| 8001 only architecture-specific optimizations are likely to be useful. |
| 8002 The FFT <xref target="FFT"/> used is a slightly modified version of the KISS-FFT
library, |
| 8003 but it is easy to substitute any other FFT library. |
| 8004 </t> |
| 8005 |
| 8006 <t> |
| 8007 While the reference implementation does not rely on any |
| 8008 <spanx style="emph">undefined behavior</spanx> as defined by C89 or C99, |
| 8009 it relies on common <spanx style="emph">implementation-defined behavior</spanx> |
| 8010 for two's complement architectures: |
| 8011 <list style="symbols"> |
| 8012 <t>Right shifts of negative values are consistent with two's complement arithmet
ic, so that a>>b is equivalent to floor(a/(2**b)),</t> |
| 8013 <t>For conversion to a signed integer of N bits, the value is reduced modulo 2**
N to be within range of the type,</t> |
| 8014 <t>The result of integer division of a negative value is truncated towards zero,
and</t> |
| 8015 <t>The compiler provides a 64-bit integer type (a C99 requirement which is suppo
rted by most C89 compilers).</t> |
| 8016 </list> |
| 8017 </t> |
| 8018 |
| 8019 <t> |
| 8020 In its current form, the reference implementation also requires the following |
| 8021 architectural characteristics to obtain acceptable performance: |
| 8022 <list style="symbols"> |
| 8023 <t>Two's complement arithmetic,</t> |
| 8024 <t>At least a 16 bit by 16 bit integer multiplier (32-bit result), and</t> |
| 8025 <t>At least a 32-bit adder/accumulator.</t> |
| 8026 </list> |
| 8027 </t> |
| 8028 |
| 8029 |
| 8030 <section title="Extracting the source"> |
| 8031 <t> |
| 8032 The complete source code can be extracted from this draft, by running the |
| 8033 following command line: |
| 8034 |
| 8035 <list style="symbols"> |
| 8036 <t><![CDATA[ |
| 8037 cat draft-ietf-codec-opus.txt | grep '^\ \ \ ###' | sed -e 's/...###//' | base64
-d > opus_source.tar.gz |
| 8038 ]]></t> |
| 8039 <t> |
| 8040 tar xzvf opus_source.tar.gz |
| 8041 </t> |
| 8042 <t>cd opus_source</t> |
| 8043 <t>make</t> |
| 8044 </list> |
| 8045 On systems where the provided Makefile does not work, the following command line
may be used to compile |
| 8046 the source code: |
| 8047 <list style="symbols"> |
| 8048 <t><![CDATA[ |
| 8049 cc -O2 -g -o opus_demo src/opus_demo.c `cat *.mk | grep -v fixed | sed -e 's/.*=
//' -e 's/\\\\//'` -DOPUS_BUILD -Iinclude -Icelt -Isilk -Isilk/float -DUSE_ALLOC
A -Drestrict= -lm |
| 8050 ]]></t></list> |
| 8051 </t> |
| 8052 |
| 8053 <t> |
| 8054 On systems where the base64 utility is not present, the following commands can b
e used instead: |
| 8055 <list style="symbols"> |
| 8056 <t><![CDATA[ |
| 8057 cat draft-ietf-codec-opus.txt | grep '^\ \ \ ###' | sed -e 's/...###//' > opus.b
64 |
| 8058 ]]></t> |
| 8059 <t>openssl base64 -d -in opus.b64 > opus_source.tar.gz</t> |
| 8060 </list> |
| 8061 |
| 8062 </t> |
| 8063 </section> |
| 8064 |
| 8065 <section title="Up-to-date Implementation"> |
| 8066 <t> |
| 8067 As of the time of publication of this memo, an up-to-date implementation conform
ing to |
| 8068 this standard is available in a |
| 8069 <xref target='Opus-git'>Git repository</xref>. |
| 8070 Releases and other resources are available at |
| 8071 <xref target='Opus-website'/>. However, although that implementation is expecte
d to |
| 8072 remain conformant with the standard, it is the code in this document that shall |
| 8073 remain normative. |
| 8074 </t> |
| 8075 </section> |
| 8076 |
| 8077 <section title="Base64-encoded Source Code"> |
| 8078 <t> |
| 8079 <?rfc include="opus_source.base64"?> |
| 8080 </t> |
| 8081 </section> |
| 8082 |
| 8083 <section anchor="test-vectors" title="Test Vectors"> |
| 8084 <t> |
| 8085 Because of size constraints, the Opus test vectors are not distributed in this |
| 8086 draft. They are available in the proceedings of the 83th IETF meeting (Paris) <x
ref target="Vectors-proc"/> and from the Opus codec website at |
| 8087 <xref target="Vectors-website"/>. These test vectors were created specifically t
o exercise |
| 8088 all aspects of the decoder and therefore the audio quality of the decoded output
is |
| 8089 significantly lower than what Opus can achieve in normal operation. |
| 8090 </t> |
| 8091 |
| 8092 <t> |
| 8093 The SHA1 hash of the files in the test vector package are |
| 8094 <?rfc include="testvectors_sha1"?> |
| 8095 </t> |
| 8096 |
| 8097 </section> |
| 8098 |
| 8099 </section> |
| 8100 |
| 8101 <section anchor="self-delimiting-framing" title="Self-Delimiting Framing"> |
| 8102 <t> |
| 8103 To use the internal framing described in <xref target="modes"/>, the decoder |
| 8104 must know the total length of the Opus packet, in bytes. |
| 8105 This section describes a simple variation of that framing which can be used |
| 8106 when the total length of the packet is not known. |
| 8107 Nothing in the encoding of the packet itself allows a decoder to distinguish |
| 8108 between the regular, undelimited framing and the self-delimiting framing |
| 8109 described in this appendix. |
| 8110 Which one is used and where must be established by context at the transport |
| 8111 layer. |
| 8112 It is RECOMMENDED that a transport layer choose exactly one framing scheme, |
| 8113 rather than allowing an encoder to signal which one it wants to use. |
| 8114 </t> |
| 8115 |
| 8116 <t> |
| 8117 For example, although a regular Opus stream does not support more than two |
| 8118 channels, a multi-channel Opus stream may be formed from several one- and |
| 8119 two-channel streams. |
| 8120 To pack an Opus packet from each of these streams together in a single packet |
| 8121 at the transport layer, one could use the self-delimiting framing for all but |
| 8122 the last stream, and then the regular, undelimited framing for the last one. |
| 8123 Reverting to the undelimited framing for the last stream saves overhead |
| 8124 (because the total size of the transport-layer packet will still be known), |
| 8125 and ensures that a "multi-channel" stream which only has a single Opus stream |
| 8126 uses the same framing as a regular Opus stream does. |
| 8127 This avoids the need for signaling to distinguish these two cases. |
| 8128 </t> |
| 8129 |
| 8130 <t> |
| 8131 The self-delimiting framing is identical to the regular, undelimited framing |
| 8132 from <xref target="modes"/>, except that each Opus packet contains one extra |
| 8133 length field, encoded using the same one- or two-byte scheme from |
| 8134 <xref target="frame-length-coding"/>. |
| 8135 This extra length immediately precedes the compressed data of the first Opus |
| 8136 frame in the packet, and is interpreted in the various modes as follows: |
| 8137 <list style="symbols"> |
| 8138 <t> |
| 8139 Code 0 packets: It is the length of the single Opus frame (see |
| 8140 <xref target="sd_code0_packet"/>). |
| 8141 </t> |
| 8142 <t> |
| 8143 Code 1 packets: It is the length used for both of the Opus frames (see |
| 8144 <xref target="sd_code1_packet"/>). |
| 8145 </t> |
| 8146 <t> |
| 8147 Code 2 packets: It is the length of the second Opus frame (see |
| 8148 <xref target="sd_code2_packet"/>).</t> |
| 8149 <t> |
| 8150 CBR Code 3 packets: It is the length used for all of the Opus frames (see |
| 8151 <xref target="sd_code3cbr_packet"/>). |
| 8152 </t> |
| 8153 <t>VBR Code 3 packets: It is the length of the last Opus frame (see |
| 8154 <xref target="sd_code3vbr_packet"/>). |
| 8155 </t> |
| 8156 </list> |
| 8157 </t> |
| 8158 |
| 8159 <figure anchor="sd_code0_packet" title="A Self-Delimited Code 0 Packet" |
| 8160 align="center"> |
| 8161 <artwork align="center"><![CDATA[ |
| 8162 0 1 2 3 |
| 8163 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 8164 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8165 | config |s|0|0| N1 (1-2 bytes): | |
| 8166 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| 8167 | Compressed frame 1 (N1 bytes)... : |
| 8168 : | |
| 8169 | | |
| 8170 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8171 ]]></artwork> |
| 8172 </figure> |
| 8173 |
| 8174 <figure anchor="sd_code1_packet" title="A Self-Delimited Code 1 Packet" |
| 8175 align="center"> |
| 8176 <artwork align="center"><![CDATA[ |
| 8177 0 1 2 3 |
| 8178 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 8179 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8180 | config |s|0|1| N1 (1-2 bytes): | |
| 8181 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : |
| 8182 | Compressed frame 1 (N1 bytes)... | |
| 8183 : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8184 | | | |
| 8185 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : |
| 8186 | Compressed frame 2 (N1 bytes)... | |
| 8187 : +-+-+-+-+-+-+-+-+ |
| 8188 | | |
| 8189 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8190 ]]></artwork> |
| 8191 </figure> |
| 8192 |
| 8193 <figure anchor="sd_code2_packet" title="A Self-Delimited Code 2 Packet" |
| 8194 align="center"> |
| 8195 <artwork align="center"><![CDATA[ |
| 8196 0 1 2 3 |
| 8197 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 8198 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8199 | config |s|1|0| N1 (1-2 bytes): N2 (1-2 bytes : | |
| 8200 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : |
| 8201 | Compressed frame 1 (N1 bytes)... | |
| 8202 : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8203 | | | |
| 8204 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
| 8205 | Compressed frame 2 (N2 bytes)... : |
| 8206 : | |
| 8207 | | |
| 8208 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8209 ]]></artwork> |
| 8210 </figure> |
| 8211 |
| 8212 <figure anchor="sd_code3cbr_packet" title="A Self-Delimited CBR Code 3 Packet" |
| 8213 align="center"> |
| 8214 <artwork align="center"><![CDATA[ |
| 8215 0 1 2 3 |
| 8216 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 8217 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8218 | config |s|1|1|0|p| M | Pad len (Opt) : N1 (1-2 bytes): |
| 8219 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8220 | | |
| 8221 : Compressed frame 1 (N1 bytes)... : |
| 8222 | | |
| 8223 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8224 | | |
| 8225 : Compressed frame 2 (N1 bytes)... : |
| 8226 | | |
| 8227 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8228 | | |
| 8229 : ... : |
| 8230 | | |
| 8231 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8232 | | |
| 8233 : Compressed frame M (N1 bytes)... : |
| 8234 | | |
| 8235 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8236 : Opus Padding (Optional)... | |
| 8237 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8238 ]]></artwork> |
| 8239 </figure> |
| 8240 |
| 8241 <figure anchor="sd_code3vbr_packet" title="A Self-Delimited VBR Code 3 Packet" |
| 8242 align="center"> |
| 8243 <artwork align="center"><![CDATA[ |
| 8244 0 1 2 3 |
| 8245 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 8246 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8247 | config |s|1|1|1|p| M | Padding length (Optional) : |
| 8248 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8249 : N1 (1-2 bytes): ... : N[M-1] | N[M] : |
| 8250 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8251 | | |
| 8252 : Compressed frame 1 (N1 bytes)... : |
| 8253 | | |
| 8254 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8255 | | |
| 8256 : Compressed frame 2 (N2 bytes)... : |
| 8257 | | |
| 8258 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8259 | | |
| 8260 : ... : |
| 8261 | | |
| 8262 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8263 | | |
| 8264 : Compressed frame M (N[M] bytes)... : |
| 8265 | | |
| 8266 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8267 : Opus Padding (Optional)... | |
| 8268 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 8269 ]]></artwork> |
| 8270 </figure> |
| 8271 |
| 8272 </section> |
| 8273 |
| 8274 </back> |
| 8275 |
| 8276 </rfc> |
OLD | NEW |