OLD | NEW |
(Empty) | |
| 1 <?xml version="1.0" encoding="utf-8"?> |
| 2 <!-- |
| 3 Copyright (c) 2012-2016 Xiph.Org Foundation and contributors |
| 4 |
| 5 Redistribution and use in source and binary forms, with or without |
| 6 modification, are permitted provided that the following conditions |
| 7 are met: |
| 8 |
| 9 - Redistributions of source code must retain the above copyright |
| 10 notice, this list of conditions and the following disclaimer. |
| 11 |
| 12 - Redistributions in binary form must reproduce the above copyright |
| 13 notice, this list of conditions and the following disclaimer in the |
| 14 documentation and/or other materials provided with the distribution. |
| 15 |
| 16 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS |
| 17 ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT |
| 18 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR |
| 19 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER |
| 20 OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, |
| 21 EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, |
| 22 PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR |
| 23 PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF |
| 24 LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING |
| 25 NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS |
| 26 SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
| 27 |
| 28 Special permission is granted to remove the above copyright notice, list of |
| 29 conditions, and disclaimer when submitting this document, with or without |
| 30 modification, to the IETF. |
| 31 --> |
| 32 <!DOCTYPE rfc SYSTEM 'rfc2629.dtd' [ |
| 33 <!ENTITY rfc2119 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.
RFC.2119.xml'> |
| 34 <!ENTITY rfc3533 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.
RFC.3533.xml'> |
| 35 <!ENTITY rfc3629 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.
RFC.3629.xml'> |
| 36 <!ENTITY rfc4732 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.
RFC.4732.xml'> |
| 37 <!ENTITY rfc5226 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.
RFC.5226.xml'> |
| 38 <!ENTITY rfc5334 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.
RFC.5334.xml'> |
| 39 <!ENTITY rfc6381 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.
RFC.6381.xml'> |
| 40 <!ENTITY rfc6716 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.
RFC.6716.xml'> |
| 41 <!ENTITY rfc6982 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.
RFC.6982.xml'> |
| 42 <!ENTITY rfc7587 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.
RFC.7587.xml'> |
| 43 ]> |
| 44 <?rfc toc="yes" symrefs="yes" ?> |
| 45 |
| 46 <rfc ipr="trust200902" category="std" docName="draft-ietf-codec-oggopus-14" |
| 47 updates="5334"> |
| 48 |
| 49 <front> |
| 50 <title abbrev="Ogg Opus">Ogg Encapsulation for the Opus Audio Codec</title> |
| 51 <author initials="T.B." surname="Terriberry" fullname="Timothy B. Terriberry"> |
| 52 <organization>Mozilla Corporation</organization> |
| 53 <address> |
| 54 <postal> |
| 55 <street>650 Castro Street</street> |
| 56 <city>Mountain View</city> |
| 57 <region>CA</region> |
| 58 <code>94041</code> |
| 59 <country>USA</country> |
| 60 </postal> |
| 61 <phone>+1 650 903-0800</phone> |
| 62 <email>tterribe@xiph.org</email> |
| 63 </address> |
| 64 </author> |
| 65 |
| 66 <author initials="R." surname="Lee" fullname="Ron Lee"> |
| 67 <organization>Voicetronix</organization> |
| 68 <address> |
| 69 <postal> |
| 70 <street>246 Pulteney Street, Level 1</street> |
| 71 <city>Adelaide</city> |
| 72 <region>SA</region> |
| 73 <code>5000</code> |
| 74 <country>Australia</country> |
| 75 </postal> |
| 76 <phone>+61 8 8232 9112</phone> |
| 77 <email>ron@debian.org</email> |
| 78 </address> |
| 79 </author> |
| 80 |
| 81 <author initials="R." surname="Giles" fullname="Ralph Giles"> |
| 82 <organization>Mozilla Corporation</organization> |
| 83 <address> |
| 84 <postal> |
| 85 <street>163 West Hastings Street</street> |
| 86 <city>Vancouver</city> |
| 87 <region>BC</region> |
| 88 <code>V6B 1H5</code> |
| 89 <country>Canada</country> |
| 90 </postal> |
| 91 <phone>+1 778 785 1540</phone> |
| 92 <email>giles@xiph.org</email> |
| 93 </address> |
| 94 </author> |
| 95 |
| 96 <date day="22" month="February" year="2016"/> |
| 97 <area>RAI</area> |
| 98 <workgroup>codec</workgroup> |
| 99 |
| 100 <abstract> |
| 101 <t> |
| 102 This document defines the Ogg encapsulation for the Opus interactive speech and |
| 103 audio codec. |
| 104 This allows data encoded in the Opus format to be stored in an Ogg logical |
| 105 bitstream. |
| 106 </t> |
| 107 </abstract> |
| 108 </front> |
| 109 |
| 110 <middle> |
| 111 <section anchor="intro" title="Introduction"> |
| 112 <t> |
| 113 The IETF Opus codec is a low-latency audio codec optimized for both voice and |
| 114 general-purpose audio. |
| 115 See <xref target="RFC6716"/> for technical details. |
| 116 This document defines the encapsulation of Opus in a continuous, logical Ogg |
| 117 bitstream <xref target="RFC3533"/>. |
| 118 Ogg encapsulation provides Opus with a long-term storage format supporting |
| 119 all of the essential features, including metadata, fast and accurate seeking, |
| 120 corruption detection, recapture after errors, low overhead, and the ability to |
| 121 multiplex Opus with other codecs (including video) with minimal buffering. |
| 122 It also provides a live streamable format, capable of delivery over a reliable |
| 123 stream-oriented transport, without requiring all the data, or even the total |
| 124 length of the data, up-front, in a form that is identical to the on-disk |
| 125 storage format. |
| 126 </t> |
| 127 <t> |
| 128 Ogg bitstreams are made up of a series of 'pages', each of which contains data |
| 129 from one or more 'packets'. |
| 130 Pages are the fundamental unit of multiplexing in an Ogg stream. |
| 131 Each page is associated with a particular logical stream and contains a capture |
| 132 pattern and checksum, flags to mark the beginning and end of the logical |
| 133 stream, and a 'granule position' that represents an absolute position in the |
| 134 stream, to aid seeking. |
| 135 A single page can contain up to 65,025 octets of packet data from up to 255 |
| 136 different packets. |
| 137 Packets can be split arbitrarily across pages, and continued from one page to |
| 138 the next (allowing packets much larger than would fit on a single page). |
| 139 Each page contains 'lacing values' that indicate how the data is partitioned |
| 140 into packets, allowing a demultiplexer (demuxer) to recover the packet |
| 141 boundaries without examining the encoded data. |
| 142 A packet is said to 'complete' on a page when the page contains the final |
| 143 lacing value corresponding to that packet. |
| 144 </t> |
| 145 <t> |
| 146 This encapsulation defines the contents of the packet data, including |
| 147 the necessary headers, the organization of those packets into a logical |
| 148 stream, and the interpretation of the codec-specific granule position field. |
| 149 It does not attempt to describe or specify the existing Ogg container format. |
| 150 Readers unfamiliar with the basic concepts mentioned above are encouraged to |
| 151 review the details in <xref target="RFC3533"/>. |
| 152 </t> |
| 153 |
| 154 </section> |
| 155 |
| 156 <section anchor="terminology" title="Terminology"> |
| 157 <t> |
| 158 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", |
| 159 "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this |
| 160 document are to be interpreted as described in <xref target="RFC2119"/>. |
| 161 </t> |
| 162 |
| 163 </section> |
| 164 |
| 165 <section anchor="packet_organization" title="Packet Organization"> |
| 166 <t> |
| 167 An Ogg Opus stream is organized as follows (see |
| 168 <xref target="packet-org-example"/> for an example). |
| 169 </t> |
| 170 |
| 171 <figure anchor="packet-org-example" |
| 172 title="Example packet organization for a logical Ogg Opus stream" |
| 173 align="center"> |
| 174 <artwork align="center"><![CDATA[ |
| 175 Page 0 Pages 1 ... n Pages (n+1) ... |
| 176 +------------+ +---+ +---+ ... +---+ +-----------+ +---------+ +-- |
| 177 | | | | | | | | | | | | | |
| 178 |+----------+| |+-----------------+| |+-------------------+ +----- |
| 179 |||ID Header|| || Comment Header || ||Audio Data Packet 1| | ... |
| 180 |+----------+| |+-----------------+| |+-------------------+ +----- |
| 181 | | | | | | | | | | | | | |
| 182 +------------+ +---+ +---+ ... +---+ +-----------+ +---------+ +-- |
| 183 ^ ^ ^ |
| 184 | | | |
| 185 | | Mandatory Page Break |
| 186 | | |
| 187 | ID header is contained on a single page |
| 188 | |
| 189 'Beginning Of Stream' |
| 190 ]]></artwork> |
| 191 </figure> |
| 192 |
| 193 <t> |
| 194 There are two mandatory header packets. |
| 195 The first packet in the logical Ogg bitstream MUST contain the identification |
| 196 (ID) header, which uniquely identifies a stream as Opus audio. |
| 197 The format of this header is defined in <xref target="id_header"/>. |
| 198 It is placed alone (without any other packet data) on the first page of |
| 199 the logical Ogg bitstream, and completes on that page. |
| 200 This page has its 'beginning of stream' flag set. |
| 201 </t> |
| 202 <t> |
| 203 The second packet in the logical Ogg bitstream MUST contain the comment header, |
| 204 which contains user-supplied metadata. |
| 205 The format of this header is defined in <xref target="comment_header"/>. |
| 206 It MAY span multiple pages, beginning on the second page of the logical |
| 207 stream. |
| 208 However many pages it spans, the comment header packet MUST finish the page on |
| 209 which it completes. |
| 210 </t> |
| 211 <t> |
| 212 All subsequent pages are audio data pages, and the Ogg packets they contain are |
| 213 audio data packets. |
| 214 Each audio data packet contains one Opus packet for each of N different |
| 215 streams, where N is typically one for mono or stereo, but MAY be greater than |
| 216 one for multichannel audio. |
| 217 The value N is specified in the ID header (see |
| 218 <xref target="channel_mapping"/>), and is fixed over the entire length of the |
| 219 logical Ogg bitstream. |
| 220 </t> |
| 221 <t> |
| 222 The first (N - 1) Opus packets, if any, are packed one after another |
| 223 into the Ogg packet, using the self-delimiting framing from Appendix B of |
| 224 <xref target="RFC6716"/>. |
| 225 The remaining Opus packet is packed at the end of the Ogg packet using the |
| 226 regular, undelimited framing from Section 3 of <xref target="RFC6716"/>. |
| 227 All of the Opus packets in a single Ogg packet MUST be constrained to have the |
| 228 same duration. |
| 229 An implementation of this specification SHOULD treat any Opus packet whose |
| 230 duration is different from that of the first Opus packet in an Ogg packet as |
| 231 if it were a malformed Opus packet with an invalid Table Of Contents (TOC) |
| 232 sequence. |
| 233 </t> |
| 234 <t> |
| 235 The TOC sequence at the beginning of each Opus packet indicates the coding |
| 236 mode, audio bandwidth, channel count, duration (frame size), and number of |
| 237 frames per packet, as described in Section 3.1 |
| 238 of <xref target="RFC6716"/>. |
| 239 The coding mode is one of SILK, Hybrid, or Constrained Energy Lapped Transform |
| 240 (CELT). |
| 241 The combination of coding mode, audio bandwidth, and frame size is referred to |
| 242 as the configuration of an Opus packet. |
| 243 </t> |
| 244 <t> |
| 245 Packets are placed into Ogg pages in order until the end of stream. |
| 246 Audio data packets might span page boundaries. |
| 247 The first audio data page could have the 'continued packet' flag set |
| 248 (indicating the first audio data packet is continued from a previous page) if, |
| 249 for example, it was a live stream joined mid-broadcast, with the headers |
| 250 pasted on the front. |
| 251 If a page has the 'continued packet' flag set and one of the following |
| 252 conditions is also true: |
| 253 <list style="symbols"> |
| 254 <t>the previous page with packet data does not end in a continued packet (does |
| 255 not end with a lacing value of 255) OR</t> |
| 256 <t>the page sequence numbers are not consecutive,</t> |
| 257 </list> |
| 258 then a demuxer MUST NOT attempt to decode the data for the first packet on the |
| 259 page unless the demuxer has some special knowledge that would allow it to |
| 260 interpret this data despite the missing pieces. |
| 261 An implementation MUST treat a zero-octet audio data packet as if it were a |
| 262 malformed Opus packet as described in |
| 263 Section 3.4 of <xref target="RFC6716"/>. |
| 264 </t> |
| 265 <t> |
| 266 A logical stream ends with a page with the 'end of stream' flag set, but |
| 267 implementations need to be prepared to deal with truncated streams that do not |
| 268 have a page marked 'end of stream'. |
| 269 There is no reason for the final packet on the last page to be a continued |
| 270 packet, i.e., for the final lacing value to be 255. |
| 271 However, demuxers might encounter such streams, possibly as the result of a |
| 272 transfer that did not complete or of corruption. |
| 273 If a packet continues onto a subsequent page (i.e., when the page ends with a |
| 274 lacing value of 255) and one of the following conditions is also true: |
| 275 <list style="symbols"> |
| 276 <t>the next page with packet data does not have the 'continued packet' flag |
| 277 set OR</t> |
| 278 <t>there is no next page with packet data OR</t> |
| 279 <t>the page sequence numbers are not consecutive,</t> |
| 280 </list> |
| 281 then a demuxer MUST NOT attempt to decode the data from that packet unless the |
| 282 demuxer has some special knowledge that would allow it to interpret this data |
| 283 despite the missing pieces. |
| 284 There MUST NOT be any more pages in an Opus logical bitstream after a page |
| 285 marked 'end of stream'. |
| 286 </t> |
| 287 </section> |
| 288 |
| 289 <section anchor="granpos" title="Granule Position"> |
| 290 <t> |
| 291 The granule position MUST be zero for the ID header page and the |
| 292 page where the comment header completes. |
| 293 That is, the first page in the logical stream, and the last header |
| 294 page before the first audio data page both have a granule position of zero. |
| 295 </t> |
| 296 <t> |
| 297 The granule position of an audio data page encodes the total number of PCM |
| 298 samples in the stream up to and including the last fully-decodable sample from |
| 299 the last packet completed on that page. |
| 300 The granule position of the first audio data page will usually be larger than |
| 301 zero, as described in <xref target="start_granpos_restrictions"/>. |
| 302 </t> |
| 303 |
| 304 <t> |
| 305 A page that is entirely spanned by a single packet (that completes on a |
| 306 subsequent page) has no granule position, and the granule position field is |
| 307 set to the special value '-1' in two's complement. |
| 308 </t> |
| 309 |
| 310 <t> |
| 311 The granule position of an audio data page is in units of PCM audio samples at |
| 312 a fixed rate of 48 kHz (per channel; a stereo stream's granule position |
| 313 does not increment at twice the speed of a mono stream). |
| 314 It is possible to run an Opus decoder at other sampling rates, |
| 315 but all Opus packets encode samples at a sampling rate that evenly divides |
| 316 48 kHz. |
| 317 Therefore, the value in the granule position field always counts samples |
| 318 assuming a 48 kHz decoding rate, and the rest of this specification makes |
| 319 the same assumption. |
| 320 </t> |
| 321 |
| 322 <t> |
| 323 The duration of an Opus packet as defined in <xref target="RFC6716"/> can be |
| 324 any multiple of 2.5 ms, up to a maximum of 120 ms. |
| 325 This duration is encoded in the TOC sequence at the beginning of each packet. |
| 326 The number of samples returned by a decoder corresponds to this duration |
| 327 exactly, even for the first few packets. |
| 328 For example, a 20 ms packet fed to a decoder running at 48 kHz will |
| 329 always return 960 samples. |
| 330 A demuxer can parse the TOC sequence at the beginning of each Ogg packet to |
| 331 work backwards or forwards from a packet with a known granule position (i.e., |
| 332 the last packet completed on some page) in order to assign granule positions |
| 333 to every packet, or even every individual sample. |
| 334 The one exception is the last page in the stream, as described below. |
| 335 </t> |
| 336 |
| 337 <t> |
| 338 All other pages with completed packets after the first MUST have a granule |
| 339 position equal to the number of samples contained in packets that complete on |
| 340 that page plus the granule position of the most recent page with completed |
| 341 packets. |
| 342 This guarantees that a demuxer can assign individual packets the same granule |
| 343 position when working forwards as when working backwards. |
| 344 For this to work, there cannot be any gaps. |
| 345 </t> |
| 346 |
| 347 <section anchor="gap-repair" title="Repairing Gaps in Real-time Streams"> |
| 348 <t> |
| 349 In order to support capturing a real-time stream that has lost or not |
| 350 transmitted packets, a multiplexer (muxer) SHOULD emit packets that explicitly |
| 351 request the use of Packet Loss Concealment (PLC) in place of the missing |
| 352 packets. |
| 353 Implementations that fail to do so still MUST NOT increment the granule |
| 354 position for a page by anything other than the number of samples contained in |
| 355 packets that actually complete on that page. |
| 356 </t> |
| 357 <t> |
| 358 Only gaps that are a multiple of 2.5 ms are repairable, as these are the |
| 359 only durations that can be created by packet loss or discontinuous |
| 360 transmission. |
| 361 Muxers need not handle other gap sizes. |
| 362 Creating the necessary packets involves synthesizing a TOC byte (defined in |
| 363 Section 3.1 of <xref target="RFC6716"/>)—and whatever |
| 364 additional internal framing is needed—to indicate the packet duration |
| 365 for each stream. |
| 366 The actual length of each missing Opus frame inside the packet is zero bytes, |
| 367 as defined in Section 3.2.1 of <xref target="RFC6716"/>. |
| 368 </t> |
| 369 |
| 370 <t> |
| 371 Zero-byte frames MAY be packed into packets using any of codes 0, 1, |
| 372 2, or 3. |
| 373 When successive frames have the same configuration, the higher code packings |
| 374 reduce overhead. |
| 375 Likewise, if the TOC configuration matches, the muxer MAY further combine the |
| 376 empty frames with previous or subsequent non-zero-length frames (using |
| 377 code 2 or VBR code 3). |
| 378 </t> |
| 379 |
| 380 <t> |
| 381 <xref target="RFC6716"/> does not impose any requirements on the PLC, but this |
| 382 section outlines choices that are expected to have a positive influence on |
| 383 most PLC implementations, including the reference implementation. |
| 384 Synthesized TOC sequences SHOULD maintain the same mode, audio bandwidth, |
| 385 channel count, and frame size as the previous packet (if any). |
| 386 This is the simplest and usually the most well-tested case for the PLC to |
| 387 handle and it covers all losses that do not include a configuration switch, |
| 388 as defined in Section 4.5 of <xref target="RFC6716"/>. |
| 389 </t> |
| 390 |
| 391 <t> |
| 392 When a previous packet is available, keeping the audio bandwidth and channel |
| 393 count the same allows the PLC to provide maximum continuity in the concealment |
| 394 data it generates. |
| 395 However, if the size of the gap is not a multiple of the most recent frame |
| 396 size, then the frame size will have to change for at least some frames. |
| 397 Such changes SHOULD be delayed as long as possible to simplify |
| 398 things for PLC implementations. |
| 399 </t> |
| 400 |
| 401 <t> |
| 402 As an example, a 95 ms gap could be encoded as nineteen 5 ms frames |
| 403 in two bytes with a single CBR code 3 packet. |
| 404 If the previous frame size was 20 ms, using four 20 ms frames |
| 405 followed by three 5 ms frames requires 4 bytes (plus an extra byte |
| 406 of Ogg lacing overhead), but allows the PLC to use its well-tested steady |
| 407 state behavior for as long as possible. |
| 408 The total bitrate of the latter approach, including Ogg overhead, is about |
| 409 0.4 kbps, so the impact on file size is minimal. |
| 410 </t> |
| 411 |
| 412 <t> |
| 413 Changing modes is discouraged, since this causes some decoder implementations |
| 414 to reset their PLC state. |
| 415 However, SILK and Hybrid mode frames cannot fill gaps that are not a multiple |
| 416 of 10 ms. |
| 417 If switching to CELT mode is needed to match the gap size, a muxer SHOULD do |
| 418 so at the end of the gap to allow the PLC to function for as long as possible. |
| 419 </t> |
| 420 |
| 421 <t> |
| 422 In the example above, if the previous frame was a 20 ms SILK mode frame, |
| 423 the better solution is to synthesize a packet describing four 20 ms SILK |
| 424 frames, followed by a packet with a single 10 ms SILK |
| 425 frame, and finally a packet with a 5 ms CELT frame, to fill the 95 ms |
| 426 gap. |
| 427 This also requires four bytes to describe the synthesized packet data (two |
| 428 bytes for a CBR code 3 and one byte each for two code 0 packets) but three |
| 429 bytes of Ogg lacing overhead are needed to mark the packet boundaries. |
| 430 At 0.6 kbps, this is still a minimal bitrate impact over a naive, low quality |
| 431 solution. |
| 432 </t> |
| 433 |
| 434 <t> |
| 435 Since medium-band audio is an option only in the SILK mode, wideband frames |
| 436 SHOULD be generated if switching from that configuration to CELT mode, to |
| 437 ensure that any PLC implementation which does try to migrate state between |
| 438 the modes will be able to preserve all of the available audio bandwidth. |
| 439 </t> |
| 440 |
| 441 </section> |
| 442 |
| 443 <section anchor="preskip" title="Pre-skip"> |
| 444 <t> |
| 445 There is some amount of latency introduced during the decoding process, to |
| 446 allow for overlap in the CELT mode, stereo mixing in the SILK mode, and |
| 447 resampling. |
| 448 The encoder might have introduced additional latency through its own resampling |
| 449 and analysis (though the exact amount is not specified). |
| 450 Therefore, the first few samples produced by the decoder do not correspond to |
| 451 real input audio, but are instead composed of padding inserted by the encoder |
| 452 to compensate for this latency. |
| 453 These samples need to be stored and decoded, as Opus is an asymptotically |
| 454 convergent predictive codec, meaning the decoded contents of each frame depend |
| 455 on the recent history of decoder inputs. |
| 456 However, a player will want to skip these samples after decoding them. |
| 457 </t> |
| 458 |
| 459 <t> |
| 460 A 'pre-skip' field in the ID header (see <xref target="id_header"/>) signals |
| 461 the number of samples that SHOULD be skipped (decoded but discarded) at the |
| 462 beginning of the stream, though some specific applications might have a reason |
| 463 for looking at that data. |
| 464 This amount need not be a multiple of 2.5 ms, MAY be smaller than a single |
| 465 packet, or MAY span the contents of several packets. |
| 466 These samples are not valid audio. |
| 467 </t> |
| 468 |
| 469 <t> |
| 470 For example, if the first Opus frame uses the CELT mode, it will always |
| 471 produce 120 samples of windowed overlap-add data. |
| 472 However, the overlap data is initially all zeros (since there is no prior |
| 473 frame), meaning this cannot, in general, accurately represent the original |
| 474 audio. |
| 475 The SILK mode requires additional delay to account for its analysis and |
| 476 resampling latency. |
| 477 The encoder delays the original audio to avoid this problem. |
| 478 </t> |
| 479 |
| 480 <t> |
| 481 The pre-skip field MAY also be used to perform sample-accurate cropping of |
| 482 already encoded streams. |
| 483 In this case, a value of at least 3840 samples (80 ms) provides |
| 484 sufficient history to the decoder that it will have converged |
| 485 before the stream's output begins. |
| 486 </t> |
| 487 |
| 488 </section> |
| 489 |
| 490 <section anchor="pcm_sample_position" title="PCM Sample Position"> |
| 491 <t> |
| 492 The PCM sample position is determined from the granule position using the |
| 493 formula |
| 494 </t> |
| 495 <figure align="center"> |
| 496 <artwork align="center"><![CDATA[ |
| 497 'PCM sample position' = 'granule position' - 'pre-skip' . |
| 498 ]]></artwork> |
| 499 </figure> |
| 500 |
| 501 <t> |
| 502 For example, if the granule position of the first audio data page is 59,971, |
| 503 and the pre-skip is 11,971, then the PCM sample position of the last decoded |
| 504 sample from that page is 48,000. |
| 505 </t> |
| 506 <t> |
| 507 This can be converted into a playback time using the formula |
| 508 </t> |
| 509 <figure align="center"> |
| 510 <artwork align="center"><![CDATA[ |
| 511 'PCM sample position' |
| 512 'playback time' = --------------------- . |
| 513 48000.0 |
| 514 ]]></artwork> |
| 515 </figure> |
| 516 |
| 517 <t> |
| 518 The initial PCM sample position before any samples are played is normally '0'. |
| 519 In this case, the PCM sample position of the first audio sample to be played |
| 520 starts at '1', because it marks the time on the clock |
| 521 <spanx style="emph">after</spanx> that sample has been played, and a stream |
| 522 that is exactly one second long has a final PCM sample position of '48000', |
| 523 as in the example here. |
| 524 </t> |
| 525 |
| 526 <t> |
| 527 Vorbis streams use a granule position smaller than the number of audio samples |
| 528 contained in the first audio data page to indicate that some of those samples |
| 529 are trimmed from the output (see <xref target="vorbis-trim"/>). |
| 530 However, to do so, Vorbis requires that the first audio data page contains |
| 531 exactly two packets, in order to allow the decoder to perform PCM position |
| 532 adjustments before needing to return any PCM data. |
| 533 Opus uses the pre-skip mechanism for this purpose instead, since the encoder |
| 534 might introduce more than a single packet's worth of latency, and since very |
| 535 large packets in streams with a very large number of channels might not fit |
| 536 on a single page. |
| 537 </t> |
| 538 </section> |
| 539 |
| 540 <section anchor="end_trimming" title="End Trimming"> |
| 541 <t> |
| 542 The page with the 'end of stream' flag set MAY have a granule position that |
| 543 indicates the page contains less audio data than would normally be returned by |
| 544 decoding up through the final packet. |
| 545 This is used to end the stream somewhere other than an even frame boundary. |
| 546 The granule position of the most recent audio data page with completed packets |
| 547 is used to make this determination, or '0' is used if there were no previous |
| 548 audio data pages with a completed packet. |
| 549 The difference between these granule positions indicates how many samples to |
| 550 keep after decoding the packets that completed on the final page. |
| 551 The remaining samples are discarded. |
| 552 The number of discarded samples SHOULD be no larger than the number decoded |
| 553 from the last packet. |
| 554 </t> |
| 555 </section> |
| 556 |
| 557 <section anchor="start_granpos_restrictions" |
| 558 title="Restrictions on the Initial Granule Position"> |
| 559 <t> |
| 560 The granule position of the first audio data page with a completed packet MAY |
| 561 be larger than the number of samples contained in packets that complete on |
| 562 that page, however it MUST NOT be smaller, unless that page has the 'end of |
| 563 stream' flag set. |
| 564 Allowing a granule position larger than the number of samples allows the |
| 565 beginning of a stream to be cropped or a live stream to be joined without |
| 566 rewriting the granule position of all the remaining pages. |
| 567 This means that the PCM sample position just before the first sample to be |
| 568 played MAY be larger than '0'. |
| 569 Synchronization when multiplexing with other logical streams still uses the PCM |
| 570 sample position relative to '0' to compute sample times. |
| 571 This does not affect the behavior of pre-skip: exactly 'pre-skip' samples |
| 572 SHOULD be skipped from the beginning of the decoded output, even if the |
| 573 initial PCM sample position is greater than zero. |
| 574 </t> |
| 575 |
| 576 <t> |
| 577 On the other hand, a granule position that is smaller than the number of |
| 578 decoded samples prevents a demuxer from working backwards to assign each |
| 579 packet or each individual sample a valid granule position, since granule |
| 580 positions are non-negative. |
| 581 An implementation MUST treat any stream as invalid if the granule position |
| 582 is smaller than the number of samples contained in packets that complete on |
| 583 the first audio data page with a completed packet, unless that page has the |
| 584 'end of stream' flag set. |
| 585 It MAY defer this action until it decodes the last packet completed on that |
| 586 page. |
| 587 </t> |
| 588 |
| 589 <t> |
| 590 If that page has the 'end of stream' flag set, a demuxer MUST treat any stream |
| 591 as invalid if its granule position is smaller than the 'pre-skip' amount. |
| 592 This would indicate that there are more samples to be skipped from the initial |
| 593 decoded output than exist in the stream. |
| 594 If the granule position is smaller than the number of decoded samples produced |
| 595 by the packets that complete on that page, then a demuxer MUST use an initial |
| 596 granule position of '0', and can work forwards from '0' to timestamp |
| 597 individual packets. |
| 598 If the granule position is larger than the number of decoded samples available, |
| 599 then the demuxer MUST still work backwards as described above, even if the |
| 600 'end of stream' flag is set, to determine the initial granule position, and |
| 601 thus the initial PCM sample position. |
| 602 Both of these will be greater than '0' in this case. |
| 603 </t> |
| 604 </section> |
| 605 |
| 606 <section anchor="seeking_and_preroll" title="Seeking and Pre-roll"> |
| 607 <t> |
| 608 Seeking in Ogg files is best performed using a bisection search for a page |
| 609 whose granule position corresponds to a PCM position at or before the seek |
| 610 target. |
| 611 With appropriately weighted bisection, accurate seeking can be performed in |
| 612 just one or two bisections on average, even in multi-gigabyte files. |
| 613 See <xref target="seeking"/> for an example of general implementation guidance. |
| 614 </t> |
| 615 |
| 616 <t> |
| 617 When seeking within an Ogg Opus stream, an implementation SHOULD start decoding |
| 618 (and discarding the output) at least 3840 samples (80 ms) prior to |
| 619 the seek target in order to ensure that the output audio is correct by the |
| 620 time it reaches the seek target. |
| 621 This 'pre-roll' is separate from, and unrelated to, the 'pre-skip' used at the |
| 622 beginning of the stream. |
| 623 If the point 80 ms prior to the seek target comes before the initial PCM |
| 624 sample position, an implementation SHOULD start decoding from the beginning of |
| 625 the stream, applying pre-skip as normal, regardless of whether the pre-skip is |
| 626 larger or smaller than 80 ms, and then continue to discard samples |
| 627 to reach the seek target (if any). |
| 628 </t> |
| 629 </section> |
| 630 |
| 631 </section> |
| 632 |
| 633 <section anchor="headers" title="Header Packets"> |
| 634 <t> |
| 635 An Ogg Opus logical stream contains exactly two mandatory header packets: |
| 636 an identification header and a comment header. |
| 637 </t> |
| 638 |
| 639 <section anchor="id_header" title="Identification Header"> |
| 640 |
| 641 <figure anchor="id_header_packet" title="ID Header Packet" align="center"> |
| 642 <artwork align="center"><![CDATA[ |
| 643 0 1 2 3 |
| 644 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 645 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 646 | 'O' | 'p' | 'u' | 's' | |
| 647 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 648 | 'H' | 'e' | 'a' | 'd' | |
| 649 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 650 | Version = 1 | Channel Count | Pre-skip | |
| 651 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 652 | Input Sample Rate (Hz) | |
| 653 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 654 | Output Gain (Q7.8 in dB) | Mapping Family| | |
| 655 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : |
| 656 | | |
| 657 : Optional Channel Mapping Table... : |
| 658 | | |
| 659 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 660 ]]></artwork> |
| 661 </figure> |
| 662 |
| 663 <t> |
| 664 The fields in the identification (ID) header have the following meaning: |
| 665 <list style="numbers"> |
| 666 <t>Magic Signature: |
| 667 <vspace blankLines="1"/> |
| 668 This is an 8-octet (64-bit) field that allows codec identification and is |
| 669 human-readable. |
| 670 It contains, in order, the magic numbers: |
| 671 <list style="empty"> |
| 672 <t>0x4F 'O'</t> |
| 673 <t>0x70 'p'</t> |
| 674 <t>0x75 'u'</t> |
| 675 <t>0x73 's'</t> |
| 676 <t>0x48 'H'</t> |
| 677 <t>0x65 'e'</t> |
| 678 <t>0x61 'a'</t> |
| 679 <t>0x64 'd'</t> |
| 680 </list> |
| 681 Starting with "Op" helps distinguish it from audio data packets, as this is an |
| 682 invalid TOC sequence. |
| 683 <vspace blankLines="1"/> |
| 684 </t> |
| 685 <t>Version (8 bits, unsigned): |
| 686 <vspace blankLines="1"/> |
| 687 The version number MUST always be '1' for this version of the encapsulation |
| 688 specification. |
| 689 Implementations SHOULD treat streams where the upper four bits of the version |
| 690 number match that of a recognized specification as backwards-compatible with |
| 691 that specification. |
| 692 That is, the version number can be split into "major" and "minor" version |
| 693 sub-fields, with changes to the "minor" sub-field (in the lower four bits) |
| 694 signaling compatible changes. |
| 695 For example, an implementation of this specification SHOULD accept any stream |
| 696 with a version number of '15' or less, and SHOULD assume any stream with a |
| 697 version number '16' or greater is incompatible. |
| 698 The initial version '1' was chosen to keep implementations from relying on this |
| 699 octet as a null terminator for the "OpusHead" string. |
| 700 <vspace blankLines="1"/> |
| 701 </t> |
| 702 <t>Output Channel Count 'C' (8 bits, unsigned): |
| 703 <vspace blankLines="1"/> |
| 704 This is the number of output channels. |
| 705 This might be different than the number of encoded channels, which can change |
| 706 on a packet-by-packet basis. |
| 707 This value MUST NOT be zero. |
| 708 The maximum allowable value depends on the channel mapping family, and might be |
| 709 as large as 255. |
| 710 See <xref target="channel_mapping"/> for details. |
| 711 <vspace blankLines="1"/> |
| 712 </t> |
| 713 <t>Pre-skip (16 bits, unsigned, little |
| 714 endian): |
| 715 <vspace blankLines="1"/> |
| 716 This is the number of samples (at 48 kHz) to discard from the decoder |
| 717 output when starting playback, and also the number to subtract from a page's |
| 718 granule position to calculate its PCM sample position. |
| 719 When cropping the beginning of existing Ogg Opus streams, a pre-skip of at |
| 720 least 3,840 samples (80 ms) is RECOMMENDED to ensure complete |
| 721 convergence in the decoder. |
| 722 <vspace blankLines="1"/> |
| 723 </t> |
| 724 <t>Input Sample Rate (32 bits, unsigned, little |
| 725 endian): |
| 726 <vspace blankLines="1"/> |
| 727 This is the sample rate of the original input (before encoding), in Hz. |
| 728 This field is <spanx style="emph">not</spanx> the sample rate to use for |
| 729 playback of the encoded data. |
| 730 <vspace blankLines="1"/> |
| 731 Opus can switch between internal audio bandwidths of 4, 6, 8, 12, and |
| 732 20 kHz. |
| 733 Each packet in the stream can have a different audio bandwidth. |
| 734 Regardless of the audio bandwidth, the reference decoder supports decoding any |
| 735 stream at a sample rate of 8, 12, 16, 24, or 48 kHz. |
| 736 The original sample rate of the audio passed to the encoder is not preserved |
| 737 by the lossy compression. |
| 738 <vspace blankLines="1"/> |
| 739 An Ogg Opus player SHOULD select the playback sample rate according to the |
| 740 following procedure: |
| 741 <list style="numbers"> |
| 742 <t>If the hardware supports 48 kHz playback, decode at 48 kHz.</t> |
| 743 <t>Otherwise, if the hardware's highest available sample rate is a supported |
| 744 rate, decode at this sample rate.</t> |
| 745 <t>Otherwise, if the hardware's highest available sample rate is less than |
| 746 48 kHz, decode at the next higher Opus supported rate above the highest |
| 747 available hardware rate and resample.</t> |
| 748 <t>Otherwise, decode at 48 kHz and resample.</t> |
| 749 </list> |
| 750 However, the 'Input Sample Rate' field allows the muxer to pass the sample |
| 751 rate of the original input stream as metadata. |
| 752 This is useful when the user requires the output sample rate to match the |
| 753 input sample rate. |
| 754 For example, when not playing the output, an implementation writing PCM format |
| 755 samples to disk might choose to resample the audio back to the original input |
| 756 sample rate to reduce surprise to the user, who might reasonably expect to get |
| 757 back a file with the same sample rate. |
| 758 <vspace blankLines="1"/> |
| 759 A value of zero indicates 'unspecified'. |
| 760 Muxers SHOULD write the actual input sample rate or zero, but implementations |
| 761 which do something with this field SHOULD take care to behave sanely if given |
| 762 crazy values (e.g., do not actually upsample the output to 10 MHz if |
| 763 requested). |
| 764 Implementations SHOULD support input sample rates between 8 kHz and |
| 765 192 kHz (inclusive). |
| 766 Rates outside this range MAY be ignored by falling back to the default rate of |
| 767 48 kHz instead. |
| 768 <vspace blankLines="1"/> |
| 769 </t> |
| 770 <t>Output Gain (16 bits, signed, little endian): |
| 771 <vspace blankLines="1"/> |
| 772 This is a gain to be applied when decoding. |
| 773 It is 20*log10 of the factor by which to scale the decoder output to achieve |
| 774 the desired playback volume, stored in a 16-bit, signed, two's complement |
| 775 fixed-point value with 8 fractional bits (i.e., |
| 776 Q7.8 <xref target="q-notation"/>). |
| 777 <vspace blankLines="1"/> |
| 778 To apply the gain, an implementation could use |
| 779 <figure align="center"> |
| 780 <artwork align="center"><![CDATA[ |
| 781 sample *= pow(10, output_gain/(20.0*256)) , |
| 782 ]]></artwork> |
| 783 </figure> |
| 784 where output_gain is the raw 16-bit value from the header. |
| 785 <vspace blankLines="1"/> |
| 786 Players and media frameworks SHOULD apply it by default. |
| 787 If a player chooses to apply any volume adjustment or gain modification, such |
| 788 as the R128_TRACK_GAIN (see <xref target="comment_header"/>), the adjustment |
| 789 MUST be applied in addition to this output gain in order to achieve playback |
| 790 at the normalized volume. |
| 791 <vspace blankLines="1"/> |
| 792 A muxer SHOULD set this field to zero, and instead apply any gain prior to |
| 793 encoding, when this is possible and does not conflict with the user's wishes. |
| 794 A nonzero output gain indicates the gain was adjusted after encoding, or that |
| 795 a user wished to adjust the gain for playback while preserving the ability |
| 796 to recover the original signal amplitude. |
| 797 <vspace blankLines="1"/> |
| 798 Although the output gain has enormous range (+/- 128 dB, enough to amplify |
| 799 inaudible sounds to the threshold of physical pain), most applications can |
| 800 only reasonably use a small portion of this range around zero. |
| 801 The large range serves in part to ensure that gain can always be losslessly |
| 802 transferred between OpusHead and R128 gain tags (see below) without |
| 803 saturating. |
| 804 <vspace blankLines="1"/> |
| 805 </t> |
| 806 <t>Channel Mapping Family (8 bits, unsigned): |
| 807 <vspace blankLines="1"/> |
| 808 This octet indicates the order and semantic meaning of the output channels. |
| 809 <vspace blankLines="1"/> |
| 810 Each currently specified value of this octet indicates a mapping family, which |
| 811 defines a set of allowed channel counts, and the ordered set of channel names |
| 812 for each allowed channel count. |
| 813 The details are described in <xref target="channel_mapping"/>. |
| 814 </t> |
| 815 <t>Channel Mapping Table: |
| 816 This table defines the mapping from encoded streams to output channels. |
| 817 Its contents are specified in <xref target="channel_mapping"/>. |
| 818 </t> |
| 819 </list> |
| 820 </t> |
| 821 |
| 822 <t> |
| 823 All fields in the ID headers are REQUIRED, except for the channel mapping |
| 824 table, which MUST be omitted when the channel mapping family is 0, but |
| 825 is REQUIRED otherwise. |
| 826 Implementations SHOULD treat a stream as invalid if it contains an ID header |
| 827 that does not have enough data for these fields, even if it contain a valid |
| 828 Magic Signature. |
| 829 Future versions of this specification, even backwards-compatible versions, |
| 830 might include additional fields in the ID header. |
| 831 If an ID header has a compatible major version, but a larger minor version, |
| 832 an implementation MUST NOT treat it as invalid for containing additional data |
| 833 not specified here, provided it still completes on the first page. |
| 834 </t> |
| 835 |
| 836 <section anchor="channel_mapping" title="Channel Mapping"> |
| 837 <t> |
| 838 An Ogg Opus stream allows mapping one number of Opus streams (N) to a possibly |
| 839 larger number of decoded channels (M + N) to yet another number of |
| 840 output channels (C), which might be larger or smaller than the number of |
| 841 decoded channels. |
| 842 The order and meaning of these channels are defined by a channel mapping, |
| 843 which consists of the 'channel mapping family' octet and, for channel mapping |
| 844 families other than family 0, a channel mapping table, as illustrated in |
| 845 <xref target="channel_mapping_table"/>. |
| 846 </t> |
| 847 |
| 848 <figure anchor="channel_mapping_table" title="Channel Mapping Table" |
| 849 align="center"> |
| 850 <artwork align="center"><![CDATA[ |
| 851 0 1 2 3 |
| 852 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 853 +-+-+-+-+-+-+-+-+ |
| 854 | Stream Count | |
| 855 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 856 | Coupled Count | Channel Mapping... : |
| 857 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 858 ]]></artwork> |
| 859 </figure> |
| 860 |
| 861 <t> |
| 862 The fields in the channel mapping table have the following meaning: |
| 863 <list style="numbers" counter="8"> |
| 864 <t>Stream Count 'N' (8 bits, unsigned): |
| 865 <vspace blankLines="1"/> |
| 866 This is the total number of streams encoded in each Ogg packet. |
| 867 This value is necessary to correctly parse the packed Opus packets inside an |
| 868 Ogg packet, as described in <xref target="packet_organization"/>. |
| 869 This value MUST NOT be zero, as without at least one Opus packet with a valid |
| 870 TOC sequence, a demuxer cannot recover the duration of an Ogg packet. |
| 871 <vspace blankLines="1"/> |
| 872 For channel mapping family 0, this value defaults to 1, and is not coded. |
| 873 <vspace blankLines="1"/> |
| 874 </t> |
| 875 <t>Coupled Stream Count 'M' (8 bits, unsigned): |
| 876 This is the number of streams whose decoders are to be configured to produce |
| 877 two channels (stereo). |
| 878 This MUST be no larger than the total number of streams, N. |
| 879 <vspace blankLines="1"/> |
| 880 Each packet in an Opus stream has an internal channel count of 1 or 2, which |
| 881 can change from packet to packet. |
| 882 This is selected by the encoder depending on the bitrate and the audio being |
| 883 encoded. |
| 884 The original channel count of the audio passed to the encoder is not |
| 885 necessarily preserved by the lossy compression. |
| 886 <vspace blankLines="1"/> |
| 887 Regardless of the internal channel count, any Opus stream can be decoded as |
| 888 mono (a single channel) or stereo (two channels) by appropriate initialization |
| 889 of the decoder. |
| 890 The 'coupled stream count' field indicates that the decoders for the first M |
| 891 Opus streams are to be initialized for stereo (two-channel) output, and the |
| 892 remaining (N - M) decoders are to be initialized for mono (a single |
| 893 channel) only. |
| 894 The total number of decoded channels, (M + N), MUST be no larger than |
| 895 255, as there is no way to index more channels than that in the channel |
| 896 mapping. |
| 897 <vspace blankLines="1"/> |
| 898 For channel mapping family 0, this value defaults to (C - 1) |
| 899 (i.e., 0 for mono and 1 for stereo), and is not coded. |
| 900 <vspace blankLines="1"/> |
| 901 </t> |
| 902 <t>Channel Mapping (8*C bits): |
| 903 This contains one octet per output channel, indicating which decoded channel |
| 904 is to be used for each one. |
| 905 Let 'index' be the value of this octet for a particular output channel. |
| 906 This value MUST either be smaller than (M + N), or be the special |
| 907 value 255. |
| 908 If 'index' is less than 2*M, the output MUST be taken from decoding stream |
| 909 ('index'/2) as stereo and selecting the left channel if 'index' is even, and |
| 910 the right channel if 'index' is odd. |
| 911 If 'index' is 2*M or larger, but less than 255, the output MUST be taken from |
| 912 decoding stream ('index' - M) as mono. |
| 913 If 'index' is 255, the corresponding output channel MUST contain pure silence. |
| 914 <vspace blankLines="1"/> |
| 915 The number of output channels, C, is not constrained to match the number of |
| 916 decoded channels (M + N). |
| 917 A single index value MAY appear multiple times, i.e., the same decoded channel |
| 918 might be mapped to multiple output channels. |
| 919 Some decoded channels might not be assigned to any output channel, as well. |
| 920 <vspace blankLines="1"/> |
| 921 For channel mapping family 0, the first index defaults to 0, and if |
| 922 C == 2, the second index defaults to 1. |
| 923 Neither index is coded. |
| 924 </t> |
| 925 </list> |
| 926 </t> |
| 927 |
| 928 <t> |
| 929 After producing the output channels, the channel mapping family determines the |
| 930 semantic meaning of each one. |
| 931 There are three defined mapping families in this specification. |
| 932 </t> |
| 933 |
| 934 <section anchor="channel_mapping_0" title="Channel Mapping Family 0"> |
| 935 <t> |
| 936 Allowed numbers of channels: 1 or 2. |
| 937 RTP mapping. |
| 938 This is the same channel interpretation as <xref target="RFC7587"/>. |
| 939 </t> |
| 940 <t> |
| 941 <list style="symbols"> |
| 942 <t>1 channel: monophonic (mono).</t> |
| 943 <t>2 channels: stereo (left, right).</t> |
| 944 </list> |
| 945 Special mapping: This channel mapping value also |
| 946 indicates that the contents consists of a single Opus stream that is stereo if |
| 947 and only if C == 2, with stream index 0 mapped to output |
| 948 channel 0 (mono, or left channel) and stream index 1 mapped to |
| 949 output channel 1 (right channel) if stereo. |
| 950 When the 'channel mapping family' octet has this value, the channel mapping |
| 951 table MUST be omitted from the ID header packet. |
| 952 </t> |
| 953 </section> |
| 954 |
| 955 <section anchor="channel_mapping_1" title="Channel Mapping Family 1"> |
| 956 <t> |
| 957 Allowed numbers of channels: 1...8. |
| 958 Vorbis channel order (see below). |
| 959 </t> |
| 960 <t> |
| 961 Each channel is assigned to a speaker location in a conventional surround |
| 962 arrangement. |
| 963 Specific locations depend on the number of channels, and are given below |
| 964 in order of the corresponding channel indices. |
| 965 <list style="symbols"> |
| 966 <t>1 channel: monophonic (mono).</t> |
| 967 <t>2 channels: stereo (left, right).</t> |
| 968 <t>3 channels: linear surround (left, center, right)</t> |
| 969 <t>4 channels: quadraphonic (front left, front right, rear left
, rear right).</t> |
| 970 <t>5 channels: 5.0 surround (front left, front center, front ri
ght, rear left, rear right).</t> |
| 971 <t>6 channels: 5.1 surround (front left, front center, front ri
ght, rear left, rear right, LFE).</t> |
| 972 <t>7 channels: 6.1 surround (front left, front center, front ri
ght, side left, side right, rear center, LFE).</t> |
| 973 <t>8 channels: 7.1 surround (front left, front center, front ri
ght, side left, side right, rear left, rear right, LFE)</t> |
| 974 </list> |
| 975 </t> |
| 976 <t> |
| 977 This set of surround options and speaker location orderings is the same |
| 978 as those used by the Vorbis codec <xref target="vorbis-mapping"/>. |
| 979 The ordering is different from the one used by the |
| 980 WAVE <xref target="wave-multichannel"/> and |
| 981 Free Lossless Audio Codec (FLAC) <xref target="flac"/> formats, |
| 982 so correct ordering requires permutation of the output channels when decoding |
| 983 to or encoding from those formats. |
| 984 'LFE' here refers to a Low Frequency Effects channel, often mapped to a |
| 985 subwoofer with no particular spatial position. |
| 986 Implementations SHOULD identify 'side' or 'rear' speaker locations with |
| 987 'surround' and 'back' as appropriate when interfacing with audio formats |
| 988 or systems which prefer that terminology. |
| 989 </t> |
| 990 </section> |
| 991 |
| 992 <section anchor="channel_mapping_255" |
| 993 title="Channel Mapping Family 255"> |
| 994 <t> |
| 995 Allowed numbers of channels: 1...255. |
| 996 No defined channel meaning. |
| 997 </t> |
| 998 <t> |
| 999 Channels are unidentified. |
| 1000 General-purpose players SHOULD NOT attempt to play these streams. |
| 1001 Offline implementations MAY deinterleave the output into separate PCM files, |
| 1002 one per channel. |
| 1003 Implementations SHOULD NOT produce output for channels mapped to stream index |
| 1004 255 (pure silence) unless they have no other way to indicate the index of |
| 1005 non-silent channels. |
| 1006 </t> |
| 1007 </section> |
| 1008 |
| 1009 <section anchor="channel_mapping_undefined" |
| 1010 title="Undefined Channel Mappings"> |
| 1011 <t> |
| 1012 The remaining channel mapping families (2...254) are reserved. |
| 1013 A demuxer implementation encountering a reserved channel mapping family value |
| 1014 SHOULD act as though the value is 255. |
| 1015 </t> |
| 1016 </section> |
| 1017 |
| 1018 <section anchor="downmix" title="Downmixing"> |
| 1019 <t> |
| 1020 An Ogg Opus player MUST support any valid channel mapping with a channel |
| 1021 mapping family of 0 or 1, even if the number of channels does not match the |
| 1022 physically connected audio hardware. |
| 1023 Players SHOULD perform channel mixing to increase or reduce the number of |
| 1024 channels as needed. |
| 1025 </t> |
| 1026 |
| 1027 <t> |
| 1028 Implementations MAY use the matrices in |
| 1029 Figures <xref target="downmix-matrix-3" format="counter"/> |
| 1030 through <xref target="downmix-matrix-8" format="counter"/> to implement |
| 1031 downmixing from multichannel files using |
| 1032 <xref target="channel_mapping_1">Channel Mapping Family 1</xref>, which are |
| 1033 known to give acceptable results for stereo. |
| 1034 Matrices for 3 and 4 channels are normalized so each coefficient row sums |
| 1035 to 1 to avoid clipping. |
| 1036 For 5 or more channels they are normalized to 2 as a compromise between |
| 1037 clipping and dynamic range reduction. |
| 1038 </t> |
| 1039 <t> |
| 1040 In these matrices the front left and front right channels are generally |
| 1041 passed through directly. |
| 1042 When a surround channel is split between both the left and right stereo |
| 1043 channels, coefficients are chosen so their squares sum to 1, which |
| 1044 helps preserve the perceived intensity. |
| 1045 Rear channels are mixed more diffusely or attenuated to maintain focus |
| 1046 on the front channels. |
| 1047 </t> |
| 1048 |
| 1049 <figure anchor="downmix-matrix-3" |
| 1050 title="Stereo downmix matrix for the linear surround channel mapping" |
| 1051 align="center"> |
| 1052 <artwork align="center"><![CDATA[ |
| 1053 L output = ( 0.585786 * left + 0.414214 * center ) |
| 1054 R output = ( 0.414214 * center + 0.585786 * right ) |
| 1055 ]]></artwork> |
| 1056 <postamble> |
| 1057 Exact coefficient values are 1 and 1/sqrt(2), multiplied by |
| 1058 1/(1 + 1/sqrt(2)) for normalization. |
| 1059 </postamble> |
| 1060 </figure> |
| 1061 |
| 1062 <figure anchor="downmix-matrix-4" |
| 1063 title="Stereo downmix matrix for the quadraphonic channel mapping" |
| 1064 align="center"> |
| 1065 <artwork align="center"><![CDATA[ |
| 1066 / \ / \ / FL \ |
| 1067 | L output | | 0.422650 0.000000 0.366025 0.211325 | | FR | |
| 1068 | R output | = | 0.000000 0.422650 0.211325 0.366025 | | RL | |
| 1069 \ / \ / \ RR / |
| 1070 ]]></artwork> |
| 1071 <postamble> |
| 1072 Exact coefficient values are 1, sqrt(3)/2 and 1/2, multiplied by |
| 1073 1/(1 + sqrt(3)/2 + 1/2) for normalization. |
| 1074 </postamble> |
| 1075 </figure> |
| 1076 |
| 1077 <figure anchor="downmix-matrix-5" |
| 1078 title="Stereo downmix matrix for the 5.0 surround mapping" |
| 1079 align="center"> |
| 1080 <artwork align="center"><![CDATA[ |
| 1081 / FL \ |
| 1082 / \ / \ | FC | |
| 1083 | L | | 0.650802 0.460186 0.000000 0.563611 0.325401 | | FR | |
| 1084 | R | = | 0.000000 0.460186 0.650802 0.325401 0.563611 | | RL | |
| 1085 \ / \ / | RR | |
| 1086 \ / |
| 1087 ]]></artwork> |
| 1088 <postamble> |
| 1089 Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2, multiplied by |
| 1090 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2) |
| 1091 for normalization. |
| 1092 </postamble> |
| 1093 </figure> |
| 1094 |
| 1095 <figure anchor="downmix-matrix-6" |
| 1096 title="Stereo downmix matrix for the 5.1 surround mapping" |
| 1097 align="center"> |
| 1098 <artwork align="center"><![CDATA[ |
| 1099 /FL \ |
| 1100 / \ / \ |FC | |
| 1101 |L| | 0.529067 0.374107 0.000000 0.458186 0.264534 0.374107 | |FR | |
| 1102 |R| = | 0.000000 0.374107 0.529067 0.264534 0.458186 0.374107 | |RL | |
| 1103 \ / \ / |RR | |
| 1104 \LFE/ |
| 1105 ]]></artwork> |
| 1106 <postamble> |
| 1107 Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2, multiplied by |
| 1108 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2 + 1/sqrt(2)) |
| 1109 for normalization. |
| 1110 </postamble> |
| 1111 </figure> |
| 1112 |
| 1113 <figure anchor="downmix-matrix-7" |
| 1114 title="Stereo downmix matrix for the 6.1 surround mapping" |
| 1115 align="center"> |
| 1116 <artwork align="center"><![CDATA[ |
| 1117 / \ |
| 1118 | 0.455310 0.321953 0.000000 0.394310 0.227655 0.278819 0.321953 | |
| 1119 | 0.000000 0.321953 0.455310 0.227655 0.394310 0.278819 0.321953 | |
| 1120 \ / |
| 1121 ]]></artwork> |
| 1122 <postamble> |
| 1123 Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2, 1/2 and |
| 1124 sqrt(3)/2/sqrt(2), multiplied by |
| 1125 2/(1 + 1/sqrt(2) + sqrt(3)/2 + 1/2 + |
| 1126 sqrt(3)/2/sqrt(2) + 1/sqrt(2)) for normalization. |
| 1127 The coefficients are in the same order as in <xref target="channel_mapping_1" />
, |
| 1128 and the matrices above. |
| 1129 </postamble> |
| 1130 </figure> |
| 1131 |
| 1132 <figure anchor="downmix-matrix-8" |
| 1133 title="Stereo downmix matrix for the 7.1 surround mapping" |
| 1134 align="center"> |
| 1135 <artwork align="center"><![CDATA[ |
| 1136 / \ |
| 1137 | .388631 .274804 .000000 .336565 .194316 .336565 .194316 .274804 | |
| 1138 | .000000 .274804 .388631 .194316 .336565 .194316 .336565 .274804 | |
| 1139 \ / |
| 1140 ]]></artwork> |
| 1141 <postamble> |
| 1142 Exact coefficient values are 1, 1/sqrt(2), sqrt(3)/2 and 1/2, multiplied by |
| 1143 2/(2 + 2/sqrt(2) + sqrt(3)) for normalization. |
| 1144 The coefficients are in the same order as in <xref target="channel_mapping_1" />
, |
| 1145 and the matrices above. |
| 1146 </postamble> |
| 1147 </figure> |
| 1148 |
| 1149 </section> |
| 1150 |
| 1151 </section> <!-- end channel_mapping_table --> |
| 1152 |
| 1153 </section> <!-- end id_header --> |
| 1154 |
| 1155 <section anchor="comment_header" title="Comment Header"> |
| 1156 |
| 1157 <figure anchor="comment_header_packet" title="Comment Header Packet" |
| 1158 align="center"> |
| 1159 <artwork align="center"><![CDATA[ |
| 1160 0 1 2 3 |
| 1161 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 |
| 1162 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1163 | 'O' | 'p' | 'u' | 's' | |
| 1164 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1165 | 'T' | 'a' | 'g' | 's' | |
| 1166 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1167 | Vendor String Length | |
| 1168 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1169 | | |
| 1170 : Vendor String... : |
| 1171 | | |
| 1172 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1173 | User Comment List Length | |
| 1174 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1175 | User Comment #0 String Length | |
| 1176 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1177 | | |
| 1178 : User Comment #0 String... : |
| 1179 | | |
| 1180 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1181 | User Comment #1 String Length | |
| 1182 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 1183 : : |
| 1184 ]]></artwork> |
| 1185 </figure> |
| 1186 |
| 1187 <t> |
| 1188 The comment header consists of a 64-bit magic signature, followed by data in |
| 1189 the same format as the <xref target="vorbis-comment"/> header used in Ogg |
| 1190 Vorbis, except (like Ogg Theora and Speex) the final "framing bit" specified |
| 1191 in the Vorbis spec is not present. |
| 1192 <list style="numbers"> |
| 1193 <t>Magic Signature: |
| 1194 <vspace blankLines="1"/> |
| 1195 This is an 8-octet (64-bit) field that allows codec identification and is |
| 1196 human-readable. |
| 1197 It contains, in order, the magic numbers: |
| 1198 <list style="empty"> |
| 1199 <t>0x4F 'O'</t> |
| 1200 <t>0x70 'p'</t> |
| 1201 <t>0x75 'u'</t> |
| 1202 <t>0x73 's'</t> |
| 1203 <t>0x54 'T'</t> |
| 1204 <t>0x61 'a'</t> |
| 1205 <t>0x67 'g'</t> |
| 1206 <t>0x73 's'</t> |
| 1207 </list> |
| 1208 Starting with "Op" helps distinguish it from audio data packets, as this is an |
| 1209 invalid TOC sequence. |
| 1210 <vspace blankLines="1"/> |
| 1211 </t> |
| 1212 <t>Vendor String Length (32 bits, unsigned, little endian): |
| 1213 <vspace blankLines="1"/> |
| 1214 This field gives the length of the following vendor string, in octets. |
| 1215 It MUST NOT indicate that the vendor string is longer than the rest of the |
| 1216 packet. |
| 1217 <vspace blankLines="1"/> |
| 1218 </t> |
| 1219 <t>Vendor String (variable length, UTF-8 vector): |
| 1220 <vspace blankLines="1"/> |
| 1221 This is a simple human-readable tag for vendor information, encoded as a UTF-8 |
| 1222 string <xref target="RFC3629"/>. |
| 1223 No terminating null octet is necessary. |
| 1224 <vspace blankLines="1"/> |
| 1225 This tag is intended to identify the codec encoder and encapsulation |
| 1226 implementations, for tracing differences in technical behavior. |
| 1227 User-facing applications can use the 'ENCODER' user comment tag to identify |
| 1228 themselves. |
| 1229 <vspace blankLines="1"/> |
| 1230 </t> |
| 1231 <t>User Comment List Length (32 bits, unsigned, little endian): |
| 1232 <vspace blankLines="1"/> |
| 1233 This field indicates the number of user-supplied comments. |
| 1234 It MAY indicate there are zero user-supplied comments, in which case there are |
| 1235 no additional fields in the packet. |
| 1236 It MUST NOT indicate that there are so many comments that the comment string |
| 1237 lengths would require more data than is available in the rest of the packet. |
| 1238 <vspace blankLines="1"/> |
| 1239 </t> |
| 1240 <t>User Comment #i String Length (32 bits, unsigned, little endian): |
| 1241 <vspace blankLines="1"/> |
| 1242 This field gives the length of the following user comment string, in octets. |
| 1243 There is one for each user comment indicated by the 'user comment list length' |
| 1244 field. |
| 1245 It MUST NOT indicate that the string is longer than the rest of the packet. |
| 1246 <vspace blankLines="1"/> |
| 1247 </t> |
| 1248 <t>User Comment #i String (variable length, UTF-8 vector): |
| 1249 <vspace blankLines="1"/> |
| 1250 This field contains a single user comment encoded as a UTF-8 |
| 1251 string <xref target="RFC3629"/>. |
| 1252 There is one for each user comment indicated by the 'user comment list length' |
| 1253 field. |
| 1254 </t> |
| 1255 </list> |
| 1256 </t> |
| 1257 |
| 1258 <t> |
| 1259 The vendor string length and user comment list length are REQUIRED, and |
| 1260 implementations SHOULD treat a stream as invalid if it contains a comment |
| 1261 header that does not have enough data for these fields, or that does not |
| 1262 contain enough data for the corresponding vendor string or user comments they |
| 1263 describe. |
| 1264 Making this check before allocating the associated memory to contain the data |
| 1265 helps prevent a possible Denial-of-Service (DoS) attack from small comment |
| 1266 headers that claim to contain strings longer than the entire packet or more |
| 1267 user comments than than could possibly fit in the packet. |
| 1268 </t> |
| 1269 |
| 1270 <t> |
| 1271 Immediately following the user comment list, the comment header MAY |
| 1272 contain zero-padding or other binary data which is not specified here. |
| 1273 If the least-significant bit of the first byte of this data is 1, then editors |
| 1274 SHOULD preserve the contents of this data when updating the tags, but if this |
| 1275 bit is 0, all such data MAY be treated as padding, and truncated or discarded |
| 1276 as desired. |
| 1277 This allows informal experimentation with the format of this binary data until |
| 1278 it can be specified later. |
| 1279 </t> |
| 1280 |
| 1281 <t> |
| 1282 The comment header can be arbitrarily large and might be spread over a large |
| 1283 number of Ogg pages. |
| 1284 Implementations MUST avoid attempting to allocate excessive amounts of memory |
| 1285 when presented with a very large comment header. |
| 1286 To accomplish this, implementations MAY treat a stream as invalid if it has a |
| 1287 comment header larger than 125,829,120 octets (120 MB), and MAY |
| 1288 ignore individual comments that are not fully contained within the first |
| 1289 61,440 octets of the comment header. |
| 1290 </t> |
| 1291 |
| 1292 <section anchor="comment_format" title="Tag Definitions"> |
| 1293 <t> |
| 1294 The user comment strings follow the NAME=value format described by |
| 1295 <xref target="vorbis-comment"/> with the same recommended tag names: |
| 1296 ARTIST, TITLE, DATE, ALBUM, and so on. |
| 1297 </t> |
| 1298 <t> |
| 1299 Two new comment tags are introduced here: |
| 1300 </t> |
| 1301 |
| 1302 <t>First, an optional gain for track normalization:</t> |
| 1303 <figure align="center"> |
| 1304 <artwork align="left"><![CDATA[ |
| 1305 R128_TRACK_GAIN=-573 |
| 1306 ]]></artwork> |
| 1307 </figure> |
| 1308 <t> |
| 1309 representing the volume shift needed to normalize the track's volume |
| 1310 during isolated playback, in random shuffle, and so on. |
| 1311 The gain is a Q7.8 fixed point number in dB, as in the ID header's 'output |
| 1312 gain' field. |
| 1313 This tag is similar to the REPLAYGAIN_TRACK_GAIN tag in |
| 1314 Vorbis <xref target="replay-gain"/>, except that the normal volume |
| 1315 reference is the <xref target="EBU-R128"/> standard. |
| 1316 </t> |
| 1317 <t>Second, an optional gain for album normalization:</t> |
| 1318 <figure align="center"> |
| 1319 <artwork align="left"><![CDATA[ |
| 1320 R128_ALBUM_GAIN=111 |
| 1321 ]]></artwork> |
| 1322 </figure> |
| 1323 <t> |
| 1324 representing the volume shift needed to normalize the overall volume when |
| 1325 played as part of a particular collection of tracks. |
| 1326 The gain is also a Q7.8 fixed point number in dB, as in the ID header's |
| 1327 'output gain' field. |
| 1328 The values '-573' and '111' given here are just examples. |
| 1329 </t> |
| 1330 <t> |
| 1331 An Ogg Opus stream MUST NOT have more than one of each of these tags, and if |
| 1332 present their values MUST be an integer from -32768 to 32767, inclusive, |
| 1333 represented in ASCII as a base 10 number with no whitespace. |
| 1334 A leading '+' or '-' character is valid. |
| 1335 Leading zeros are also permitted, but the value MUST be represented by |
| 1336 no more than 6 characters. |
| 1337 Other non-digit characters MUST NOT be present. |
| 1338 </t> |
| 1339 <t> |
| 1340 If present, R128_TRACK_GAIN and R128_ALBUM_GAIN MUST correctly represent |
| 1341 the R128 normalization gain relative to the 'output gain' field specified |
| 1342 in the ID header. |
| 1343 If a player chooses to make use of the R128_TRACK_GAIN tag or the |
| 1344 R128_ALBUM_GAIN tag, it MUST apply those gains |
| 1345 <spanx style="emph">in addition</spanx> to the 'output gain' value. |
| 1346 If a tool modifies the ID header's 'output gain' field, it MUST also update or |
| 1347 remove the R128_TRACK_GAIN and R128_ALBUM_GAIN comment tags if present. |
| 1348 A muxer SHOULD place the gain it wants other tools to use by default into the |
| 1349 'output gain' field, and not the comment tag. |
| 1350 </t> |
| 1351 <t> |
| 1352 To avoid confusion with multiple normalization schemes, an Opus comment header |
| 1353 SHOULD NOT contain any of the REPLAYGAIN_TRACK_GAIN, REPLAYGAIN_TRACK_PEAK, |
| 1354 REPLAYGAIN_ALBUM_GAIN, or REPLAYGAIN_ALBUM_PEAK tags, unless they are only |
| 1355 to be used in some context where there is guaranteed to be no such confusion. |
| 1356 <xref target="EBU-R128"/> normalization is preferred to the earlier |
| 1357 REPLAYGAIN schemes because of its clear definition and adoption by industry. |
| 1358 Peak normalizations are difficult to calculate reliably for lossy codecs |
| 1359 because of variation in excursion heights due to decoder differences. |
| 1360 In the authors' investigations they were not applied consistently or broadly |
| 1361 enough to merit inclusion here. |
| 1362 </t> |
| 1363 </section> <!-- end comment_format --> |
| 1364 </section> <!-- end comment_header --> |
| 1365 |
| 1366 </section> <!-- end headers --> |
| 1367 |
| 1368 <section anchor="packet_size_limits" title="Packet Size Limits"> |
| 1369 <t> |
| 1370 Technically, valid Opus packets can be arbitrarily large due to the padding |
| 1371 format, although the amount of non-padding data they can contain is bounded. |
| 1372 These packets might be spread over a similarly enormous number of Ogg pages. |
| 1373 When encoding, implementations SHOULD limit the use of padding in audio data |
| 1374 packets to no more than is necessary to make a variable bitrate (VBR) stream |
| 1375 constant bitrate (CBR), unless they have no reasonable way to determine what |
| 1376 is necessary. |
| 1377 Demuxers SHOULD treat audio data packets as invalid (treat them as if they were |
| 1378 malformed Opus packets with an invalid TOC sequence) if they are larger than |
| 1379 61,440 octets per Opus stream, unless they have a specific reason for |
| 1380 allowing extra padding. |
| 1381 Such packets necessarily contain more padding than needed to make a stream CBR. |
| 1382 Demuxers MUST avoid attempting to allocate excessive amounts of memory when |
| 1383 presented with a very large packet. |
| 1384 Demuxers MAY treat audio data packets as invalid or partially process them if |
| 1385 they are larger than 61,440 octets in an Ogg Opus stream with channel |
| 1386 mapping families 0 or 1. |
| 1387 Demuxers MAY treat audio data packets as invalid or partially process them in |
| 1388 any Ogg Opus stream if the packet is larger than 61,440 octets and also |
| 1389 larger than 7,680 octets per Opus stream. |
| 1390 The presence of an extremely large packet in the stream could indicate a |
| 1391 memory exhaustion attack or stream corruption. |
| 1392 </t> |
| 1393 <t> |
| 1394 In an Ogg Opus stream, the largest possible valid packet that does not use |
| 1395 padding has a size of (61,298*N - 2) octets. |
| 1396 With 255 streams, this is 15,630,988 octets and can |
| 1397 span up to 61,298 Ogg pages, all but one of which will have a granule |
| 1398 position of -1. |
| 1399 This is of course a very extreme packet, consisting of 255 streams, each |
| 1400 containing 120 ms of audio encoded as 2.5 ms frames, each frame |
| 1401 using the maximum possible number of octets (1275) and stored in the least |
| 1402 efficient manner allowed (a VBR code 3 Opus packet). |
| 1403 Even in such a packet, most of the data will be zeros as 2.5 ms frames |
| 1404 cannot actually use all 1275 octets. |
| 1405 </t> |
| 1406 <t> |
| 1407 The largest packet consisting of entirely useful data is |
| 1408 (15,326*N - 2) octets. |
| 1409 This corresponds to 120 ms of audio encoded as 10 ms frames in either |
| 1410 SILK or Hybrid mode, but at a data rate of over 1 Mbps, which makes little |
| 1411 sense for the quality achieved. |
| 1412 </t> |
| 1413 <t> |
| 1414 A more reasonable limit is (7,664*N - 2) octets. |
| 1415 This corresponds to 120 ms of audio encoded as 20 ms stereo CELT mode |
| 1416 frames, with a total bitrate just under 511 kbps (not counting the Ogg |
| 1417 encapsulation overhead). |
| 1418 For channel mapping family 1, N=8 provides a reasonable upper bound, as it |
| 1419 allows for each of the 8 possible output channels to be decoded from a |
| 1420 separate stereo Opus stream. |
| 1421 This gives a size of 61,310 octets, which is rounded up to a multiple of |
| 1422 1,024 octets to yield the audio data packet size of 61,440 octets |
| 1423 that any implementation is expected to be able to process successfully. |
| 1424 </t> |
| 1425 </section> |
| 1426 |
| 1427 <section anchor="encoder" title="Encoder Guidelines"> |
| 1428 <t> |
| 1429 When encoding Opus streams, Ogg muxers SHOULD take into account the |
| 1430 algorithmic delay of the Opus encoder. |
| 1431 </t> |
| 1432 <t> |
| 1433 In encoders derived from the reference |
| 1434 implementation <xref target="RFC6716"/>, the number of samples can be |
| 1435 queried with: |
| 1436 </t> |
| 1437 <figure align="center"> |
| 1438 <artwork align="center"><![CDATA[ |
| 1439 opus_encoder_ctl(encoder_state, OPUS_GET_LOOKAHEAD(&delay_samples)); |
| 1440 ]]></artwork> |
| 1441 </figure> |
| 1442 <t> |
| 1443 To achieve good quality in the very first samples of a stream, implementations |
| 1444 MAY use linear predictive coding (LPC) extrapolation to generate at least 120 |
| 1445 extra samples at the beginning to avoid the Opus encoder having to encode a |
| 1446 discontinuous signal. |
| 1447 For more information on linear prediction, see |
| 1448 <xref target="linear-prediction"/>. |
| 1449 For an input file containing 'length' samples, the implementation SHOULD set |
| 1450 the pre-skip header value to (delay_samples + extra_samples), encode |
| 1451 at least (length + delay_samples + extra_samples) |
| 1452 samples, and set the granule position of the last page to |
| 1453 (length + delay_samples + extra_samples). |
| 1454 This ensures that the encoded file has the same duration as the original, with |
| 1455 no time offset. The best way to pad the end of the stream is to also use LPC |
| 1456 extrapolation, but zero-padding is also acceptable. |
| 1457 </t> |
| 1458 |
| 1459 <section anchor="lpc" title="LPC Extrapolation"> |
| 1460 <t> |
| 1461 The first step in LPC extrapolation is to compute linear prediction |
| 1462 coefficients. <xref target="lpc-sample"/> |
| 1463 When extending the end of the signal, order-N (typically with N ranging from 8 |
| 1464 to 40) LPC analysis is performed on a window near the end of the signal. |
| 1465 The last N samples are used as memory to an infinite impulse response (IIR) |
| 1466 filter. |
| 1467 </t> |
| 1468 <t> |
| 1469 The filter is then applied on a zero input to extrapolate the end of the signal. |
| 1470 Let a(k) be the kth LPC coefficient and x(n) be the nth sample of the signal, |
| 1471 each new sample past the end of the signal is computed as: |
| 1472 </t> |
| 1473 <figure align="center"> |
| 1474 <artwork align="center"><![CDATA[ |
| 1475 N |
| 1476 --- |
| 1477 x(n) = \ a(k)*x(n-k) |
| 1478 / |
| 1479 --- |
| 1480 k=1 |
| 1481 ]]></artwork> |
| 1482 </figure> |
| 1483 <t> |
| 1484 The process is repeated independently for each channel. |
| 1485 It is possible to extend the beginning of the signal by applying the same |
| 1486 process backward in time. |
| 1487 When extending the beginning of the signal, it is best to apply a "fade in" to |
| 1488 the extrapolated signal, e.g. by multiplying it by a half-Hanning window |
| 1489 <xref target="hanning"/>. |
| 1490 </t> |
| 1491 |
| 1492 </section> |
| 1493 |
| 1494 <section anchor="continuous_chaining" title="Continuous Chaining"> |
| 1495 <t> |
| 1496 In some applications, such as Internet radio, it is desirable to cut a long |
| 1497 stream into smaller chains, e.g. so the comment header can be updated. |
| 1498 This can be done simply by separating the input streams into segments and |
| 1499 encoding each segment independently. |
| 1500 The drawback of this approach is that it creates a small discontinuity |
| 1501 at the boundary due to the lossy nature of Opus. |
| 1502 A muxer MAY avoid this discontinuity by using the following procedure: |
| 1503 <list style="numbers"> |
| 1504 <t>Encode the last frame of the first segment as an independent frame by |
| 1505 turning off all forms of inter-frame prediction. |
| 1506 De-emphasis is allowed.</t> |
| 1507 <t>Set the granule position of the last page to a point near the end of the |
| 1508 last frame.</t> |
| 1509 <t>Begin the second segment with a copy of the last frame of the first |
| 1510 segment.</t> |
| 1511 <t>Set the pre-skip value of the second stream in such a way as to properly |
| 1512 join the two streams.</t> |
| 1513 <t>Continue the encoding process normally from there, without any reset to |
| 1514 the encoder.</t> |
| 1515 </list> |
| 1516 </t> |
| 1517 <t> |
| 1518 In encoders derived from the reference implementation, inter-frame prediction |
| 1519 can be turned off by calling: |
| 1520 </t> |
| 1521 <figure align="center"> |
| 1522 <artwork align="center"><![CDATA[ |
| 1523 opus_encoder_ctl(encoder_state, OPUS_SET_PREDICTION_DISABLED(1)); |
| 1524 ]]></artwork> |
| 1525 </figure> |
| 1526 <t> |
| 1527 For best results, this implementation requires that prediction be explicitly |
| 1528 enabled again before resuming normal encoding, even after a reset. |
| 1529 </t> |
| 1530 |
| 1531 </section> |
| 1532 |
| 1533 </section> |
| 1534 |
| 1535 <section anchor="implementation" title="Implementation Status"> |
| 1536 <t> |
| 1537 A brief summary of major implementations of this draft is available |
| 1538 at <eref target="https://wiki.xiph.org/OggOpusImplementation"/>, |
| 1539 along with their status. |
| 1540 </t> |
| 1541 <t> |
| 1542 [Note to RFC Editor: please remove this entire section before |
| 1543 final publication per <xref target="RFC6982"/>, along with |
| 1544 its references.] |
| 1545 </t> |
| 1546 </section> |
| 1547 |
| 1548 <section anchor="security" title="Security Considerations"> |
| 1549 <t> |
| 1550 Implementations of the Opus codec need to take appropriate security |
| 1551 considerations into account, as outlined in <xref target="RFC4732"/>. |
| 1552 This is just as much a problem for the container as it is for the codec itself. |
| 1553 Malicious payloads and/or input streams can be used to attack codec |
| 1554 implementations. |
| 1555 Implementations MUST NOT overrun their allocated memory nor consume excessive |
| 1556 resources when decoding payloads or processing input streams. |
| 1557 Although problems in encoding applications are typically rarer, this still |
| 1558 applies to a muxer, as vulnerabilities would allow an attacker to attack |
| 1559 transcoding gateways. |
| 1560 </t> |
| 1561 |
| 1562 <t> |
| 1563 Header parsing code contains the most likely area for potential overruns. |
| 1564 It is important for implementations to ensure their buffers contain enough |
| 1565 data for all of the required fields before attempting to read it (for example, |
| 1566 for all of the channel map data in the ID header). |
| 1567 Implementations would do well to validate the indices of the channel map, also, |
| 1568 to ensure they meet all of the restrictions outlined in |
| 1569 <xref target="channel_mapping"/>, in order to avoid attempting to read data |
| 1570 from channels that do not exist. |
| 1571 </t> |
| 1572 |
| 1573 <t> |
| 1574 To avoid excessive resource usage, we advise implementations to be especially |
| 1575 wary of streams that might cause them to process far more data than was |
| 1576 actually transmitted. |
| 1577 For example, a relatively small comment header may contain values for the |
| 1578 string lengths or user comment list length that imply that it is many |
| 1579 gigabytes in size. |
| 1580 Even computing the size of the required buffer could overflow a 32-bit integer, |
| 1581 and actually attempting to allocate such a buffer before verifying it would be |
| 1582 a reasonable size is a bad idea. |
| 1583 After reading the user comment list length, implementations might wish to |
| 1584 verify that the header contains at least the minimum amount of data for that |
| 1585 many comments (4 additional octets per comment, to indicate each has a |
| 1586 length of zero) before proceeding any further, again taking care to avoid |
| 1587 overflow in these calculations. |
| 1588 If allocating an array of pointers to point at these strings, the size of the |
| 1589 pointers may be larger than 4 octets, potentially requiring a separate |
| 1590 overflow check. |
| 1591 </t> |
| 1592 |
| 1593 <t> |
| 1594 Another bug in this class we have observed more than once involves the handling |
| 1595 of invalid data at the end of a stream. |
| 1596 Often, implementations will seek to the end of a stream to locate the last |
| 1597 timestamp in order to compute its total duration. |
| 1598 If they do not find a valid capture pattern and Ogg page from the desired |
| 1599 logical stream, they will back up and try again. |
| 1600 If care is not taken to avoid re-scanning data that was already scanned, this |
| 1601 search can quickly devolve into something with a complexity that is quadratic |
| 1602 in the amount of invalid data. |
| 1603 </t> |
| 1604 |
| 1605 <t> |
| 1606 In general when seeking, implementations will wish to be cautious about the |
| 1607 effects of invalid granule position values, and ensure all algorithms will |
| 1608 continue to make progress and eventually terminate, even if these are missing |
| 1609 or out-of-order. |
| 1610 </t> |
| 1611 |
| 1612 <t> |
| 1613 Like most other container formats, Ogg Opus streams SHOULD NOT be used with |
| 1614 insecure ciphers or cipher modes that are vulnerable to known-plaintext |
| 1615 attacks. |
| 1616 Elements such as the Ogg page capture pattern and the magic signatures in the |
| 1617 ID header and the comment header all have easily predictable values, in |
| 1618 addition to various elements of the codec data itself. |
| 1619 </t> |
| 1620 </section> |
| 1621 |
| 1622 <section anchor="content_type" title="Content Type"> |
| 1623 <t> |
| 1624 An "Ogg Opus file" consists of one or more sequentially multiplexed segments, |
| 1625 each containing exactly one Ogg Opus stream. |
| 1626 The RECOMMENDED mime-type for Ogg Opus files is "audio/ogg". |
| 1627 </t> |
| 1628 |
| 1629 <t> |
| 1630 If more specificity is desired, one MAY indicate the presence of Opus streams |
| 1631 using the codecs parameter defined in <xref target="RFC6381"/> and |
| 1632 <xref target="RFC5334"/>, e.g., |
| 1633 </t> |
| 1634 <figure> |
| 1635 <artwork align="center"><![CDATA[ |
| 1636 audio/ogg; codecs=opus |
| 1637 ]]></artwork> |
| 1638 </figure> |
| 1639 <t> |
| 1640 for an Ogg Opus file. |
| 1641 </t> |
| 1642 |
| 1643 <t> |
| 1644 The RECOMMENDED filename extension for Ogg Opus files is '.opus'. |
| 1645 </t> |
| 1646 |
| 1647 <t> |
| 1648 When Opus is concurrently multiplexed with other streams in an Ogg container, |
| 1649 one SHOULD use one of the "audio/ogg", "video/ogg", or "application/ogg" |
| 1650 mime-types, as defined in <xref target="RFC5334"/>. |
| 1651 Such streams are not strictly "Ogg Opus files" as described above, |
| 1652 since they contain more than a single Opus stream per sequentially |
| 1653 multiplexed segment, e.g. video or multiple audio tracks. |
| 1654 In such cases the the '.opus' filename extension is NOT RECOMMENDED. |
| 1655 </t> |
| 1656 |
| 1657 <t> |
| 1658 In either case, this document updates <xref target="RFC5334"/> |
| 1659 to add 'opus' as a codecs parameter value with char[8]: 'OpusHead' |
| 1660 as Codec Identifier. |
| 1661 </t> |
| 1662 </section> |
| 1663 |
| 1664 <section anchor="iana" title="IANA Considerations"> |
| 1665 <t> |
| 1666 This document updates the IANA Media Types registry to add .opus |
| 1667 as a file extension for "audio/ogg", and to add itself as a reference |
| 1668 alongside <xref target="RFC5334"/> for "audio/ogg", "video/ogg", and |
| 1669 "application/ogg" Media Types. |
| 1670 </t> |
| 1671 <t> |
| 1672 This document defines a new registry "Opus Channel Mapping Families" to |
| 1673 indicate how the semantic meanings of the channels in a multi-channel Opus |
| 1674 stream are described. |
| 1675 IANA is requested to create a new name space of "Opus Channel Mapping |
| 1676 Families". |
| 1677 This will be a new registry on the IANA Matrix, and not a subregistry of an |
| 1678 existing registry. |
| 1679 Modifications to this registry follow the "Specification Required" registration |
| 1680 policy as defined in <xref target="RFC5226"/>. |
| 1681 Each registry entry consists of a Channel Mapping Family Number, which is |
| 1682 specified in decimal in the range 0 to 255, inclusive, and a Reference (or |
| 1683 list of references) |
| 1684 Each Reference must point to sufficient documentation to describe what |
| 1685 information is coded in the Opus identification header for this channel |
| 1686 mapping family, how a demuxer determines the Stream Count ('N') and Coupled |
| 1687 Stream Count ('M') from this information, and how it determines the proper |
| 1688 interpretation of each of the decoded channels. |
| 1689 </t> |
| 1690 <t> |
| 1691 This document defines three initial assignments for this registry. |
| 1692 </t> |
| 1693 <texttable> |
| 1694 <ttcol>Value</ttcol><ttcol>Reference</ttcol> |
| 1695 <c>0</c><c>[RFCXXXX] <xref target="channel_mapping_0"/></c> |
| 1696 <c>1</c><c>[RFCXXXX] <xref target="channel_mapping_1"/></c> |
| 1697 <c>255</c><c>[RFCXXXX] <xref target="channel_mapping_255"/></c> |
| 1698 </texttable> |
| 1699 <t> |
| 1700 The designated expert will determine if the Reference points to a specification |
| 1701 that meets the requirements for permanence and ready availability laid out |
| 1702 in <xref target="RFC5226"/> and that it specifies the information |
| 1703 described above with sufficient clarity to allow interoperable |
| 1704 implementations. |
| 1705 </t> |
| 1706 </section> |
| 1707 |
| 1708 <section anchor="Acknowledgments" title="Acknowledgments"> |
| 1709 <t> |
| 1710 Thanks to Ben Campbell, Joel M. Halpern, Mark Harris, Greg Maxwell, |
| 1711 Christopher "Monty" Montgomery, Jean-Marc Valin, Stephan Wenger, and Mo Zanaty |
| 1712 for their valuable contributions to this document. |
| 1713 Additional thanks to Andrew D'Addesio, Greg Maxwell, and Vincent Penquerc'h for |
| 1714 their feedback based on early implementations. |
| 1715 </t> |
| 1716 </section> |
| 1717 |
| 1718 <section title="RFC Editor Notes"> |
| 1719 <t> |
| 1720 In <xref target="iana"/>, "RFCXXXX" is to be replaced with the RFC number |
| 1721 assigned to this draft. |
| 1722 </t> |
| 1723 </section> |
| 1724 |
| 1725 </middle> |
| 1726 <back> |
| 1727 <references title="Normative References"> |
| 1728 &rfc2119; |
| 1729 &rfc3533; |
| 1730 &rfc3629; |
| 1731 &rfc5226; |
| 1732 &rfc5334; |
| 1733 &rfc6381; |
| 1734 &rfc6716; |
| 1735 |
| 1736 <reference anchor="EBU-R128" target="https://tech.ebu.ch/loudness"> |
| 1737 <front> |
| 1738 <title>Loudness Recommendation EBU R128</title> |
| 1739 <author> |
| 1740 <organization>EBU Technical Committee</organization> |
| 1741 </author> |
| 1742 <date month="August" year="2011"/> |
| 1743 </front> |
| 1744 </reference> |
| 1745 |
| 1746 <reference anchor="vorbis-comment" |
| 1747 target="https://www.xiph.org/vorbis/doc/v-comment.html"> |
| 1748 <front> |
| 1749 <title>Ogg Vorbis I Format Specification: Comment Field and Header |
| 1750 Specification</title> |
| 1751 <author initials="C." surname="Montgomery" |
| 1752 fullname="Christopher "Monty" Montgomery"/> |
| 1753 <date month="July" year="2002"/> |
| 1754 </front> |
| 1755 </reference> |
| 1756 |
| 1757 </references> |
| 1758 |
| 1759 <references title="Informative References"> |
| 1760 |
| 1761 <!--?rfc include="http://xml.resource.org/public/rfc/bibxml/reference.RFC.3550.x
ml"?--> |
| 1762 &rfc4732; |
| 1763 &rfc6982; |
| 1764 &rfc7587; |
| 1765 |
| 1766 <reference anchor="flac" |
| 1767 target="https://xiph.org/flac/format.html"> |
| 1768 <front> |
| 1769 <title>FLAC - Free Lossless Audio Codec Format Description</title> |
| 1770 <author initials="J." surname="Coalson" fullname="Josh Coalson"/> |
| 1771 <date month="January" year="2008"/> |
| 1772 </front> |
| 1773 </reference> |
| 1774 |
| 1775 <reference anchor="hanning" |
| 1776 target="https://en.wikipedia.org/w/index.php?title=Window_function&oldid=70
3074467#Hann_.28Hanning.29_window"> |
| 1777 <front> |
| 1778 <title>Hann window</title> |
| 1779 <author> |
| 1780 <organization>Wikipedia</organization> |
| 1781 </author> |
| 1782 <date month="February" year="2016"/> |
| 1783 </front> |
| 1784 </reference> |
| 1785 |
| 1786 <reference anchor="linear-prediction" |
| 1787 target="https://en.wikipedia.org/w/index.php?title=Linear_predictive_coding&
;oldid=687498962"> |
| 1788 <front> |
| 1789 <title>Linear Predictive Coding</title> |
| 1790 <author> |
| 1791 <organization>Wikipedia</organization> |
| 1792 </author> |
| 1793 <date month="October" year="2015"/> |
| 1794 </front> |
| 1795 </reference> |
| 1796 |
| 1797 <reference anchor="lpc-sample" |
| 1798 target="https://svn.xiph.org/trunk/vorbis/lib/lpc.c"> |
| 1799 <front> |
| 1800 <title>Autocorrelation LPC coeff generation algorithm |
| 1801 (Vorbis source code)</title> |
| 1802 <author initials="J." surname="Degener" fullname="Jutta Degener"/> |
| 1803 <author initials="C." surname="Bormann" fullname="Carsten Bormann"/> |
| 1804 <date month="November" year="1994"/> |
| 1805 </front> |
| 1806 </reference> |
| 1807 |
| 1808 <reference anchor="q-notation" |
| 1809 target="https://en.wikipedia.org/w/index.php?title=Q_%28number_format%29&ol
did=697252615"> |
| 1810 <front> |
| 1811 <title>Q (number format)</title> |
| 1812 <author><organization>Wikipedia</organization></author> |
| 1813 <date month="December" year="2015"/> |
| 1814 </front> |
| 1815 </reference> |
| 1816 |
| 1817 <reference anchor="replay-gain" |
| 1818 target="https://wiki.xiph.org/VorbisComment#Replay_Gain"> |
| 1819 <front> |
| 1820 <title>VorbisComment: Replay Gain</title> |
| 1821 <author initials="C." surname="Parker" fullname="Conrad Parker"/> |
| 1822 <author initials="M." surname="Leese" fullname="Martin Leese"/> |
| 1823 <date month="June" year="2009"/> |
| 1824 </front> |
| 1825 </reference> |
| 1826 |
| 1827 <reference anchor="seeking" |
| 1828 target="https://wiki.xiph.org/Seeking"> |
| 1829 <front> |
| 1830 <title>Granulepos Encoding and How Seeking Really Works</title> |
| 1831 <author initials="S." surname="Pfeiffer" fullname="Silvia Pfeiffer"/> |
| 1832 <author initials="C." surname="Parker" fullname="Conrad Parker"/> |
| 1833 <author initials="G." surname="Maxwell" fullname="Greg Maxwell"/> |
| 1834 <date month="May" year="2012"/> |
| 1835 </front> |
| 1836 </reference> |
| 1837 |
| 1838 <reference anchor="vorbis-mapping" |
| 1839 target="https://www.xiph.org/vorbis/doc/Vorbis_I_spec.html#x1-810004.3.9"> |
| 1840 <front> |
| 1841 <title>The Vorbis I Specification, Section 4.3.9 Output Channel Order</title> |
| 1842 <author initials="C." surname="Montgomery" |
| 1843 fullname="Christopher "Monty" Montgomery"/> |
| 1844 <date month="January" year="2010"/> |
| 1845 </front> |
| 1846 </reference> |
| 1847 |
| 1848 <reference anchor="vorbis-trim" |
| 1849 target="https://xiph.org/vorbis/doc/Vorbis_I_spec.html#x1-132000A.2"> |
| 1850 <front> |
| 1851 <title>The Vorbis I Specification, Appendix A: Embedding Vorbis |
| 1852 into an Ogg stream</title> |
| 1853 <author initials="C." surname="Montgomery" |
| 1854 fullname="Christopher "Monty" Montgomery"/> |
| 1855 <date month="November" year="2008"/> |
| 1856 </front> |
| 1857 </reference> |
| 1858 |
| 1859 <reference anchor="wave-multichannel" |
| 1860 target="http://msdn.microsoft.com/en-us/windows/hardware/gg463006.aspx"> |
| 1861 <front> |
| 1862 <title>Multiple Channel Audio Data and WAVE Files</title> |
| 1863 <author> |
| 1864 <organization>Microsoft Corporation</organization> |
| 1865 </author> |
| 1866 <date month="March" year="2007"/> |
| 1867 </front> |
| 1868 </reference> |
| 1869 |
| 1870 </references> |
| 1871 |
| 1872 </back> |
| 1873 </rfc> |
OLD | NEW |