xz/doc/xz-file-format.txt - Issue 2869016: Add an unpatched version of xz, XZ Utils, to /trunk/deps/third_party

Unified Diff: xz/doc/xz-file-format.txt

Issue 2869016: Add an unpatched version of xz, XZ Utils, to /trunk/deps/third_party (Closed) Base URL: svn://svn.chromium.org/chrome/trunk/deps/third_party/

Patch Set: Created 10 years, 6 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: xz/doc/xz-file-format.txt

===================================================================

--- xz/doc/xz-file-format.txt (revision 0)

+++ xz/doc/xz-file-format.txt (revision 0)

@@ -0,0 +1,1150 @@

+The .xz File Format

+===================

+Version 1.0.4 (2009-08-27)

+ 0. Preface

+ 0.1. Notices and Acknowledgements

+ 0.2. Getting the Latest Version

+ 0.3. Version History

+ 1. Conventions

+ 1.1. Byte and Its Representation

+ 1.2. Multibyte Integers

+ 2. Overall Structure of .xz File

+ 2.1. Stream

+ 2.1.1. Stream Header

+ 2.1.1.1. Header Magic Bytes

+ 2.1.1.2. Stream Flags

+ 2.1.1.3. CRC32

+ 2.1.2. Stream Footer

+ 2.1.2.1. CRC32

+ 2.1.2.2. Backward Size

+ 2.1.2.3. Stream Flags

+ 2.1.2.4. Footer Magic Bytes

+ 2.2. Stream Padding

+ 3. Block

+ 3.1. Block Header

+ 3.1.1. Block Header Size

+ 3.1.2. Block Flags

+ 3.1.3. Compressed Size

+ 3.1.4. Uncompressed Size

+ 3.1.5. List of Filter Flags

+ 3.1.6. Header Padding

+ 3.1.7. CRC32

+ 3.2. Compressed Data

+ 3.3. Block Padding

+ 3.4. Check

+ 4. Index

+ 4.1. Index Indicator

+ 4.2. Number of Records

+ 4.3. List of Records

+ 4.3.1. Unpadded Size

+ 4.3.2. Uncompressed Size

+ 4.4. Index Padding

+ 4.5. CRC32

+ 5. Filter Chains

+ 5.1. Alignment

+ 5.2. Security

+ 5.3. Filters

+ 5.3.1. LZMA2

+ 5.3.2. Branch/Call/Jump Filters for Executables

+ 5.3.3. Delta

+ 5.3.3.1. Format of the Encoded Output

+ 5.4. Custom Filter IDs

+ 5.4.1. Reserved Custom Filter ID Ranges

+ 6. Cyclic Redundancy Checks

+ 7. References

+0. Preface

+ This document describes the .xz file format (filename suffix

+ ".xz", MIME type "application/x-xz"). It is intended that this

+ this format replace the old .lzma format used by LZMA SDK and

+ LZMA Utils.

+0.1. Notices and Acknowledgements

+ This file format was designed by Lasse Collin

+ <lasse.collin@tukaani.org> and Igor Pavlov.

+ Special thanks for helping with this document goes to

+ Ville Koskinen. Thanks for helping with this document goes to

+ Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.

+ This document has been put into the public domain.

+0.2. Getting the Latest Version

+ The latest official version of this document can be downloaded

+ from <http://tukaani.org/xz/xz-file-format.txt>.

+ Specific versions of this document have a filename

+ xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.

+ For example, the version 1.0.0 of this document is available

+ at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.

+0.3. Version History

+ Version Date Description

+ 1.0.4 2009-08-27 Language improvements in Sections 1.2,

+ 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1

+ 1.0.3 2009-06-05 Spelling fixes in Sections 5.1 and 5.4

+ 1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1

+ 1.0.1 2009-06-01 Typo fix in Section 0.3 and minor

+ clarifications to Sections 2, 2.2,

+ 3.3, 4.4, and 5.3.2

+ 1.0.0 2009-01-14 The first official version

+1. Conventions

+ The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",

+ "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this

+ document are to be interpreted as described in [RFC-2119].

+ Indicating a warning means displaying a message, returning

+ appropriate exit status, or doing something else to let the

+ user know that something worth warning occurred. The operation

+ SHOULD still finish if a warning is indicated.

+ Indicating an error means displaying a message, returning

+ appropriate exit status, or doing something else to let the

+ user know that something prevented successfully finishing the

+ operation. The operation MUST be aborted once an error has

+ been indicated.

+1.1. Byte and Its Representation

+ In this document, byte is always 8 bits.

+ A "null byte" has all bits unset. That is, the value of a null

+ byte is 0x00.

+ To represent byte blocks, this document uses notation that

+ is similar to the notation used in [RFC-1952]:

+ +-------+

+ | Foo | One byte.

+ +-------+

+ +---+---+

+ | Foo | Two bytes; that is, some of the vertical bars

+ +---+---+ can be missing.

+ +=======+

+ | Foo | Zero or more bytes.

+ +=======+

+ In this document, a boxed byte or a byte sequence declared

+ using this notation is called "a field". The example field

+ above would be called "the Foo field" or plain "Foo".

+ If there are many fields, they may be split to multiple lines.

+ This is indicated with an arrow ("--->"):

+ +=====+

+ | Foo |

+ +=====+

+ ---> | Bar |

+ +=====+

+ The above is equivalent to this:

+ +=====+=====+

+ | Foo | Bar |

+ +=====+=====+

+1.2. Multibyte Integers

+ Multibyte integers of static length, such as CRC values,

+ are stored in little endian byte order (least significant

+ byte first).

+ When smaller values are more likely than bigger values (for

+ example file sizes), multibyte integers are encoded in a

+ variable-length representation:

+ - Numbers in the range [0, 127] are copied as is, and take

+ one byte of space.

+ - Bigger numbers will occupy two or more bytes. All but the

+ last byte of the multibyte representation have the highest

+ (eighth) bit set.

+ For now, the value of the variable-length integers is limited

+ to 63 bits, which limits the encoded size of the integer to

+ nine bytes. These limits may be increased in the future if

+ needed.

+ The following C code illustrates encoding and decoding of

+ variable-length integers. The functions return the number of

+ bytes occupied by the integer (1-9), or zero on error.

+ #include <stddef.h>

+ #include <inttypes.h>

+ size_t

+ encode(uint8_t buf[static 9], uint64_t num)

+ {

+ if (num > UINT64_MAX / 2)

+ return 0;

+ size_t i = 0;

+ while (num >= 0x80) {

+ buf[i++] = (uint8_t)(num) | 0x80;

+ num >>= 7;

+ }

+ buf[i++] = (uint8_t)(num);

+ return i;

+ }

+ size_t

+ decode(const uint8_t buf[], size_t size_max, uint64_t *num)

+ {

+ if (size_max == 0)

+ return 0;

+ if (size_max > 9)

+ size_max = 9;

+ *num = buf[0] & 0x7F;

+ size_t i = 0;

+ while (buf[i++] & 0x80) {

+ if (i >= size_max || buf[i] == 0x00)

+ return 0;

+ *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);

+ }

+ return i;

+ }

+2. Overall Structure of .xz File

+ A standalone .xz files consist of one or more Streams which may

+ have Stream Padding between or after them:

+ +========+================+========+================+

+ +========+================+========+================+

+ The sizes of Stream and Stream Padding are always multiples

+ of four bytes, thus the size of every valid .xz file MUST be

+ a multiple of four bytes.

+ While a typical file contains only one Stream and no Stream

+ Padding, a decoder handling standalone .xz files SHOULD support

+ files that have more than one Stream or Stream Padding.

+ In contrast to standalone .xz files, when the .xz file format

+ is used as an internal part of some other file format or

+ communication protocol, it usually is expected that the decoder

+ stops after the first Stream, and doesn't look for Stream

+ Padding or possibly other Streams.

+2.1. Stream

+ +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+

+ +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+

+ +=======+-+-+-+-+-+-+-+-+-+-+-+-+

+ ---> | Index | Stream Footer |

+ +=======+-+-+-+-+-+-+-+-+-+-+-+-+

+ All the above fields have a size that is a multiple of four. If

+ Stream is used as an internal part of another file format, it

+ is RECOMMENDED to make the Stream start at an offset that is

+ a multiple of four bytes.

+ Stream Header, Index, and Stream Footer are always present in

+ a Stream. The maximum size of the Index field is 16 GiB (2^34).

+ There are zero or more Blocks. The maximum number of Blocks is

+ limited only by the maximum size of the Index field.

+ Total size of a Stream MUST be less than 8 EiB (2^63 bytes).

+ The same limit applies to the total amount of uncompressed

+ data stored in a Stream.

+ If an implementation supports handling .xz files with multiple

+ concatenated Streams, it MAY apply the above limits to the file

+ as a whole instead of limiting per Stream basis.

+2.1.1. Stream Header

+ +---+---+---+---+---+---+-------+------+--+--+--+--+

+ | Header Magic Bytes | Stream Flags | CRC32 |

+ +---+---+---+---+---+---+-------+------+--+--+--+--+

+2.1.1.1. Header Magic Bytes

+ The first six (6) bytes of the Stream are so called Header

+ Magic Bytes. They can be used to identify the file type.

+ Using a C array and ASCII:

+ const uint8_t HEADER_MAGIC[6]

+ = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };

+ In plain hexadecimal:

+ FD 37 7A 58 5A 00

+ Notes:

+ - The first byte (0xFD) was chosen so that the files cannot

+ be erroneously detected as being in .lzma format, in which

+ the first byte is in the range [0x00, 0xE0].

+ - The sixth byte (0x00) was chosen to prevent applications

+ from misdetecting the file as a text file.

+ If the Header Magic Bytes don't match, the decoder MUST

+ indicate an error.

+2.1.1.2. Stream Flags

+ The first byte of Stream Flags is always a null byte. In the

+ future, this byte may be used to indicate a new Stream version

+ or other Stream properties.

+ The second byte of Stream Flags is a bit field:

+ Bit(s) Mask Description

+ 0-3 0x0F Type of Check (see Section 3.4):

+ ID Size Check name

+ 0x00 0 bytes None

+ 0x01 4 bytes CRC32

+ 0x02 4 bytes (Reserved)

+ 0x03 4 bytes (Reserved)

+ 0x04 8 bytes CRC64

+ 0x05 8 bytes (Reserved)

+ 0x06 8 bytes (Reserved)

+ 0x07 16 bytes (Reserved)

+ 0x08 16 bytes (Reserved)

+ 0x09 16 bytes (Reserved)

+ 0x0A 32 bytes SHA-256

+ 0x0B 32 bytes (Reserved)

+ 0x0C 32 bytes (Reserved)

+ 0x0D 64 bytes (Reserved)

+ 0x0E 64 bytes (Reserved)

+ 0x0F 64 bytes (Reserved)

+ 4-7 0xF0 Reserved for future use; MUST be zero for now.

+ Implementations SHOULD support at least the Check IDs 0x00

+ (None) and 0x01 (CRC32). Supporting other Check IDs is

+ OPTIONAL. If an unsupported Check is used, the decoder SHOULD

+ indicate a warning or error.

+ If any reserved bit is set, the decoder MUST indicate an error.

+ It is possible that there is a new field present which the

+ decoder is not aware of, and can thus parse the Stream Header

+ incorrectly.

+2.1.1.3. CRC32

+ The CRC32 is calculated from the Stream Flags field. It is

+ stored as an unsigned 32-bit little endian integer. If the

+ calculated value does not match the stored one, the decoder

+ MUST indicate an error.

+ The idea is that Stream Flags would always be two bytes, even

+ if new features are needed. This way old decoders will be able

+ to verify the CRC32 calculated from Stream Flags, and thus

+ distinguish between corrupt files (CRC32 doesn't match) and

+ files that the decoder doesn't support (CRC32 matches but

+ Stream Flags has reserved bits set).

+2.1.2. Stream Footer

+ +-+-+-+-+---+---+---+---+-------+------+----------+---------+

+ +-+-+-+-+---+---+---+---+-------+------+----------+---------+

+2.1.2.1. CRC32

+ The CRC32 is calculated from the Backward Size and Stream Flags

+ fields. It is stored as an unsigned 32-bit little endian

+ integer. If the calculated value does not match the stored one,

+ the decoder MUST indicate an error.

+ The reason to have the CRC32 field before the Backward Size and

+ Stream Flags fields is to keep the four-byte fields aligned to

+ a multiple of four bytes.

+2.1.2.2. Backward Size

+ Backward Size is stored as a 32-bit little endian integer,

+ which indicates the size of the Index field as multiple of

+ four bytes, minimum value being four bytes:

+ real_backward_size = (stored_backward_size + 1) * 4;

+ If the stored value does not match the real size of the Index

+ field, the decoder MUST indicate an error.

+ Using a fixed-size integer to store Backward Size makes

+ it slightly simpler to parse the Stream Footer when the

+ application needs to parse the Stream backwards.

+2.1.2.3. Stream Flags

+ This is a copy of the Stream Flags field from the Stream

+ Header. The information stored to Stream Flags is needed

+ when parsing the Stream backwards. The decoder MUST compare

+ the Stream Flags fields in both Stream Header and Stream

+ Footer, and indicate an error if they are not identical.

+2.1.2.4. Footer Magic Bytes

+ As the last step of the decoding process, the decoder MUST

+ verify the existence of Footer Magic Bytes. If they don't

+ match, an error MUST be indicated.

+ Using a C array and ASCII:

+ const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };

+ In hexadecimal:

+ 59 5A

+ The primary reason to have Footer Magic Bytes is to make

+ it easier to detect incomplete files quickly, without

+ uncompressing. If the file does not end with Footer Magic Bytes

+ (excluding Stream Padding described in Section 2.2), it cannot

+ be undamaged, unless someone has intentionally appended garbage

+ after the end of the Stream.

+2.2. Stream Padding

+ Only the decoders that support decoding of concatenated Streams

+ MUST support Stream Padding.

+ Stream Padding MUST contain only null bytes. To preserve the

+ four-byte alignment of consecutive Streams, the size of Stream

+ Padding MUST be a multiple of four bytes. Empty Stream Padding

+ is allowed. If these requirements are not met, the decoder MUST

+ indicate an error.

+ Note that non-empty Stream Padding is allowed at the end of the

+ file; there doesn't need to be a new Stream after non-empty

+ Stream Padding. This can be convenient in certain situations

+ [GNU-tar].

+ The possibility of Stream Padding MUST be taken into account

+ when designing an application that parses Streams backwards,

+ and the application supports concatenated Streams.

+3. Block

+ +==============+=================+===============+=======+

+ +==============+=================+===============+=======+

+3.1. Block Header

+ +-------------------+-------------+=================+

+ | Block Header Size | Block Flags | Compressed Size |

+ +-------------------+-------------+=================+

+ +===================+======================+

+ ---> | Uncompressed Size | List of Filter Flags |

+ +===================+======================+

+ +================+--+--+--+--+

+ ---> | Header Padding | CRC32 |

+ +================+--+--+--+--+

+3.1.1. Block Header Size

+ This field overlaps with the Index Indicator field (see

+ Section 4.1).

+ This field contains the size of the Block Header field,

+ including the Block Header Size field itself. Valid values are

+ in the range [0x01, 0xFF], which indicate the size of the Block

+ Header as multiples of four bytes, minimum size being eight

+ bytes:

+ real_header_size = (encoded_header_size + 1) * 4;

+ If a Block Header bigger than 1024 bytes is needed in the

+ future, a new field can be added between the Block Header and

+ Compressed Data fields. The presence of this new field would

+ be indicated in the Block Header field.

+3.1.2. Block Flags

+ The Block Flags field is a bit field:

+ Bit(s) Mask Description

+ 0-1 0x03 Number of filters (1-4)

+ 2-5 0x3C Reserved for future use; MUST be zero for now.

+ 6 0x40 The Compressed Size field is present.

+ 7 0x80 The Uncompressed Size field is present.

+ If any reserved bit is set, the decoder MUST indicate an error.

+ It is possible that there is a new field present which the

+ decoder is not aware of, and can thus parse the Block Header

+ incorrectly.

+3.1.3. Compressed Size

+ This field is present only if the appropriate bit is set in

+ the Block Flags field (see Section 3.1.2).

+ The Compressed Size field contains the size of the Compressed

+ Data field, which MUST be non-zero. Compressed Size is stored

+ using the encoding described in Section 1.2. If the Compressed

+ Size doesn't match the size of the Compressed Data field, the

+ decoder MUST indicate an error.

+3.1.4. Uncompressed Size

+ This field is present only if the appropriate bit is set in

+ the Block Flags field (see Section 3.1.2).

+ The Uncompressed Size field contains the size of the Block

+ after uncompressing. Uncompressed Size is stored using the

+ encoding described in Section 1.2. If the Uncompressed Size

+ does not match the real uncompressed size, the decoder MUST

+ indicate an error.

+ Storing the Compressed Size and Uncompressed Size fields serves

+ several purposes:

+ - The decoder knows how much memory it needs to allocate

+ for a temporary buffer in multithreaded mode.

+ - Simple error detection: wrong size indicates a broken file.

+ - Seeking forwards to a specific location in streamed mode.

+ It should be noted that the only reliable way to determine

+ the real uncompressed size is to uncompress the Block,

+ because the Block Header and Index fields may contain

+ (intentionally or unintentionally) invalid information.

+3.1.5. List of Filter Flags

+ +================+================+ +================+

+ +================+================+ +================+

+ The number of Filter Flags fields is stored in the Block Flags

+ field (see Section 3.1.2).

+ The format of each Filter Flags field is as follows:

+ +===========+====================+===================+

+ | Filter ID | Size of Properties | Filter Properties |

+ +===========+====================+===================+

+ Both Filter ID and Size of Properties are stored using the

+ encoding described in Section 1.2. Size of Properties indicates

+ the size of the Filter Properties field as bytes. The list of

+ officially defined Filter IDs and the formats of their Filter

+ Properties are described in Section 5.3.

+ Filter IDs greater than or equal to 0x4000_0000_0000_0000

+ (2^62) are reserved for implementation-specific internal use.

+ These Filter IDs MUST never be used in List of Filter Flags.

+3.1.6. Header Padding

+ This field contains as many null byte as it is needed to make

+ the Block Header have the size specified in Block Header Size.

+ If any of the bytes are not null bytes, the decoder MUST

+ indicate an error. It is possible that there is a new field

+ present which the decoder is not aware of, and can thus parse

+ the Block Header incorrectly.

+3.1.7. CRC32

+ The CRC32 is calculated over everything in the Block Header

+ field except the CRC32 field itself. It is stored as an

+ unsigned 32-bit little endian integer. If the calculated

+ value does not match the stored one, the decoder MUST indicate

+ an error.

+ By verifying the CRC32 of the Block Header before parsing the

+ actual contents allows the decoder to distinguish between

+ corrupt and unsupported files.

+3.2. Compressed Data

+ The format of Compressed Data depends on Block Flags and List

+ of Filter Flags. Excluding the descriptions of the simplest

+ filters in Section 5.3, the format of the filter-specific

+ encoded data is out of scope of this document.

+3.3. Block Padding

+ Block Padding MUST contain 0-3 null bytes to make the size of

+ the Block a multiple of four bytes. This can be needed when

+ the size of Compressed Data is not a multiple of four. If any

+ of the bytes in Block Padding are not null bytes, the decoder

+ MUST indicate an error.

+3.4. Check

+ The type and size of the Check field depends on which bits

+ are set in the Stream Flags field (see Section 2.1.1.2).

+ The Check, when used, is calculated from the original

+ uncompressed data. If the calculated Check does not match the

+ stored one, the decoder MUST indicate an error. If the selected

+ type of Check is not supported by the decoder, it SHOULD

+ indicate a warning or error.

+4. Index

+ +-----------------+===================+

+ | Index Indicator | Number of Records |

+ +-----------------+===================+

+ +=================+===============+-+-+-+-+

+ ---> | List of Records | Index Padding | CRC32 |

+ +=================+===============+-+-+-+-+

+ Index serves several purposes. Using it, one can

+ - verify that all Blocks in a Stream have been processed;

+ - find out the uncompressed size of a Stream; and

+ - quickly access the beginning of any Block (random access).

+4.1. Index Indicator

+ This field overlaps with the Block Header Size field (see

+ Section 3.1.1). The value of Index Indicator is always 0x00.

+4.2. Number of Records

+ This field indicates how many Records there are in the List

+ of Records field, and thus how many Blocks there are in the

+ Stream. The value is stored using the encoding described in

+ Section 1.2. If the decoder has decoded all the Blocks of the

+ Stream, and then notices that the Number of Records doesn't

+ match the real number of Blocks, the decoder MUST indicate an

+ error.

+4.3. List of Records

+ List of Records consists of as many Records as indicated by the

+ Number of Records field:

+ +========+========+

+ | Record | Record | ...

+ +========+========+

+ Each Record contains information about one Block:

+ +===============+===================+

+ | Unpadded Size | Uncompressed Size |

+ +===============+===================+

+ If the decoder has decoded all the Blocks of the Stream, it

+ MUST verify that the contents of the Records match the real

+ Unpadded Size and Uncompressed Size of the respective Blocks.

+ Implementation hint: It is possible to verify the Index with

+ constant memory usage by calculating for example SHA-256 of

+ both the real size values and the List of Records, then

+ comparing the hash values. Implementing this using

+ non-cryptographic hash like CRC32 SHOULD be avoided unless

+ small code size is important.

+ If the decoder supports random-access reading, it MUST verify

+ that Unpadded Size and Uncompressed Size of every completely

+ decoded Block match the sizes stored in the Index. If only

+ partial Block is decoded, the decoder MUST verify that the

+ processed sizes don't exceed the sizes stored in the Index.

+4.3.1. Unpadded Size

+ This field indicates the size of the Block excluding the Block

+ Padding field. That is, Unpadded Size is the size of the Block

+ Header, Compressed Data, and Check fields. Unpadded Size is

+ stored using the encoding described in Section 1.2. The value

+ MUST never be zero; with the current structure of Blocks, the

+ actual minimum value for Unpadded Size is five.

+ Implementation note: Because the size of the Block Padding

+ field is not included in Unpadded Size, calculating the total

+ size of a Stream or doing random-access reading requires

+ calculating the actual size of the Blocks by rounding Unpadded

+ Sizes up to the next multiple of four.

+ The reason to exclude Block Padding from Unpadded Size is to

+ ease making a raw copy of Compressed Data without Block

+ Padding. This can be useful, for example, if someone wants

+ to convert Streams to some other file format quickly.

+4.3.2. Uncompressed Size

+ This field indicates the Uncompressed Size of the respective

+ Block as bytes. The value is stored using the encoding

+ described in Section 1.2.

+4.4. Index Padding

+ This field MUST contain 0-3 null bytes to pad the Index to

+ a multiple of four bytes. If any of the bytes are not null

+ bytes, the decoder MUST indicate an error.

+4.5. CRC32

+ The CRC32 is calculated over everything in the Index field

+ except the CRC32 field itself. The CRC32 is stored as an

+ unsigned 32-bit little endian integer. If the calculated

+ value does not match the stored one, the decoder MUST indicate

+ an error.

+5. Filter Chains

+ The Block Flags field defines how many filters are used. When

+ more than one filter is used, the filters are chained; that is,

+ the output of one filter is the input of another filter. The

+ following figure illustrates the direction of data flow.

+ v Uncompressed Data ^

+ | Filter 0 |

+ Encoder | Filter 1 | Decoder

+ | Filter n |

+ v Compressed Data ^

+5.1. Alignment

+ Alignment of uncompressed input data is usually the job of

+ the application producing the data. For example, to get the

+ best results, an archiver tool should make sure that all

+ PowerPC executable files in the archive stream start at

+ offsets that are multiples of four bytes.

+ Some filters, for example LZMA2, can be configured to take

+ advantage of specified alignment of input data. Note that

+ taking advantage of aligned input can be beneficial also when

+ a filter is not the first filter in the chain. For example,

+ if you compress PowerPC executables, you may want to use the

+ PowerPC filter and chain that with the LZMA2 filter. Because

+ not only the input but also the output alignment of the PowerPC

+ filter is four bytes, it is now beneficial to set LZMA2

+ settings so that the LZMA2 encoder can take advantage of its

+ four-byte-aligned input data.

+ The output of the last filter in the chain is stored to the

+ Compressed Data field, which is is guaranteed to be aligned

+ to a multiple of four bytes relative to the beginning of the

+ Stream. This can increase

+ - speed, if the filtered data is handled multiple bytes at

+ a time by the filter-specific encoder and decoder,

+ because accessing aligned data in computer memory is

+ usually faster; and

+ - compression ratio, if the output data is later compressed

+ with an external compression tool.

+5.2. Security

+ If filters would be allowed to be chained freely, it would be

+ possible to create malicious files, that would be very slow to

+ decode. Such files could be used to create denial of service

+ attacks.

+ Slow files could occur when multiple filters are chained:

+ v Compressed input data

+ | Filter 1 decoder (last filter)

+ | Filter 0 decoder (non-last filter)

+ v Uncompressed output data

+ The decoder of the last filter in the chain produces a lot of

+ output from little input. Another filter in the chain takes the

+ output of the last filter, and produces very little output

+ while consuming a lot of input. As a result, a lot of data is

+ moved inside the filter chain, but the filter chain as a whole

+ gets very little work done.

+ To prevent this kind of slow files, there are restrictions on

+ how the filters can be chained. These restrictions MUST be

+ taken into account when designing new filters.

+ The maximum number of filters in the chain has been limited to

+ four, thus there can be at maximum of three non-last filters.

+ Of these three non-last filters, only two are allowed to change

+ the size of the data.

+ The non-last filters, that change the size of the data, MUST

+ have a limit how much the decoder can compress the data: the

+ decoder SHOULD produce at least n bytes of output when the

+ filter is given 2n bytes of input. This limit is not

+ absolute, but significant deviations MUST be avoided.

+ The above limitations guarantee that if the last filter in the

+ chain produces 4n bytes of output, the chain as a whole will

+ produce at least n bytes of output.

+5.3. Filters

+5.3.1. LZMA2

+ LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose

+ compression algorithm with high compression ratio and fast

+ decompression. LZMA is based on LZ77 and range coding

+ algorithms.

+ LZMA2 is an extension on top of the original LZMA. LZMA2 uses

+ LZMA internally, but adds support for flushing the encoder,

+ uncompressed chunks, eases stateful decoder implementations,

+ and improves support for multithreading. Thus, the plain LZMA

+ will not be supported in this file format.

+ Filter ID: 0x21

+ Size of Filter Properties: 1 byte

+ Changes size of data: Yes

+ Allow as a non-last filter: No

+ Allow as the last filter: Yes

+ Preferred alignment:

+ Input data: Adjustable to 1/2/4/8/16 byte(s)

+ Output data: 1 byte

+ The format of the one-byte Filter Properties field is as

+ follows:

+ Bits Mask Description

+ 0-5 0x3F Dictionary Size

+ 6-7 0xC0 Reserved for future use; MUST be zero for now.

+ Dictionary Size is encoded with one-bit mantissa and five-bit

+ exponent. The smallest dictionary size is 4 KiB and the biggest

+ is 4 GiB.

+ Raw value Mantissa Exponent Dictionary size

+ 0 2 11 4 KiB

+ 1 3 11 6 KiB

+ 2 2 12 8 KiB

+ 3 3 12 12 KiB

+ 4 2 13 16 KiB

+ 5 3 13 24 KiB

+ 6 2 14 32 KiB

+ ... ... ... ...

+ 35 3 27 768 MiB

+ 36 2 28 1024 MiB

+ 37 3 29 1536 MiB

+ 38 2 30 2048 MiB

+ 39 3 30 3072 MiB

+ 40 2 31 4096 MiB - 1 B

+ Instead of having a table in the decoder, the dictionary size

+ can be decoded using the following C code:

+ const uint8_t bits = get_dictionary_flags() & 0x3F;

+ if (bits > 40)

+ return DICTIONARY_TOO_BIG; // Bigger than 4 GiB

+ uint32_t dictionary_size;

+ if (bits == 40) {

+ dictionary_size = UINT32_MAX;

+ } else {

+ dictionary_size = 2 | (bits & 1);

+ dictionary_size <<= bits / 2 + 11;

+ }

+5.3.2. Branch/Call/Jump Filters for Executables

+ These filters convert relative branch, call, and jump

+ instructions to their absolute counterparts in executable

+ files. This conversion increases redundancy and thus

+ compression ratio.

+ Size of Filter Properties: 0 or 4 bytes

+ Changes size of data: No

+ Allow as a non-last filter: Yes

+ Allow as the last filter: No

+ Below is the list of filters in this category. The alignment

+ is the same for both input and output data.

+ Filter ID Alignment Description

+ 0x04 1 byte x86 filter (BCJ)

+ 0x05 4 bytes PowerPC (big endian) filter

+ 0x06 16 bytes IA64 filter

+ 0x07 4 bytes ARM (little endian) filter

+ 0x08 2 bytes ARM Thumb (little endian) filter

+ 0x09 4 bytes SPARC filter

+ If the size of Filter Properties is four bytes, the Filter

+ Properties field contains the start offset used for address

+ conversions. It is stored as an unsigned 32-bit little endian

+ integer. The start offset MUST be a multiple of the alignment

+ of the filter as listed in the table above; if it isn't, the

+ decoder MUST indicate an error. If the size of Filter

+ Properties is zero, the start offset is zero.

+ Setting the start offset may be useful if an executable has

+ multiple sections, and there are many cross-section calls.

+ Taking advantage of this feature usually requires usage of

+ the Subblock filter, whose design is not complete yet.

+5.3.3. Delta

+ The Delta filter may increase compression ratio when the value

+ of the next byte correlates with the value of an earlier byte

+ at specified distance.

+ Filter ID: 0x03

+ Size of Filter Properties: 1 byte

+ Changes size of data: No

+ Allow as a non-last filter: Yes

+ Allow as the last filter: No

+ Preferred alignment:

+ Input data: 1 byte

+ Output data: Same as the original input data

+ The Properties byte indicates the delta distance, which can be

+ 1-256 bytes backwards from the current byte: 0x00 indicates

+ distance of 1 byte and 0xFF distance of 256 bytes.

+5.3.3.1. Format of the Encoded Output

+ The code below illustrates both encoding and decoding with

+ the Delta filter.

+ // Distance is in the range [1, 256].

+ const unsigned int distance = get_properties_byte() + 1;

+ uint8_t pos = 0;

+ uint8_t delta[256];

+ memset(delta, 0, sizeof(delta));

+ while (1) {

+ const int byte = read_byte();

+ if (byte == EOF)

+ break;

+ uint8_t tmp = delta[(uint8_t)(distance + pos)];

+ if (is_encoder) {

+ tmp = (uint8_t)(byte) - tmp;

+ delta[pos] = (uint8_t)(byte);

+ } else {

+ tmp = (uint8_t)(byte) + tmp;

+ delta[pos] = tmp;

+ }

+ write_byte(tmp);

+ --pos;

+ }

+5.4. Custom Filter IDs

+ If a developer wants to use custom Filter IDs, he has two

+ choices. The first choice is to contact Lasse Collin and ask

+ him to allocate a range of IDs for the developer.

+ The second choice is to generate a 40-bit random integer,

+ which the developer can use as his personal Developer ID.

+ To minimize the risk of collisions, Developer ID has to be

+ a randomly generated integer, not manually selected "hex word".

+ The following command, which works on many free operating

+ systems, can be used to generate Developer ID:

+ dd if=/dev/urandom bs=5 count=1 | hexdump

+ The developer can then use his Developer ID to create unique

+ (well, hopefully unique) Filter IDs.

+ Bits Mask Description

+ 0-15 0x0000_0000_0000_FFFF Filter ID

+ 16-55 0x00FF_FFFF_FFFF_0000 Developer ID

+ 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F

+ The resulting 63-bit integer will use 9 bytes of space when

+ stored using the encoding described in Section 1.2. To get

+ a shorter ID, see the beginning of this Section how to

+ request a custom ID range.

+5.4.1. Reserved Custom Filter ID Ranges

+ Range Description

+ 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility

+ 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility

+ 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility

+6. Cyclic Redundancy Checks

+ There are several incompatible variations to calculate CRC32

+ and CRC64. For simplicity and clarity, complete examples are

+ provided to calculate the checks as they are used in this file

+ format. Implementations MAY use different code as long as it

+ gives identical results.

+ The program below reads data from standard input, calculates

+ the CRC32 and CRC64 values, and prints the calculated values

+ as big endian hexadecimal strings to standard output.

+ #include <stddef.h>

+ #include <inttypes.h>

+ #include <stdio.h>

+ uint32_t crc32_table[256];

+ uint64_t crc64_table[256];

+ void

+ init(void)

+ {

+ static const uint32_t poly32 = UINT32_C(0xEDB88320);

+ static const uint64_t poly64

+ = UINT64_C(0xC96C5795D7870F42);

+ for (size_t i = 0; i < 256; ++i) {

+ uint32_t crc32 = i;

+ uint64_t crc64 = i;

+ for (size_t j = 0; j < 8; ++j) {

+ if (crc32 & 1)

+ crc32 = (crc32 >> 1) ^ poly32;

+ else

+ crc32 >>= 1;

+ if (crc64 & 1)

+ crc64 = (crc64 >> 1) ^ poly64;

+ else

+ crc64 >>= 1;

+ }

+ crc32_table[i] = crc32;

+ crc64_table[i] = crc64;

+ }

+ uint32_t

+ crc32(const uint8_t *buf, size_t size, uint32_t crc)

+ {

+ crc = ~crc;

+ for (size_t i = 0; i < size; ++i)

+ crc = crc32_table[buf[i] ^ (crc & 0xFF)]

+ ^ (crc >> 8);

+ return ~crc;

+ }

+ uint64_t

+ crc64(const uint8_t *buf, size_t size, uint64_t crc)

+ {

+ crc = ~crc;

+ for (size_t i = 0; i < size; ++i)

+ crc = crc64_table[buf[i] ^ (crc & 0xFF)]

+ ^ (crc >> 8);

+ return ~crc;

+ }

+ int

+ main()

+ {

+ init();

+ uint32_t value32 = 0;

+ uint64_t value64 = 0;

+ uint64_t total_size = 0;

+ uint8_t buf[8192];

+ while (1) {

+ const size_t buf_size

+ = fread(buf, 1, sizeof(buf), stdin);

+ if (buf_size == 0)

+ break;

+ total_size += buf_size;

+ value32 = crc32(buf, buf_size, value32);

+ value64 = crc64(buf, buf_size, value64);

+ }

+ printf("Bytes: %" PRIu64 "\n", total_size);

+ printf("CRC-32: 0x%08" PRIX32 "\n", value32);

+ printf("CRC-64: 0x%016" PRIX64 "\n", value64);

+ return 0;

+ }

+7. References

+ LZMA SDK - The original LZMA implementation

+ http://7-zip.org/sdk.html

+ LZMA Utils - LZMA adapted to POSIX-like systems

+ http://tukaani.org/lzma/

+ XZ Utils - The next generation of LZMA Utils

+ http://tukaani.org/xz/

+ [RFC-1952]

+ GZIP file format specification version 4.3

+ http://www.ietf.org/rfc/rfc1952.txt

+ - Notation of byte boxes in section "2.1. Overall conventions"

+ [RFC-2119]

+ Key words for use in RFCs to Indicate Requirement Levels

+ http://www.ietf.org/rfc/rfc2119.txt

+ [GNU-tar]

+ GNU tar 1.21 manual

+ http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html

+ - Node 9.4.2 "Blocking Factor", paragraph that begins

+ "gzip will complain about trailing garbage"

+ - Note that this URL points to the latest version of the

+ manual, and may some day not contain the note which is in

+ 1.21. For the exact version of the manual, download GNU

+ tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz

Property changes on: xz/doc/xz-file-format.txt

___________________________________________________________________

Added: svn:eol-style

+ LF

« no previous file with comments | « xz/doc/lzma-file-format.txt ('k') | xz/dos/Makefile » ('j') | no next file with comments »