读gzip-1.2.4源码时的笔记(1)

王朝other·作者佚名 2006-01-08

宽屏版字体: 小|中|大|超大

(由于项目进行一半就终止了，所以只有这些)

algorithm.doc文件的部分翻译

2 bytes GZIP标志字节：0x1f, 0x8b (\037 \213)

1 byte 压缩方法： (0..7 reserved, 8 = deflate)

1 byte 标志位：

bit 0 set: 文件可能是ASCII文本文件

bit 1 set: 附加多个gzip文件部分

bit 2 set: 存在有可选的附加内容

bit 3 set: 提供了原始的文件名称

bit 4 set: 则提供有一个O－终结的文件内容

bit 5 set: 文件被加密

bit 6,7: 保留

4 bytes 文件更改时间(Unix时间)

1 byte 额外的标志，决定了压缩方法。 2:使用最大的压缩，最慢的算法

4:采用最快的算法

1 byte 这个标志指明了进行压缩时系统的类型。

0 - FAT filesystem (MS-DOS, OS/2, NT/Win32)

1 - Amiga

2 - VMS (or OpenVMS)

3 - Unix

4 - VM/CMS

5 - Atari TOS

6 - HPFS filesystem (OS/2, NT)

7 - Macintosh

8 - Z-System

9 - CP/M

10 - TOPS-20

11 - NTFS filesystem (NT)

12 - QDOS

13 - Acorn RISCOS

255 - unknown

2 bytes optional part number (second part=1) 可选的序号

2 bytes optional extra field length 可选的附加内容的长度

? bytes optional extra field 可选的附加内容

? bytes optional original file name, zero terminated

可选的原始文件名称，以'\0'结束

? bytes optional file comment, zero terminated

可选文件内容(这部分不被解释，而是可读的供人使用的，以'\0'结束

12 bytes optional encryption header

? bytes compressed data

4 bytes crc32 这个是未压缩数据的循环冗余校验值。

4 bytes uncompressed input size modulo 2^32 这是原始数据的长度以2的32次方为模的值。

设计了一种可以单向编码的格式，而不用反向查找，也不用预知未压缩数据及输出的

已压缩数据的大小。如果输入的数据不是一个文件，那么修改时间被设置为压缩的开

始时间。

The format was designed to allow single pass compression without any

backwards seek, and without a priori knowledge of the uncompressed

input size or the available size on the output media. If input does

not come from a regular disk file, the file modification time is set

to the time at which compression started.

时间戳主要是用在在网络上传输gzip文件的情况下。在这种情况下，它不需要保存所有

者的属性。在本地传输的时候，所有者的属性在压缩/解压缩时由gzip所保存。忽略值为

0的时间戳。

The time stamp is useful mainly when one gzip file is transferred over

a network. In this case it would not help to keep ownership

attributes. In the local case, the ownership attributes are preserved

by gzip when compressing/decompressing the file. A time stamp of zero

is ignored.

标志位中，值为0的位是可选的，它可以使我们对输入的数据做一个预先的了解。在不

确定的时候，要将标志位清除。对有不同文件格式(文本文件和二进制文件)的系统来说，

解码时，可以使用标志位来选择不同的格式。

Bit 0 in the flags is only an optional indication, which can be set by

a small lookahead in the input data. In case of doubt, the flag is

cleared indicating binary data. For systems which have different

file formats for ascii text and binary data, the decompressor can

use the flag to choose the appropriate format.

如果有附加内容，则它必须包含一个或多个子字段，每个子字段有如下格式：

The extra field, if present, must consist of one or more subfields,

each with the following format:

subfield id : 2 bytes 子字段ID

subfield size : 2 bytes (little-endian format)子字段长度(小端字节序)

subfield data 子字段内容

子字段ID可以包含两个可记住的字母。请发送一些这样的ID给jloup@chorus.fr.

第二个字节为0的ID是被保留的。定义了如下的ID

The subfield id can consist of two letters with some mnemonic value.

Please send any such id to jloup@chorus.fr. Ids with a zero second

byte are reserved for future use. The following ids are defined:

Ap (0x41, 0x70) : Apollo file type information

子字段长度是子字段内容的长度，不包含ID及子字段长度这四字节。但是

前面所说的“可选的附加内容的长度”则包含了ID及子字段长度的四字节。

The subfield size is the size of the subfield data and does not

include the id and the size itself. The field 'extra field length' is

the total size of the extra field, including subfield ids and sizes.

必须可以在压缩数据中找到数据结束的位置，而不论数据的实际长度是多少。如果压缩

数据不能够放到一个文件中(如磁盘的情况)，每一部分都要由一个头字段开始，但是只

有最后一部分中有CRC32和原始数据的长度。解压程序应该可以提示输入另外的，存在

于多个压缩文件中的数据。这是必要，但不是绝对的，因为当一部分数据毁坏时，还需

要得到其它部分的内容。

It must be possible to detect the end of the compressed data with any

compression format, regardless of the actual size of the compressed

data. If the compressed data cannot fit in one file (in particular for

diskettes), each part starts with a header as described above, but

only the last part has the crc32 and uncompressed size. A decompressor

may prompt for additional data for multipart compressed files. It is

desirable but not mandatory that multiple parts be extractable

independently so that partial data can be recovered if one of the

parts is damaged. This is possible only if no compression state is

kept from one part to the other. The compression-type dependent flags

can indicate this.

如果压缩文件的系统对文件名的大小写不敏感，则原始文件名会会强制转换成小写。

如果是从标准输入读入的数据，则没有原始文件名。

If the file being compressed is on a file system with case insensitive

names, the original name field must be forced to lower case. There is

no original file name if the data was compressed from standard input.

即使压缩后的文件会比原来的文件大，压缩还是会完成的。

Compression is always performed, even if the compressed file is

slightly larger than the original. The worst case expansion is

a few bytes for the gzip file header, plus 5 bytes every 32K block,

or an expansion ratio of 0.015% for large files. Note that the actual

number of used disk blocks almost never increases.

The encryption is that of zip 1.9. For the encryption check, the

last byte of the decoded encryption header must be zero. The time

stamp of an encrypted file might be set to zero to avoid giving a clue

about the construction of the random header.