王朝网络
分享
 
 
 

chm文件的文件格式 (chm format)

王朝other·作者佚名  2006-01-09
宽屏版  字体: |||超大  

CHM文件的文件格式

Microsoft's HTML Help (.chm) format

Preface

This is documentation on the .chm format used by Microsoft HTML Help. This format has been reverse engineered in the past, but as far as I know this is the first freely available documentation on it. One Usenet message indicates that these .chm files are actually IStorage files documented in the Microsoft Platform SDK. However, I have not been able to locate such documentation.

Note

The word "section" is badly overloaded in this document. Sorry about that.

All numbers are in hexadecimal unless otherwise indicated in the text. Except in tabular listings, this will be indicated by $ or 0x as appropriate. All values within the file are Intel byte order (little endian) unless indicated otherwise.

The overall format of a .chm file

The .chm file begins with a short ($38 byte) initial header. This is followed by the header section table and the offset to the content. Collectively, this is the "header".

The header is followed by the header sections. There are two header sections. One header section is the file directory, the other contains the file length and some unknown data. Immediately following the header sections is the content.

The Header

The header starts with the initial header, which has the following format

0000: char[4] 'ITSF'

0004: DWORD 3 (Version number)

0008: DWORD Total header length, including header section table and

following data.

000C: DWORD 1 (unknown)

0010: DWORD a timestamp.

Considered as a big-endian DWORD, it appears to contain

seconds (MSB) and fractional seconds (second byte).

The third and fourth bytes may contain even more fractional

bits. The 4 least significant bits in the last byte are

constant.

0014: DWORD Windows Language ID. The two I've seen

$0409 = LANG_ENGLISH/SUBLANG_ENGLISH_US

$0407 = LANG_GERMAN/SUBLANG_GERMAN

0018: GUID {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC}

0028: GUID {7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC}

Note: a GUID is $10 bytes, arranged as 1 DWORD, 2 WORDs, and 8 BYTEs.

It is followed by the header section table, which is 2 entries, where each entry is $10 bytes long and has this format:

0000: QWORD Offset of section from beginning of file

0008: QWORD Length of section

Following the header section table is 8 bytes of additional header data. In Version 2 files, this data is not there and the content section starts immediately after the directory.

0000: QWORD Offset within file of content section 0

The Header Sections

Header Section 0

This section contains the total size of the file, and not much else

0000: DWORD $01FE (unknown)

0004: DWORD 0 (unknown)

0008: QWORD File Size

0010: DWORD 0 (unknown)

0014: DWORD 0 (unknown)

Header Section 1: The Directory Listing

The central part of the .chm file: A directory of the files and information it contains.

Directory header

The directory starts with a header; its format is as follows:

0000: char[4] 'ITSP'

0004: DWORD Version number 1

0008: DWORD Length of the directory header

000C: DWORD $0a (unknown)

0010: DWORD $1000 Directory chunk size

0014: DWORD "Density" of quickref section, usually 2.

0018: DWORD Depth of the index tree

1 there is no index, 2 if there is one level of PMGI

chunks.

001C: DWORD Chunk number of root index chunk, -1 if there is none

(though at least one file has 0 despite there being no

index chunk, probably a bug.)

0020: DWORD Chunk number of first PMGL (listing) chunk

0024: DWORD Chunk number of last PMGL (listing) chunk

0028: DWORD -1 (unknown)

002C: DWORD Number of directory chunks (total)

0030: DWORD Windows language ID

0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC}

0044: DWORD $54 (This is the length again)

0048: DWORD -1 (unknown)

004C: DWORD -1 (unknown)

0050: DWORD -1 (unknown)

The Listing Chunks

The header is directly followed by the directory chunks. There are two types of directory chunks -- index chunks, and listing chunks. The index chunk will be omitted if there is only one listing chunk. A listing chunk has the following format:

0000: char[4] 'PMGL'

0004: DWORD Length of free space and/or quickref area at end of

directory chunk

0008: DWORD Always 0.

000C: DWORD Chunk number of previous listing chunk when reading

directory in sequence (-1 if this is the first listing chunk)

0010: DWORD Chunk number of next listing chunk when reading

directory in sequence (-1 if this is the last listing chunk)

0014: Directory listing entries (to quickref area) Sorted by

filename; the sort is case-insensitive.

The quickref area is written backwards from the end of the chunk. One quickref entry exists for every n entries in the file, where n is calculated as 1 + (1 << quickref density). So for density = 2, n = 5.

Chunklen-0002: WORD Number of entries in the chunk

Chunklen-0004: WORD Offset of entry n from entry 0

Chunklen-0008: WORD Offset of entry 2n from entry 0

Chunklen-000C: WORD Offset of entry 3n from entry 0

...

The format of a directory listing entry is as follows

ENCINT: length of name

BYTEs: name (UTF-8 encoded)

ENCINT: content section

ENCINT: offset

ENCINT: length

The offset is from the beginning of the content section the file is in, after the section has been decompressed (if appropriate). The length also refers to length of the file in the section after decompression.

There are two kinds of file represented in the directory: user data and format related files. The files which are format-related have names which begin with '::', the user data files have names which begin with "/".

The Index Chunk

An index chunk has the following format

0000: char[4] 'PMGI'

0004: DWORD Length of quickref/free area at end of directory chunk

0008: Directory index entries (to quickref/free area)

The quickref area in an PMGI is the same as in an PMGL

The format of a directory index entry is as follows

ENCINT: length of name

BYTEs: name (UTF-8 encoded)

ENCINT: directory listing chunk which starts with name

When higher-level indexes exist (when the depth of the index tree is 3 or higher), presumably the upper-level indexes will contain the numbers of lower-level index chunks rather than listing chunks

Encoded Integers

An ENCINT is a variable-length integer. The high bit of each byte indicates "continued to the next byte". Bytes are stored most significant to least significant. So, for example, $EA $15 is (((0xEA&0x7F)<<7)|0x15) = 0x3515.

The Content

In Version 3, the content typically immediately follows the header sections, and is at the location indicated by the DWORD following the header section table. In Version 2, the content immediately follows the header. All content section 0 locations in the directory are relative to that point. The other content sections are stored WITHIN content section 0.

The Namelist file

There exists in content section 0 and in the directory a file called "::DataSpace/NameList". This file contains the names of all the content sections. The format is as follows:

0000: WORD Length of file, in words

0002: WORD Number of entries in file

Each entry:

0000: WORD Length of name in words, excluding terminating null

0002: WORD Double-byte characters

xxxx: WORD 0

Yes, the names have a length word AND are null terminated; sort of a belt-and-suspenders approach. The coding system is likely UTF-16 (little endian).

The section names seen so far are

Uncompressed

MSCompressed

"Uncompressed" is self-explanatory. The section "MSCompressed" is compressed with Microsoft's LZX algorithm.

The Section Data

For each section other than 0, there exists a file called '::DataSpace/Storage/<Section Name>/Content'. This file contains the compressed data for the section. So, conceptually, getting a file from a nonzero section is a multi-step process. First you must get the content file from section 0. Then you decompress (if appropriate) the section. Then you get the desired file from your decompressed section.

Other section format-related files

There are several other files associated with the sections

::DataSpace/Storage/<SectionName>/ControlData

This file contains $20 bytes of information on the compression. The information is partially known:

0000: DWORD Number of DWORDs following 'LZXC', must be 6 if version is 2

0004: ASCII 'LZXC' Compression type identifier

0008: DWORD Version (Must be <=2)

000C: DWORD The LZX reset interval

0010: DWORD The window size

0014: DWORD The cache size

0018: DWORD 0 (unknown)

Reset interval, window size, and cache size are in bytes if version is 1, $8000-byte blocks if version is 2.

::DataSpace/Storage/<SectionName>/SpanInfo

This file contains a quadword containing the uncompressed length of the section.

::DataSpace/Storage/<SectionName>/Transform/List

It appears this file was intended to contain a list of GUIDs belonging to methods of decompressing (or otherwise transforming) the section. However, it actually contains only half of the string representation of a GUID, apparently because it was sized for characters but contains wide characters.

Appendix: The Compression

The compressed sections are compressed using LZX, a compression method Microsoft also uses for its cabinet files. To ensure this, check the second DWORD of compression info in the ControlData file for the section — it should be 'LZXC'. To decompress, first read the file "::DataSpace/Storage/<SectionName>/Transform/{7FC28940-9D31-11D0-9B27-00A0C91E9C7C}/InstanceData/ResetTable". This reset table has the following format

0000: DWORD 2 unknown (possibly a version number)

0004: DWORD Number of entries in reset table

0008: DWORD 8 Size of table entry (bytes)

000C: DWORD $28 Length of table header (area before table entries)

0010: QWORD Uncompressed Length

0018: QWORD Compressed Length

0020: QWORD 0x8000 block size for locations below

0028: QWORD 0 (zeroth entry of table)

0030: QWORD location in compressed data of 1st block boundary in

uncompressed data

Repeat to end of file

Now you can finally obtain the section (from its Content file). The window size for the LZX compression is 16 (decimal) on all the files seen so far. This is specified by the DWORD at $10 in the ControlData file (but note that DWORD gives the window size in 0x8000-byte blocks, not the LZX code for the window size)

The rule that the input bit-stream is to be re-aligned to a 16-bit boundary after $8000 output characters have been processed IS in effect, despite this LZX not being part of a CAB file. The reset table tells you when this was done, though there is no need for that during decompression; you can just keep track of the number of output characters. Furthermore, while this does not appear to be documented in the LZX format, the uncompressed stream is padded to an $8000 byte boundary.

There is one change from LZX as defined by Microsoft: After each LZX reset interval (defined in the ControlData file, but in practice equal to the window size) of compressed data is processed, the LZX state is fully reset, as if an entirely new file was being encoded. This allows semi-random access to the compressed data; you can start reading on any reset interval boundary using the reset interval size and the reset table.

Note:

Earlier versions of this document stated that the reset interval only reset the Huffman tables and required outputting the 1-bit header again. This was erroneous. The Lempel Ziv state is reset as well. In practice, a decoder works just as well with the incorrect assumption, but encoding a file with match positions which refer to a time before the most recent LZX reset causes crashes on decoding.

Acknowledgements

The following people in (no particular order) have submitted information which has helped correct and close the gaps in this document.

Peter Ferrie (peter_ferrie at hotmail.com) Web Site

Pabs (pabs at zip.to) who also runs the CHM Spec page.

And others I have not been able to reach.

Copyright 2001-2003 Matthew T. Russotto

You may freely copy and distribute unmodified copies of this file, or copies where the only modification is a change in line endings, padding after the html end tag, coding system, or any combination thereof. The original is in ASCII with Unix line endings.

 
 
 
免责声明:本文为网络用户发布,其观点仅代表作者个人观点,与本站无关,本站仅提供信息存储服务。文中陈述内容未经本站证实,其真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
2023年上半年GDP全球前十五强
 百态   2023-10-24
美众议院议长启动对拜登的弹劾调查
 百态   2023-09-13
上海、济南、武汉等多地出现不明坠落物
 探索   2023-09-06
印度或要将国名改为“巴拉特”
 百态   2023-09-06
男子为女友送行,买票不登机被捕
 百态   2023-08-20
手机地震预警功能怎么开?
 干货   2023-08-06
女子4年卖2套房花700多万做美容:不但没变美脸,面部还出现变形
 百态   2023-08-04
住户一楼被水淹 还冲来8头猪
 百态   2023-07-31
女子体内爬出大量瓜子状活虫
 百态   2023-07-25
地球连续35年收到神秘规律性信号,网友:不要回答!
 探索   2023-07-21
全球镓价格本周大涨27%
 探索   2023-07-09
钱都流向了那些不缺钱的人,苦都留给了能吃苦的人
 探索   2023-07-02
倩女手游刀客魅者强控制(强混乱强眩晕强睡眠)和对应控制抗性的关系
 百态   2020-08-20
美国5月9日最新疫情:美国确诊人数突破131万
 百态   2020-05-09
荷兰政府宣布将集体辞职
 干货   2020-04-30
倩女幽魂手游师徒任务情义春秋猜成语答案逍遥观:鹏程万里
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案神机营:射石饮羽
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案昆仑山:拔刀相助
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案天工阁:鬼斧神工
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案丝路古道:单枪匹马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:与虎谋皮
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:李代桃僵
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:指鹿为马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:小鸟依人
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:千金买邻
 干货   2019-11-12
 
>>返回首页<<
推荐阅读
 
 
频道精选
 
静静地坐在废墟上,四周的荒凉一望无际,忽然觉得,凄凉也很美
© 2005- 王朝网络 版权所有