Specification of PDIC Format (Version 5.0, BOCU-1 Compression)

PDIC is a Personal Dictionary software made by TaN.
This is one of the most well know dictionary software / data format in Japan.

Japanese specification of PDIC format is available here , but no English, so I translate it to English for all over the world :)




Introduction

This specification is translated based on Japanese PDF (version 0.8/2003.4.28) by Toshiyuki UMEDA. The original Japanese specification copy right is TaN. This descriptoin is only version 5.0 BOCU-1 Compression version.




PDIC Format

PDIC composition

Header part
Fixed size 256 byte
Extended Header part
Valiable size upto 4 G byte (but now it's not used).
Index part
Valiable size n(0 to 32767) x 256byte
Data part
Valiable size, one block size is 256 byte





Header part structure (256 byte)

struct HEADER {

           char headername[ 100 ]; // Dictionary Header Title
           char dictitle[ 40 ]; // Dictionary Name 
           short version; // Dictionary Version 
           short lword; // MAX Length of Index word 
           short ljapa; // MAX Length of translation 
           short block_size; // Block size (256 fixed)
           short index_block; // Number of blocks of index part 
           short header_size; // Header size (byte)
           unsigned short index_size; // Size of Index (not used)
           short empty_block; // number of first empty block (if not, -1)
           short nindex; // number of index element (not used)
           short nblock; // number of total blocks (not used)
           unsigned long nword; // number of record 
           byte dicorder; // dicitionary order
           byte dictype; // dicitionary type
           byte attrlen; // length of attribution 
           byte os; // OS
           long olenumber; // Serial number for OLE 
           ushort lid_word; // ID of index. word
           ushort lid_japa; // ID of translation 
           ushort lid_exp; // ID of example
           ushort lid_pron; // ID of pronunciation code
           ushort lid_other; // other ID
           byte index_blkbit; // 0:16bit, 1:32bit
           byte dummy0;
           ulong extheader; // size of extended header 
           long empty_block2; // number of first empty block
           long nindex2; // number of index
           long nblock2; // number of used block
           byte reserved[8];
           ulong update_count; // update count
           byte dummy00[4];
           byte dicident[8]; // dictionary identify
           char dummy[ 32 ]; // Dummy 
};

Name
Type
Explanation
headername
char [100]
Nothing special
dictitile
char[40]
NULL (Bocu1 encode)
version
short
0x500
lword
short
not used
ljapa
short
not used
block_size
short
size of block. 256
index_block
short
number of blocks of index part. 1block size is 256.
index size = index_block * block_size
header_size
short
header size 256
index_size
short
not used
empty_block
short
not used (use empty_block2)
nindex
short
not used (use nindex2)
nblock
short
not used (use nblock2)
nword
ulong
number of record
dicorder
byte
registration order of index
0 : Code order
1: upper case and lower case is same order
2: Dictionary order
3: Down order
dictype
byte
attribute of dictionary
0x01: binary compression
0x08: BOCU1 compression
0x40: passwd required to use dictionary
0x80: TreeView Dictionary
attrlen
byte
length of attribution (always 1)
os
byte
os
0x20 : bocu encoding
olenumber
long
latest ole object number
lid_word
ushort
not used
lid_japa
ushort
not used
lid_exp
ushort
not used
lid_pron
ushort
not used
lid_other
ushort
not used
index_blkbit
byte
size of block number of index part (0: 16bit, 1: 32bit)
dummy0
byte

extheader
ulong
extended header size (byte)
empty_block2
long
first empty block number (if not, -1(0xffffffff))
nindex2
ulong
number of index element
nblock2
ulong
number of used block
cypt
8byte
crypt code
update_count
ulong
update count (only for LAN)
dummy00
4byte
researved
dicident
8byte
randum 8 byte
dummy
32byte
dummy




Index Part

Name
Size
Explanation
Block number
2 or 4 byte
size is depend on index_blkbit(0:2byte, 1:4byte)
First word of block
valiable
termination is NULL
This word can be compressed with BOCU1
Bock Termination
4 byte or more
all 0

Disk point of First word = header size (256) + extended header size (0) + block_size x index_block + block_size*(Block number)






Data Part

Name
Size
Explanation
Number of data block
2 byte
if it's 0, it's empty
Field Data
Valiable
One field has one index
Termination
2 byte of 4 byte
The size is depend on the length of field data


Field Data (ordinary type)
Name
Size
Explanation
Field Length
2byte
Size is from index word to the end of translation. It doesn't contain FieldLength, Compression Length and Index Attribution. This size is depend on top bit of "Number of data block" in Data part.
Compress Length
1 byte
Compression length of index. This unit is byte.
Attribution 1 byte

Index word
valiable
Null termination. Bocu1 Compress
Translation
Valiable
Not Null Termination Bocu1 compress.


Index word is compressed as followsing.

Example of Index compression (Neibour compression)

Index
Compression Length
Index word (After compression)
1st
ABC
0
ABC
2nd
ABDEF
2
DEF
3rd
ABDGD
3
GD
4th
AGDE
1
GDE

Normally index word can be well compressed with neighbour word.
Index word is compressed with ordinary Index compression which is above and BOCU1 compress.
First you need to reconstuct index word with Neibour compression and then you can decompress with BOCU1

Attribution
Value
Explanation
0x80
obligation(/Termination)
0x10
Extended flag
0x20
Important word for study
0x40
Changed word

If Attribution contain the Extened flag, the Field Datais as followings.

Field Data (extended type)
Name
Size
Explanation
Field Length
2 byte
Size is from index word to the end of translation. It doesn't contain FieldLength, Compression Length and Index Attribution. This size is depend on top bit of "Number of data block" in Data part.
Length of compression
1 byte
Length of compssion of index
Attribution
1byte

Translation
Valiable
NULL Termination
Extened Attribution
1byte

Data
Valiable

Extended Attribution
1byte
You can put unlimit number of Extended Attribution and Data.
Data
1byte

Termination
1byte
0x80


Extended Attribution
Value
Explanation
0x01
Example
0x02
Pronauntiation
0x03
not used
0x04
link data
0x05 - 0x0f
not used
0x10
binary data
0x20
not yet
0x40
compression flag
0x80
Termination


Data field is omission. When you need, please let me know by e-mail.




E-mail umeda@tele.ucl.ac.be
http://www.tele.ucl.ac.be/MEMBERS/Umeda_Toshiyuki_e.html(CV)
TEL: +32 10 47 80 74 (Office/Belgium)