增刊第十八期

EBPIG	6探索杂志6	MHJDQ

	知识共享J资源共享J资料共享

【发行时间】2000-11-07

【期刊号码】增刊十八

【网站地址】http://programhunter.com

此杂志由程式猎人编辑、制作及发行；杂志可以自由转载、分发和传播；任何个人或团体不得在未经本人授权的情况下修改杂志的外观及内容；杂志的解释权归程式猎人所有。

【编辑寄语】

    
   {~._.~} 
    ( Y )  
   ()~*~() 
   (_)-(_)

�这期杂志没有什么可以说的，就是延续上周的增刊的内容。所以今天不写什么内容了，大家好好看一看杂志的内容就可以了。

【目 �� 录】

��&破解心得
	J……Part VIII: Burrows - Wheeler - Transformation (BWT)	Joa Koester
��%初学天地
��O问题答疑
��4网站介绍
��,杂志信箱

&【破解心得】

 Part VIII: Burrows - Wheeler - Transformation (BWT)

 By Joa Koester (JoKo2000@HotMail.Com)
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it.
If you have corrections, comments, better examples than the ones used
in this essay etc. - drop me a line.

 hello, hello,

well, shave my legs and call me Grandpa if that wasn't a long delay, but i still
have some serious difficulties accessing the Internet recently. I hope that i
can access you more frequently in the near future. And to sweet up the time till
then a little bit here my next essay about crunching methods:

 THE BURROWS - WHEELER - TRANSFORMATION (tataaa :)


As we discussed last time the mechanism of Arithmetic Crunching some of you
asked me about the source (the input of the cruncher) and that sometimes it
would take the cruncher a long time to adjust to the source data. Yes, there is
a problem with this kind of crunching: the source itself should be transformed
to be better crunchable. But how do we transform a data to be better crunchable
with a fast algorithm?

Well, the arithmetic cruncher would love to crunch data like (all hex):

00, 01, 05, 02, 08, c2, 03, 01, c0, 0a, 12, 01, 02....


but how do we transform a sequence like {0a, f2, f2, 3c, 0a, 77, 44, 0a} into
such cruncherfriendly data?

Well, luckily for us, there is an algorithm available for us that does exactly
what we need. This alg is called Move-To-Front (MTF).


The MTF-Algorithm

MTF is - once you had the idea - a pretty simple algorithm. It transforms the
input data into some other data, it is reversible and it is pretty fast. The
basic idea is that we don't output the data itself but the index of the char
in some table (in our case the table has the usual possible 256 entries).
After the output we save the byte, shift the rest below one position up and
MOVE the saved byte TO the FRONT of the table. A simple char-Table will show
you what i mean:


Let's presume the input is {0a, f2, f2, 3c, 0a, 77, 44, 0a}

That's 8 entries. But we have only 5 distinct values. As we want a reference
table, we only need the distinct values, so for the sake of our example
we take a reference table with only 5 entries (which are -  for the example -
filled with the distinct input values as starting values. In a real case the
table would have 256 entries filled with the 256 possible values for start):

Table   Value
[ ]
 0        0a
 1        f2
 2        3c
 3        77
 4        44

 Now let's transform the first byte: 0a

We search our table for the value 0a and we find it at offset 00. And that's our
output: 00. Now we shift the rest one position upwards and move the 0a at offset
00 to offset 00. Luckily for us the loop terminates immediately and our table
looks now exactly like before:

Table   Value
[ ]
 0        0a
 1        f2
 2        3c
 3        77
 4        44


And now the second byte: f2
f2 is found at position 01 and this is our output: 01. Now we save the f2 at
position 01, move all other entries below the position 01 one up and move the
saved f2 to position 00. Let's watch the manipulation of the table in slow
motion:

Table   Value                           Table   Value                   Table   Value
[ ]                                     [ ]                             [ ]
 0        0a      f2 is saved. Copy      0        0a    insert the   ->  0        f2
 1        f2  <-  the entries below  ->  1        0a    saved f2 at      1        0a
 2        3c      one position up:       2        3c    offset 00:       2        3c
 3        77                             3        77                     3        77
 4        44                             4        44                     4        44

The next byte is again: f2
As we find it at position 00 our output is 00. Our moving loop terminates
immediately and now let's have a look at our table and our output so far:

Table   Value
[ ]
 0        f2
 1        0a
 2        3c
 3        77
 4        44

Output so far: 00, 01, 00


Oh, ah, fascinating. Our input has surely been transformed into something more
likely to be crunched by an Arithmetic Cruncher system.
Let's continue with the next bytes:

Inputbyte is 3c
3c is found at offset 02. So we save the 3c at offset 02, move the rest below
position 02 one up and move 3c to offset 00:

Table   Value                           Table   Value                   Table   Value
[ ]                                     [ ]                             [ ]
 0        f2      3c is saved. Copy      0        f2    insert the   ->  0        3c
 1        0a      the entries below      1        f2    saved f2 at      1        f2
 2        3c  <-  one position up:   ->  2        0a    offset 00:       2        0a
 3        77                             4        77                     4        77
 4        44                             5        44                     5        44

Output so far: 00, 01, 00, 02

The next input byte is 0a.
It is found at offset 02 which is our output. Now we move the 0a at position 02 to the front:

Table   Value                           Table   Value                   Table   Value
[ ]                                     [ ]                             [ ]
 0        3c      0a is saved. Copy      0        3c    insert the   ->  0        0a
 1        f2      the entries below      1        3c    saved 0a at      1        3c
 2        0a  <-  one position up:   ->  2        f2    offset 00:       2        f2
 3        77                             4        77                     4        77
 4        44                             5        44                     5        44

Output so far: 00, 01, 00, 02, 02


Now for the next input byte: 77. 77 is found at offset 03 which is output. This
is how our table is altered:

Table   Value                           Table   Value                   Table   Value
[ ]                                     [ ]                             [ ]
 0        0a      77 is saved. Copy      0        0a    insert the   ->  0        77
 1        3c      the entries below      1        0a    saved 77 at      1        0a
 2        f2      one position up:       2        3c    offset 00:       2        3c
 3        77  <-                     ->  4        f2                     4        f2
 4        44                             5        44                     5        44

Output so far: 00, 01, 00, 02, 02, 03


After the next input byte (44) our output is 04 and our table looks like this:

Table   Value
[ ]
 0        44
 1        77
 2        0a
 4        3c
 5        f2

Output so far: 00, 01, 00, 02, 02, 03, 04


After the final input byte (0a) our output is 02 and our table looks like this:

Table   Value
[ ]
 0        0a
 1        44
 2        77
 4        3c
 5        f2


Output so far: 00, 01, 00, 02, 02, 03, 04, 02


As you can see these output bytes are far more crunchable than the original
bytes. But is it really reversible? Let's retransform the first few bytes
together. At first we build the basic transformation table which is the same as
the cruncher uses (remember how cruncher and decruncher have to use the same
model?):

Table   Value
[ ]
 0        0a
 1        f2
 2        3c
 3        77
 4        44

Again, in a real version the table would be filled with the standard 256
possible values of the ASCII-TABLE.

Now let's retransform the first byte: 00

We look for the 00th byte in our table, find a 0a and output it. Then we take
the 00 as starting point for the shifting and moving and proceed as we already
know from the first pass. As the loop terminates immediately we can continue
with the next input byte.

The next input byte is a 01. Now, at offset 01 we find a f2, output it and
proceed: Save the char at offset 01, shift all entries below offset 01 one
position up and move the saved byte into position 00. After that our table looks
like this:

Table   Value
[ ]
 0        f2
 1        0a
 2        3c
 3        77
 4        44

As you can already see the table looks exactly like the table when we had our
first pass and transformed the char f2. If we would continue to scan thru the
rest of the input (the former output) we would see that the transformation
table would be exactly like the one in the first pass.


So, let's recapitulate: the MTF-Algorithm is fast, easy to implement and a good
preprocessor for Huffman or (even better) Arithmetic Crunching. It's also a nice
basic idea for crypting data.


Yes, but, ehm...


Yes, Watson?


Well, i thought, that this essay was about the Burrows-Wheeler-Transformation,
not about the MTF-algorithm.


You are right. But the authors of the BWT, Michael Burrows and David Wheeler
recommended that data should be crunched in such an order:

Data -> (maybe a simple RLE cruncher ->) BWT -> MTF -> Huffman or Arithmetic Cruncher


Ah, i see.


Yes, and now as the basic premises are all clear let's examine the BWT.

What is the BWT? Well, a short answer would be: an alg that sorts data, saves a
certain aspekt of the sorted data, saves a starting index and that's it.

What? No compression?

No, Watson, no compression. The transformation is just what it is: A
transformation. But think about it for a moment. A reversible transformation has
a big advantage for designers: You can inspect and/or manipulate the transformed
data in another way than you can with the original data. Take the MTF for
example. It transforms the input data into some other data. But this other data
is in most cases much better crunchable than the original data. And as the
process is completely reversible you don't have any disadvantage. You could
write a special cruncher sensitive to the output of MTF processes with which you
could achieve better results than achieved with normal data. You could say that
this kind of transforming data, manipulating that other data and retransforming
the data to the original kind is some form of indirection. Indirection is very
crucial for programmers (and philosophers); it enables you to solve and 
discuss problems on a much more comfortable level than the original problem 
stated.

Keep this in your mind: if you have some problems, viewing the problem from
a differenct angle gives you much more possibilities than trying to solve the 
problem with brute force. But back to the BWT.

So, the BWT is just some kind of indirection. And it has to be reversible. But
how does this work together with sorting?

Well, the BWT algorithm gives you the possibility to unsort your data...

Yes, you read right. With BWT you can unsort the saved data you constructed by
BWT. OK, the data you save itself is not sorted, but is has some interesting
features:

1. It has in some way the FOLLOWING context of the original data.
2. It enables you to unsort the data and rebuild the original data.
3. The bigger the chunks of data you sort, the better your results for the
   crunching stage.

So, speedwise it's a matter of how fast your sorting algorithm is.
crunchwise it's a matter of how much memory you have.


So, for example let's transform the word CRACKER.

To do this we need an array to all possible shifted forms of the word.
If you take the word CRACKER, it has 7 letters. So we can shift it
7 times, from shift 0 to shift 6. The result is something like this:

CRACKER
RACKERC
ACKERCR
CKERCRA
KERCRAC
ERCRACK
RCRACKE

How do arrange this? Well, one easy way would be to duplicate the source data
in RAM and just install enough charpointers (7 in that case).

    <  RAM-DATA  >
    
    CRACKERCRACKER
S0  ^^^^^^^
S1   ^^^^^^^
S2    ^^^^^^^
S3     ^^^^^^^
S4      ^^^^^^^
S5       ^^^^^^^
S6        ^^^^^^^

When we later would sort the data we would have a very simple compare logic.
Another way would be just 7 charpointers and the compare part of the sort logic 
would have to wrap around if it reached the end of the block:

   
   
S0 CRACKER
S1 RACKERC
S2 ACKERCR
S3 CKERCRA
S4 KERCRAC
S5 ERCRACK
S6 RCRACKE

I assume the latter for now. But it's a matter of taste, nothing else.


However, what we do next is to sort the data. As pointers we take our 7
charpointer and we should only swap the string pointers as swapping some 100
KByte could take a looong time if done some 100 times :). Again: if you sort 
that way i do, your compare function has to recognize the end of the block to 
wrap around. Otherwise your sorting will fail.

(You should save the pointer of S1, as you will need it later.)

So, after sorting our new set of strings looks like this:

   
S2 ACKERCR
S3 CKERCRA
S0 CRACKER
S5 ERCRACK
S4 KERCRAC
S1 RACKERC
S6 RCRACKE

Now, what next?


Ok, let's take a closer look to the first and last column of the strings:

   <   block   >
S2 A   CKERC   R
S3 C   KERCR   A
S0 C   RACKE   R
S5 E   RCRAC   K
S4 K   ERCRA   C
S1 R   ACKER   C
S6 R   CRACK   E

The first column ist SORTED. Well, we can expect it to be sorted, as we have
just sorted the data :) And what about the last column? Well, the least we can
say - as we are having a wrap-around look - is

 - that the first and the last column both contain the word (of course) in some 
   more or less strange order and
 - the letter in the last column is the prefix letter of the letter in the first
   column in the same row.

Now, what we save to disk is the LAST column and an index. The index is the
position of the first letter of the original word in the saved data (our last
column).

When we rotate the last column by 90 degrees it looks like: RARKCCE. And the
'C' of CRACKER is at position 5 (zero-based). It is important to take the
correct offset of the correct 'C' as it is the starting point for the 
re-transformation.

There is an easy way to find the right 'C'. There are two 'C's in our little
example. But we need the 'C' which starts the word CRACKER. If you look at the
table, it is the 'C' of the last column of S1. Why? Because S0 is the original 
(shifted 0 times). And S1 is the original shifted once:

S0 C   RACKE   R
S1 R   ACKER   C

That's why i'd save the pointer S1. As we want the first char of the word, but
have to use the last column, we only can take the char in the last colum of S1.
So, all, we have to do is to count thru our charpointers until we reach S1:

       <   block   >
0:  S2 A   CKERC   R
1:  S3 C   KERCR   A
2:  S0 C   RACKE   R
3:  S5 E   RCRAC   K
4:  S4 K   ERCRA   C
5:  S1 R   ACKER   C    <- BINGO
6:  S6 R   CRACK   E

There comes our index: 5. We output it and are done with the first pass of the
transformation.


However, there are two naturally rasing questions:

How do we unsort RARKCCE and how do we rebuild our original data CRACKER?
                      ^


Well, to restore our original data we need some kind of re-transformation 
table. In this table the offsets of the next char would be saved and the only 
thing we would have to do is to follow the jumps in the table (starting with our 
saved index) and output the char found at the given offset in our data. We are 
basically building a sequence table, only that it the given position is also the 
offset in our source data to be output. For our data RARKCCE, how would such 
table look like?

Well, the table would look like this:


Table
[ ]
 1
 4
 5
 6
 3
 0   <- Start here
 2

Now we include an offset, add the source data to the table to the view and watch 
them parallel:

Offset   Table   Data
         [ ]
0         1        R
1         4        A
2         5        R
3         6        K
4         3        C
5         0        C   <- Start here
6         2        E

Let's rebuild the data with this table:

we start at offset 5, our given saved index and output the char 'C'. The entry 
in the table sends us to position 0. There we find an 'R' and output it. The 
table offsets at position 0 says 1, so we goto position 1 and find an 'A' there. 
The offset at position 1 sends us to position 4. There awaits us a 'C'. From 
there we are sent to the position 3 and can fetch a 'K' there. Now we jump to 
offset 6 and can harvest an 'E'. From position 6 we are sent to pos 2, output 
the 'R' from there and are sent to our starting position, which tells us, that 
we are finished. Our original data is revuild.

Astounding, isn't it?

Nice table, all right, but how do we build this table?

Well, here comes the (un)sorting part into the game.

Again, let's watch our first and our last column (last column first):

    L F
0:  R A
1:  A C
2:  R C
3:  K E
4:  C K
5:  C R
6:  E R

Well, if we take the first char in the column (F), we find an 'A'. Where
do we find the first 'A' in column (L)? At offset 01. Hm. And the next
char is a 'C'. Where do we find this in the column (L)? At offset 4.
Let's have a look at the table again after filling in these values:

Offset   Table   Data
         [ ]
0         1*       R
1         4*       A
2         5        R
3         6        K
4         3        C
5         0        C   <- Start here
6         2        E

Wow, the first two offsets are correct. And watch the two columns again:

    L F
0:  R A
1:  A C
2:  R C
3:  K E
4:  C K
5:  C R
6:  E R


You start at the offset 5 (C R) (* = (already) used)

Offset   Table   Data  Data-Sort
         [ ]
0         1        R       A
1         4        A       C
2         5        R       C
3         6        K       E
4         3        C       K
5         0 *      C *     R     <- Start here
6         2        E       R


and jump to the first offset of an 'R' (offset 0) you see a (R A).

Offset   Table   Data  Data-Sort
         [ ]
0         1 *      R *     A     <- ActPos
1         4        A       C
2         5        R       C
3         6        K       E
4         3        C       K
5         0 *      C *     R
6         2        E       R


The first 'A' is at offset 1 (A C).

Offset   Table   Data  Data-Sort
         [ ]
0         1 *      R *     A
1         4 *      A *     C     <- ActPos
2         5        R       C
3         6        K       E
4         3        C       K
5         0 *      C *     R
6         2        E       R

The NEXT AVAILABLE 'C' ist at offset 4 (C K):

Offset   Table   Data  Data-Sort
         [ ]
0         1 *      R *     A
1         4 *      A *     C
2         5        R       C
3         6        K       E
4         3 *      C *     K     <- ActPos
5         0 *      C *     R
6         2        E       R

from where you can reach the first 'K' at offset 3 (K E).

Offset   Table   Data  Data-Sort
         [ ]
0         1 *      R *     A
1         4 *      A *     C
2         5        R       C
3         6 *      K *     E     <- ActPos
4         3 *      C *     K
5         0 *      C *     R
6         2        E       R

The first 'E' is at offset 6 (E R).

Offset   Table   Data  Data-Sort
         [ ]
0         1 *      R *     A
1         4 *      A *     C
2         5        R       C
3         6 *      K *     E
4         3 *      C *     K
5         0 *      C *     R
6         2 *      E *     R     <- ActPos

The NEXT AVAILABLE 'R' for us is at offset 2 (R C).

Offset   Table   Data  Data-Sort
         [ ]
0         1 *      R *     A
1         4 *      A *     C
2         5 *      R *     C     <- ActPos
3         6 *      K *     E
4         3 *      C *     K
5         0 *      C *     R
6         2 *      E *     R

As the next available 'C' is at the starting offset (5) we are finished.

To program a table-building part like this:
An easy way to build the array is to have a sorted copy of the data ready and a 
table buffer big enough. Now you start with the first char in the sorted buffer. 
In our case an 'A'. Now as this is our first run, we simply search in our 
original data for an 'A' and find it at offset 01. As our searchloop is still at 
index 0 we write our offset (the 01) into the table at index [0].

Index    Table   Data    Data-Sort
         [ ]
0         1        R         A
1                  A         C
2                  R         C
3                  K         E
4                  C         K
5                  C         R
6                  E         R

Now we increment our index to 01 and in the sorted data we find a 'C'. Now, 
maybe we already had a 'C' searched before. We could have cached the last char 
and the last char pos and would check this out now. Or we could look one 
position back. Let's assume the latter. When we look one position back, we see 
an 'A' and recognize that we are indeed the first 'C'. So we search the original 
data for the first 'C' to come and find it at offset 04. Our searchloop index is 
01 so we write our offset 04 to the table at index [1].

Index    Table   Data    Data-Sort
         [ ]
0         1        R         A
1         4        A         C
2                  R         C
3                  K         E
4                  C         K
5                  C         R
6                  E         R

Now we increment from 01 to 02 and find again a 'C'. As we look one position 
back we see that we already had searched for a 'C'. We fetch the saved offset 
(04) and start to search from 04 + 1. As we are on position 5 we immediately 
find another 'C' and are done. We write the offset 5 into the table at index [2] 
and go on.

Index    Table   Data    Data-Sort
         [ ]
0         1        R         A
1         4        A   +---- C
2         5        R   ! +-- C
3                  K   ! !   E
4                  C <-+ !   K
5                  C <---+   R
6                  E         R

The 'E' and 'K' are treated like the 'A' (no former search done, start 
searching from the beginning):

Index    Table   Data    Data-Sort
         [ ]
0         1        R         A
1         4        A         C
2         5        R         C
3         6        K         E
4         3        C         K
5                  C         R
6                  E         R

The first and the second 'R' are done like the 'C': 
Are we the first one? 
If so, start to look from the beginning of the original data, 
else get the last known position and start searching from this last position +1.

When we are ready, our table looks like this:

Offset   Table   Data    Data-Sort
         [ ]
0         1        R         A
1         4        A         C
2         5        R         C
3         6        K         E
4         3        C         K
5         0        C         R
6         2        E         R

Yep, that's the table we need to rebuild our really original data from the data 
we saved.

And where do we get the sorted copy of the data? Well, there's no way around it: 
You have to reserve another block of memory, copy the saved data to it and SORT 
the saved data there. As we just transformed the data there is no byte missing.

The inner trick of the BWT is the rebuilding of the original data with the 
intervals between the sorted column of the data and the prefix column (our saved 
data). Neat, eh?


Let's repeat: To restore our original data from the saved data, we arrange a 
second copy of the data, sort it, build the table from it and with the table we 
can restore our original data (starting with the saved starting index).

To transform the data we sort it using pointers, save the last column of the 
sorted data and also save the offset of S1 as starting index.

Either in the net or on my harddisk is a complete set of sources doing the job 
for you. If i am successfull you should find an attachment to this essay 
(and if everser+ allows so :).

If you find the attachment, take your compiler and compile the progs.
The correct sequence for crunching / decrunching is:

Crunch:
SourceProg -> RLE -> BWT -> MTF -> ARI -> CrunchProg

Decrunch:
CrunchProg -> UNARI -> UNMTF -> UNBWT -> UNRLE -> SourceProg.

Anyway, as always i'm open to suggestions, questions etc. Have just a little 
patience, that's all (maybe i should get a private INet-Access anyway :).

Ok,

see you next time

Greetings from a programmer

Joa

%【初学天地】

O【问题答疑】

4【网站介绍】

,【杂志信箱】

投稿信箱：discoveredit@china.com

答疑信箱：discoveranswer@china.com

斑竹信箱：programhunter@china.com