增刊第十六期

EBPIG	6探索杂志6	MHJDQ

	知识共享J资源共享J资料共享

【发行时间】2000-10-25

【期刊号码】增刊十六

【网站地址】http://programhunter.com

此杂志由程式猎人编辑、制作及发行；杂志可以自由转载、分发和传播；任何个人或团体不得在未经本人授权的情况下修改杂志的外观及内容；杂志的解释权归程式猎人所有。

【编辑寄语】

    
   {~._.~} 
    ( Y )  
   ()~*~() 
   (_)-(_)

�不知道大家对上周的那三篇文章看的如何，时间好快啊，现在又到了周三的时间，所以我又要向大家发送关于这个方面的文章，这次将向大家发送二篇，因为第三篇的文章太多了，没有办法一起发，下周将发后面二篇，我想大家先看这个英文的吧，再过一段时间中文的将推出来了，现在我已经翻译完前三篇了，以后一定会向大家发送中文的。

【目 �� 录】

��&破解心得
	J……Part IV: Leaving Huffman and entering the realms of R'lyeh, eh, RLE
	K……Part V: Interlude and the Mystery of Lempel-Ziv Part I
	L……
��%初学天地
��O问题答疑
��4网站介绍
��,杂志信箱

&【破解心得】

Little essay about the various methods and viewpoints of crunching.

Part IV: Leaving Huffman and entering the realms of R'lyeh, eh, RLE



By Joa Koester (JoKo2000@HotMail.Com) 
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it. 
If you have corrections, comments, better examples than the ones used 
in this essay, etc. - drop me a line.


OK, let's dive into the world of crunchers and munchers...


Last time we discussed the possible uses of the Huffman-tech and
the possibibility to use it in an adaptive way. I would like
to conclude this with the remark that the Huffman-way is pretty 
useless for 'normal' progs. You achieve about 5% - 10% if you
are lucky. Used on normal texts yielding only characters with
ASCII > 127 there will be at least a 25% ratio. Normally there
will be a ratio of about 30%-35%. But there you have a limited
Alphabet (the letters in lower and upper cases and the numbers
plus some extra symbols). What we want are algorithms usable
for more general (random) data. One of these algorithms (or
better the principle) is RLE.

What does RLE mean?

RLE, Dr. Watson, stands for Run Length Encoding. It's varying 
implementations are mostly used when speed is important.
On the Amiga days there was a nice picture format - IFF.
It had a RLE mode (or was RLE always on? Can't remember anymore...)
where the single lines of a picture were crunched with a RLE
algorithm leading to a space-saving of about 30% to 80% depending
on the picture. Nice, isn't it?

But what is the basic idea of RLE?

Ah, Watson, always so eager, aren't you? Well, your desire will
be satisfied... 

Imagine a file with some data and some zeros:

abcdefg00000000000000000000

(you will see data like this a lot in uncompressed bitmaps)

Now, we want to crunch it. We would like to crunch the RUN of the
zeros. For this we have to tell the decoder the LENGTH  of the run. 
One obvious idea would be:

abcdefg{0x14}0

where the {0x14} would be a byte which would tell the 
decompressor how often it has to copy the following byte, 
which would give us 9 bytes instead of 27 bytes: a 66.66% ratio.
Now you know where the name RLE comes from. Believe me, this is
one of the most primitive ways of thinking up a crunch-method.
But as with all most primitive ways it is extremely fast and 
astoundingly stable. 

--- Little private note ----
As a programmer you don't win a prize for the most complicated 
algorithm you can think of (because you aren't paid (enough) for it 
and there is in most cases no time for it) (except for
the field of security where complicated algs are expected. 
And yet - a programmer has to debug his own routines and so he won't 
use tricks he won't understand a few weeks later), so you don't do it.
You program on a more stable level: KISS. Keep It Simple, Stupid!
If there is a more easy way of doing it, choose it. Use preformulated
libraries like the C++ STL or (yuk) MFC or Borland VCL. That's the
way programmers program their applications. Hugh!
--- Little private note end ---

One immediate question is of course coming up:
What, if the zeros would have appeared 0x61 times (the value for 'a')?
The output would have been:
abcdefg{0x61}0 = abcdefga0

How does the decruncher know when to activate a copy loop (crunch) or 
just copy the actual bytes (no-crunch)?

We have to build up a rule that both, the coder and the decoder
use without exceptions. The coder has to build the output in a way that
the decoder can decode it's input without problems.

So, what we need is a kind of SWITCH. A switch which the encoder sends.
Linked with the descision of implementing the switch is automatically 
the question: WHEN do i crunch? 
Do i crunch when there are 10 equal bytes following in a run? Certainly.
There will be enough ways to tell the decoder that a run is coming up.
Do i crunch when there are 5 equal bytes following in a run? Yep. Same answer.
Do i crunch when there are 4 equal bytes following in a run? Yep. Same answer.
Do i crunch when there are 3 equal bytes following in a run? Hm, let me think of it.
Do i crunch when there are 2 equal bytes following in a run? Eh, (panic) don't know. HELP!

Ok, ok. What you need is a little overview of some ideas i observed while 
analysing some RLE methods. In most crunching algorithms there will be telling
bits or bytes, telling the cruncher to change into crunch/normal mode. After
this switch the following bits/bytes are interpreted completely different.

- IFF like.
  The IFf format was build upon the idea of signed chars. I don't remember the 
  algorithm in all detail, but enough to explain the idea. If someone has the
  original IFF-Readme's he should correct me here. But the basic idea is the 
  following:
  There are TELLING bytes. These telling bytes are either signed or unsigned.
  The crunched file starts with a telling byte.
  The switch was the 8th bit (the sign bit) of this char. 
  When the decoder encountered a signed char it knew that, by masking with 0x7f,
  it would get to the Runlength of the following byte. It copied the following
  byte the decoded time +2 and went on with the next byte. +2 coz there had to
  be at least 2 bytes, so you could add 2 in mind, giving you a length of a
  crunch-run from 2 (encoded with %1 + (2 - 2 = 0 = %000 0000) = 0x80) 
  to 129 (encoded with %1 + (129 - 2 = 127 = %111 1111) = 0xff).
  If the byte was not signed, that would mean that the next X +1 bytes were to 
  be copied as they were. X +1 because there HAS to be at least one byte, so 
  you can add 1 in mind, giving us a range 
  from 1 byte (encoded with %0 + (1 - 1 = 0 = %000 0000) = 0x00) 
  to 128 (encoded with %0 + (128 - 1 = 127 =  %111 1111) = 0x7f). 
  The minimum crunching point was 2 equal following bytes. They would be
  coded as two bytes (0x80,0x??). Yes, i know, there is no saving here, but the other
  consequence would have been to encode these two bytes as no-crunch, leading
  to a three byte coding (0x01, 0x??, 0x??). 
  In cases like this it is most important to keep the garbage down, or else you are
  increasing the filesize with it.
  One example:
  
  abcdefg00000000000000000000 (= abcdefg + 20 '0')
  
  would be coded as:
  
  {0x06}abcdefg{0x92}0. 
  0x06 because we have 7 bytes here and the decoder adds one for you.
  0x92 = %10010010 = %1 0010010 = 128 + 18. 18 because the decoder adds 2 for you.

  It is important that you add the original filelength with the crunched file or your 
  decoder doesn't know when to stop. 

  If you have a file of total random data (like a already crunched file) you will
  add (filesize / 128) bytes to it. So, if your Word '95 with 3.845.120 bytes was
  total random data it would be enlarged by 30.040 bytes. But run the alf on uncompressed
  bitmaps and watch them shrink.

  Remark: IFF is a very good and simple to implement algorithm and it's a shame that
          the Amiga is dead nowadays. I see IFF only on AIFF sound formats. ;(


- one for eight
  We have a telling byte which is viewed as a row of eight status bits. 1 means we have
  a crunch, 0 means we have garbage (or the other way round - your choice).
  Crunch and Garbage have unsigned counting bytes giving us a range from 1 to 256.
  After we worked thru a sequence of all 8 bits, the next telling byte is read (thus
  forming a package, remember?)
  The crunched file starts with such a telling byte.
  One example:

  abcdefg00000000000000000000xyz111222333444abcdefg = 49 bytes

  We step thru the first 7 bytes and consider them Garbage. 
  The next 20 bytes are Crunch.
  Then 3 bytes Garbage.
  Then 3 bytes Crunch.
  Then 3 bytes Crunch.
  Then 3 bytes Crunch.
  Then 3 bytes Crunch.
  Then 7 bytes Garbage.

  So our telling byte would be: %0 1 0 1 1 1 1 0 = 0x5e.
  The counting bytes would be (decremented by one, remember): 
  0x06, 0x13, 0x02, 0x02, 0x02, 0x02, 0x02, 0x06

  The crunching algorithm can start by crunching two bytes and we should do it
  or else we would produce one counting byte, plus those two bytes = 3 bytes!

  It is important that you add the original filelength with the crunched file or your 
  decoder doesn't know when to stop. 

  The whole package would look like this:
  {0x5e}{0x06}abcdefg{0x13}0{0x02}xyz{0x02}1{0x02}2{0x02}3{0x02}4{0x06}abcdefg = 31 bytes

  The IFF-coder would have produced:
  {0x06}abcdefg{0x92}0{0x02}xyz{0x81}1{0x81}2{0x81}3{0x81}4{0x06}abcdefg = 30 bytes

  While the algorithm is a little bit more complicated than the IFF-alg, it has the advantage
  of having a range of 256 instead of 129 byte. In big files this could lead to some significant
  savings.

  I lifted this algorithm from the bit-level to the bytelevel. I saw both implementations and
  due to my observations the bytelevel is much faster. In the bitlevel you read a telling-BIT. 
  If it's a one (crunch) you read 8 bits as a counter, add one, read the next 8 bit as the char 
  and copy your decoded byte ?-times into your destination buffer. 
  If it's a zero you read the next 8 bits 
  as a counter, add one and reading ? times 8 bits as a char you copy the next ? bytes. It's the
  same mechanism, but implemented as byte it's sooooo much faster. The bitlevel is just more
  pseudo-crypting, because you can't read anything anymore :-).

  If you have a file of total random data (like a already crunched file) you will
  add (filesize / 256) + ( (filesize / 256) / 8) bytes to it. 
  So, if your Word '95 with 3.845.120 bytes was total random data it would be 
  enlarged by 15.020 + 1878 = 16898 bytes. 
  

- No Garbage
  This one is most interesting. As you saw above, one problem arising is the coding of garbage.
  This idea here now implements a way of RLE without having the problems of garbage with the
  disadvantage of only being able to crunch from up to 3 bytes in a row instead of 2 like the
  others.
  The idea is, that the decoder can do something about the garbage recognizing. If the decoder
  would somehow recognize that we have a crunch here, it just would have to read a counter byte 
  and could start the copy routine. 

  Well, the easiest way of making the decoder recognizing a byte-run is to let the run partially 
  STAY in the source. That means that we have to let at least two bytes in a row stay in the source for
  to make the decoder able to recognize a crunch. Than we add a counting byte with a range from
  1 to 256 of how many of this byte will FOLLOW and there are no garbage bytes anymore.
  1 to 256 because we will crunch starting by three equal bytes in a row, so the byte will at least 
  be copied once more, thus adding one to the range of a byte.
  In the praxis this means you will have a run with a certain length. When the length is more than
  two and shorter than 259 you subtract 3 from it. When you have a length of 3 you subtract 3 giving
  a zero as output. As the decoder will at least copy one byte the run is perfect. If you have a run
  of 258 you subtract 3 giving 255 as output. Exactly what can be put into a unsigned char. As two
  bytes will be coded the normal way and the decoder will add one to the counter we have our 258 
  runlength encoded.

  One example:
  abcdefg00000000000000000000xyz111222333444abcdefg = 49 bytes

  abcdefg               is encoded as is
  00000000000000000000  is encoded as crunch: 20 - 3 = 17 bytes will follow, so 00{0x11}
  xyz                   is encoded as is
  111                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 11{0x00}
  222                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 22{0x00}
  333                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 33{0x00}
  444                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 44{0x00}
  abcdefg               is encoded as is

  makes alltogether:

  abcdefg00{0x11}xyz11{0x00}22{0x00}33{0x00}44{0x00}abcdedfg = 32 bytes


  If you have a file of total random data (like a already crunched file) you will
  add 0 (!) bytes to it. That's an important feature. This algorithm can be
  let loose on all data you know. In the worst case the output will be the same
  as the input and no harm was done.
  So, if your Word '95 with 3.845.120 bytes was total random data it would be 
  still 3.845.120 bytes
  The disadvantage is of course the unability to crunch two consequenting bytes.
  But hey, you can't have it all, can you?


There are of course a lot more possibilities of coding RLE. I hope you get the idea.
Don't let the statistics deceive you. It strongly depends on the input data which
of the above mentioned algorithms are best suited for the task. One good strategy 
would be to load about 20%-10% of the file (if it's big) and test crunch it with 
ALL your algs. Then choose the best one and crunch the whole file. This will not
work on smaller files where you should load the whole file into a buffer (say, 64K
or 128K or so). For most RLE-algs speed is the most critical factor. Have this 
always on your mind when you invent your own RLE-crunchers.

When i started writing little crunchers on my own i started with the DECruncher.
It worked always fine for me. First think of a file and a possible coding. Write
it down. Create a file with this coding. Write a decruncher. And then think of a 
way to generate such a code. The central point of all crunch algorithms will be
the pointers to your source-buffer and to internal arrays and/or to some
history tables. Try it. It's easier than you may think right now.


Eh, i have observed something important, i think.

What is it, Watson?

In your example (abcdefg00000000000000000000xyz111222333444abcdefg) the 
sequence 'abcdefg' happens to appear two times. After having observed this
i examined some files and texts and i realized that sequences (and, or, in...)
(re)appear a lot more often than equal-byte runs in files. 
Can we crunch these sequences, too?

Oh, yes, Watson. We can! Jacob Ziv and Abraham Lempel published essays dealing
exactly with this problem in 1977 and 1978. The theories described there are
the basics of what we call today the LZxxx-algorithms. Programs like the
UNIX-Compress or ZIP or RAR are based on these theories. But this is a 
little bit more complicated and better dealt with in an chapter on it's own.



I hope you enjoyed this chapter. 
There will be more - i promise.
Just have a little patience.

Greetings from a programmer

Joa

 Little essay about the various methods and viewpoints of crunching.

Part V: Interlude and the Mystery of Lempel-Ziv Part I


---Joa's little essay on performing the art of compressing data - little big interlude---
(Pre-Chapter V)

And YET Garbage... ;((

In chapter IV ;) i told you some ways of compressing runs of bytes with certain 
techniques. But the last one, the No Garbage, technique has a little problem.
(thanks go to rudeboy for clearly pointing that out! Your EMail has brought me
some new ideas. Thanks again.)


Consider the following bytes:

xyz1112

As the coder would start upon a run of at least three bytes and would subtract one
we would get the following sequence:

xyz11{0x00}2.

The decoder would interpret the file in the following way:

actual   last byte  crunch 
x                    false
y        x           false
z        y           false
1        z           false
1        1           true
0x00     1           true     
copy '1' one time
continue with reading
                     false
2        1
(EOF)
done

and now consider the following byte-sequence:


xyz1122

The coder would code this verbatim:

xyz1122

The decoder would interpret this in the following way:

actual   last byte   crunch
x                     false
y        x            false
z        y            false
1        z            false
1        1            true
{0x32}   1            true
[CRASH] - our decoder writes bytes incorrectly!!!

Hey what's going on? Last time i told you this - now i tell you that.
Well, to be honest, i didn't implement this last one (trusting the 
basic sample i received) and so didn't test it out. For this i officially 
apologize  :(.

Now the least i can do is to tell you what you can do, as there are a 
lot of capable +crackers out there, trying their own implementations and
they are just asking: How do i code this and that in an INTELLIGENT 
(= bits-saving) way?

This is a very good question. What can we (i) do in our actual situation?

Well, there are some ways would be:


- adding an entry list with all those 'dangerous' addresses. With this 
we would only have to enhance our coder with a security check whether
the third byte would be the same as the second one or not. If not (our
dangerous case) we would add this address to the 'dangerous list' and
the decoder could look up, whether the actual decoding address would 
be dangerous or not. To save space we would have to write a little 
managment (for example: was the last address one byte in the past, or 
one word or doubleword?) but we want to keep it simple. Anyway, on the 
byte level we would add a big chunk of garbage to the file.


- coding the real values and not subtracting one releases the zero, 
thus enabling us to code the dangerous sequence in a way like this:

xyz1122 -> xyz11{0x00}22{0x00}

As you can clearly see, we are adding garbage to the file. And as you can
imagine, two-byte-runs are so common, our cruncher would become useless.

One way out of this misery would be to lower the quality of the cruncher
and start up by 4 bytes instead of 3. Still we would add garbage!

- to be honest, on the bytelevel i have no idea how to implmenent an 
easy, garbageless cruncher (yet), but when we leave the bytelevel 
and enter the bitlevel we have a universe of possibilities ready for us
(still with garbage but much less garbage than expected).


So let's start with the first solution on the bitlevel:
(a bitwasting version - optimizing comes later)

%'?' means that a char is added (8 bit binary)

Ok, here we go:

crunchable file:

xyz1112

the first three bytes cannot be crunched, so:

%'x'%'y'%'z'

the next two bytes, too,  are coded normally:

%'1'%'1'

this is what we normally would do: 
we count the following '1's, subtract one, AND with 0xff for 8bit-coding. 
After that we set the crunch flag to false to code the 
next byte(s) normally. If we have more than 256 bytes in a run, we loose
two byte every block of 256s. But except for bitmaps RLE will seldom meet
runs of more than 256. 
Counter gives us 1 so the counter is coded:

%0000 0000   <- Space for optimation here

we encounter '2' and code "it.class" tppabs="http://fravia.org/it.class" normally:

%'2'.

Looking back we have nothing gained :(


And here the dangerous file:
xyz1122

The first three chars are coded:

%'x'%'y'%'z'


The next two chars are also coded normally:

%'1'%'1'

Now the Encoder would recognize that the next charakter is NOT a '1' and we have
 to solve this problem. One way of doing this would be to count the number of eq
ual chars following and if the counter was zero add a signal bit and code the ru
n. But if we code a signal bit, we have to code it in normal files, too. So, if 
the run was no run, we would add a %0, if we could run a crunch we send a %1. So
 for now we add a %0:
%0

After that we code the next two bytes normally:

%'2'%'2'

and as the last run isn't a crunch, we add a %0:

%0

So, the whole file would look like this (SPACEs inserted for clarity):

%'x'%'y'%'z' %'1'%'1'%0 %'2'%'2'%0

So for the cases where we couldn't crunch we would only add one bit. But hey, we
 forgot to adapt this new format to the crunchable data. Let's check out the new
 format on a crunchable run: 
xyz1112 becomes

%'x'%'y'%'z' %'1'%'1' %1 %00000001 (+1 as we DO have a crunch - so makes a counter of 1) %'2'


Ooops. What happened? Now we enlarged our crunchable file by one bit. Damn. We want to be sure
that a three byte run doesn't create garbage, not even one bit. As a good solution we would 
reduce the maximum run length from 256 (255 +1) down to 128 (127 +1)  which can be encoded in
7 bits. Done that we look again on our file:

%'x'%'y'%'z' %'1'%'1' %1 %0000000 (+1 as we DO have a crunch - so makes a counter of 1) %'2'

Good. We crunched the file. The filelength hasn't changed, but we didn't produce garbage. And
when we enter runs greater than 3 we have win situation. We can only hope that we have more
wins than (garbabe-bits / 8).

But, ahem...

Yes, Watson?

I would like to implement another way...

Yes?

What about this dangerous address-list, you mentioned?

Well, yes, we could implement one. All right. The secret to this method
is that the intervalls rely on a straight direction. The intervalls are
always positive. And when the file has a lot of dangerous addresses, the
distances are relatively short. Therefore the distances can be expressed
in only a few bits and not a word or long.

The first 'dangerous' address is encoded as long in the crunched file.

But 'how?' do you ask? Well, it's clear that we would have to add some kind
of switch to the dangerous-address list, isn't it? 
How would a crunchable file and our dangerous file look like?


First, we would encode dangerous runs as garbage:

xyz11abcd334455 would be encoded verbatim.

But we would start a 'dangerous' list:

The encoder recognizes the dangerous run (by checking, if the third byte of the 
possible run is the same as the second or first) and initializes/updates the list.
The decoder, on the other side reads one entry (if available) AHEAD and knows the
next dangerous address from decoding it. But back to the encoder.

When the first run is to be done, we should code it as a long:

The "11" is found on address 0x00000003 (zero-based of course) and should be coded
as long. Then we go along, code "abcd" and enter "33". As this is a dangerous run, 
we have to code it (the difference that is), but we want to code it with as 
few bits as possible. And here the great sparks of creativity are fanned. 
We could possibly do:

Code two signal bits for four different modes: 
%00 for long
%01 for three bytes
%10 for word
%11 for byte

and let those headerbits follow the specified number of bits (32, 24, 16, 8). Yes, this means,
we have to implement a bitreading routine for our decruncher. But hey, those are 
very useful for a lot of purposes. Don't be lazy. Once you've written a pair of
bitswriting/bitsreading-routines, you'll be amazed about the new possibilities of 
addressing things.

Another way would be to do a Huffman-Version of these two bits:
%0   for byte (should be the most happening intervall)
%10  for word
%110 for three bytes
%111 for long

followed by the appropiate number of bits.
If you are crazy enough you could run a counter along and 
calculate adaptively an always optimized Huffman-tree for these two bits ;)

Next possibility, please:
if you are sure that the distances will be VERY short - you could do a variable coding:
divide the distance by 3. Save the remainder. In a loop running from 1 to the counter add
%11 into your dangerous-list. Then add the remainder in binary, but with two bits. Examples:
Intervall   added bits:
0           %00
1           %01
2           %10
3           %11 00
4           %11 01
5           %11 10
6           %11 11 00
7           %11 11 01
8           %11 11 10
9           %11 11 11 00
10          %11 11 11 01
...


To complicate things, i thought: what, if i want to create a flexible and variable code, giving
me good ranges without exploding numbers of bits when i reach higher numbers? I searched my 
library and came up with an idea i'll explain now. 
But first the useful, but expensive unary code:
The unary code has as condition that the number to be encoded is greater 0.

Imagine you want to code a number with bits. Imagine you don't have a single idea, what the
binary system is. You could came up with the idea of writing %1 as often as your number
indicates. After that you would send a %0 as delimiter. Examples:
value      bits (all binary):
1          0
2          10
3          110
4          1110
5          11110
6          111110
7          1111110
8          11111110
9          111111110
10         1111111110
...

For normal cases this coding method is not useful, but for some cases it is the 
foundation. And for values between 1 and 3 nothing is better. Now Imagine, you w
ould use the number of %1 you read as SHL-argument. You would get a power of 2. 
But what about the lower bits? When i code %1110 i read three %1 (plus %0 as del
imiter) and know that i have a basic value of 2^3 = 8. If i code %11110 i get 2^
4 = 16. What about the rest? Well, it just has to be inserted, that's all. The r
emainder is just added to the binary. This method is called y - code (it's an gr
eek y-char). You tell the decoder what he has still to read and you also know th
e power of 2 that these other bits are to be added.
Let's create a y code for the number 9:

First we have to find the powers of 2 our number lies in:
The number 9 is between 8 and 16, so we use 8 as base. 8 is 2^3 so we use the un
ary code to code 3: %1110. Then we subtract 8 from 9 giving us a rest of 1. As o
ur power of 2 is three the rest of the code MUST be expressable in 3 bits. And i
t is. We code the remaining 1 in three bits: %001. So the whole code for 9 would
 be: %1110 001. Here are the numbers 0 to 10 (decimal):
Number      Code (in binary)
1           0
2           10 0
3           10 1
4           110 00
5           110 01
6           110 10
7           110 11
8           1110 000
9           1110 001
10          1110 010


And now we could combine the above mentioned divide-by-three method with the las
t part of the y-code, that is, exchange the first, the unary part with the divid
e-by-three-method.
Let's code the number 9 again: 
9 is 2^3 + 1. So our power of two is 3. To code 3 we do:
3/3 = 1 remainder 0. So 1 x %11 and remainder %00 -> %11 00.

And after that we would encode the rest in three bits: %001. 
So the whole encoding would look like: 
%1100 %001.

Now let's encode 129 with both methods:
129 = 2^7 + 1 = %11111110 0000001  (y code)       (8 + 7 = 15 bits)
129 = 2^7 + 1. 7 = %11 11 01 -> %111101 %0000001  (6 + 7 = 13 bits)


Of course you are free to combine all mentioned methods together. It may be a go
od idea to pre-create a dummy-entry, counting the number of bits needed for writ
ing the entry. By looping thru all your available methods you could switch to th
e best method for the actual intervall. The only premise is that the decoder use
s the same routines as the encoder.

I hope that helped. May be i will refer to these methods in later chapters, so s
tudy them and take your time. For more information on coding bits - go to your l
ocal library and lend: "Managing Gigabytes" from Ian H. Witten, Alistair Moffat,
 Timothy C. Bell. But be warned: This it not an easy book to read! Quite the con
trary: I would be happy if i'd understand half of the book with it's thousands o
f formulas ;)
But now let's pass to chapter 5
---Joa's little essay on performing the art of compressing data - little big int
erlude END---


By Joa Koester (JoKo2000@HotMail.Com) 
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it. 
If you have corrections, comments, better examples than the ones used 
in this essay, etc. - drop me a line.

OK, let's dive into the world of crunchers and munchers...


Lempel - Ziv 
A Milestone in crunching history


Pease porridge hot, pease porridge cold,
Pease porridge in the pot,
Nine days old.
Some like it hot, some like it cold,
Some like it in the pot,
Nine days old.

(...now he has gone totally nuts ;)
*Ahem*
No, i'm not nuts. This is just a textbase suitable for crunching. Just look at i
t. Aren't there a lot of bytesequences identical? Shouldn't there be an elegant 
way of crunching this? Yes, of course there is. The secret lies in the observati
on, that bytes (symbols) that appeared right now had ALREADY appeared in the rec
ent past. Consider the second line as our actual line. The bytesequence "Pease p
orridge " is the same as in line one. In the sixth line the bytesequence "Nine d
ays old.{0x0d, 0x0a}" is identical to the third. But if we already HAVE the byte
s that we need, we don't need to write them anymore. Just for the sixth line, we
 could just write a reference to the file, like: 
Pease porridge hot, pease porridge cold,
Pease porridge in the pot,
Nine days old.
Some like it hot, some like it cold,
Some like it in the pot,
[ ->Line 3, 16 bytes]

If we could express this [ ->Line 3, 16 bytes] in fewer bytes than 16 we 
would have crunched a good portion of our file.
In the first line, we could put a reference to the bytes of "ease porridge" 
as the initial 'p' is not identical to the 'P' of the first "Pease". But 
with a reference we would write something like:

Pease porridge hot, p[->Line 1, Byte 1, 14 bytes]cold.

If we write this sequence in less than 14 bytes we crunch.

Some thought like that must have been spinning thru the heads of Jacob Ziv and A
braham Lempel when 1977 and 1978 they came up with totally new ways of compressi
ng data. The '77 version is exactly what i'm explaining here. The output of your
 crunch is something with a reference and a counter. But how do you come up with
 it? 
Well, the original LZ77 alg uses something called a 'window' and the source is s
liding thru the window, scanning for repeatings. Well, this window is - in the e
nd - nothing but a loop into the past of your data you are crunching. You have a
 current point. This point points to your actual data to be crunched. What you d
o is to look for occurences of the bytes you are looking at right now in your pa
st. Somehow these occurences are in the near past (a good guess: about 8 kbyte) 
and the more you look into the past the more unprobable a crunch gets. Let's hav
e an example:

abcdxyz000abcd123

YOU recognize immediately the second appearance of "abcd", but how 
does the algorithm?

Well, let's imagine your point was already on the last '0' and now 
we are entering the 'a' of "abcd".

          *
abcdxyz000abcd123
          
    
We must have a past pointer ready. As a past pointer it starts one 
byte 'behind' our actual pointer:

          *
abcdxyz000abcd123
         ^
   
Now we compare - and fail. So we move our past pointer down by one 
back into our past:

          *
abcdxyz000abcd123
        ^
    
Now we compare - and fail. So we move our past pointer down by one 
back into our past:


          *
abcdxyz000abcd123
       ^

... This is what Lempel-Ziv meant with their window: A pointer that 
scans the past.
When we finally reach the 'a', our compare flags:


          *
abcdxyz000abcd123
^

Jippieh! Hit! And now? Well, we make a copy of our pointers and install runpointers.
From then on we run back into the future, until we reach the end of the data or the
compare fails - whichever comes first.


          *'
          *
abcdxyz000abcd123
^
^'


flags until...


              *'
          *
abcdxyz000abcd123
^
    ^'

Well, the compare fails after four bytes. That means, we have a possible crunch
of four bytes ahead. That is information number one. Information number two is t
he distance between our two pointers. A simple subtraction gives us this distanc
e. Now all we have to do is to encode a format with the distance and the number
of crunched bits and we will have our own implementation of LZ77. I think the or
iginal format was 12 bits fix for distance and 4 bits fix for the number of byte
s crunched. I don't think i'm going to surprise you if i tell you that in a norm
al file 16 equal sequence-bytes are very seldom. Most occuring byteruns are 2, 3
, 4, 5+. In that order. And you don't need 12 bits fix. Use your imagination. Th
ere a lot of ways of creating such a distance format. One good beginning would t
o connect the runlength and the distance. It is most reasonable to allow a bigge
r distance for a 5+ run and to truncate the distance for the more occuring 2-run
s, thus giving you less bits to write for the most occuring crunches. As you are
crunching 2 bytes you will only have 16 bits space (2 bytes = 2 * 8 = 16). As y
ou want to gain something here too, i suggest a distance of 9 bits maximum.
When you manage to stay under 15 bits for the 2 crunch, you will have a good for
mat.
A format i used (with a scan of 2048 byte), was:
2crunch (16 bits original): %0%000000000 (%0 for 2crunch, 9bits distance)
3crunch (24 bits original): %10%0000000000 (%10 for 3crunch, 10bits distance)
4crunch (32 bits original): %110%00000000000 (%110 for 4crunch, 11bits distance)
5+crunch (40+ bits original): %111%(variable)(%111 for 5crunch plus sequencing bitd
for coding distance and number of bytes)

You are of course free to use whatever comes into your mind. And do yourself a favor 
and DO IT! It is good fun and you learn a lot.

But back to our scanning algorithm. We have a few problems to examine:

What happens if i come up with a sequence like: abcdefg123abcd###abcdef  ???
What about the start and the ending of our data (beginning and ending of scanning)?
What about speed?
What about garbage?

all right, all right.
We will crunch the above mentioned bytes together. Then your questions will be answered.

Important points: 
We install a crunch-counter, telling us, how many bytes we have crunched. 
If the counter remains at 1 we will have not crunched anything and the 
current byte is to be written as garbage.
We also install a Crunch-Distance for comparisons.


Let's begin.
Obviously we start at the first byte:

*
abcdefg123abcd###abcdef

Now, as we don't have a window yet, we have to write this as garbage. And as we 
now recognize, that we don't know how many bytes garbage we will have, we have t
wo options: write the garbage away binary reversed and build the decruncher to r
ead your file from behind back to the front
OR
to write the garbage into some temporary memory/file/whatever and when you reach
 your crunch you know how many bytes garbage you had, code this number and appen
d the garbagebytes from your tmp. The first solution would also mean that the sc
an-alg would have to scan the actual future (which would be the decrunchers past
), but there's no real difference between an incremental scan or decremental sca
n. 
Solution ONE would look like: (when we reach the second 'abcd')

321gfedcba (not completely binary reversed for clarity)
  then we have the count of our garbage (0x0a) and i code it for example using 
  the 'divide-by-three' method: 10 / 3 = 3 rem 1 -- %11 %11 %11 %01 
  this would make -- binary reversed -- :
%10 %11 %11 %11 
  plus a %0 for garbage --
%0

and if we add this:

(321gfedcba (not completely binary reversed for clarity) ) %10 %11 %11 %11 %0

As in this variation the decoder would read the bits from behind it 
would get the codes in correct order:
%0     for garbage
%11 %11 %11 %01   counter of garbage
abcdefg123            and then the garbage bytes re-reversed to original state

The other way would be to save the byte somewhere.
So when we reach the crunch at 'abcd'
we would have:

[tmp]abcdefg123

we would code 
the %0 for garbage, the counter %11 %11 %11 %01
and the bytes.
In this version the decruncher would read the file from the start 
and not from behind. Both versions have their advantages and 
disadvantages and it's just a matter of taste not of difficulty.

For this example i assume a [tmp]-solution, so we write the 'a' to our [tmp]-area.
After that we increment our pointer and start anew. 
The Crunch-Counter is set back to 1.


 *
abcdefg123abcd###abcdef


Now we can have a look into the past. 
We install our PastPointer one byte behind the actual pointer and
run a compare.


 *
abcdefg123abcd###abcdef
^

Whew. 'b' and 'a' don't match. So we'll continue our scan either until we reach 
the start of the file or we reach the end of our scan-reach. As we have reached 
the start of the file we look at our Crunch-Counter: still 1. So this byte is al
so garbage. Away with it ;)
Reset your Crunch-Counter (redundant in this case, i know, but when you code the
 alg you'll see that an If-Then construct is more timewasting than the move-oper
ation) and increment your actual pointer by the Crunch-Counter (1):

  *
abcdefg123abcd###abcdef

Install the Scan-Pointer:


  *
abcdefg123abcd###abcdef
 ^

And start scanning. 
As the Byte at distance -1 is not equal to the actual byte we'll decrement 
our pointer by one and do this until we:
 - reach an equal byte, 
 - reach the start of the file
 - reach the end of our scan-reach.
 
Our Scan-Pointer is at the 'a' and the start of the file is reached. 
As we look at our Crunch-Counter, we still have a 1. So away with it.
This goes on until we reach the second 'a' at position {0x0a} (zero-based).

We'll continue the situation one byte BEFORE the crunch:

         *
abcdefg123abcd###abcdef
^

After this last comparison we write the '3' as garbage. Our [tmp] look like: [tm
p]abcdefg123, we have a garbagecounter of 0x0a and a Crunch-Counter of 1. Now we
 go ahead and increment our crunch-pointer by the Crunch-Counter (1).
          *
abcdefg123abcd###abcdef
         ^


The scan will fail until the very first byte of the file:


          *
abcdefg123abcd###abcdef
^

Now, as written some paragraphs above, we install copies of our 
pointers and let the copies run until either we reach the end of 
the file or the cmp fails.

          *'
          *
abcdefg123abcd###abcdef   is ok.
^
^'

           *'
          *
abcdefg123abcd###abcdef   is ok.
^
 ^'

            *'
          *
abcdefg123abcd###abcdef   is ok.
^
  ^'

             *'
          *
abcdefg123abcd###abcdef   is ok.
^
   ^'

              *'
          *
abcdefg123abcd###abcdef   fails.
^
    ^'

We compute the runlength and compare it with our Crunch-Counter (currently 1). A
s 4 is greater than 1 we have a crunch and continue happily. We also set the Cru
nch-Counter to our new value (4) and adjust the Crunch-Distance to 0x0a.
So we have 0x0e (the pos of our crunch-point-copy) - 0x0a (the pos of our crunch
-point) = 4 as runlength and 0x0e - 0x04 = 0x0a as distance. 
We want to encode this. But first we have to get rid of the garbage. As i assume
d we had the bytes in some kind of [tmp] and now, before we encode the crunch, w
e encode the garbage:

%0            for garbage
%11111101     the counter
abcdefg123    the bytes themselves

Now, as we have a telling bit for garbage and crunch 
(0 for garbage, 1 for crunch), we have to encode the
telling bit for the crunch, right?


Think for a minute...

.....

Back?
Did you get it? 
Yes, in this case you don't need a telling bit for crunch.

The explanation goes as follows:

These are the cases we have (theorethically):

garbage, garbage
garbage, crunch
crunch, garbage
crunch, crunch


Can we have a crunch after a crunch? Yes.
Can we have crunch followed by garbage? of course
Can we have garbage followed by crunch? Yep.
Can we have garbage after garbage? In this system we implemented - NO!!!

That means, that, after coding garbage we don't need to ouput a 
telling-bit for crunch. This saves us a lot of bits, believe me.

So after the garbage we code the crunch-sequence:

(%1)   omitted due to using brain ;)
%110   for 4crunch
%00000001010  for distance

We increment our original crunch-pointer by the Crunch-Counter (4),
reset the Crunch-Counter back to 1 and continue our normal scan.

              
              *
abcdefg123abcd###abcdef   
             ^
    
The first '#' will be emitted as garbage. Our Crunch-Counter is reset to 1.
But the next one is quite interesting:

    
               *
abcdefg123abcd###abcdef
              ^
     
Just one byte behind we find a crunch. And now? Well, we continue as before.
We install copies of our actual pointer and let them run.

                *'
               *
abcdefg123abcd###abcdef    is ok
              ^
               ^'
      
      
                 *'    
               *
abcdefg123abcd###abcdef    fails.
              ^
                ^'

As we compute the run length (2), we compare it with our Crunch-Counter (1) 
we see, that we are greater. So we have a crunch.

We compute the difference (1) and set the Crunch-Counter to our new run-length (2).
We also set the Crunch-Distance to the right difference (in this case 1)

THEN WE CONTINUE OUR SCAN!!! 
But as we don't have a second '#' in the file we come up with the values we just
computed.

As we just had a garbage out, we don't emit the telling bit for crunch and 
send:

%0           for 2crunch
%0000000001  for distance.

After that we increment the Crunch-Pointer by the Crunch-Counter (2), reset the 
Crunch-Counter to 1 and continue.


                 *
abcdefg123abcd###abcdef    
                ^


The scan will fail until the first 'a' on position 0x0a:


                 *
abcdefg123abcd###abcdef    
          ^


We react as we are now used to:
Install pointer-copies and let them run.

                  *'
                 *
abcdefg123abcd###abcdef    is ok
          ^
           ^'


                   *'
                 *
abcdefg123abcd###abcdef    is ok
          ^
            ^'
   
                    *'
                 *
abcdefg123abcd###abcdef    is ok
          ^
             ^'

                     *'
                 *
abcdefg123abcd###abcdef    fails.
          ^
              ^'

So, we compute the runlength (4) and compare it with our actual Crunch-Counter (1).
As it is greater (= better) we save it into our Crunch-Counter. We also save the
distance in the Crunch-Distance. 
After that, we continue.


                 *
abcdefg123abcd###abcdef    fails.
         ^

We will have lots of fails until the second 'a':


                 *
abcdefg123abcd###abcdef    fails.
^

Now, we have a crunch again. You already know whats happening, so let's watch the
movie :)

                  *'
                 *
abcdefg123abcd###abcdef    is ok.
^
 ^'

                   *'
                 *
abcdefg123abcd###abcdef    is ok.
^
  ^'

                    *'
                 *
abcdefg123abcd###abcdef    is ok.
^
   ^'

                     *'
                 *
abcdefg123abcd###abcdef    is ok.
^
    ^'

                      *'
                 *
abcdefg123abcd###abcdef    is ok.
^
     ^'

unfortunately, after this step we cannot continue, as our file is at end. So we 
stop our loop and compute our crunch-len: 6. We compare it with our Crunch-Count
er (4) and find it greater (=better). So we actualize our values and set Crunch-
Counter to 6 and the Crunch-Distance to 0x11. 
Then we continue our loop.
Hm, the crunch-pointer is at EOF, so we look, what we have:
Crunch-Counter > 1, in fact even 6, so we have a great crunch here.

As the last action was a crunch, we send our telling bit for crunch 
and after that we encode the Crunch-Counter and the distance

%1
%111
%(your code here)

And then we are finished. Flush all streams, close all files, delete all mem and exit.


This way of scanning for the best (=longest) match is called 'greedy evaluation', btw.

The decoder is much easier to program:

file://assuming that the original length of the file is known and
file://we have a output, which is incremented after each byte written.
file://we also must have a pointer to our actual address to be written,
file://or else we could not have references to our past.
while (Actual_Address < ShouldBe_Pos)
{ 
 get bit
 if (bit == 0)
 {
  file://garbage
  sum = 0;
  x = 0;
  read two bits into x
  while (x == %11) 
  {
   sum += x; file://add 3 to sum
   read two bits into x
  } 
  sum += x;  file://add remainder to sum

  for (int t = 0; t < sum; t++) 
  {
   char c;
   read 8 bits into c;

   output c; file://output char and increment Actual_Address
  }
  
  file://We know, that after garbage ALWAYS follows crunch
  call Do_Crunch;  slightly redundant call-structure
 }
 else
 { file://clean crunch
  call Do_Crunch;  slightly redundant call-structure
 }
}
ret

Do_Crunch:
{
 mode = 0
 read bit
 if (bit == 0) 
 {
  mode = 2
 }
 else
 { file://first bit was one
  read bit
  if (bit == 0)
  {
   mode = 3
  }
  else
  { file://third bit decides
   read bit
   if (bit == 0)
   {
    mode = 4
   }
   else
   {
    mode = 5
   }
  }
 }
 
 file://only one mode explained. The rest is either analog or dependend to your codes
 switch (mode)
 {
  case 2:
  {
   int distance = 0
   read 9 bits into distance
   
  file://as we assume a forward cruncher, this decruncher references to the past
  file://and writes into it's actual future.
  pointer back = Actual_Address;
  back = back - distance;
  
  file://we have already written out these bytes, so we can just get them.
  file://if we are decrunching into memory, that's childsplay, but if we write
  file://into a file, you really should buffer things up. As you have a 
  file://certain scan-reach, you should buffer exactly the amount of this
  file://reach for faster access. After processing you can then increment
  file://your pointer and write the first bytes of your buffer to the
  file://destination file. You will have to flush your buffer, before you 
  file://exit the decruncher!!!
  file://Here decrunching into memory is assumed.
 for (int t = 0; t < 2; t++)
  {
   char c = *back;  file://get char
   output c;  file://output it (plus incrementing Actual_Address)
   back = back + 1; file://increment pointer    
  }   
 }
    }
}
ret


The Scan-Method i showed you is of course pretty slow. Every time you scan thru 
a 8 kbyte area you read the bytes x-thousand times. A lot of tricks were invente
d to ease this problem.
One trick is to save the following byte of the actual byte. When you than scan i
nto your past and you get to the same byte you than can immediately scan, if the
 second byte is also equal. If it is, you will at least have a 2crunch. Example:
 
              *
abcd12345a6789abcd
             ^
    
Scanning thru your memory with this trick would lead to a 
Crunch-Second-Guess-Bytes with the value 'b' in this case. 
Now after some scanning we reach the following situation:

              *
abcd12345a6789abcd
         ^

We reached an 'a', so we normally enter a compare loop - but not now. Before thi
s we compare our saved Guess-Byte ('b') with the second byte after the scan-byte
 ('6'). As the compare fails, we continue our scan without having to enter a tim
econsuming loop.
The evolution to this trick is to check not the second byte but the [Crunch-Coun
ter]-byte. As in the normal loop it will be 1 we have exact our second byte. But
 if we already had a crunch, we save the last byte of the crunch into this Guess
-Byte. When we reach another byte signalling a possible crunch, we check ahead t
he Scan-Pointer[Guess-Byte]-byte and see if it's worth it. 
When you think it over, you realize that we can then exterminate the Guess-Byte 
again and just compare to the [Crunch-Counter]-offset. Example: 
abcdefg..abcX..abcd..abcdefg

Now, when we reach the following situation:

                     *
abcdefg..abcX..abcd..abcdefg
                    ^
      
We start to scan into our past. The first promising compare comes up:


                     *
abcdefg..abcX..abcd..abcdefg
               ^


We now compare the ^[Crunch-Counter]-byte with the *[Crunch-Counter]-byte and as the 
Crunch-Counter is 1 we compare 'b' with 'b', so we continue. 
Our loops stops with a run of 4. We now set our Crunch-Counter to 4 and continue 
our scan. Then the next 'a' appears:


                     *
abcdefg..abcX..abcd..abcdefg
         ^

We now again compare the ^[Crunch-Counter]-byte with the *[Crunch-Counter]-byte (should
really be saved into a register) and see, we compare a 'X' with a 'd'. As the compare
fails we just ignore this byte and continue our scan into our past. Soon we reach the 
last 'a':

                     *
abcdefg..abcX..abcd..abcdefg
^

When we now compare the [Crunch-Counter]-bytes we have a positive flag and can begin
to crunch.


Another method of speeding up the main-scanning alg is to implement a jump-table.
The table would look like this (excerpt for the values 'a' to 'd'):

[a] [b] [c] [d]
 0   4   4   4

for the following source:

abcd.bcd...

The DISTANCE between the first 'b' and the second 'b' is stated in this table. 
Now our scan algorithm would just have to look at the actual byte, look it up
in the table and if the value is > 0 sent the scan directly there. If the
value IS 0, we have a garbage byte. Sounds pretty fast, isn't it? Yes it is.
But the building of such a table is a little bit difficult. I explain ONE way
of creating such a table. For the sake of clearness i will explain for BYTES
but it's not so difficult to enhance this mechanism for WORDS.

Ok, what do we need? Watson?

Here, Sir. Well, i think we need something the size of our scan-area, namely the
distance-table. And then i would think for constructing a 256-entry-table would 
be useful.

Very good, Watson. That's exactly what i wanted to hear. There will be just
a little change, but don't sweat it.
Yes, we have a table for the scanner and 
for constructing we have a temporary table for the 256 possible Byte-possibilities.

Now, what do we do?
Assume we have the following source:

abcd..bbc

and we want to create our jump-table. 
At first we would initialize all tables with zeros. Then, in a loop we read a byte
and from the 256-table we read the last address. We subtract this from the actual address
and if the adresspointer was 0x00000000 before we write this difference into our jump-table. 
Explained on the example, this would be something like:

012345678 (Address)
abcd..bbc (Source)

We read 'a'. We look up 'a' and come up with 0x00000000 as we just initialized the table.
Our actual adress is 0x00000000, so the difference is also 0x00000000. As our scanreach
is just under 8 KByte, we can save the difference as short (16 bit). So our tables 
would look like:

char 256-Bytetable  Adresspointer Jumptable
a 0x00000000  0x00000000 0x0000
b 0x00000000    0x0000
c 0x00000000    0x0000
d 0x00000000    0x0000
. 0x00000000    0x0000  ('.' just for the clarity)
. 0x00000000    0x0000  ('.' just for the clarity)
b      0x0000
b      0x0000
c      0x0000


We compute the next char: 'b'.
Adresspointer is now 0x00000001. We write down the difference to 0x00000000 into
 the table, but as the entry was zero before, we don't write it into our jump ad
ress. This is to prevent a jump to an address where the actual char should be, b
ut won't. Imagine we would write a jump-table-entry with the calculated offset. 
We would then indicate that the next 'b' in reach would be 0x0001 bytes in our p
ast. But when you look into Address 0x00000001-0x00000001 = 0x00000000 you see a
n 'a' there and not a 'b'. Therefore we only write into the jumptable when we al
ready have an offset. This only happens until the char is reached the first time
.
The tables would look like this now:

char 256-Bytetable  Adresspointer Jumptable
a 0x00000000  0x00000001 0x0000
b 0x00000001    0x0000
c 0x00000000    0x0000
d 0x00000000    0x0000
. 0x00000000    0x0000  ('.' just for the clarity)
. 0x00000000    0x0000  ('.' just for the clarity)
b      0x0000
b      0x0000
c      0x0000

Now with 'c', 'd' and the noninteresting '.' (just imagined as 
random bytes) we have the same situations. After processing the last '.' 
we have the following table:


char 256-Bytetable  Adresspointer Jumptable
a 0x00000000  0x00000005 0x0000
b 0x00000001    0x0000
c 0x00000002     0x0000
d 0x00000003     0x0000
. 0x00000004     0x0000  ('.' just for the clarity)
. 0x00000005    0x0000  ('.' just for the clarity)
b      0x0000
b      0x0000
c      0x0000


And now we come to the next char: 'b'. We look it up in out table and we find a
non-zero entry. We get 0x00000001 and we have an actual Adresspointer of 0x00000006.
We subtract this and get a difference of 0x0005. 
We write this into our jump table at the entry [Adresspointer] of our Jumptable 
- 0x00000006 -, 
we write the Addresspointer into our 256-Byte-Table 
and we get the following states:


char 256-Bytetable  Adresspointer Jumptable
a 0x00000000  0x00000006 0x0000
b 0x00000006    0x0000
c 0x00000002     0x0000
d 0x00000003     0x0000
. 0x00000004     0x0000  ('.' just for the clarity)
. 0x00000005    0x0000  ('.' just for the clarity)
b      0x0005
b      0x0000
c      0x0000


As we continue with the next byte, we get another 'b'. We look it up in our 256-byte-table
and receive a 0x00000006. As our actual Addresspointer is 0x00000007 we get a difference
of 0x0001. And as there is a non-zero entry in the 256 byte-table we can write the
difference into the jumptable at the 7th position of our Jumptable:

char 256-Bytetable  Adresspointer Jumptable
a 0x00000000  0x00000007 0x0000
b 0x00000007    0x0000
c 0x00000002     0x0000
d 0x00000003     0x0000
. 0x00000004     0x0000  ('.' just for the clarity)
. 0x00000005    0x0000  ('.' just for the clarity)
b      0x0005
b      0x0001
c      0x0000

Now the next char is a 'c'. And as we can see we have a non-zero entry in our 256-table.
We take our actual Addresspointer - 0x00000008 - and 
subtract the value of the entry   - 0x00000002 - and 
finally come up with a jump of:         0x0006.
We write the entries down and have our final jump-table ready for use:


char 256-Bytetable  Adresspointer Jumptable
a 0x00000000  0x00000007 0x0000
b 0x00000007    0x0000
c 0x00000002     0x0000
d 0x00000003     0x0000
. 0x00000004     0x0000  ('.' just for the clarity)
. 0x00000005    0x0000  ('.' just for the clarity)
b      0x0005
b      0x0001
c      0x0006


Now, how do we search with this kind of table? Let's do it together. You will be
 amazed: We initialize our Crunch-Counter with 1 and our Crunch-Distance with 0.
 Then we get our first entry FROM THE JUMP_TABLE!!!. We fetch a 0x0000 and immed
iately write out garbage. Now that's fast!!! We increment our Jump_Table-Offset 
by the number of the Crunch-Counter (in that case 0x0001) and continue.
The bytes 'b', 'c', 'd', '.', '.' are also written out as garbage. And now we en
ter our first non-zero entry. We get a 0x0005. We install copies of our actual p
ointers, subtract 0x0005 from the copies, subtract 0x0005 from the Jump_Table-Of
fset-Copy and start the second compare-loop. Our Jump_table_copy would point to 
the 1st offset in the jump_table, our address-pointer-copy would point to the 1s
t byte of the area we are crunching.
Unfortunately the compare fails immediately after the first byte. So we look aft
er the entry at the now-actual Jump_Table: 0x0000. Garbage. We return, write out
 one byte garbage, return to our normal pointers, increment the Jump_Table point
er by the number of Crunch-Counter (1 in that case) and continue. We reach a fur
ther non-zero entry. We install copies of our pointers, subtract the fetched val
ues from them (0x0001) and let a compare-loop run. 
Our copies would look at address 0x00000006 and 0x00000007.
Unfortunately we fail after the first byte. 
Now we fetch the Jump_Table-offset at the now-actual byte and we get a 0x0005.
We jump further 0x0005 bytes into the past and start a compare loop. 
Our copies would look at address 0x00000001 and 0x00000007.
Now we fail after two bytes. Our Crunch-Counter should now be two.
We save the distance to the real actual address as well (0x0006). 
After the outer-loop finished we have a two byte-crunch ready. 
Now as you can see this kind of comparing is a zillion times faster than the one
 explained before. But it most important that you understand how the basic mecha
nism was implemented to understand the following improvements.

There is just one thing to be explained now:
We have a Jump_Table. We scan thru our file using this Jump_Table. How can i kee
p this table as small as my scanreach is?
Well, the answer is: swapping!

When you scan thru the file and you create a table you want to use the whole reach 
of the table, right?

Imagine you would use a table with a reach of 8 bytes. You would have a reach of
 8 bytes until you come to position 7. When you then go address 8 you have to re
create your table. Now, there is nothing wrong with this. But imagine that on po
sition 8, when scanning thru the bytes you should have a reference to bytes on p
osition 6. But you haven't, because you cut all references to the past and creat
e a new 8-byte wide area. This would work, but you would loose some crunches. An
d there is an easy way of staying in the flow of references: You double your sca
nreach in the Jump_Table! With this trick you have still all references without 
any problems.
And now you can do your swapping easily: When you reach the final end of the Jum
p_Table-Area you copy the upper half into the lower half of the Jump_Table-Mem a
nd create the upper half anew:

(Source viewed in 8 byte blocks)
0  1    2       3
0123456712345670123456701234567.....


(jumptable)
0       1           1       2                2       3
0123456701234567 becomes  0123456701234567 becomes 0123456701234567 becomes...

You only have to set your Jump-Table pointer into the lower half of the table again.
With the next increment it will enter the upper half again and will have a full
reach ready.

Phew! That was a long essay. But i hope i could bring you something nearer.

Next time i'll explain LZ78 and LZW, the basic algorithm of RAR!

Till then,

Joa


P.S.
(delran: Your sorting idea is welcomed and i will see, if i can come back to 
this idea when i'll explain Burrows-Wheeler-Transformation)

%【初学天地】

O【问题答疑】

4【网站介绍】

,【杂志信箱】

投稿信箱：discoveredit@china.com

答疑信箱：discoveranswer@china.com

斑竹信箱：programhunter@china.com