EBPIG
6̽Ë÷ÔÓÖ¾6
MHJDQ
֪ʶ¹²ÏíJ×ÊÔ´¹²ÏíJ×ÊÁϹ²Ïí
¡¾·¢ÐÐʱ¼ä¡¿2000-10-18
¡¾ÆÚ¿¯ºÅÂë¡¿ÔöÊ®ÎåÆÚ
¡¾ÍøÕ¾µØÖ·¡¿http://programhunter.com
¡¾°æȨÉùÃ÷¡¿
´ËÔÓÖ¾ÓɳÌʽÁÔÈ˱༭¡¢ÖÆ×÷¼°·¢ÐУ»ÔÓÖ¾¿ÉÒÔ×ÔÓÉתÔØ¡¢·Ö·¢ºÍ´«²¥£»ÈκθöÈË»òÍÅÌå²»µÃÔÚδ¾­±¾ÈËÊÚȨµÄÇé¿öÏÂÐÞ¸ÄÔÓÖ¾µÄÍâ¹Û¼°ÄÚÈÝ£»ÔÓÖ¾µÄ½âÊÍȨ¹é³ÌʽÁÔÈËËùÓС£

¡¾±à¼­¼ÄÓï¡¿

    
   {~._.~} 
    ( Y )  
   ()~*~() 
   (_)-(_) 
ÿÕâÆÚÔö¿¯½«Ïò´ó¼Ò½éÉÜÈýƪ¹ØÓÚÆƽⷽ·¨ºÍ»ù±¾ÖªÊ¶µÄÓ¢ÎÄÎÄÕ£¬Èç¹ûÄãµÄÓ¢ÎIJ»ºÃµÄ»°£¬¿ÉÄã¾ÍûÓа취ÁË£¬²»ÓÃÒÀ¿¿ÆäËûÈË£¬ÏàÐÅ×Ô¼º°É£¬Å¬Á¦Ñ§Ï°£¬Ã»ÓÐÆäËüµÄ°ì·¨¡£Ï£Íû´ó¼ÒÄܹ»×ÐϸÑо¿ÕâÈýƪÎÄÕ£¬ÏàÐŶÔÓÚÌá¸ßµÄÆƽâˮƽ»áÓкܴóµÄ°ïÖú£¬²»ÐÅÄã¾Í¿´Ò»¿´°É¡£ÏÂÆÚÔö¿¯½«Ïò´ó¼Ò½éÉÜÁíÍâÈýƪ£¬¹²¼Æ°Ëƪ¡£´ó¼Ò²»ÒªÌ°¶à°¡¡£
 
¡¾Ä¿ ÿÿ ¼¡¿
ÿÿÿÿ&ÆƽâÐĵÃ
J¡­¡­Part I: Introduction Joa Koester
K¡­¡­Part II: Propability and Crunching (Huffman isn't the only way)  
L¡­¡­Part III: Adapting to the sources (Huffman revisited)  
ÿÿÿÿ%³õѧÌìµØ
ÿÿÿÿOÎÊÌâ´ðÒÉ
ÿÿÿÿ4ÍøÕ¾½éÉÜ
ÿÿÿÿ,ÔÓÖ¾ÐÅÏä
 
&¡¾ÆƽâÐĵá¿
         Little essay about the various methods and viewpoints of crunching.

                                    Part I: Introduction


By Joa Koester (JoKo2000(at)hotmail(point)com) 
This essay is free to be copied and read and spread as long as
you are decent enough to let my name in it. 
If you have corrections, comments, better examples than the ones used 
in this essay, etc. - drop me a line.


But on with the essay...

as i recently visited everser+'s page on the net, i was amazed about the 
knowledge packed together, the philosophy and the attidude contained 
in the writings. But i missed a little bit the **other side**. 
That is, us programmers, condemned to write software for banks, 
insurance companys etc., so they can make a lot of money, ripping
a lot of people of. These companies are often serious about data
hiding and are always eager to have their delicate data secured.
One way of securing data is crunching (packing) them. This has two valid 
points:

 - the data is secure (coz the crunching algorithm is not
       made public)
 - the data is more compact, meaning it can be easier transfered;
       the risk of a disc producing a read/write error, vaporising
	   personal data is definitevly lowered, when the data only
	   takes 50Kbyte on a disk rather than 200KByte (of course
	   if a read write error happens exactly in these 50KByte,
	   the file is also gone ;)

This brings us to the question:

WHAT IS CRUNCHING?


Well, a pseudo-mathematical answer would be:
everything that reduces a certain amount of data in its
SPACE, but not in it's INFORMATION (lossless crunching
that is, there are some quality (information) reducing 
algorithms out there - jpeg for example).


So we have some data IN:

AAAAACCCCCDDDDDDCCCCCefg


and we have a black box:

  /-------\
/           \
| Holy crap |
| happening |
|   here    |
-------------

and we have more or less unreadable data OUT:

@$/)!%A3yfg


So, what's the principle of the black box? 
Is there one Super-Algorithm that you are not allowed
to know ("Behold my son, tis knowledge is not 
for your eyes") ?

Of course not. There are principles. And there are lots
of algorithms, not just one.
And you ALREADY know a lot of these priciples 
from the everyday live.


When you have to cross a heavy-driven road you more or
less wait at the traffic light to become green. You stand 
there and see a red light. Maybe the light is shaped like
a standing person or a cross or something else. But it is red
and you already know that this means: Hey, stop, don't go.
If you go, you are violating three different traffic laws at least,
obstruct the cars and impose the next world war. Besides
you put your life in danger ;) 
And all this information just in the little red light on 
the other side of the street.
Are all red lights this way? No.
If you, for instance, are a musician and you are about to
record a song, you will press record & play on your 
recorder and a RED light will show up, telling you, that
you better not make thousands of mistakes and record 
what you are doing properly. The red light can be a circle,
a square, even a text, it doesn't matter. But it will be
red!!!


Dr. Watson, what do you think?

Well...
What we have here is a case of crunching:
The various informations are condensed into few different
symbols as possible. In both examples, only one symbol 
(the red light) is needed to get the MEANING (the information) 
transmitted. 

Right, Watson, right. And could we switch the information
contained in the symbols? That is, making the red light on
the recorder telling us, when to stop before a traffic light?

No. They are both red lights, that's true. But the red light
in the recorder has nothing to do with crossing a road and
the traffic light has nothing to do with us recording a song.
The CONTEXT is different. The only thing that is similar is
the color.

Hm, hm.
Condensing information (much source symbols -> less destination
symbols) and keeping the CONTEXT in mind. Sounds pretty 
good to me. 

Kind of 
	switch (context)
	{
		case traffic:
			if (red_Light) {...} else {...}
			break;

		case recording_music:
			if (red_Light) {...} else {...}
			break;

		default:
			No_Condensed_Symbols();
			break;
	}
ain't it?

In all crunching we will always have something that will 
tell us, in which context we will have to switch and because
of this we will know how the next following symbol(s) is/are 
to be interpreted.


Dr. Watson, are all interpretations dependend on only one symbol?

Hm, i would say no. There may be cases, where this is true,
but in most cases, there are more than one symbol defining
exactly, what's going on. There are crossroads where are
streets leading straight ahead and right and there are
crossroads where cars will drive left or straight ahead or
right. This will depend on which part of the crossroad the
car stands, so that the traffic for straight ahead can go
but the traffic for the right has still to wait for THEIR 
specific traffic light to switch from red to green. Another
example would be the position of the light. If the position
of the red light and the green light would be switched, 
there would be some chaos, i bet.

Sounds resonable, Watson. You say, that there are symbols
for a general context which are finetuned thru other symbols
defining the exact context to be interpreted? 

Exactly.

But what do you think how it is possible that all people
know that they have to stop on a red sign and go on a green
one?

Well, i would say that they know, because someone told them.
The parents, perhaps. Or they are taught so in the school.


In fact, to crunch and decrunch information correctly, both,
the sender and the receiver have to use the same way of 
interpreting the data. Society has ruled that a red traffic
light is a STOP. And so traffic lights will switch to red
when it is time to stop the traffic for one direction. And
on the other side YOU get taught by the society that you
better not run across the street when you have a red or else
you play 'Frogger' for real...
So put in one sentence - Both the sender and the receiver
use the same MODEL of data-interpretation.


Dr. Watson, what if i would like to crunch the information
transmitted in the red traffic light?

This would be nearly impossible. The whole meanings of 
what the traffic light means is already emitted in only
one symbol (the red light i mean now). There is a point
where the number of informations can't be reduced any
more without getting complications elsewhere. 
Imagine one would pack all three lights (red, green
and yellow) into one light that would change it's color,
depending on it's actual state. Ok, you would have less
metal, only one light to look at and less glass to clean.
The normal case would be condensed - not in interpretation
but in material. The routine of Green - Yellow - Red - 
Yello - Green... would stay. So far so good. But traffic
lights have the ability to send more complex signals also.
When, for example, there is a traffic-jam ahead and the
police notices this, they can (at least where i live) achieve
that the traffic lights green and yellow will blink together
to signal an upcoming jam so that the drivers can react to 
this signal. When all lights would be build in one case,
one would have to think of a special flashing / flashing
in a special speed or something like that. Not very
good. Especially for older drivers whose reaction times may
be slower - they would concentrate more on interpreting the
flashing signal than on the traffic itself increasing the 
risk of producing an accident. One other point would be the
shape of the light. A standing man in red and a walking man
in green would mean a complex shape of glass illuminated with
a complex circuitry. This would mean, if one part would activate
falsely, you would have, for example, a red man standing with
one green leg walking. Very confusing, eh? So condensing one thing
over the point of information content (also known as 
ENTROPY) on it's maximum leads to enlarging other parts giving
them biiiig overhead. How do we know that this process is worth
doing all this?

Well, a certain student once came up with exactly this question
and he answered it by himself: It depends on the probability of
the certain symbols. If some symbols are statistically so often
in our stream of perception (analyzing, reading buffer data, etc.)
that we can condense them enough that, even with the enlargement
of the other symbols (which but are not so often) we have an
overall crunching than it's worth it. The name of the student
was Huffman...
For example, you have:
aaaaaaaaaabbbbbcccdd (10 x a, 5 x b, 3 x c, 2 x d) 20 chars

then you would have 20 x 8 bits = 160 bits.

If you now would transmit the 'a' with 7 and the 'c' and 'd'
chars with 9 bits you would have	10 x 7 = 70
	 5 x 8 = 40
	 3 x 9 = 27
	 2 x 9 = 18
	     __											155

So we would save 5 Bits. Not much, but better than nothing.
Of course you have to come up with an optimized coding for
these values as you wouldn't want to calculate by hand, which
char you should with which number of bits without confusing with
the handling of the other chars. But Huffman found a perfect
algorithm giving you the optimized value-table for your chars.
(but this is for the next part ;)



To condense the intro a little bit:

- To crunch is a way to transmit a certain amount of information
    in less symbols normally needed.
- How you crunch/decrunch depends on the actual context of 
	data actually received
- Both, the sender and the receiver will build up the same
	way of interpreting the data, building up the same model
- When transforming long information-symbols to shorter packages of
	symbols and thus reducing the output, we will face the case
	that there will some (hopefully seldom) symbols getting transformed
	into LONGER symbols. If we have totally random data crunching happens
	also totally random, making our affords nil. 
	That is BTW the reason why packing an already packed zip or rar 
	file is in almost all cases useless - the data written by those
	packers is nearly perfect random. 


I hope you enjoyed this intro. If you signal me, that this was 
understandable and want more, you will - i promise.

Greetings from a programmer

Joa




 Little essay about the various methods and viewpoints of crunching.
Part II: Propability and Crunching (Huffman isn't the only way)
 
By Joa Koester (JoKo2000@HotMail.Com) 
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it.
If you have corrections, comments, better examples than the ones used
in this essay, etc. - drop me a line.

OK, let's dive into the world of crunchers and munchers...
In the Intro i told you that crunching is a process of reducing
the SPACE data needs to be stored, but not the MEANING (the
informations transmitted in the symbols).
The meter 'Entropy' is used here like in the physics. The more
entropy a symbol has, the LESS information is transmitted thru it.
Less information compared to the whole of all symbols, that is.
This means of course there is a lot of redundancy in it - meaning
a big 'crunch me, crunch me' for us. So we should concentrate
on the symbols that have a very high entropy. How we calculate
an entropy for our means depends on the data we want to crunch
and with which algorithm we want to crunch it.
I also told you that a certain student called Huffman once came
up with an algorithm to compress data. His idea is based on the
fact that in data (escpecially speech) some symbols happen
to be more often than others. Here one overstated example:
aaaaaaaaaabbbbbcccdd
What we have here is a algorithm based on propability. That means
that some symbols are not transmitted in the way they used to be,
because the algorithm will create new symbols in respect to the
propabilities of the symbols. Some of these new symbols will be
shorter. And those lesser apperaring symbols will be transformed
into longer symbols:
Before:
a = 61h = %0110 0001
b = 62h = %0110 0010
c = 63h = %0110 0011
d = 64h = %0110 0100
and 
after:
????
How do we compute such a new symbol table? OK, before we explore
this algorithm let me change the data from ASCII to the lower
nibble. That should be enough to explain the algorithm:
Before:
a = %0001
b = %0010
c = %0011
d = %0100
and we have the propability:
a = 10
b =  5
c =  3
d =  2
   ===
    20 <- Overall value. Important when calculating entropy
To save this original sequence to disk we would need: 
10 * 4 = 40
 5 * 4 = 20
 3 * 4 = 12
 2 * 4 =  8
        ===
         80 bits

Lets crunch this:
The original algorithm works with a tree building up from the ground
to the top. You can of course try to optimize it in another way. But
here's the algorithm as Huffman intended it:
Step 1:
10   5   3   2       How often the symbol appears
                     Tree (empty at first)
10   5   3   2       The added sum of the actual part of the tree
(a) (b) (c) (d)      Symbol

Step 2:
Now we take the two lowest (that is lowest in the sum of the value) entries
of the tree and add the values thus giving a new knot with the former part
as right and left part:
10   5   3   2       How often the symbol appears
                     Tree
10   5   3   2       The added sum of the actual part of the tree
(a) (b) (c) (d)      Symbol
         ^   ^
         ^   ^
         Two lowest values (3 and 2).
         Take them and add the values of the two actual
         parts of the tree and give the sum to a new knot. The two former
         parts become a left and right treebranch.
10   5   3   2       How often the symbol appears
10   5     5         The tree
(a) (b)   / \
         3   2
        (c) (d)
This Step 2 repeats until no two values are left:

 a   b   c   d
10   5   3   2       How often the symbol appears
10   5     5         The tree
(a) (b)   / \
         3   2
        (c) (d)
     ^     ^
     ^     ^
         Two lowest values (5 and 5).
         Take them and add the values of the two actual
         parts of the tree and give the sum to a new knot. The two former
         parts become a left and right treebranch.

 a   b   c   d
10   5   3   2       How often the symbol appears
10      10
(a)    / \
     5     5
    (b)   / \
         3   2
        (c) (d)
^       ^
^       ^
         Two lowest values (10 and 10).
         Take them and add the values of the two actual
         parts of the tree and give the sum to a new knot. The two former
         parts become a left and right treebranch.
 a   b   c   d
10   5   3   2       How often the symbol appears
     20              The tree
    /  \
  10   10
  (a)  / \
     5     5
    (b)   / \
         3   2
        (c) (d)

Step 3:
We have a tree, so what?
What we now have to do is to to transform the tree into
sequences of 1 and 0. To do that we start at the root giving
the root a 0 and
- all left branches a 0
and
- all right branches a 1:
                   20              The tree
                  /  \
                 /    \
            [0] /      \[1]
               10      10
               (a)    /  \
                     /    \
                [0] /      \[1]
                   5        5
                  (b)    / \
                          /   \
                     [0] /     \[1]
                        3       2
                       (c)     (d)
(Of course you could give all left branches a 1 and all right
branches a 0. Just keep straight with the directions, that's
all).
So we have the followin codes for our symbols:
a: %0
b: %10
c: %110
d: %111
Well two things are to be mentioned here:
This tree is optimized in a way that you can not produce a more
condensed one. It is not possible to create a tree that gives
more respect to all symbols in the table, calculating in whole
bits (that's no joke. There is a algorithm which calculates with
fractals of a bit and is far more powerful than Huffman.
I'll explain this, too). Try it for yourself.
Important:
The tree is value-unique, meaning that you can't interpret the
beginning of a certain symbol as another (shorter) symbol:
(Not unique and wrong table)
a: %00
b: %10
c: %101
d: %001
If you would read a '00' what would this be? A 'a' or the beginning
of a 'd'? With this table it is not possible to answer the question
giving only the solution of a GPF.
It it utmost important that a tree is completely unique.
Therefore a complete symbol cannot be the beginning of another symbol!!!
If you try to build up simple trees the unique way you'll quickly
realize that the Huffman-Algorithm is the method of your choice.
 
Our sequence
  aaaaaaaaaabbbbbcccdd
would now need:
10 * 1 = 10
 5 * 2 = 10
 3 * 3 =  9
 2 * 3 =  6
        ===
         35 Bit. Not bad, eh?
 
So that means that we can optimize codes that happen to show up
more frequently than others. Big Deal. Where does this sort of
thing happen, eh? Dr. Watson?
Here, Sir. 
Well we have a lot of interesting and probably crunchable data
in SPEECH. In a normal written english text for example the letter
'e' appears the most. If we were able to compress a normal text
using the Huffman-method we would save a lot of diskspace.
Izzatso? Well, yes, Watson is right. The letter 'e' is dominating
(not only) the english language. Here the last two paragraphs with the
letter 'e' deleted:
---------------------------------------------------------------
So that mans that w can optimiz cods that happn to show up
mor frquntly than othrs. Big Dal. Whr dos this sort of
thing happn, h? Dr. Watson?
Hr, Sir. 
Wll w hav a lot of intrsting and probably crunchabl data
in SPCH. In a normal writtn nglish txt for xampl th ltter
'' appars th most. If w wr abl to comprss a normal txt
using th Huffman-mthod w would sav a lot of diskspac.
---------------------------------------------------------------
While some words are still easy read, others have become
totally senseless.
All we have to do is count all normal letters, sort them
by their frequency and apply the Huffman-algorithm to the
table. Then we would use this tree to encode our data
enew and - voila. There we are.
But...
Yes, Watson?
In your last chapter (Intro) you mentioned that BOTH, the
sender AND the receiver must have the same MODEL of data
(de)coding for to understand each other. If we don't
transmit the Huffman-tree with the data, how does the
other program on the other side know our tree?
Good question, Watson. It can't!!! Yes, if the receiver
of the message has no knowledge of the data-specific
Huffman-tree all he will see is total garbage (and most
probably a GPF will follow).
The receiver MUST use the same tree. That means, that,
either, one has communicated/transmitted the tree on
a second way or one uses a more general Huffman-tree,
which is of course not optimized for the specific file
we have transmitted. A third way will be explained now.
In the early days of the C64 you could tell that a program
used a Huffman-like tree when you recognized a block consisting
of all 256 possible chars, sorted in a strange way in the
beginning of the file. That meant that the BASIS for the
complete tree (all 256 chars in the sorted order) was transmitted WITHIN
the crunched file (which of course was now the crunched size + 256 Bytes ;)
The BASIS?
Yes, the basis.

Hm, i have a little problem...
What is it, Watson?
I think i got the theoretical aspect. But: how do i programm this
in reality? And what is this basis, you are talking about?

All right.
Well, we have to do with trees, you know. And the main part about
trees is the link to the other parts of the tree. When you want
to programm trees, you better get accomplished with languages
supporting pointers.
A typical C/C++ struct would look like this:
struct Tree
{
    Tree* zero;  file://pointer to struct for zeros
    Tree* one;  file://pointer to struct for ones
    long  value; file://value of the leaf
    long  sum;  file://when getting linked and
                        file://becoming a knot here the sum, else 0.
    char  symbol;       file://my symbol
};
You should build a tree consisting of 258 entries, with entry
257 as special EOF-signal and entry 258 as dummy for later
services.
One of the most important steps of the alg is the counting
of the bytes. With this the whole system stands or falls.
We also have to consider the DEcoding part. We can't transmit
1024 bytes, just because ONE character has to be encoded with
10 bits, leaving 500 bytes completely zero.
And always think of it: Coder and DEcoder MUST use the same
model or else...GPF.
So, to ensure that both use the same model, the process with
the smaller data entry dictates how the coder has to work.
We decide that a table of 256 bytes, transmitted with the
crunched file is enough overhead. What is in these 256
bytes, you ask? It's simple: It's the frequencies of the
symbols scaled down to 8 bits. Why?
Well, the general problem is: How do coder and decoder know
that, for example, the symbol 'y' is to be encoded with 10
bits and the symbol 'e' with 6 bits?
The best way is to let the coder and the decoder BOTH build
up a tree. When this happens, we can be sure, that both use
the same model and everybody is happy ("ha- vannagila,
havannagila, havannagila..." :-).
The only thing that both, the coder and the decoder have in
common is the frequency of the symbols. So the only thing
we can transmit are the frequencies. And to save space, we
just scale them down to 8 bit. Beware of having a symbol
with the frequency zero. It will never be encoded or decoded.
One good way of scaling down the table is the following:
Counting, which symbol appeared the most and put it into maxcounter
if (maxcounter == 0) maxcounter++;
maxcounter /= 255;
maxcounter++;
in a loop divide all entries by maxcounter
(if an entry, which had a positive value before and now has
been zeroed - move a 1 to it).
for example:
the maxcounter is 32767;
maxcounter = 32767 / 255 = 128;
macounter++ = 129;
freq[0] = 1234;
freq[1] = 23456;
freq[2] = 12;
freq[0] = 1234 / 129 = 9
freq[1] = 23456 / 129 = 181
freq[2] = 12 / 129 = 0; must be made 1;

So if we transmit the scaled down frequencies with the decoder,
all the decoder has to do is create the Huffman-tree from the
frequencies and start decoding the inputstream.
Yes. That means the both, coder and decoder use the same
routines. If you can reverse the algorithm of the decoder
you know at least half of the coding algorithm. In the huffman
case, you'll know ALL the important stuff. This behaviour will
be found in every decrunching routine. To decrunch something
the decruncher has - at least partly - reverse the steps that
the cruncher did. Step successfully thru a decruncher and you
have - with a little phantasy - the crunch-algorithm (at least
the theoretical aspect of it) reversed. I never found a cruncher
where this was not the case, so i suspect that all packers have
this weak spot. Weak, because a lot of crunching alg are packaged.
That means, that after some data a new cycle of crunched data
begins. If you manage to isolate a single (best hint: in the
last part of the file) cycle, you can study the alg
while decrunching just about 10-20 bytes. Good, eh?
 

These are some steps which should give you the general idea of
creating a Huffman-tree:
- Build up a table consisting of 256 longs.
- Count up the bytes you want to crunch (the whole file for example)
  where you count up the frequency of all the bytes in your
  source. As there are only 256 possibilities, there are no probs.
- Scale the frequency down to 8 bit.
- Build up a second array consisting of 256 chars. Fill them
  with the chars 0...255. You will have the char 0 in entry [0],
  char 1 in entry [1], the char 'A' in entry[0x41] and so on.
- Sort the frequency table, but when you switch the values,
  do also switch the values in the second array, holding the
  chars. With that you'll create a sorted model of your input
  data.
- After sorting, proceed as follows:
   - create an array of 257 entries of type Tree.
     (If you have another service ready, use 258).
   - fill the .value of each entry with the frequency of the
     symbol. Fill the .symbol with the entry of the symbol char
     array. Fill the .zero and .one with NULL to mark them
     as leafes. This will seperate them from the knots where
     there will be the adresses of your next branches/leafes.
     fill .value of entry[256] (the 257th) with 1 as you will
     only one EOF, right?
   - recursively build up the tree. Don't forget, that you
     may have to add a last knot to link the last two partial-trees
     together. One way to do this:
     - Have a counter set to 257.
     - repeat:
     - take the least two entries.
     - add up their value.
     - create a new entry of type Tree.
     - link the two entries to the new tree.
     - give the .sum of the new entry the added value
     - delete the two entries from the list (NOT FROM THE MEMORY) and add the new one.
     - reduce the counter by 1
     - resort the trees with criteria .sum
     - while counter > 1
   - create a new array with the struct: char symbol, long bitsvalue, long
number_of_bits_to_be_written
   - recurse thru the tree to get to every char to fill up the new array.
     For every char you reached fill up the entries for symbol, bitsvalue and
number_of_bits_to_be_written
   - nearly finished.
   - Your encoding array is ready. All you have to do now is to read a char from the input
     and write the corresponding bitvalue to the outputstream with the number of bits,
     given from the array-entry. Send the EOF at last and you are done.
For decoding, the process is similar:
Create the tree. (thats the similar part)
unsigned char symbol = 0x00;
startknot = knot(0) file://(of type Tree)
knot = startknot    file://(of type Tree)
while (symbol != EOF)
   clear symbol.
   while (not symbol)
     
      get bit
      if ( (bit == 0) && (knot->zero != NULL) )
         knot = knot->zero
      else
         if (knot->one != NULL) ) knot = knot->one
      if (knot.one == NULL) file://is enough. knot.zero is NULL, too
         file://symbol found. great. To the output with it
         symbol = knot.symbol
      end if
    end while  
    Do_Outpur_Char(symbol);
end while
Done. 

        
Hm, i have another little problem...
What is it, Watson?

I did as you told me. Took a table of 256 longs, counted the 256 chars,
sorted them and let the computer calculate an optimized Huffman-tree.
But all i got was a tree with all leaves having a length of 8 - all chars
are encoded with 8 bits and there is no crunching in that. What happened?
Ah, Dr. Watson,
you did not remember my sentence about Entropy. This file you examined
had all bytes nearly equally spread all over the length of the file.
So -statisticalwise- the file is equalized. No surprises here. With
a Huffman-tree, which considers the whole file you won't get any
crunching here, i fear.
But what do i do now?
Well, there are three scenarios that will satisfy your desire:
- We use a different algorithm
- We use a brutal tree
- We use an adapting technique
??? Tell me about the brutal tree.
Ok, i got this one from the 1001 Packer from the C64 (Anyone ringing a 
bell? 1001 Crew? No? Where are the old days gone to...)
They used a tree i just can label 'brutal tree'. Here's the idea:
When we have the case that we have data that is so equal that no
Huffman-tree will help us, we help us ourself. We take all the
Bytecounters and sort them. Then we create a table of the 256 chars
ordered by the sorted table, so that the still most frequent
chars appear on the top. Now we have the 256 possible bytes sorted.
So far so good.
Lets think about a byte. How do we read a byte? Isn't it:
#33 = $21 = %0010 0001  ?
What, if we we would change our view of the binary to a 2 + 6:
#33 = $21 = %00 100001
Our 256 chars would now be represented as 4 (00, 01, 10, 11) blocks
of 6 bits. What if we now 'huffman' these first two bits to:
0
10
110
111
So that we would have %0 100001 instead of %00 100001. Every byte that
is in the first 64Byteblock is coded with 7 bits, while the others
are coded either with 8 or with 9 bits. We can only hope that there
are enough bytes in the file to outweigh the two 9bit-chars. But it is
amazing how this little trick works. And the speed. It's so easy to
implement it and it is so fast. Even on the C64 it took only fractals of
a second to decode a programm as big as the main memory area (other
packers took several seconds for the same size).

Wow, cool. But what about your different algorithms
and this 'adapting technique' ?

Well, this is another story and will be told in the next chapter...
 

I hope you enjoyed this chapter.
There will be more - i promise.
Just have a little patience.
Greetings from a programmer
Joa
 Little essay about the various methods and viewpoints of crunching.
Part III: Adapting to the sources (Huffman revisited)
 
By Joa Koester (JoKo2000@HotMail.Com) 
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it.
If you have corrections, comments, better examples than the ones used
in this essay, etc. - drop me a line.

OK, let's dive into the world of crunchers and munchers...
In the last chapter Dr. Watson stumbled over the problem of big
files with so many charakters in it that the frequency of all
256 possible chars are -statisticalwise- nearly the same.
Running a Huffman-algorithm on such files leads to trees with
8bits encoding for each char - thus giving us no crunching
efficiency.
But on the other hand everyone with a hex-editor in his
toolbox has made the experience that files are
somehow seperated (chunked).
Some areas seem to yield code, others appear as vast deserts of
zeros, and again others have up- and down-counting values.
On the PC you have for example Code and Resource areas, which are
seperately created and just linked together (hence the Borland
Resource Workshop 4.5).
But how do we take advantage of this circumstance?
Well, the key to our problem is the tree. Once it is created,
we are doomed with it, unable to change it, only able to
encode our data with this static, inflexible tree, right?
Not quite! If we could adjust our tree to the actual data somehow,
we would be able to generate several trees in the runtime, each
tree nearly optimized for the actual data coming up. But how do
we adjust our tree? And what about the decoder? And how do we
build up the first tree?
Yes, the first tree. What about it? We are about to be flexible, right?
Could we take the pain of the first data not perfectly encoded?
What, if we could? What, if we would build up a tree with all
frequencies set to one? Each symbol would be written with 8 bits.
But the more we would read from the file, the more we would adapt
to the data read and the more we could cope with sudden changes in
the data flow.
so the basic idea would be:
Cruncher:
initialize_tree_with_all_frequencies_to_1
while not eof
{
    read char c
    encode c, Output
    frequencyarray[c]++
    adapt tree, c
}
Decruncher:
initialize_tree_with_all_frequencies_to_1
Mychar = 0
while Mychar != EOF
{
    Mychar = decode(Input) file://read symbol to the next symbol
    put Mychar, Output
    frequencyarray[Mychar]++
    adapt tree, Mychar
}
This would work and would be stable. Unfortunately, even programmed
in Assembler we would have to wait to many hours to crunch a simple
file. The resorting of the tree after each byte read/written eats up
to many memory-read/writes. Imagine crunching a file like
Winword '95 (Ver 7.0) with it's 3.845.120 bytes. Imagine a tree-struct
having about 20 bytes in memory. Multiply that with 257 = 5140 bytes,
not necessarily stored nearly together or array-like (so read-ahead
memory chips can't outplay their abilities). Imagine this
5140 bytes getting sorted 3.845.120 times = 1.97639168exp10 bytes
shifted thru your memory just for Micro$oft
(assumed you would have only 1 sorting step while building the tree)...

There must be a better way, right?
Yes, Watson, there is a better way. And it's not so difficult to implement
it for our purposes.We have to implement a watcher!
This watcher can be an intelligent algorithm or a not-so-intelligent counter
with a conditional rebuild. Let's examine the latter method first:
As long as normal files are not bigger than 4 GByte we can use a simple
32bit-integer for our counter. Let's name it "Adjustcounter".
This counter will be incremented and tested afterwards:
Cruncher:
initialize_tree_with_all_frequencies_to_1
Adjustcounter = 0
while not eof
{
    read char c
    encode c, Output
    frequencyarray[c]++
    Adjustcounter = Adjustcounter + 1
    if Adjustcounter > 4096
        adapt tree
        set all of frequencyarray[] to 1
        Adjustcounter = 0
    end if
}
Decruncher:
initialize_tree_with_all_frequencies_to_1
Mychar = 0
Adjustcounter = 0
while Mychar != EOF
{
    Mychar = decode(Input) file://read symbol to the next symbol
    put Mychar, Output
    frequencyarray[c]++
    Adjustcounter = Adjustcounter + 1
    if Adjustcounter > 4096
        adapt tree
        set all of frequencyarray[] to 1
        Adjustcounter = 0
    end if
}
This will reduce our horror scenario from 1.97639168exp10 bytes
down to 4.825.175 bytes swirled thru our memory. Where i got the
4096 from, you ask? To be honest, i put it out of my hat. There is
no special reason for it - just a page size i encountered a lot,
that's all. If you experiment with this algorithm, try it with
other sizes. Keep book about the kinds of data you crunch and
which Re-Adjusting size gets you which results.

The other way would be an intelligent algorithm watching the
input stream giving alarm, when the data, rendered in the tree,
is outdated by the incoming new data. How could such an algorithm
look like? Well, there are a lot of possibilities. One basic
mechanism would be:
- Have the 'best' few codes of our tree ready.
- Check the incoming byte.
- actualize a second statistic.
- compare it with the 'best' codes of our tree
  if the second statistic has no more 'best' bytes
  with the tree 'best' bytes in common, then alarm and adapt.
Let's examine the steps:
- Have the 'best' few codes of our tree ready.
That's easy: We just iterate thru our tree and keep book
of the best, say, 10 codes in a little array.
As we execute this routine only when we (re)build our tree,
it can be written a little bit more sloppy
(always remember: the most timeconsuming code (about 80% time)
is consumed in only 20% of the code. So don't optimize where
it is not appropiate).
-Check the incoming byte
Well, that's easy, too. After we received the byte from
the I/O-routine, we have it ready for dinner.
-actualize a second statistic
Hm, that's not so easy. I would advise a shuffle-algorithm.
For this we initialize a 256-byte statistic array with
the accompaning char-values (array[0] = 0x00; array[1] = 0x01 etc.)
When the first byte is read, we localize it
in this array WITH A LOOP!!!
Then we shuffle all bytes from the specific found-index
down to index 0 one position up and insert the byte at array[0].
Here one example (array shortened to 4 entries):
array[0] = 0x00
array[1] = 0x01
array[2] = 0x02
array[3] = 0x03
incoming data: 0x02. Now we look for the byte in our array, 
fetching the INDEX. The char 0x02 is found at the position
2 in the array. Now we shuffle all bytes below one position
up:
array[0] = 0x00  >- keeps it's value
array[1] = 0x00  >- gets overwritten
array[2] = 0x01  >- gets overwritten
array[3] = 0x03
Then we patch array-pos[0] with the char we wanted to find (0x02):
array[0] = 0x02  >- patched
array[1] = 0x00
array[2] = 0x01
array[3] = 0x03

Next byte could be a 0x02 again.
We would find it on position 0. We would shuffle all bytes from
position 0 till position 0 up by one (that is, im this case
the loop terminates immediately, as the condition is flagged
before the body of the loop could be executed) and we patch
position 0 with the byte we wanted to find (the same byte).
If you implement this algorithm this is a point where a
little if-clause could help saving CPU-time.

Next byte could be a 0x03.
We would find it on position 3. All bytes are shuffled one
position up from 3 till 0:
array[0] = 0x02  >- keeps it's value
array[1] = 0x02  >- gets overwritten
array[2] = 0x00  >- gets overwritten
array[3] = 0x01  >- gets overwritten
Array-Pos 0 is patched with the byte we wanted to find (0x03):
array[0] = 0x03  >- patched
array[1] = 0x02
array[2] = 0x00
array[3] = 0x01

Next byte could be a 0x02 again.
We would find it on position 1. All bytes are shuffled one
position up from 1 till 0:
array[0] = 0x03  >- keeps it's value
array[1] = 0x03  >- gets overwritten
array[2] = 0x00 
array[3] = 0x01 
Array-Pos 0 is patched with the byte we wanted to find (0x02):
array[0] = 0x02  >- patched
array[1] = 0x03
array[2] = 0x00
array[3] = 0x01
etc.
This algorithm guarantees that the most recently appeared bytes
are in the first places of the array. With this we have our
second statistic acualized. Note that there is a lot of counting
and memory reading going on here. This could easily lead to a
new horror-scenario here. An intelligent MemCpy, taking the
adjacent bytes of the array into account, so copying first in
longs, then in bytes would be appropiate here.

- compare it with the 'best' codes of our tree
  if the second statistic has no more 'best' bytes
  with the tree 'best' bytes in common, then alarm and adapt.
Now, that's easy again, after having our statistic actualized
so eloquently. We just compare the first, say, each of the 10 entries of
our second statistic with the best 10 tree-entries. If there is no
more hit, we rebuild the tree anew and reset the tree-statistic.
If we calculate what amount of bytes we read/write:
3.845.120 times x adding one to a frequencyarray =     3.845.120 +
3.845.120 times x searching in the 
shufflearray  = (3.845.120 * 32). 32 because
the most recent bytes WILL be in the front of
the array and therefore be found faster.         =   123.043.840 +
3.845.120 times x shuffling the array 
shufflearray  = (3.845.120 * 32*31).             = 3.814.359.040 +
3.845.120 times x patching the first entry       =     3.845.120 +
3.845.120 times x executing a 10*10 loop
looking for one topmost byte in the
other topmost array (statistically leaving the
loop after the half, only 50 steps instead of
100).                                            =   192.256.000 +
(plus the resorting of the arrays and the
rebuilding of the tree. For the sake of the
demonstration this is not calculated here)
Makes =                                          ==4.137.349.120 
That is not quite 4 Gigabyte memory reading and writing here. Is this
worth the effort? You decide. I prefer the counter-method, but you'll
notice that the shuffle-to-front-algorithm can be useful for more than
this kind of work. Indeed we'll meet it again when we come to the
burrows-wheeler compression.

I have a question.
Yes, Watson?
This adaptive technique - is it only used in combination with the
Huffman algorithm?
No, the principle of making a statistical algorithm more flexible 
thru the adaptive approach is not limited to Huffman. When we come
to the arithmetic crunching technique we'll come back to the
adaptive approach again, believe me.
 
But the next thing we will discuss are some non-statisticall things.
Till then...happy cracking :-)
 
I hope you enjoyed this chapter. 
There will be more - i promise.
Just have a little patience.
Greetings from a programmer
Joa
%¡¾³õѧÌìµØ¡¿
                 
O¡¾ÎÊÌâ´ðÒÉ¡¿
 
4¡¾ÍøÕ¾½éÉÜ¡¿
 
 
,¡¾ÔÓÖ¾ÐÅÏä¡¿
Ͷ¸åÐÅÏ䣺discoveredit@china.com
´ðÒÉÐÅÏ䣺discoveranswer@china.com
°ßÖñÐÅÏ䣺programhunter@china.com