EBPIG
6̽Ë÷ÔÓÖ¾6
MHJDQ
֪ʶ¹²ÏíJ×ÊÔ´¹²ÏíJ×ÊÁϹ²Ïí
¡¾·¢ÐÐʱ¼ä¡¿2000-11-01
¡¾ÆÚ¿¯ºÅÂë¡¿Ôö¿¯Ê®Æß
¡¾ÍøÕ¾µØÖ·¡¿http://programhunter.com
¡¾°æȨÉùÃ÷¡¿
´ËÔÓÖ¾ÓɳÌʽÁÔÈ˱༭¡¢ÖÆ×÷¼°·¢ÐУ»ÔÓÖ¾¿ÉÒÔ×ÔÓÉתÔØ¡¢·Ö·¢ºÍ´«²¥£»ÈκθöÈË»òÍÅÌå²»µÃÔÚδ¾­±¾ÈËÊÚȨµÄÇé¿öÏÂÐÞ¸ÄÔÓÖ¾µÄÍâ¹Û¼°ÄÚÈÝ£»ÔÓÖ¾µÄ½âÊÍȨ¹é³ÌʽÁÔÈËËùÓС£

¡¾±à¼­¼ÄÓï¡¿

    
   {~._.~} 
    ( Y )  
   ()~*~() 
   (_)-(_) 
ÿ½ñÌìµÄÔÓÖ¾ÄÚÈÝ»¹ÊÇͬǰ¶þÖܵÄÔö¿¯Ò»Ñù£¬Ïò´ó¼Ò½éÉÜÓÉJoaдµÄ¹ØÓÚÆƽⷽÃæµÄÎÄÕ£¬Ï£Íû´ó¼ÒÄܹ»¶ÔËüºÃºÃµÄÑо¿Ò»Ï£¬ÕâÑù¾ÍÓÐÖúÓÚÌá¸ß´ó¼ÒµÄÆƽâˮƽ¡£ºÃÁË£¬½ñÌì¾Í½éÉܵ½ÕâÀï°É¡£
 
¡¾Ä¿ ÿÿ ¼¡¿
ÿÿÿÿ&ÆƽâÐĵÃ
J¡­¡­Part VI: The secret of Zip and Rar  
K¡­¡­Part VII: Arithmetic Crunching  
L¡­¡­  
ÿÿÿÿ%³õѧÌìµØ
ÿÿÿÿOÎÊÌâ´ðÒÉ
ÿÿÿÿ4ÍøÕ¾½éÉÜ
ÿÿÿÿ,ÔÓÖ¾ÐÅÏä
 
&¡¾ÆƽâÐĵá¿
 Part VI: The secret of Zip and Rar (LZW explained)
 
By Joa Koester (JoKo2000@HotMail.Com)
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it.
If you have corrections, comments, better examples than the ones used
in this essay etc. - drop me a line.

Last time i showed you how you could reach enormous crunch-ratios using the
LZ77-technique. I wouldn't wonder if someone would build a protection system
upon it. You'd just have to allocate some stackspace for your routine, decrunch
it into the stack and execute it. You'd would have to switch to a
decrunch-from-behind version and just by copying the decrunching routine into
the stack the final crunches would overwrite the last decrunch-loop part
thus immediately executing the decrunched routine afterwards.
But back to crunching. Soon after publishing the LZ77 alg Lempel and Ziv
fumbled a little bit around and came up with another table-oriented crunching
alg. While LZ77 used a Offset-Length pair to implement a reference to some
already passed data, LZ78 walked a completely different path. Lempel and Ziv
came up with an idea of directly addressing a table. In this table there would
be the strings they passed. The only thing you'd have to do would be to
derefernce the table entries down to the strings and you would have a crunch.
The table would start with a null-string. One Byte would be read ahead. One
acutal string would be initialized with this read-ahead byte. Then you'd check
if you already had this string in your table. If not, you would output it and
afterwards you'd add this string to your table. If you already had this string
in your table you'd read another byte and add it to your string. It is
important to notice that the last string to be output build up from a known
string in the table plus a new char.
Mark Nelson, a good source for crunching (next to Charles Bloom),
shows this alg quite clear:
for (;;)    file://until the end of the file
{
  act_phrase = 0;   file://the zero string
  act_len = 0;
  memset (test_string, 0x00, MAX_STRING);
  //...........................................................................
  for (;;)
  {
    test_string[act_len++] = getc(input);
    new_phrase = search_phrase_in_table(test_string); // is test_string in
                                                      // table?
    if (new_phrase == -1)   // no
    {
      break;                // then break out of inner loop
    }
    file://else
    act_phrase = new_phrase // else set new actual phrase
  }
  //...........................................................................
  file://Here we breaked out of the inner loop. This could happen, because of EOF or
  file://because a string could not be found in the table (in case of EOF this will
  file://happen, too, of course ;)
  output(act_phrase);            file://output the reference of the table
  output(test_string[act_len -1] ); file://afterwards output the difference char
  add_string_to_table(test_string); file://and actualize table
}
Let's crunch the following term together:
0123456789a iwobiwoiwoi

Our table is empty in the beginning except for a null-string at pos 1.
pos:  0
act_phrase = 0;
act_len = 0;
test_string = [0x00, 0x00, 0x00...]

Ok, we enter the first char: i
                ...test_string[act_len++] = getc(input);...
char: 'i'
pos:  0
act_phrase = 0;
act_len = 1;
test_string = ['i', 0x00, 0x00...]
table = [ {"", 0} ]

                After the search, new_phrase is -1, so we break

  output(act_phrase);            file://output the reference of the table
  output(test_string[act_len -1] ); file://afterwards output the difference char
  add_string_to_table(test_string); file://and actualize table
our output is: {0}i
We output the act_phrase (0) and output the 'i' afterwards. Then we update the
table with the 'i'-entry and jump back to our loop.
pos:  1
act_phrase = 0;
act_len = 0;
test_string = [0x00, 0x00, 0x00...]
enter char (w)
char: 'w'
pos:  1
act_phrase = 0;
act_len = 1;
test_string = ['w', 0x00, 0x00...]
table = [ {"", 0}, {"i", 1} ]
                After the search, new_phrase is -1, so we break
  output(act_phrase);            file://output the reference of the table
  output(test_string[act_len -1] ); file://afterwards output the difference char
  add_string_to_table(test_string); file://and actualize table
We output the act_phrase (0) and output the 'w' afterwards. Then we update the
table with the 'w'-entry and jump back to our loop.
[...]
for the chars 'o' and 'b' it's the same. We reenter on the next 'i'
pos:  4
act_phrase = 0;
act_len = 0;
test_string = [0x00, 0x00, 0x00...]
enter char (i)
char: 'i'
pos:  4
act_phrase = 0;
act_len = 1;
test_string = ['i', 0x00, 0x00...]
table = [ {"", 0}, {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4} ]
    new_phrase = search_phrase_in_table(test_string); // is test_string in
                                                      // table?
    yes, it is, so
            act_phrase = new_phrase (1 in this case)
    and back to our inner loop.
char: 'w'
pos:  5
act_phrase = 1;
act_len = 2;
test_string = ['i', 'w', 0x00...]
table = [ {"", 0}, {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4} ]
    Do we have a 'iw' in our table? don't think so. That means next step we
    break.
  output(act_phrase);               // (1 in this case)
  output(test_string[act_len -1] ); // (w in this case)
  add_string_to_table(test_string);
and to the next char...

char: 'o'
pos:  6
act_phrase = 0;
act_len = 1;
test_string = ['o', 0x00, 0x00...]
table = [ {"", 0}, {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4}, {"iw", 5} ]
    well, we have an 'o' in our table, so the string is extended with the
    next char: 'i'.
char: 'i'
pos:  7
act_phrase = 3;
act_len = 1;
test_string = ['o', 'i', 0x00...]
table = [ {"", 0}, {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4}, {"iw", 5} ]

    Some 'oi' here? No, sir. no 'oi' here. We break and write:
  output(act_phrase);               // (3 in this case)
  output(test_string[act_len -1] ); // (i in this case)
  add_string_to_table(test_string);
the next char is the 'w':
char: 'w'
pos:  8
act_phrase = 0;
act_len = 1;
test_string = ['w', 0x00, 0x00...]
table = [ {"", 0}  , {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4}, {"iw", 5},
          {"oi", 6} ]
after adding the next char (coz 'w' is in our table) we have:
char: 'o'
pos:  9
act_phrase = 3;
act_len = 1;
test_string = ['w', 'o', 0x00...]
table = [ {"", 0}, {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4}, {"iw", 5},
          {"oi", 6} ]
    Some 'wo' here? Nope. We break and write:
  output(act_phrase);               // (2 in this case)
  output(test_string[act_len -1] ); // (o in this case)
  add_string_to_table(test_string);
The final char (plus EOF) is confronted with the following table:
table = [ {"", 0}  , {"i", 1}, {"w", 2}, {"o", 3}, {"b", 4}, {"iw", 5},
          {"oi", 6}, {"wo", 7} ]
then the i+EOF is added to the table.

The output we produced so far (spaces inserted for clarity):
{0}i {0}w {0}o {0}b {1}w {3}i {2}o {1}EOF
Hmm. This doesn't look like an algorithm which produces good results in small
files. But think about this: if there would have been another "iwo" at the end
of the file instead the 'i' we would have coded {5}o thus saving us two bytes.
And for the next 'iwo' we would have had a new table entry ready. This means
that the algorithm manages inequally long strings with just one entry
reference. If our reference is shorter than the string, we crunch. Normally we
achieve an increasing crunch-ratio after ca. 100 bytes. From that moment on our
table begins to fill with strings longer than one byte and from that moment on
we save bits with every reference we write.
I have a question, sir...
Yes, Watson?
This table-reference - how big should that table be and how do we code the
reference? And while i'm at the point: How do we search the table? This will be
the most time-consuming part, i think. And, oh yes, what if the table is full?

You are absolutely right, Watson.
Good questions!
OK!
How do we search the table FAST? Well, to be honest, i do not prefer one method
over the other. One good method would be to implement a hash-table. In C++ you
would implment a Set of Strings and hope that the implementation from your
compiler vendor is fast enough to handle some 100 Kbyte Strings. Another way
would be to allocate a biiiiig amount of memory (premise: size MOD 256 == 0),
divide it by 256 and reserve these chunks for the according ascii-values). With
this method you would have equally long maximumsized strings ready for
binary-searching and overwriting. These chunks would then be filled with the
string and a timestamp. if our chunk was full, the next time we would have to
add a string we would delete the earliest one (or maybe two or three) and could
add the new string.
I know that this method screams "RAM WASTE!!! RAM WASTE!!!", but let me tell
you that the ACB-cruncher from George Buyanovsky needs 16 MByte RAM in one
chunk to work. We will deal with 256 MByte RAM on every average PC within the
next 5 years. Not just because RAM gets cheaper every minute, but because the
average user wants to scan and digitize things (in 1998: buy three CDs or a
scanner. What do you think we will be in 5 years?).
A good tablesearch-algorithm is the stand-or-fall part of the cruncher.
As far as i'm concerned: Let the cruncher take as much memory as possible.
If you have 32 MB reserved in one piece and you want a maximum stringlen of 128
Byte per string you will have 33554432 / 256 =  131072 Bytes per chunk.
And 131072 / 128 = 1024 Strings per chunk. The more you reduce your stringlen,
the more strings you can pack into the chunk. Reducing the maximum stringlen
down to 64 brings you 4096 strings per chunk. This is quite good. 4096 strings
just for one ascii-slot.
Well, the way we code the reference is completely ours. We should take a look
back to our LZ77 essay and let our phantasy roam. From the most stupid 8+12 bit
reference (gives us crunching only at minimum 4 byte strinlen) to unary +
your_stuff - everything is possible. However it should be fast.

The coding part is pretty easy, isn't it? You could theoretically write a LZ78
cruncher in Visual Basic in a few hours and it would actually work.
And what about the decoding part? Well, no problems here, too. The decruncher
starts with the same emptied model (emptied table) as the cruncher:

  New_String = Take the first code from decoding a bit_input
  output NewString
  take the next char
  output it
  add char to New_String
  add New_String into table.
repeat until EOF
 
//.............................................................................
While this method is pretty good and pretty easy to implement, 1984 Terry
Welch came up with a fundamental improvement of this algorithm. Since then the
term LZW is also common. Welch had the idea that one could pre-create a table
(with the first 256 possible chars) and that the algorithm would only output
table-references.
With this trick one could create a huffman-tree for all those codes one writes
thus letting us write even fewer bits (there is a method of writing even fewer
bits than Huffman - arithmetic coding, but we'll come to this later).
So, how does this LZW work? Let's have a look:
The basic LZW-algorithm:
file://Read ahead because we definitively know every single char
oldString[0] = getc(input);
oldString[1] = 0x00;
while (! eof(input))
{
  c = getc(input);                file://We get the second string from our input
  strcpy(newString, oldString);   file://and save the oldstring into the search
                                  file://string. The first time the oldstring is
                                  file://our very first char of our input. But
                                  file://later it will be longer strings.
  strncat (newString, &c, 1);     file://add the actual read char to the
                                  file://actual string
  if (stringInTable(newString))   file://when the string is already in the table
  {
    strcpy (oldString, newString) file://take it as oldString and read the next char
  }
  else                            file://else
  {
    code = searchTable(oldString);//get the code for the last known String
    output (code);                file://and output it
    addStringToTable(newString);  file://add the new String for the next time
    oldString[0] = c;             file://and start again with the latest unknown
    oldString[1] = 0x00;          file://char
  }
}
Looks pretty easy, ain't it? Well, it is. Let's crunch the our little 
source again (this time a little longer):

                                  0123456789abcde
                                  iwobiwoiwoiwoix

          
(We have a table, where the first 256 (0 - 255) entries are occupied with the
corresponding values)
Let's trace the steps of the cruncher. Maybe you should print the above
typed source for checking.
  oldstring = ( 'i', 0x00 )
 
(we enter the loop) (Pos = 0)
  c = 'w'
  newString = ( 'i', 0x00 )
  newString = ( 'iw', 0x00 )
 
(no, we don't have 'iw' in our table yet)
  code = searchTable(oldString)   file://oldString = 'i', 0x00
  output (code)                   file://i
  addString (newString)
  oldString[0] = 'w'
  oldString[1] = 0x00
 
(back to the top of the loop) (Pos = 1)
(our table has the 256 entries plus [ {256, 'iw'} ]
  oldstring = ( 'w', 0x00 )
  c = 'o'
  newString = ( 'w', 0x00 ) 
  newString = ( 'wo', 0x00 ) 
(no, we don't have 'wo' in our table yet)
  code = searchTable(oldString)   file://oldString = 'w', 0x00
  output (code)                   file://w
  addString (newString)
  oldString[0] = 'o'
  oldString[1] = 0x00
(back to the top of the loop) (Pos = 2)
(our table has the 256 entries plus [ {256, 'iw'}, {257, 'wo'} ]
  oldstring = ( 'o', 0x00 )
  c = 'o'
  newString = ( 'b', 0x00 ) 
  newString = ( 'ob', 0x00 ) 
(no, we don't have 'ob' in our table yet)
  code = searchTable(oldString)   file://oldString = 'o', 0x00
  output (code)                   file://o
  addString (newString)
  oldString[0] = 'b'
  oldString[1] = 0x00
(back to the top of the loop) (Pos = 3)
(our table has the 256 entries plus [ {256, 'iw'}, {257, 'wo'}, {258, 'ob'} ]
  oldstring = ( 'b', 0x00 )
  c = 'i'
  newString = ( 'b', 0x00 ) 
  newString = ( 'bi', 0x00 ) 
(no, we don't have 'bi' in our table yet)
  code = searchTable(oldString)   file://oldString = 'b', 0x00
  output (code)                   file://i
  addString (newString)
  oldString[0] = 'i'
  oldString[1] = 0x00
(Now we will have something to crunch :)
(back to the top of the loop) (Pos = 4)
(our table has the 256 entries
plus [ {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}  ]
  oldstring = ( 'i', 0x00 )
  c = 'w'
  newString = ( 'i', 0x00 ) 
  newString = ( 'iw', 0x00 ) 
(YES, we have 'iw' in our table )
  oldString = ( 'iw', 0x00)
(back to the top of the loop)
  oldString = ( 'iw', 0x00)
  c = 'o' 
  newString = ( 'iw', 0x00 )  
  newString = ( 'iwo', 0x00 ) 

(no, we don't have 'iwo' in our table yet)
  code = searchTable(oldString)   file://oldString = 'iw', 0x00
  output (code)                   file://[256]
  addString (newString)
  oldString[0] = 'o'
  oldString[1] = 0x00

(back to the top of the loop) (Pos = 6)
(our table has the 256 entries
plus [ {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'}  ]
  oldstring = ( 'o', 0x00 )
  c = 'i'
  newString = ( 'o', 0x00 ) 
  newString = ( 'oi', 0x00 ) 
(no, we don't have 'oi' in our table yet)
  code = searchTable(oldString)   file://oldString = 'o', 0x00
  output (code)                   file://o
  addString (newString)
  oldString[0] = 'i'
  oldString[1] = 0x00

(back to the top of the loop) (Pos = 7)
(our table has the 256 entries
plus [
       {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'},
       {261, 'oi'} 
     ] 
  oldstring = ( 'i', 0x00 )
  c = 'w'
  newString = ( 'i', 0x00 ) 
  newString = ( 'iw', 0x00 ) 
(YES, we have a 'iw' in our table)
  oldString = ( 'iw', 0x00)
(back to the top of the loop)
  oldString = ( 'iw', 0x00)
  c = 'o' 
  newString = ( 'iw', 0x00 )  
  newString = ( 'iwo', 0x00 ) 
(YES, we have a 'iwo' in our table)
  oldString = ( 'iwo', 0x00)
(back to the top of the loop)
  oldString = ( 'iwo', 0x00)
  c = 'i' 
  newString = ( 'iwo', 0x00 )  
  newString = ( 'iwoi', 0x00 ) 

(no, we don't have 'iwoi' in our table yet)
  code = searchTable(oldString)   file://oldString = 'iwo', 0x00
  output (code)                   file://260
  addString (newString)
  oldString[0] = 'i'
  oldString[1] = 0x00

(back to the top of the loop) (Pos = a)
(our table has the 256 entries
plus [
       {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'},
       {261, 'oi'}, {262, 'iwoi'}   
     ] 
  oldstring = ( 'i', 0x00 )
  c = 'w'
  newString = ( 'i', 0x00 ) 
  newString = ( 'iw', 0x00 ) 
(YES, we have a 'iw' in our table)
  oldString = ( 'iw', 0x00)
(back to the top of the loop)
  oldString = ( 'iw', 0x00)
  c = 'o' 
  newString = ( 'iw', 0x00 )  
  newString = ( 'iwo', 0x00 ) 
(YES, we have a 'iwo' in our table)
  oldString = ( 'iwo', 0x00)
(back to the top of the loop)
  oldString = ( 'iwo', 0x00)
  c = 'i' 
  newString = ( 'iwo', 0x00 )  
  newString = ( 'iwoi', 0x00 ) 
(YES, we have a 'iwoi' in our table)
  oldString = ( 'iwoi', 0x00)
(back to the top of the loop)
  oldString = ( 'iwoix', 0x00)
  c = 'x' 
  newString = ( 'iwoi', 0x00 )  
  newString = ( 'iwoix', 0x00 ) 

(no, we don't have 'iwoix' in our table yet)
  code = searchTable(oldString)   file://oldString = 'iwo', 0x00
  output (code)                   file://262
  addString (newString)
  oldString[0] = 'x'
  oldString[1] = 0x00
So, our output would look something like this:
 iwob[256]o[260][262]

Hey, we have two codes directly. Does this make trouble when decrunching?
Well, let's see how we would decrunch our little baby:

file://Well, we know that the first character is uncrunchable, so let's output it
file://directly
oldString[0] = decode(input())
oldString[1] = 0x00;
putString(oldString);      
newCode = input()                         file://We get a new inputCode
while (newCode != EOF)
{
  newString = searchInTable(newCode);     file://We have to have it in our table
  putString(newString);                   file://then we output it
  strncat (oldString, &newString[0], 1);  file://the oldstring is concatenated by
                                          file://the first char of the freshly read
                                          file://string
  addString (oldString);                  file://this string is then added to the
                                          file://table
  strcpy (oldString, newString);          file://then the freshly read string is
                                          file://from now on our new base
  newCode = input()                       file://We get a new inputCode
}

OK, let's trace this alg when we decrunch our crunch:
 iwob[256]o[260][262]
Our table has the table entries 0 - 255 filled with the corresponding values.
oldString[0] = 'i';
oldString[1] = 0x00;
putString(oldString);       file://i is output
newCode = 'w';
(enter loop)
  newString = 'w'
  Output 'w'                file://w is output
 
  oldString = 'iw'          file://build up new String
  //                        file://and add it to the table
  oldString = 'w'    
  newCode = 'o'
our table has the 256 entries 
plus [
       {256, 'iw'}
     ] 
(enter loop)
  newString = 'o'
  Output 'o'                file://o is output
 
  oldString = 'wo'          file://build up new String
  //                        file://and add it to the table
  oldString = 'o'    
  newCode = 'b'
our table has the 256 entries 
plus [
      {256, 'iw'}, {257, 'wo'}
     ] 
(enter loop)
  newString = 'b'
  Output 'b'                file://b is output
 
  oldString = 'ob'          file://build up new String
  //                        file://and add it to the table
  oldString = 'b'    
  newCode = [256]
our table has the 256 entries 
plus [
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'}
     ] 
(enter loop)
  newString = 'iw'
  Output 'iw'               file://iw is output
 
  oldString = 'bi'          file://build up new String (just the first char from the
                            file://new String)
  //                        file://and add it to the table
  oldString = 'iw'    
  newCode = 'o'

our table has the 256 entries
plus [
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}
     ] 
(enter loop)
  newString = 'o'
  Output 'o'                file://o is output
 
  oldString = 'iwo'         file://build up new String
  //                        file://and add it to the table
  oldString = 'o'    
  newCode = [260]
 
our table has the 256 entries 
plus [
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'}
     ] 
(enter loop)
  newString = 'iwo'
  Output 'iwo'              file://o is output
 
  oldString = 'oi'          file://build up new String (just the first char)
  //                        file://and add it to the table
  oldString = 'iwo'    
  newCode = [262]
 
our table has the 256 entries 
plus [
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'}
      {261, 'oi'}
     ] 
(enter loop)
  newString = CRASH !!! CRASH
 
Damn. I knew it. The cruncher added a code and used it immediately and the
decruncher can't use it as it doesn't know how to create it. Or does it???
Hmmmm....
Ok, ok, i lied. I knew it all the time that this constellation was deadly for 
our decruncher. This behaviour appears, whenever the cruncher meets a
constellation like (iwo b iwo iwo iwo) where the last character of a crunch is
the beginning of the next sequence. This leads to creating table-entries which
the decruncher can not foresee. But lucky us - this is the only case when this
happens. A little if-clause will just catch the null-pointer and just fetch the
oldString and copy it to newString. Then we add the first char of newString to
the end of newString. Then we continue as always. The if-clause could look like
this:
  if (newString == null)
  {
    char c[2]
    file://we still know oldString
    strcpy (newString, oldString)
    char c[0] = newString[0];
    char c[1] = 0x00;
    strcat (newString, c);   
  }
With this clause let's restart from the last part:
our table has the 256 entries 
plus [
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'}, {259, 'bi'}, {260, 'iwo'}
     ] 
(enter loop)
  newString = 'iwo'
  Output 'iwo'              file://o is output
 
  oldString = 'oi'          file://build up new String (just the first char)
  //                        file://and add it to the table
  oldString = 'iwo'    newCode = [262]

our table has the 256 entries
plus [
      {256, 'iw'}, {257, 'wo'}, {258, 'ob'},
      {259, 'bi'}, {260, 'iwo'} {261, 'oi'}
     ] 
(enter loop)
  newString = null
  if (newString = null)
  {
    newString = 'iwo'
    newString = 'iwoi'      file://First character added again
  }
  Output 'iwoi'             file://iwoi is output
 
  oldString = 'iwoi'        file://build up new String (just the first char)
  //                        file://and add it to the table
...
and then we have our entry 262.
Phew.
You see when you try this on your own, that there is a lot of room for 
improvement.
But this is the basic algorithm and as now know the secrets of LZW you know
what RAR and ZIP are doing. I have some sources i can mail you, if you want.

All right, next time i'll explain something related to Huffman -
                The Arithmetic Crunching Method (tataaa... :-)
                (how can i crunch 0.74 bits ???)

Greetings from a programmer,
Joa
 Part VII: Arithmetic Crunching (crunching bits apart...)
 
By Joa Koester (JoKo2000@HotMail.Com)
This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it.
If you have corrections, comments, better examples than the ones used
in this essay etc. - drop me a line.

hello, hello,
well, shave my legs and call me Grandpa if that wasn't a long delay, but i 
had some serious difficulties accessing the Internet recently. I hope that
i can access you more frequently in the near future. And to sweet up the time
till then a little bit here my next essay about crunching methods:

Arithmetic Crunching!

What is Arithmetic Crunching?
Arithmetic Crunching is based on the observation that a certain floating 
point number decrements to a much smaller degree the higher the factor is with
which you multiply it. For us that means, that that arithmetic crunching a based
on probabilities like Huffman. But as you will see, much better.
Wanna example? Well, imagine a floating point number. Say, we imagine 1,00.
Then imagine we have a multiplicating factor like 0,9.
And now watch what happens when we multiply the 1,00 with 0,9 and then 
multiply the result with 0,9 again and again and ... and that some times:
1,00 * 0,9 = 0,9
0,9  * 0,9 = 0,81
0,81 * 0,9 = 0,729
...
as we continue we see that the value of the number decrements. But it
decrements SLOWLY. And it decrements with the factor 0,9. This is the first key
element for arithmetic crunching: you multiply a number with a factor. The
higher the factor the slower the number decrements, the better the crunching.

You want a practical example? Coming up... Have a look at the following byte
sequence:

aaaaaaaaaabbbbcccddd
which consists of four chars. Counting the appearences of the char reveals the 
following:
10 x a
 4 x b
 3 x c
 3 x d
-------
20 bytes
As we know that Arithmetic Crunching has something to do with probabilities, 
let's build up a table with the probabilites of the chars:
a = 50%
b = 20%
c = 15%
d = 15%
And to explain the mechanics, let's start at the most basic implementation. The 
basic idea is to transform the probabilities into slots, starting - for now - in
floating point mode in the range from 0,0 to 1,0.
So let's scale this table into a range from 0.0 to 1.0, giving the lowest
value the starting range 0.0:
Byte |  %  |  Low  | High
a    | 50% |  0.50 | 0.999999(999...)
b    | 20% |  0.30 | 0.499999(999...)
c    | 15% |  0.15 | 0.299999(999...)
d    | 15% |  0.00 | 0.149999(999...)
So we have the char 'a' with the range from 0.5 till 0.99999. But what does
that bring us?
Well, we have now TWO values and we automatically have a RANGE, namely the 
range of the two actual calculated values. That means we have three values that
are calculated: the actual range, the actual Low and the actual High. The actual
range is the range that is later saved bytewise to disk, the actual High and Low
are the parts that change according to the crunched byte.
Just watch as we step thru the byte-sequence:
(We start with Low = 0.0 and High with 1.0)
Low:   0.0
High:  1.0
Range: (High - Low) = 1.0
char:  a
LowOfChar: 0.5
HighOfChar: 0.99999(999...)
High= Low + (Range * HighOfChar)= 0.0 + (1.0 * 0.99999) = 0.99999
Low = Low + (Range * LowOfChar) = 0.0 + (1.0 * 0.5)     = 0.5

What happened here? Well at first we calculated the actual RANGE. This range is
our arithmetic number, yielding the values of the chars we are crunching. Then
we calculate the High and then the Low.
These are build upon the slots (the ranges) of the actual char. Because we are 
having TWO values we have to calculate also two new values. The High is to be
calculated with the high position of the actual slot (0.99999 for the char a,
0.49999 for the char b, etc.).                
We multiply the high and low position with the actual range, because this 
range is recreated dynamically with every byte we crunch. At first the range is
1,0. But with every byte the range becomes smaller and smaller until a byte or
so has to emitted to give space again and the range is shifted upwards again.
If for example your range is 0.00025 that would mean that your High and Low
are only 0.00025 away. At this point we have to emit the first number(s) of Low
to disk and shift the rest up until we are behind the decimal point again.
So, if Low was 0.12325 and High was 0.12350, our RANGE would be 0.00025. Too 
small to continue. First we have to emit something and then shift the rest
upwards again. We emit for example 0.12 to disk and subtract this from Low and
High.
Low and High would then be 0.00325 / 0.00350. Then we have to shift the 
values up again until they are behind the decimal point: 0.325 / 0.350.
The next bytes you crunch will more or less slowly bring the values nearer again 
and then we emit again some values, probably 0.3 and shift the rest up again,
and so on.
How many decimal points you emit to disk depends on your personal ideas. You 
could send as much decimal numbers to disk until there is a difference in High
and Low (so in the example 0.12350 and 0.12325 we could have sent 0.123 to
disk because these are the values that aren't different) or you could send
a certain number of numbers to disk and shift upwards until a certain difference
is ensured.
Or something completely different. It's up to you.
We calculate the High first, because the High-Value is calculated on the actual 
Low-Value. If we would calculate the Low-Value first, we couldn't calculate our
High-Value anymore. When we calculate these values we take the actual range,
multiply it with the factor(s) of the actual char and give the result back into
High or Low.
 
What that mechanism will do will become clearer when we watch the crunching of
the second byte:

Low:   0.5
High:  0.99999
Range: (High - Low) = 0.49999
char:  a
LowOfChar: 0.5
HighOfChar: 0.999999
High= Low + (Range * HighOfChar)= 0.5 + (0.49999 * 0.99999) = 0.999985
Low = Low + (Range * LowOfChar) = 0.5 + (0.49999 * 0.5)     = 0.749995
What you see is that the number is still in the range of the char 'a'.
What will happen when we crunch the next 8 a's ? I reckon that the number will
get nearer to each other.
But still it will be in the range of 0.5 - 0.99999.
What we could use now is an example consisting of just 4 chars:
aabc
Ok. Now build the probability table:
a = 50%
b = 25%
c = 25%
And next the transforming-table from the percentages into the range of 0.0 - 1.0.
Byte |  %  |  Low  | High
a    | 50% |  0.50 | 0.99999
b    | 25% |  0.25 | 0.49999
c    | 25% |  0.00 | 0.24999
Now again, watch the ranges as we skip the first two crunching steps as the
result is identical with the first example. Let's watch what happens when we now
reach the 'b':
Low:   0.749995
High:  0.999985
Range: (High - Low) = 0.24999
char:  b
LowOfChar: 0.25
HighOfChar: 0.49999
High= Low + (Range * HighOfChar)= 0.749995 + (0.24999 * 0.49999) = 0.8749875
Low = Low + (Range * LowOfChar) = 0.749995 + (0.24999 * 0.25)    = 0.8124925

Phew. As you can see the values made some heavy jumps. The High sank from
0.999985 down to 0.8749875, while the Low jumped from 0,749995 up to 0,8124925.
I think i will not surprise you when i say the the next step will go in the same
direction:
Low:   0.8124925
High:  0.8749875
Range: (High - Low) = 0.062495
char:  b
LowOfChar: 0.0
HighOfChar: 0.24999
High= Low + (Range * HighOfChar)= 0.8124925 + (0.062495 * 0.24999) = 0.828115625
Low = Low + (Range * LowOfChar) = 0.8124925 + (0.062495 * 0.0)     = 0.8124925
Oh boy, oh boy, oh boy. Look, what we have here. The Low and the High are damn 
near to each other. But the bytes are thru. The crunching is over. And what now?
Well, all we have to do now is to emit the Low and that's all. We are done. The
sequence aabc is coded into 0.8124925! It is most important to understand that we
are not emitting a number, but a collection of RANGES!!! When we decrunch the
number it will become even clearer:
(For now the decoder is supposed to know the model)
Byte |  %  |  Low  | High
a    | 50% |  0.50 | 0.99999
b    | 25% |  0.25 | 0.49999
c    | 25% |  0.00 | 0.24999
0.8124925 is in the range of the char a. So we output the char 'a'. And 
then? It's simple. We subtract the Low of the char and shift the rest upwards
by multiplying it with the RANGE:
.99999    -   .5                   = .49999      = Range
.8124925  -   .5 (=LowOfChar)      = .3124925
.3124925  /   .49999 (=Range)      = .6249975
(Remember: dividing by a fraction is the same as multiplying with it's reciprocal.
 Dividing by .49999 is nearly the same as multiplying with 2.0, meaning for us
 the shifting we searched for.)
.6249975 is in the range of the char a. So we output the char 'a'. And proceed as 
before:
.99999   -   .5                    = .49999      = Range
.6249975 -   .5 (=LowOfChar)       = .1249975
.1249975 /   .49999 (=Range)       = .25
Well, 0.25 is clearly in the range of b and so we emit a 'b'. Then we proceed:
.49999   -   .25                   = .24999      = Range
.25      -   .25 (=LowOfChar)      = .0
.0       /   .24999 (=Range)       = .0
0.0 is in the range of c and we emit a 'c'. After proceeding we recognize that 
we are finished and stop working. Again: It is most important to notice that
in the crunching process we emit RANGES, or maybe better formulated,
interpretations of ranges. Therefore we emit only the Low of the range in
question. As a consequence when we decrunch the 'number' we subtract the
LowOfRange of the char in question. After that we can shift the 'number' up by
the RANGE of the char.
Now two problems are emerging: Is it necessary to emit the model to the 
decruncher and second: How do we implement a floating point number that can be
as long as some hundreds of kilobyte when we only have registers of maybe 80 or
128 bits?
To answer the first question: Well, what about an adaptive model? It will not 
crunch efficiently in the first few hundreds bytes but then we won't crunch just
texts of 200 bytes, right?
The answer to the second question is a little bit more complex. Of course we 
can't emit a floating point number consisting of thousands of decimal numbers.
But with a mathematical trick we can fake this process. I'm sure that nearly all
of you once tried to programm some 3D stuff. And after some time you came up
with the idea to shift up the floating point numbers by, say, 16384, and
calculate then with these now integer values. After the calculation you'd then
shift the result down again by 16384 and would have speeded up your floating
point calculations by the factor 1000 or so. Now, based on this we could also
say that the number 1.0 could be same as 0x010000 and 0.0 could be 0x0000.
We would then transform the whole calculations into the realm of integer values.
We can only hope that 16 bits are enough to simulate the floating point
calculations, but let's think our example thru again:
Byte |  %  |  Low        | High
a    | 50% |  0x008000 | 0x00ffff
b    | 25% |  0x004000 | 0x007fff
c    | 25% |  0x000000 | 0x003fff

First byte:
Low:   0x000000
High:  0x010000
Range: (High - Low) = 0x010000
char:  a
LowOfChar:  0x008000
HighOfChar: 0x00ffff
High= Low + (Range * HighOfChar)= 0 + (0x010000 * 0x00ffff) = BANG
Ooops. It seems that we have a little overload here. Hm, but there must be a way 
to calculate those values in integer format. It it most necessary for to
output single bytes to disk. What do we do?
Well, there are certain ways out of this. I will tell you one i find extremely 
elegant and also has the advantage of being extensible and able to crunch
SIMILAR values, too. This solution comes from
Gordon V. Cormack  
University of Waterloo
cormack@uwaterloo.ca
and is build upon the idea that you can see the bytes you crunch also as a 
sequence of BITS you crunch. Now when you observe the 8 bits of a byte and would
log the countings of the 1-bit and 0-bit states of the byte you would have a
High and a Low. They would just be transformed into 256 entries of two arrays.
In C/C++ these array would be declared: int one[256], int zero[256].
When we then would examine one single bit of a byte we would then have to fetch 
the correct entry of the array and calculate the actual fraction with the values
of the entries in the arrays one and zero. What we would get out of this
calculation would be the factor to multiply with the actual range. Well, i
include the sources for both the cruncher and the decruncher at the end of this
essay (they were not formatted by me).
Let's examine the source together:
int max = 0x1000000,
    min = 0,
    mid,
    index,
    c,
    i,
    bytes = 0,
    obytes = 3;
int bit;
int one[256], zero[256];
for (i=0;i<256;i++) {
   one[i] = 1;
   zero[i] = 1;
   }
Ok, some ints are created and the arrays are build and initialized. Why they are 
initialized with 1 is due to the fact that they are used to be divided with.
Were they not initialized it would create an error "Division by Zero".

for(;;){
   c = getchar();
   if (c == EOF) {
      min = max-1;
      fprintf(stderr,"compress done bytes in %d bytes out %d ratio %f\n",
                      bytes,obytes, (float)obytes/bytes);
      break;
   }
   ...
So, forever (or until End Of File) we read a char and if it's EOF we break the 
loop and land here:
putchar(min>>16);
putchar((min>>8) & 0xff);
putchar(min & 0x00ff);
}
So we at least output three bytes. Again you can see that we output from the 
min-value (which is of course our Low). But that is the starting and the end of
the loop. What happens in the normal course:

   bytes++;
   for (i=0;i<8;i++){
      bit = (c << i) & 0x80;
      index = (1<> (8-i));
      mid = min + ((max-min-1)*((float)zero[index]) / (one[index]+zero[index]));
      if (mid == min) mid++;
      if (mid == (max-1)){ /* should never happen, but with floating pt? */
         mid--;
      }
      if (bit) {
         min = mid;
         one[index]++;
         }
      else {
         max = mid;
         zero[index]++;
         }
      while ((max-min) < 256) {
         max--;
         putchar(min >> 16);
         obytes++;
         min = (min << 8) & 0xffff00;
         max = ((max << 8) & 0xffff00) ;
         if (min >= max) max = 0x1000000;
         }
      }
 
Uh, looks complicated. Let's examine it step by step and first up reformat the source:
   
   
   /*-----------------------------------------------------------------------------*/
   bytes++;
  
   for (i=0;i<8;i++)
   {
      bit = (c << i) & 0x80;
      index = (1<> (8-i));
      /*--------------------------------------------------------------------------*/
      mid = min + ((max-min-1)*((float)zero[index]) / (one[index]+zero[index]));
      if (mid == min) mid++;
      if (mid == (max-1))
      {   /* should never happen, but with floating pt? */
         mid--;
      }  
      /*--------------------------------------------------------------------------*/
     
      if (bit)
      {
         min = mid;
         one[index]++;
      }
      else
      {
         max = mid;
         zero[index]++;
      }
      /*--------------------------------------------------------------------------*/
      while ((max-min) < 256) 
      {
         max--;
         putchar(min >> 16);
         obytes++;
         min = (min << 8) & 0xffff00;
         max = ((max << 8) & 0xffff00) ;
        
         if (min >= max)
         {
           max = 0x1000000;
         }
      }
      /*--------------------------------------------------------------------------*/
   }

Ok, let's examine the source from easy to heavy:
      while ((max-min) < 256) 
      {
         max--;
         putchar(min >> 16);
         obytes++;
         min = (min << 8) & 0xffff00;
         max = ((max << 8) & 0xffff00) ;
        
         if (min >= max)
         {
           max = 0x1000000;
         }
      }
should mean that, if the difference between High and Low is less than 256 we 
emit the highest byte and shift up the rest. For the case that this difference
is still present after one shifting the code is packaged into a loop. max is
ensured to be greater than min.
Next one:
      if (bit) 
      {
         min = mid;
         one[index]++;
      }
      else
      {
         max = mid;
         zero[index]++;
      }
means that if the actual bit of the actual byte is set, the Low is set to the 
variable mid and the specified entry of the array one is incremented by one. If
however the bit is not set we set the High down to mid and increment the
specified entry in the array zero.
The value for mid is calculated here: 
   for (i=0;i<8;i++)
   {
      bit = (c << i) & 0x80;
      index = (1<> (8-i));
      /*--------------------------------------------------------------------------*/
      mid = min + ((max-min-1)*((float)zero[index]) / (one[index]+zero[index]));
      if (mid == min) mid++;
      if (mid == (max-1))
      {   /* should never happen, but with floating pt? */
         mid--;
      }  
     
     
      /* ... */     
   }
Again, let's examine this code backwards:

      if (mid == min) mid++;
      if (mid == (max-1))
      {   /* should never happen, but with floating pt? */
         mid--;
      }  
are just insurances that further calculations won't go beyond the boundaries of  
the cruncher.

      mid = min + ((max-min-1)*((float)zero[index]) / (one[index]+zero[index]));
based on min (Low) we calculate a new value. The formula becomes clearer when we 
do it in two steps:
      float factor = ((float)zero[index]) / (one[index] + zero[index]);
      mid = min + (max-min-1) * factor;
So the factor is the actual count of the zero-bits of this actual bit of the 
actual char and some similar chars divided by the sum of the actual count of the
zero-bits plus the one-bits of the actual bit of the actual char and similar
chars. Sounds heavy, eh? This factor is one key-element of the actual implementation.
If you would create another formula for a better factor, you could reach a
better compression, as the prof already said. But back to the code. This factor
is multiplied with the actual range. Then this result is added to the actual min
and is put into the variable called mid which is then used to alter the value of
min or max as shown above.

      bit = (c << i) & 0x80;
      index = (1<> (8-i));
are examples why C/C++ is loved and hated by it's programmers / opponents. 
Well, the first line is pretty easy. It's a bit-test to check, whether the
i. bit of the char is set or not. In Assembler you could do a
simple "bt reg, i" - however after that instruction the variable bit
contains a 1 or a 0.
The second line is a little bit more complicated. It can only really be
understood if we simulate a char in the loop.
Let's say we crunch the char 'a' and we would only have this line in our loop. 
The value for index would change as follows:
char          | i | index
              |   |
01000001 = 65 | 0 | (1<<0) - 1 + (65 >> (8-0))  =  0   + (65 >> (8)) = 0
01000001 = 65 | 1 | (1<<1) - 1 + (65 >> (8-1))  =  1   + (65 >> (7)) = 1
01000001 = 65 | 2 | (1<<2) - 1 + (65 >> (8-2))  =  3   + (65 >> (6)) = 4
01000001 = 65 | 3 | (1<<3) - 1 + (65 >> (8-3))  =  7   + (65 >> (5)) = 9
01000001 = 65 | 4 | (1<<4) - 1 + (65 >> (8-4))  =  15  + (65 >> (4)) = 19
01000001 = 65 | 5 | (1<<5) - 1 + (65 >> (8-5))  =  31  + (65 >> (3)) = 39
01000001 = 65 | 6 | (1<<6) - 1 + (65 >> (8-6))  =  63  + (65 >> (2)) = 79
01000001 = 65 | 7 | (1<<7) - 1 + (65 >> (8-7))  =  127 + (65 >> (1)) = 159
a 'b' would deliver the following results:
char          | i | index
              |   |
01000010 = 65 | 0 | (1<<0) - 1 + (66 >> (8-0))  =  0   + (66 >> (8)) = 0
01000010 = 66 | 1 | (1<<1) - 1 + (66 >> (8-1))  =  1   + (66 >> (7)) = 1
01000010 = 66 | 2 | (1<<2) - 1 + (66 >> (8-2))  =  3   + (66 >> (6)) = 4
01000010 = 66 | 3 | (1<<3) - 1 + (66 >> (8-3))  =  7   + (66 >> (5)) = 9
01000010 = 66 | 4 | (1<<4) - 1 + (66 >> (8-4))  =  15  + (66 >> (4)) = 19
01000010 = 66 | 5 | (1<<5) - 1 + (66 >> (8-5))  =  31  + (66 >> (3)) = 39
01000010 = 66 | 6 | (1<<6) - 1 + (66 >> (8-6))  =  63  + (66 >> (2)) = 79
01000010 = 66 | 7 | (1<<7) - 1 + (66 >> (8-7))  =  127 + (66 >> (1)) = 160
a 'c' would deliver the following results:
char          | i | index
              |   |
01000011 = 67 | 0 | (1<<0) - 1 + (67 >> (8-0))  =  0   + (67 >> (8)) = 0
01000011 = 67 | 1 | (1<<1) - 1 + (67 >> (8-1))  =  1   + (67 >> (7)) = 1
01000011 = 67 | 2 | (1<<2) - 1 + (67 >> (8-2))  =  3   + (67 >> (6)) = 4
01000011 = 67 | 3 | (1<<3) - 1 + (67 >> (8-3))  =  7   + (67 >> (5)) = 9
01000011 = 67 | 4 | (1<<4) - 1 + (67 >> (8-4))  =  15  + (67 >> (4)) = 19
01000011 = 67 | 5 | (1<<5) - 1 + (67 >> (8-5))  =  31  + (67 >> (3)) = 39
01000011 = 67 | 6 | (1<<6) - 1 + (67 >> (8-6))  =  63  + (67 >> (2)) = 79
01000011 = 67 | 7 | (1<<7) - 1 + (67 >> (8-7))  =  127 + (67 >> (1)) = 160
and a 'd' would give us this:
char          | i | index
              |   |
01000100 = 68 | 0 | (1<<0) - 1 + (68 >> (8-0))  =  0   + (68 >> (8)) = 0
01000100 = 68 | 1 | (1<<1) - 1 + (68 >> (8-1))  =  1   + (68 >> (7)) = 1
01000100 = 68 | 2 | (1<<2) - 1 + (68 >> (8-2))  =  3   + (68 >> (6)) = 4
01000100 = 68 | 3 | (1<<3) - 1 + (68 >> (8-3))  =  7   + (68 >> (5)) = 9
01000100 = 68 | 4 | (1<<4) - 1 + (68 >> (8-4))  =  15  + (68 >> (4)) = 19
01000100 = 68 | 5 | (1<<5) - 1 + (68 >> (8-5))  =  31  + (68 >> (3)) = 39
01000100 = 68 | 6 | (1<<6) - 1 + (68 >> (8-6))  =  63  + (68 >> (2)) = 80
01000100 = 68 | 7 | (1<<7) - 1 + (68 >> (8-7))  =  127 + (68 >> (1)) = 161
Now you see why i emphasized so much on the similar chars. The index is in a lot 
of cases identical and gives us a great range of possibilities. In combination
with the (non-)setting of the corresponding bit the zero or the one-array is
incremented exactly at this index. So, the more letters we have the more the
entries 0, 1, 4, 9 and 19 will increment. The rest will be incremented according
to the actual char value. But the values in these entries will grow and grow and
lead to a big fraction - and so to a slower nearing of the Low and High values.
If you for example crunch a normal text then the bit 7 will never be set because
of the ASCII-Value of the normal letter values.
Personally this way of scanning thru bytes inspired me a lot. I hope it does the 
same for you.
This implementation is pretty neat and you should give it a try in a debugger. 
It's very interesting watching the value of mid, min and max grow and shrink
until the first byte of min is emitted and the whole stuff shifted upwards. The
decruncher is also extremely interesting as it recreates the first byte from the
settings of the first three bits emitted. It's a little bit like magic, but
purely mathematics.
It can't be stressed enough that the Artithmetic Crunching just emits the lower 
interpretation of a range and that this range is then shifted upwards to give
enough 'space' again for the next bytes to be crunched. So, we are not dealing
with a number but with a folded and shifted heap of RANGES of certain chars.

Well, uhm, i have a question...
Yes, Watson?
If Arithmetic Crunching is superior to Huffman, why isn't it more often used? I 
only see Huffman but never saw AC somewhere in file formats or so.
Good question, Watson. The answer is: the basic algorithm is patented. If you 
ever implement a file format using AC and this file format is used in a
commercial way you have to pay license fees to certain patent holders. If
this was not the case our JPGs would be around 10% to 20% smaller. But, what the
heck, this here is just for research purposes, isn't it? :)

OK, enough talking,
next time we talk about the most recent developement in crunching business:
the Burrows-Wheeler-Transformation (tataaa)
Till then
hasta la pasta,
Joa
 
/*----------------------------------------------------------------------------*/
/*   Arithmetic Coding Implementation  -  Compression version 0.0.0
 
     Copyright Feb. 1993
             
     Gordon V. Cormack  Feb. 1993
     University of Waterloo
     cormack@uwaterloo.ca
 
     All rights reserved.
 
     This code and the algorithms herein are the property of Gordon V. Cormack.
 
     Neither the code nor any algorithm herein may be included in any software,
     device, or process which is sold, exchanged for profit, or for which a
     licence or royalty fee is charged.
     Permission is granted to use this code for educational, research, or
     commercial purposes, provided this notice is included, and provided this
     code is not used as described in the above paragraph.
 
*/
/* 
     This code uses a one-byte finite state predictor to drive an arithmetic
     coder for data compression.  It should give compression nearly identical
     to one-byte huffman coding.
     Find a better predictor, and you'll have a better compressor!
     It handles end-of-file properly, which requires more than 5 minutes
     thought.
*/
 
#include 
main(){
int max = 0x1000000,
    min = 0,
    mid,
    index,
    c,
    i,
    bytes = 0,
    obytes = 3;
int bit;
int one[256], zero[256];
for (i=0;i<256;i++) {
   one[i] = 1;
   zero[i] = 1;
   }
for(;;){
   c = getchar();
   if (c == EOF) {
      min = max-1;
      fprintf(stderr,"compress done bytes in %d bytes out %d ratio %f\n",
                      bytes,obytes, (float)obytes/bytes);
      break;
   }
   bytes++;
   for (i=0;i<8;i++){
      bit = (c << i) & 0x80;
      index = (1<> (8-i));
      mid = min + ((max-min-1)*((float)zero[index]) / (one[index]+zero[index]));
      if (mid == min) mid++;
      if (mid == (max-1)){ /* should never happen, but with floating pt? */
         mid--;
      }
      if (bit) {
         min = mid;
         one[index]++;
         }
      else {
         max = mid;
         zero[index]++;
         }
      while ((max-min) < 256) {
         max--;
         putchar(min >> 16);
         obytes++;
         min = (min << 8) & 0xffff00;
         max = ((max << 8) & 0xffff00) ;
         if (min >= max) max = 0x1000000;
         }
      }
   }
putchar(min>>16);
putchar((min>>8) & 0xff);
putchar(min & 0x00ff);
}

/*   Arithmetic Coding Implementation  -  Expansion  version 0.0.0
 
 
     Copyright Feb. 1993
 
     Gordon V. Cormack  Feb. 1993
     University of Waterloo
     cormack@uwaterloo.ca
 
 
     All rights reserved.
 
     This code and the algorithms herein are the property of Gordon V. Cormack.
 
     Neither the code nor any algorithm herein may be included in any software,
     device, or process which is sold, exchanged for profit, or for which a
     licence or royalty fee is charged.
 
     Permission is granted to use this code for educational, research, or
     commercial purposes, provided this notice is included, and provided this
     code is not used as described in the above paragraph.
 
*/
 
#include
main(){
int max = 0x1000000,
    min = 0,
    mid,
    val,
    index,
    i;
int bit;
char c;
int one[256], zero[256];
for (i=0;i<256;i++){
   one[i] = 1;
   zero[i] = 1;
   }
val = getchar()<<16;
val += getchar()<<8;
val += getchar();
while(1) {
   c = 0;
   if (val == (max-1)) {
      fprintf(stderr,"expand done\n");
      break;
   }
   for (i=0;i<8;i++){
      index = (1<= mid) {
         bit = 1;
         min = mid;
         one[index]++;
         }
      else {
         bit = 0;
         max = mid;
         zero[index]++;
         }
      c = c + c + bit;
      while ((max-min) < 256) {
         max--;
         val = (val << 8) & 0xffff00 | (getchar()& 0xff);
         min = (min << 8) & 0xffff00;
         max = ((max << 8) & 0xffff00) ;
         if (min >= max) max = 0x1000000;
         }
      }
   putchar(c);
   }
}
 
%¡¾³õѧÌìµØ¡¿
                 
O¡¾ÎÊÌâ´ðÒÉ¡¿
 
4¡¾ÍøÕ¾½éÉÜ¡¿
 
 
,¡¾ÔÓÖ¾ÐÅÏä¡¿
Ͷ¸åÐÅÏ䣺discoveredit@china.com
´ðÒÉÐÅÏ䣺discoveranswer@china.com
°ßÖñÐÅÏ䣺programhunter@china.com