[data compression experiment] implementation and analysis of LZW encoding and decoding algorithm

I Experimental purpose

Master the basic principle of dictionary coding, program LZW decoder with C/C++/Python and other languages, and analyze the encoding and decoding algorithm

II Experimental principle

1.LZW coding principle

Core idea:

The dictionary is initialized to contain all single characters, and the current prefix string P is empty.

Enter the next character C in the data stream,

Judge whether P+C is in the dictionary

(1) Exist

Assign P+C as a new prefix string to P

(2) Does not exist

Output the codeword CW corresponding to P, enter P+C into the dictionary, and then assign C to P as a new prefix string

2.LZW decoding principle

The decoding algorithm mainly involves two variables: CW and PW

Core idea:

(1) The first codeword read into the codeword stream is assigned to CW

(2) And assign Dictionary[CW] to the character stream to be output (decoding)

(3) Assign CW to PW as the prefix string, and CW accepts the next new codeword in the codeword stream

(4) Assign Dictionary[PW] to P and Dictionary[CW] to C

(5) Judge whether P+C exists in the dictionary. If it does not exist (and the dictionary has enough space), write P+C as a new term

(6) Assign CW to PW as a new prefix string, and repeat the above steps

exceptional case:

What should I do if the current codeword does not exist in the dictionary? Why?

First, analyze the reasons why the current codeword does not exist in the dictionary:

In the encoding process, when a new string is encountered, write the new string into the dictionary, and then read the next string. If the next string happens to be the newly written string, the decoder cannot update the index number of the new string synchronously due to the delay problem, so the current codeword does not exist in the dictionary.

However, through observation, we will find that in the above special case, the new entry has an obvious feature: the first character is the same as the last character.

Therefore, at the decoding end, we can copy and paste the first symbol of the previous codeword to the position of the last symbol, and the new codeword is the current codeword we want to decode.

LZW decoding code is as follows:

void LZWDecode(BITFILE* bf, FILE* fp)
{
	int character;
	int new_code, last_code;
	int phrase_length;
	unsigned long file_length;
	file_length = BitsInput(bf, 4 * 8);
	if (-1 == file_length) file_length = 0;
	InitDictionary();//Initialize dictionary
	last_code = -1;//pw=-1, as the old character in the string
	while (0 < file_length)
	{
		new_code = input(bf);//Read in a new character, cw
		if (new_code >= next_code)//The index number of the read character is greater than the maximum index number of the initialization dictionary, indicating that the character does not exist in the dictionary
		{ // this is the case CSCSC( not in dict)
			d_stack[0] = character;//character is the string represented by pw
			phrase_length = DecodeString(1, last_code);//decode
		}
		else//The index number of the newly read character exists in the initialization dictionary
		{
			phrase_length = DecodeString(0, new_code);//Direct decoding
		}
		character = d_stack[phrase_length - 1];//Save the first character of cw into character
		while (0 < phrase_length)
		{
			phrase_length--;
			fputc(d_stack[phrase_length], fp);//Output the string corresponding to the current codeword
			file_length--;
		}
		if (MAX_CODE > next_code)//If the dictionary still has space
		{// add the new phrase to dictionary
			AddToDictionary(character, last_code);//Save a string that does not exist in the original dictionary
		}
		last_code = new_code;//Number of updated entries
	//Need to fill
	}
}
int DecodeString( int start, int code)//Direct decoding of codeword index number in dictionary
{
	int count;
	count = start;
	while (0 <= code) 
	{
		d_stack[count] = dictionary[code].suffix;//Incoming new character
		code = dictionary[code].parent;//Assign old characters
		count++;
	}
	return count;
	//Need to fill
}

III Experimental steps

1. First debug the LZW encoding program, take a text file as the input, and get the output LZW encoding file

 2. Taking the encoding file obtained in experimental step 1 as the input, the decoding program of LZW is written. When writing the decoding program, you need to annotate the key statements and explain what to do. In the experimental report, it focuses on how to deal with and explain the reasons when the current codeword does not exist in the dictionary.

The decoding program, comments and how to deal with the current codeword when it does not exist in the dictionary are described in the decoding principle.

It can be seen that the content of the original txt file is the same, and the decoding is successful.

3. Select at least ten different format types of files and compress them with LZW encoder to obtain the output compressed bitstream file. Analyze the compression efficiency of files in different formats.

It can be seen that RGB files and YUV files are reduced after compression, while other file formats become larger after compression.

IV Complete code

bitio.h

/*
 * Declaration for bitwise IO
 *
 * vim: ts=4 sw=4 cindent
 */
#ifndef __BITIO__
#define __BITIO__

#include <stdio.h>

typedef struct{
	FILE *fp;
	unsigned char mask;
	int rack;
}BITFILE;

BITFILE *OpenBitFileInput( char *filename);
BITFILE *OpenBitFileOutput( char *filename);
void CloseBitFileInput( BITFILE *bf);
void CloseBitFileOutput( BITFILE *bf);
int BitInput( BITFILE *bf);
unsigned long BitsInput( BITFILE *bf, int count);
void BitOutput( BITFILE *bf, int bit);
void BitsOutput( BITFILE *bf, unsigned long code, int count);
#endif	// __BITIO__

bitio.c

/*
 * Definitions for bitwise IO
 *
 * vim: ts=4 sw=4 cindent
 */

#include <stdlib.h>
#include <stdio.h>
#include "bitio.h"
BITFILE *OpenBitFileInput( char *filename){
	BITFILE *bf;
	bf = (BITFILE *)malloc( sizeof(BITFILE));
	if( NULL == bf) return NULL;
	if( NULL == filename)	bf->fp = stdin;
	else bf->fp = fopen( filename, "rb");
	if( NULL == bf->fp) return NULL;
	bf->mask = 0x80;
	bf->rack = 0;
	return bf;
}

BITFILE *OpenBitFileOutput( char *filename){
	BITFILE *bf;
	bf = (BITFILE *)malloc( sizeof(BITFILE));
	if( NULL == bf) return NULL;
	if( NULL == filename)	bf->fp = stdout;
	else bf->fp = fopen( filename, "wb");
	if( NULL == bf->fp) return NULL;
	bf->mask = 0x80;
	bf->rack = 0;
	return bf;
}

void CloseBitFileInput( BITFILE *bf){
	fclose( bf->fp);
	free( bf);
}

void CloseBitFileOutput( BITFILE *bf){
	// Output the remaining bits
	if( 0x80 != bf->mask) fputc( bf->rack, bf->fp);
	fclose( bf->fp);
	free( bf);
}

int BitInput( BITFILE *bf){
	int value;

	if( 0x80 == bf->mask){
		bf->rack = fgetc( bf->fp);
		if( EOF == bf->rack){
			fprintf(stderr, "Read after the end of file reached\n");
			exit( -1);
		}
	}
	value = bf->mask & bf->rack;
	bf->mask >>= 1;
	if( 0==bf->mask) bf->mask = 0x80;
	return( (0==value)?0:1);
}

unsigned long BitsInput( BITFILE *bf, int count){
	unsigned long mask;
	unsigned long value;
	mask = 1L << (count-1);
	value = 0L;
	while( 0!=mask){
		if( 1 == BitInput( bf))
			value |= mask;
		mask >>= 1;
	}
	return value;
}

void BitOutput( BITFILE *bf, int bit){
	if( 0 != bit) bf->rack |= bf->mask;
	bf->mask >>= 1;
	if( 0 == bf->mask){	// eight bits in rack
		fputc( bf->rack, bf->fp);
		bf->rack = 0;
		bf->mask = 0x80;
	}
}

void BitsOutput( BITFILE *bf, unsigned long code, int count){
	unsigned long mask;

	mask = 1L << (count-1);
	while( 0 != mask){
		BitOutput( bf, (int)(0==(code&mask)?0:1));
		mask >>= 1;
	}
}
#if 0
int main( int argc, char **argv){
	BITFILE *bfi, *bfo;
	int bit;
	int count = 0;

	if( 1<argc){
		if( NULL==OpenBitFileInput( bfi, argv[1])){
			fprintf( stderr, "fail open the file\n");
			return -1;
		}
	}else{
		if( NULL==OpenBitFileInput( bfi, NULL)){
			fprintf( stderr, "fail open stdin\n");
			return -2;
		}
	}
	if( 2<argc){
		if( NULL==OpenBitFileOutput( bfo, argv[2])){
			fprintf( stderr, "fail open file for output\n");
			return -3;
		}
	}else{
		if( NULL==OpenBitFileOutput( bfo, NULL)){
			fprintf( stderr, "fail open stdout\n");
			return -4;
		}
	}
	while( 1){
		bit = BitInput( bfi);
		fprintf( stderr, "%d", bit);
		count ++;
		if( 0==(count&7))fprintf( stderr, " ");
		BitOutput( bfo, bit);
	}
	return 0;
}
#endif

lzw.c

/*
 * Definition for LZW coding 
 *
 * vim: ts=4 sw=4 cindent nowrap
 */
#include <stdlib.h>
#include <stdio.h>
#include "bitio.h"
#define MAX_CODE 65535

struct {
	int suffix;
	int parent, firstchild, nextsibling;
} dictionary[MAX_CODE+1];
int next_code;
int d_stack[MAX_CODE]; // stack for decoding a phrase

#define input(f) ((int)BitsInput( f, 16))
#define output(f, x) BitsOutput( f, (unsigned long)(x), 16)

int DecodeString( int start, int code);
void InitDictionary( void);
void PrintDictionary( void)//Print Dictionary
{
	int n;
	int count;
	for( n=256; n<next_code; n++)//The initialization dictionary contains only 256 entries (8bit by default)
	{
		count = DecodeString( 0, n);
		printf( "%4d->", n);
		while( 0<count--) printf("%c", (char)(d_stack[count]));
		printf( "\n");
	}
}

int DecodeString( int start, int code)//Direct decoding of codeword index number in dictionary
{
	int count;
	count = start;
	while (0 <= code) 
	{
		d_stack[count] = dictionary[code].suffix;//Incoming new character
		code = dictionary[code].parent;//Assign old characters
		count++;
	}
	return count;
	//Need to fill
}
void InitDictionary( void){
	int i;

	for( i=0; i<256; i++){
		dictionary[i].suffix = i;
		dictionary[i].parent = -1;
		dictionary[i].firstchild = -1;
		dictionary[i].nextsibling = i+1;
	}
	dictionary[255].nextsibling = -1;
	next_code = 256;
}
/*
 * Input: string represented by string_code in dictionary,
 * Output: the index of character+string in the dictionary
 * 		index = -1 if not found
 */
int InDictionary( int character, int string_code){
	int sibling;
	if( 0>string_code) return character;
	sibling = dictionary[string_code].firstchild;
	while( -1<sibling){
		if( character == dictionary[sibling].suffix) return sibling;
		sibling = dictionary[sibling].nextsibling;
	}
	return -1;
}

void AddToDictionary( int character, int string_code){
	int firstsibling, nextsibling;
	if( 0>string_code) return;
	dictionary[next_code].suffix = character;
	dictionary[next_code].parent = string_code;
	dictionary[next_code].nextsibling = -1;
	dictionary[next_code].firstchild = -1;
	firstsibling = dictionary[string_code].firstchild;
	if( -1<firstsibling){	// the parent has child
		nextsibling = firstsibling;
		while( -1<dictionary[nextsibling].nextsibling ) 
			nextsibling = dictionary[nextsibling].nextsibling;
		dictionary[nextsibling].nextsibling = next_code;
	}else{// no child before, modify it to be the first
		dictionary[string_code].firstchild = next_code;
	}
	next_code ++;
}

void LZWEncode( FILE *fp, BITFILE *bf)
{
	int character;
	int string_code;
	int index;
	unsigned long file_length;

	fseek( fp, 0, SEEK_END);
	file_length = ftell( fp);
	fseek( fp, 0, SEEK_SET);
	BitsOutput( bf, file_length, 4*8);
	InitDictionary();
	string_code = -1;
	while( EOF!=(character=fgetc( fp))){
		index = InDictionary( character, string_code);
		if( 0<=index){	// string+character in dictionary
			string_code = index;
		}else{	// string+character not in dictionary
			output( bf, string_code);
			if( MAX_CODE > next_code){	// free space in dictionary
				// add string+character to dictionary
				AddToDictionary( character, string_code);
			}
			string_code = character;
		}
	}
	output( bf, string_code);
}

void LZWDecode(BITFILE* bf, FILE* fp)
{
	int character;
	int new_code, last_code;
	int phrase_length;
	unsigned long file_length;
	file_length = BitsInput(bf, 4 * 8);
	if (-1 == file_length) file_length = 0;
	InitDictionary();//Initialize dictionary
	last_code = -1;//pw=-1, as the old character in the string
	while (0 < file_length)
	{
		new_code = input(bf);//Read in a new character, cw
		if (new_code >= next_code)//The index number of the read character is greater than the maximum index number of the initialization dictionary, indicating that the character does not exist in the dictionary
		{ // this is the case CSCSC( not in dict)
			d_stack[0] = character;//character is the string represented by pw
			phrase_length = DecodeString(1, last_code);//decode
		}
		else//The index number of the newly read character exists in the initialization dictionary
		{
			phrase_length = DecodeString(0, new_code);//Direct decoding
		}
		character = d_stack[phrase_length - 1];//Save the first character of cw into character
		while (0 < phrase_length)
		{
			phrase_length--;
			fputc(d_stack[phrase_length], fp);//The output string corresponding to the current codeword
			file_length--;
		}
		if (MAX_CODE > next_code)//If the dictionary still has space
		{// add the new phrase to dictionary
			AddToDictionary(character, last_code);//Save a string that does not exist in the original dictionary
		}
		last_code = new_code;//Number of updated entries
	//Need to fill
	}
}



int main( int argc, char **argv){
	FILE *fp;
	BITFILE *bf;

	if( 4>argc){
		fprintf( stdout, "usage: \n%s <o> <ifile> <ofile>\n", argv[0]);
		fprintf( stdout, "\t<o>: E or D reffers encode or decode\n");
		fprintf( stdout, "\t<ifile>: input file name\n");
		fprintf( stdout, "\t<ofile>: output file name\n");
		return -1;
	}
	if( 'E' == argv[1][0]){ // do encoding
		fp = fopen( argv[2], "rb");
		bf = OpenBitFileOutput( argv[3]);
		if( NULL!=fp && NULL!=bf){
			LZWEncode( fp, bf);
			fclose( fp);
			CloseBitFileOutput( bf);
			fprintf( stdout, "encoding done\n");
		}
	}else if( 'D' == argv[1][0]){	// do decoding
		bf = OpenBitFileInput( argv[2]);
		fp = fopen( argv[3], "wb");
		if( NULL!=fp && NULL!=bf){
			LZWDecode( bf, fp);
			fclose( fp);
			CloseBitFileInput( bf);
			fprintf( stdout, "decoding done\n");
		}
	}else{	// otherwise
		fprintf( stderr, "not supported operation\n");
	}
	return 0;
}

Posted by PDXDesigner on Wed, 13 Apr 2022 20:05:27 +0930