TUTORIAL HTML : Improving HTML Compression
Sample of the pdf document :
2.1 HTML description:
HTML is a language that describes a structure of textbased information in a document. It denotes certain text as links, headings, paragraphs, and lists. It also supplements text with embedded images and other objects. XHTML is a new, XML-based version of HTML.
An HTML document consists of elements. An HTML element always has a start tag (e.g<elementname>) and may have an end tag (e.g. </elementname>) in opposite to XML or XHTML, where the end tag is required. Elements may have two basic properties: attributes, contained in the start tag (e.g. <elementname attribute="value">), and content, located between the tags (e.g. <elementname>
Content</element-name>).
If there is no content a start tag and an end tag can be presented in short form that is <element-name/>. Comments in HTML are delimited by <!-- and --> sequences. Some HTML elements, for example <br>, do not have any content and must not have a closing tag.
The following example [10] contains document title
(<title> element), heading (<h1> element), paragraphs (<p> elements) and links (<a> elements):
<html>
<head><title>About the Test Data</title></head>
<body>
<h1 align="center">About the Test Data</h1>
<p align="center">Matt Mahoney<br>
Last update: Dec. 17, 2006.
<a href="text.html#history">History</a>
<p>The test data for the <a href="text.html">Large Text Compression Benchmark</a> is the first 10<sup>9</sup> bytes of the English Wikipedia dump on Mar. 3, 2006.
2.2 HTTP protocol:
Hypertext Transfer Protocol (HTTP) is a communications protocol used to transfer information on
World Wide Web. HTTP is a request/response protocol between a client and a server. The client is making an HTTP request to the server, which delivers HTML files,
images and other.
HTTP compression [13] is the technology used to compress contents from a web server (an HTTP server) and to decompress them in an user’s browser.
HTTP compression is a recommendation of the HTTP 1.1 protocol specification as it reduces network traffic and improves page download time on slow networks [15].
It is especially useful when size of the web pages is large.
The experiments conducted by Wan [21] showed that HTTP compression can be improved utilizing the
previously requested files in a browsing session as a dictionary, but this idea was not embedded in HTTP protocol until today.
The popular LZ77-based gzip was intended to be the HTTP compression algorithm. Currently, HTTP servers and clients supports also LZ77-based deflate format.
Lighttpd server supports also BWT-based bzip2 compression, but this format is only supported by lynx
and some other console text-browsers. Deflate, gzip, and bzip2, however, are general-purpose compression algorithms and much better results can be achieved witha compression algorithm specialized for dealing with HTML documents.
2.3 Word-based compression:
StarNT [20] is a dictionary-based scheme, which replaces natural language words with references to an external dictionary.
A word in StarNT dictionary is a sequence of symbols over the alphabet [a..z]. There is no
need to use uppercase letters in the dictionary, as there are two one-byte flags (reserved symbols), fcl and fuw, in the output alphabet to indicate that either a given word starts with a capital letter while the following letters are all lowercase, or a given word consists of capitals only.
Another introduced flag, for, prepends an unknown word.
Finally, there is yet a collision-handling flag, fesc, used for encoding occurrences of flags fcl, fuw, for, and fesc in the text.
The ordering of words in the dictionary D, as well as mapping the words to unique codewords, are important for the compression effectiveness. StarNT uses the
following rules:
• The most popular words are stored at the beginning of the dictionary. This group has 312 words.
• The remaining words are stored in D according to their increasing lengths. Words of same length are sorted according to their frequency of occurrence in some training corpus.
• Only letters [a..zA..Z] are used to represent the codeword (with the intention to achieve better
compression performance with the backend compressor).
Each word in D has assigned a corresponding codeword.............
Click here for Download PDF / FREE

0 commentaires: