A LOOK AT NET FILE FORMATS
by Mageshkumar


Abstract

The last three years has seen a rapid growth in the use of the world wide web, in terms of the number of users of the internet in industry and other social and cultural fields. It has now established itself as one of the core technologies of the 1990s. Buisness and companies, form news and entainment to financial services, manufactuirng and engineering etc. have accepted this new technology to revolutionize the method by which they communicate with their customers to offer their products and services via the web. This can be attributed to the inexpensive, ease of access and distribution of digital data.

Consequently, this article is a description of the digital data. The many types of file formats used to store the different kinds of data such as text, images etc. are disscussed, by expanding on each kind to describe its characteristics. Furthermore, the need for conversion of unknown formats to a known and acessible format whilst interaction between the clent and server is highlighted.


The web can be generalised to contain browsers which are programs or user tools, and servers which are programs running on remote computers on the net. A user can request information from the server to run special programs called CGI(common gatewat interface) scripts for him/her using the browser. The browser can understand and display HTML documents.

When the user clicks on a highlighted piece of text to be used as a hypertext link to another page, the coordinates of the click position is transferred to a CGI script running on a web server. This communication between the browser and the server is by means of a hypertext transfer protocol(HTTP). HTTP is a client-server protocol where the user(client) and server engage in two-way communication using the protocol. The CGI script uses these coordinates to retrieve the URL(uniform reasource locator) of the required internet rea- source, and then decide what page to send back to the browser in return. The data recieved by the browser will be a file of any format, for e.g. postscript file, plain text file, GIF file etc. Browsers like Lynx ans Mosaic are text-only browsers. But most modern browsers like Netscape, Microsoft Explorer can handle a variety of file formats. Occasionally, a hypertext link would cause the downloading of a file with a format not 'inline' to the browser. In such cases, special application programs called 'helper' applications are used by the browser to connect these unknown file formats to one of several suitable formats that is inline to the browser. The server as a part of the HTTP protocol, tells the browser about the type of data being sent. This is by means os transimitting a special message called a MIME content-type header to the browser just before the transfer of the actual data. For a GIF image file transfer, the message would be 'Content-Type: image/gif, for a MPEG movie 'Content-Type: video/mpeg'. If a file is downloaded via an FTP server, then the browser has to guess the type of data in it. This is possible by checking the file extension against a database available to the browser, inorder to determine the MIME type of the file. Thge database can be updated to add new file types. (Usually, by a pull down menu option in most MS-Windows and Macintosh browsers).

--------------------------------------------------------------------------- The different types of files that a URL can refer to can be classified as:

  1. text files
  2. picture or graphic image files
  3. sound files
  4. movie files
---------------------------------------------------------------------------

Each one can be disscussed spearately.

1. (Plain) Text files(ASCII format, extension .txt)

Every browser in use today have the text file format as one of its basic file formats(i.e. 80 column text). Text file are represented as 7-bit ASCII (American standard code for information interchange) characters, and can be produced by all word processors(basically any text editor). Text files are used for very simple documents where the data to be displayed has to be seman tically characteristed. The list below states some of the text file formats that could be used on the net.

a. SGML & HTML files.

SGML(Standard Generalised Markup Language) is a software tool that is used to define Markup languages. It is a complex piece of software that is defined by the International Standards Organization, and enables its use for broad range publishing, from conventional single medium publishing on paper to online multimedia/hypermedia publishing. Hence, with SGML, a user can 'markup' a piece of text to his/her liking. Marking up refers to the pieces of text that is added to the data in files called tag elements that would semantically or logically, rather that physically (for e.g. 10 dpi Times Roman etc.) describe the way the data is presented. For example, in a word processor, the software uses properiety codes to indicate for example, how pieces of text should be printed, what font to be used, how paragraphs are aligned, line breaks etc.

HTML(Hypertext Markup Language) is one instance of SGML. It is the tool used to markup the text for display on the web. The SGML definitions of HTML syntax is contained in a special SGML document called a Document-Type-Definition(DTD), for e.g. html.dtd. This file is used in association with SGML parsing programs like sgmls written by James Clark, to check for markup errors.

HTML ------------> Extension - .html (UNIX)
.htm (PC/Windows and Macintosh)

Editors aimed at HTML are vast. Several editors are available for various computing enviornments (Mac, MS-Windows, UNIX etc.). For a comphrehensive computing enviornments (Mac, MS-Windows, UNIX etc.). For a comphrehensive list, a good starting place is at URL:
http://union.ncsa.uiuc.edu/HyperNews/get/www/html/editors.html

b. Bookmaster files.

Bookmaster is a text markup language that is used mainly in publishing. In particular, it is an advanced version of the Generalized Markup Language(GML). Bookmaster enables high quality publications with visual appeal, clarity, and elegance. Furthermore, it allows flexibility and control over the publications and reduces the turn around time for publishing. Like HTML, the writers simply specify the elements of their text. Bookmaster take scare of most of the publishing work like text formatting and type setting.

Bookmaster -----------> Extension - .bkm
Used by IBM and other large corporations for high
quality, larger scale publishing from memos to
technical documentation.

c. RTF (Rich Text Format)

RTF was devoloped by Microsoft as an open format for exchaning text and graphics that can execute on different personal computers and operating systems. It is another document formatting language but is more superior to HTML files in that RTF files can contain a vast variety of objects such as footnotes, headers, multilingual text, geometric and raster graphics, symbol sets, tables etc.

RTF ---------------> Extension - .rtf (UNIX and Mac)
Any word processing or publishing package can be used
to create and edit RTF files for e.g MS-Word,
Wordperfect, Prosfessional Write etc. Rtftohtml is a filter
used to translate RTF to HTML for display on web.

Note on Macintosh text file formats

Most of the files available for the Macintosh on the net is encoded in a file format known as binhex. this is due to the unique structure of files on the mac. When mac files are downloaded, they cannot be direcly viewed or edited. Binhexing a mac file encodes it into a plain text file of ASCII format. Thus, to use it, it has to be decode. The most widely used and freely available software tool(can be pulled off the net) is StuffIt Expander. This is a decompression program that can uncompress not only binhex files, but also various other encoded files that are from different platforms.

Binhex ------------> Extension - .hqx
BinHex 4.0 encodes mac file into .hqx.
BinHex13 used in DOS/Windows to deencode.

d. Poscript files

Postscript is a page description language devoloped by John Warnock of Adobe Systems and appeared in 1985 in the Apple Laserwriter. It si a programming language optimized for printing graphics and text(on paper or vdu). Instead of the usual method of transmitting grpahics and character information to a printer telling it where each dot should be placed on the page, postscript enables a way for the printer(laser) to interpret mathematically how the shapes and curves fit the page. In otherwords, the page to be printed is described in a device independant way such that the same document can be printed on any poscript print(whether a Linotron of a Laser write, for exmaple), if it is in a postscript file. Device independance implies without any reference to the specific features such as printer resolution etc, of the target printer. [But in practise, it is known that some postscript files make certain assumptions about the target device].

The postscript language is interpreted and stack-based like an RPN calculator where operands(numbers) and operands(*,+ etc) are pushed onto the stack. The top two elements of the stack are used to perform, some mathematical operation and the result pushed nack into the top of the stack.

Postscript ---------> Extension - .ps
Text files but not human readable, and different form
.pdf postscript files which are specific to Adobe Acrobat.
Use public-domain postscript file viewer called
Ghostscript(Windows). For the Mac, the .ps files
has to be decompressed first.

e. Other text files

  • Source program files for e.g. .C C files etc. Can be downloaded via FTP
  • Integrated application software files such as Ability (offers spreadsheets, word processing, database, buisness graphics, communcations, and prsentation graphics)
  • DIF files (data interchange format)
  • Certain database files like dBase II
  • And many more......
--------------------------------------------------------------------------

The advantage of text files in ASCII format is that they are portable between different electronic email systems. [File formats such as DIF - marketed by Lotus Corp., aim to provide a program independant method for storing data]. Generally, conversion of text files(or any files) into other formats would resut -----------------------------------------------------------------------------


2. Picture or graphic images(Binary format)

The majority of web pages today contain embedded in them many kinds of images. The graphic images are of two types.

  1. Simple images - which are just embedded on the pages for display or decoration purposes.
  2. Clickable imagemaps - which are actually picture icons representing thumbnail sketches.
There are very many formats for storing digital data corresponding to graphic images with their related advantages and disadvantages. Most web browsers (exclusind text-only ones like Lynx, Mosiac) can display only a handful of these formats inline. The format of about 90% of these files is binary. The five most universally accepted formats for 2-D graphics are:
  1. Graphical interchange format - GIF (GIF87 and GIF89A)
  2. Joint Photographic Experts Group - JPEG
  3. X-Bitmap/X-Pixelmap
  4. Portable Network Graphics - PNG
Each one can be discussed separately.

a. GIF

This is the most widely used raster graphics file format on the web. It was devolped by Compuserve and stores images in a compressed format. All browsers can display GIF inline.

GIF Colour encoding format

GIF files are represented by a maximum of only 256 colours(8-bit colour). Hence, the graphic can be black and white, grey scale, or color images. A GIF image is first analysed by an image analysis algiorithm to determine the set of 256(8-bit colour) or lesser colours that best describe the colours in the image. It next creates a colour table where each pixel colour is mapped to a number in the range 0-255, representing a colour in the table, closest to the actual colour. The resulting GIF image consists of an array of these colour indicies plus a colour map with the desired mappings.

GIF file compression The 'raw' GIF file(after encoding) is large. It could be compressed to smaller file by optimizing the colour table using the LZW(Lempel-Ziv Welch algorithm ) method of compression. Repeated sequences of colour is encoded using a shorter strings. For example, the colour code following the number of pixels in sequence representing a colour can be encoded as 50R where R is the code for the colour, instead of RRR.....RRRR. Compression ratios of 1.5:1 to 2:1 can be sucessfully achieved.

There are two common versions of GIF(Extension .gif):

  • GIF87
  • GIF89A

GIF89A has more features than GIF87 for example, colour transparency so that it allows the back ground to show through.

Software- Lview Pro can be used in the Windows environment.(It supports
both versions, provides conversion to GIF, and also image editing)
Macgzip for mac to extract and view gifs.

b. JPEG

JPEG is a standard method of storing photographic images in a compressed digital form. It is an extremely sophisticated image format that supports infinte number of colours, instead of only 256 colours. In general, JPEG is far better that GIF for either full-colour or greyscale images of natural, real world scenes. It works well on photographs, naturalistic artwork, and similar material. Consequently, JPEG handles only still images. Most graphical browsers in use today support JPEG.

JPEG Compression

JPEG uses a 'lossy' compression method. This means that the decompressed file would not be exactly the same(bit-to-bit) as the original picture. This is possible because the human eye percieves small colour changes less accurately than small changes in brightness. In otherwords, the idea is to exploit the limitations of the human eye. The degree of lossiness can be varied by adjusting compression parameters in the software. Hence, this is a useful property of JPEG that enables a trade off between file size against output image quality. Infact, the resulting file can be made extremely small resulting in poor quality, else the file can be made to have high quality in which case thre compression is lesser. Consequently, JPEG decoders can trade off decoding speed against image quality by using fast but inaccurate approximations to the required calculations. Furthermore, there are 2 types of JPEG image analysis for compression. They are:

  1. Simple or Baseline JPEG -Most widely used form. Does one top-to-bottom scan of the picture.
  2. Progressive JPEG -This divides the file into a seris of scans. The first scan produces a low quality image which occupies very less space. Gradual scans add more data to this first scan file, thus increasing in size. This is not widely used as baseline JPEG, but is more suitable for compressing images in real time.
Current JPEG specifications defines two different coding methods for the final output of the compressed data. They are:
  • Huffman Coding
  • Airthmetic coding (patent owned by IBM, AT&T, and Mitsubishi)
The choice has no effect on output image quality. But airthmetic doing produces a smaller compressed file (approx. 5% to 10% smaller than the file produced by Huffman coding). The state of the art image compression techniques are fractals and wavelets.

Hardware requirements- JPEG represents uncompressed full colour images in
normally 24 bits/pixel. So an SVGA video adapter that
is VESA compatible or greater needs to be used. This
has a resolution upto 1280 by 1024 with 16,777,216
colours. Various resolutions call for different amounts
of video RAM:

Resolution Colour depth VRAM
640X480 24-bit 1M
800X600 24-bit 2M
1024X768
& above
24-bit 4M

It is claimed that using a Math coprocessor during compression/decompression of JPEG images speeds up things. But JPEG uses only integer airthmetic. So an FPU chip would not do much. DSP chips do speed up repititive integer airthmetic. Hence, programming a DSP chip for JPEG can yield significant speed ups.

Video processor

Many of the graphiccs acceleerator boards on the market today use a fixed function acceleerator chip. The circuitry on the card does many of the time consuming video tasks such as drawing lines,circles etc., and the CPU still directs the card by passing graphics primitive commands from applications. But the latest trend and that which is more suitable for JPEG is to employ a chip technology called coprocessing. In this case, the video card includes its own dedicated processor, freeing the CPU to carry out other tasks. The preferred bus system is PCI. But its extension, the mezzanine bus is more preferable although nor eseential. PCI video cards with plug and play support require little configuration.

JPEG --------------> Extension - .jpg (DOS/Windows)
.jpeg
.jfif

JPEGView for Mac to view PICT(Mac JPEG compressed
file), GIF, JPEG, TIFF.
LviewPro, PolyView for Windows.

c. X-Bitmap/X-Pixelmap(.xbm/.xpm)

X-Bitmaps are a common format on UNIX platforms, and are often found in older image and icon libraries. Here, a bit is used to represent each pixel of the graphics. Consequently, only black and white images are supported. Transparency is possible as in GIF since the white portion is treated as the colur of the underlying background (making way for attractive designs). X-Pixelmap is the colour equivalent of X-Bitmap. Here, 8 bits represent each pixel(256 colours). Obviously, both methods are very inefficient in terms of storage space required. It is quite uncommon outside the UNIX enviornment.

d. PNG(.png)

This format is now gaining popularity but is not yet universally supported. It was designed to be a public-domain sucessor to GIF. It provides additional features inclusive of the GIF features of transparency, interlacing, and image compression. Added features include improved transparency so images can fade in, and features like colour and gamma correction.

Brief comparison between GIF and JPEG for the WWW

  1. JPEG(full colour or grey scale) is very efficient for storing photographic, realistic images where there is a continuous variation in colour. GIF is suitable for representing simple pictures like line drawing, thumbnail sketches, simple cartoons etc. In this case, GIF can compress much better than JPEG.
  2. JPEG compression is a lossy method but GIF is a lossless method. Furthermore, JPEG lossess accumulate with consecutive compressions and decompressions due to continuous editing. An image should be converted for publishing in JPEG, if it can be guaranteeded that there would be no further modifications of the image.
  3. Using JPEG, a trade off between image file size against image quality is possible. This is useful when-
    • the hardware available could not support 24-bit colour(less VRAM) or the modem is not fast enough.
    • the transmission of the file over the network has to be faster (this calls for smaller file sizes)
    • factors such as whether the image file is to be archived
    • to enable uniform viewing of the image. Different users have displays with different capabilities. This is essential to avoid quantisation loss.
    • The complexity of the decoder needs to be simplified.
    • storage space requirements.
  4. Various compression ratios can be achieved depending on its use for e.g. 2:1(lossess), 30:1, 50:1, etc. There are software to enable GIF file size to be reduced(by using the fact that a pixel need not always be represented by 8 bits) for e.g. BatchMaster(Windows), Debabelizer(Macs) etc.
Other types of binary image files
  1. . cgm files - the ISO 1987 standard computer graphics metafile based on$ graphical language.
  2. .pdf files - Adobe Acrobat postscript file. Use Adobe Acrobat reader w$ file for Windows/Mac platforms.
  3. .tiff/.tif files - Tagged Image file format. An extension of the Aldus Tiff format. This was an attempt to define a standard JPEG base file format. It is a high quality format which records all the characteristics of the image.
  4. .rgb files - the RED GREEN BLUE file format of Silicon Graphics used by most visualization software packages as the internal image format.
  5. .bmp files - Windows Bitmap files that can be created using Ms-Windows Paintbrush.
  6. .ppt files - Windows Powerpoint 3.0 graphics files from Microsoft.

    There are many other image files in binary format representing 2-D raster and 3-D graphics. =============================================================================

    3. Sound files

    There are several files formats for storing and editing digitized sound, pertaining to different platforms, that can be used on the net. Sound files are binary files. By generalizing, sound files can be categorized into two types:


    • Self-Describing formats. -These files contain information about the device parameters and encoding method as header data. Consequently, this can support a family of encoding formats. Typically header information would be for e.g. parameters of the sampling device, human-readable descriptions of the sound, copywright notice etc. Encoding data would describe the actual storage of sound samples in files e.g. short int or long int, signed or un signed, big endian or litte endian etc.

    • 'Raw' formats -These are headerless files and thus the device parameters and encoding are fixed.

    Sound files are large files. Hence, transmission over the net would always be slow. For e.g. One audio minute saved to a .wav file requires 2.5M to 10M of disk space, depending on recording options. compression sound files have been used employing methods such as Huffman encoding or simple silence deletion.

    Types of sound files used on the net


    1. .au or .snd files - first used on the SUN played using Waveform Hold and MODIFY/Windows. Sound player, SoundApp for Macs.
    2. .wav files - Windows sound files
    3. MIDI files - Musical Digital Interchange files define programs written in MIDI language that lets user store,edit, and playback music in tandem with MIDI comaptible electronic musical instrument like a keyboard synthezer. Takes much less space than .wavfiles.(typically, 1M->500K).

    The only problem is that different users on the web could be using different makes of sound cards. The abscense of an industry standard sound card adds to this problem. For example, the quality of midi reproduction varies between sound cards using wavetable synthesis for MIDI reproductions. Further, some sound cards use Adaptive differential pulse code modulation to reduce file size. Any 16-bit sound card would provide good quality reproduction of sound.

    4. Movie Files

    A few years ago, Live TV on the computer was still at a devoloping stage. Today. the availability of fast and sophisticated hardware plus complex softwaare algorithms have enabled Live Tv to become a reality, via the WWW. The standard movie file format for the net is MPEG(Motion Pictures Experts Group). MPEG is an ISO standard technique of compressing digital data. The MPEG standards that have been defined are:


    • MPEG I - widely available format
    • MPEG II & IV

    Note: Digital motion video can be accomplished with JPEEG, if the hardware is fast enough to process 30 images/s. This is usually called M-JPEG and haas no defined standard.

    MPEG Compression

    MPEG uses many of the same techniques as JPEG, but adds interframe compression to exploit the similarities that usually exist between successive frames. This exploitation of redundancy between successive frames to achieve maximum compression is a feature not poessessed by JPEG real-time video compression. This feature makes MPEG more suitable for Digital video compression.

    MPEG uses I frame intraframe compression, and P and B frame interframe compression. Intraframe compression(I-picture) means that pictures are coded independant of any other picture, and considers only the redundant information pertaining to the current. For example, if a section of a video picture or frame shows an object that is of just one colour, then every pixel corresponding to that object need not fully specify that colour. Thus, this section could be defined as containing colour x, and not specify colour x for each of the objects pixels many times over. Interframe compression considers B pictures(interpolated pictures). Here, if frame x has an object in it that is in the exact same position in frame n+1, then the information for that object need not be transmitted in its entiriety for both frames. This process is sometimes referred to as bidirectional prediction. Intrerframe compression applies between P frames, B frames, and I frames. B frames require the most computation, but that their importance stems form the fact that they enable the high compression attainable under MPEG.

    Advantages of using MPEG standard file for net


    1. Platform independance
      Many graphics and video cards from manufacturers support MPEG playback. Windows 95 and NT support MPEG playback from within the OS.
    2. Device independance
      Several vendors manufacture MPEG encoding solutions. This drives down prices and ensures continued supply of encoders.
    3. Quality
      MPEG - compressed video is the highest quality available at bit rates below 200 Kb/s.

    Its disadvantages


    1. needs alot more computation than JPEG, escpecially with B pictures
    2. editding MPEG frames is a problem due to interframe compression. This disadvantage has enabled other methods like M-JPEG to become popular for video files.

    MPEGII extension of MPEG adds supporrt for interlaced video as well as other improvements. This is used for video production and television(HDTV).

    Hardware support

    MPEG offers a varied range of video resolutions and data rates. Optimized data rates of 1.2Mbps(CD-ROM data rate) have been approached. At 30 frames/s and resolution of 352X240 pixels, it is claimed that the quality of the video would be comparable to VHS. To view MPEG playback incorporating .avi files i.e. audio and video interleave, the basic pc station on the net must have minimum specification similar to that defined by MPC standard level 3.


    • processor: 75MHz pentium
    • memory: 8M
    • hard disk: 540M
    • drive: 1.44M 3.5"
    • CD-ROM: quad speed
    • video: 16-bit, 640X480, 1M
    • other I/O: serial, parallel, MIDI, game port
    • OS: Ms Windows 3.1
    • modem: dual standard V.32 or higher (speeds close to 14400 bps preferred)

      Examples of video cards aimed at MPEG aaare Intel smart video board, Broadway(tm), etc.

      MPEG ----------> Extension - .mpg
      .mpeg

      Software - Microsoft video for windows for
      capturing, compressing, and playing videos.
      VMPEG playback of .avi on net
      Apple Quicktime for Windows/mac
      Quicktime VR player for Macs.


      References:

      1. Title: HTML Sourcebook(3rd ed.)
        Author(s): Ian S. Graham
        Publisher: John Wiley & Sons, Inc.

      2. Title: Common internet file fomrats
        Location: http://www.matisse.net/files/formats.html

      3. Title: Graphics viewers, editors, utilities and...
        Location: http://www2.ncsu.edu/bae/people/faculty/walker/hotlist/graphics.html