Bootstrap Script

A Scripting Pattern

Problem

How can one efficiently represent and process external data using a scripting language?

Context

Because of the high-level nature of scripting languages, it is easy to build and manipulate data structures using scripts. However, scripting languages are often two or more orders of magnitude slower than the low-level programming language in which they are implemented, which can make data processing prohibitively inefficient.

Forces

  1. It is easy to build and manipulate data structures in a high-level scripting language,
  2. It is slow to process data structures in a high-level scripting language.

Solution

Write a "bootstrap" script that examines the external data and builds another script that represents the external data using scripting commands and their arguments as a data structure.

Write an implementation for each command in the data-structure-script that to perform appropriate processing for the data element represented by the command.

Evaluate the script to process the data.

Consequences

A Bootstrap Script represents data to be processed by using the low level data structures used by the script interpreter to represent the scripting language itself. Therefore, instead of being processed by an interpreted script manipulating script-level data structures, the data is processed by the compiled code of the language interpreter manipulating low level data structures directly. This removes the levels of indirection that slow down the execution of scripts, and so results in significant performance improvements.

Scripts created by Bootstrap Scripts can be difficult to debug because they exist only at run-time and their eventual behaviour is stored in many places throughout the source code of the Bootstrap Script.

One must be careful that a script created by the Bootstrap Script cannot breach system security. For example, certain old HTTP servers could be attacked by passing data to CGI programs that contained a semicolon followed by any UNIX command. The servers used the system shell to spawn off the CGI process with the received data as arguments to the process. Because the semicolon is the statement separator in the shell language, the shell started two processes, one being the CGI script and the other being whatever the attacker specified after the semicolon (for example, rm -rf /).

Known Uses

Stephen Uhler's HTML parsing library for Tcl/Tk translates an HTML document into a Tcl script made up of commands that represent the structure of the document. The user of the library can define these commands to process the contents of the document when the script is executed.

The author wrote a network management tool in [Incr Tcl] that started off as a single instance of a single class. This instance listed the distributed objects stored in a single directory of a name server and created new classes and instances to represent those objects. This continued recursively when the user queried the sub-objects of an object, and so built a local model of the real objects running in the network.