CSG, Linux guide part 2

Department of Computing	Imperial College
More information on Linux and UNIX

This section of the guide contains more advanced information on using Linux. Other sources of information after this are the online manual pages, guides available from the CSG web pages, and books in the CSG Technical Library the catalogue of which can be searched from our web page.

The UNIX story started in 1969 at Bell laboratories when a group of researchers who needed a set of modern computing tools to help them with their projects borrowed a spare PDP7 computer with an assembler and loader and not much else. Not being the easiest machines to use, it was decided to start by writing a new operating system for it. This was the original UNIX. Inevitably the group outgrew the PDP7 and got an upgrade by persuading the Bell patent office to use it for their text preparation. UNIX was re-written in C for the machine upgrade, except for a small process control area which was kept in assembler for speed. Since UNIX was written in a high level language, it was very portable and, as AT&T could not market it because of the US anti-trust laws, UNIX was literally given to anyone who wanted it, providing they did not re-sell it themselves. As a result, UNIX was distributed free to many universities and other educational establishments, which made it very popular. Added to that, the fact that UNIX was developed in a research environment and enhanced in a university setting, has made the UNIX operating system a very powerful software development tool.

UNIX was first introduced to DoC in 1980 when the Department acquired a PDP11/34 running Version 6 UNIX. Teaching under UNIX didn't begin until 1982 and then only for final year undergraduates. Since 1985, however, most teaching and research activities in DoC have run under UNIX in one form or another. The system we currently use is called Linux, which is really a UNIX-like operating system rather than a version of UNIX. Linux was developed by a Finnish undergraduate student called Linus Torvalds. Many other programmers around the world have subsequently extended the code and provided extra functionality. Linux can be freely distributed and therefore students can install it on their home PCs if they wish.

The Shell

This section provides a brief overview of the Linux shell, describing some of the concepts that you will need to know in order to get the most out of it.

The shell is a program that provides the user interface to the operating system. At one level it is a command interpreter; users give commands to the shell and the shell makes the proper calls to the system. The shell is also, however, a programming language in its own right: it provides the programmer with variables, control flow, subroutines and interrupt handling. This combination of command interpreter and programming language, allows users to build up their own commands to fit the way they work.

The shell is an ordinary program and therefore it can be replaced on a per-user basis. In other words, there are different kinds of shells.

sh
The Bourne shell, available on all Linux systems.
csh
The C-Shell, from Berkeley, offers command history and job control.
tcsh
An extended csh with command completion and command line editing.

There are a number of others such as bash (Bourne-Again SHell), an extension of sh, and ksh (Korn Shell - not available on CSG supported systems). Many experienced Linux users use csh or tcsh as their interactive command interpreter, and sh as their preferred shell programming language. While a number of different shells are available at DoC, we only guarantee support of sh , csh , and tcsh since these are commonly available across almost all Linux and UNIX systems. This document describes the C-Shell since that is the shell given by default to all user accounts. If you want to change your shell please email your request to help@doc.

Metacharacters

Some characters have a special meaning to the shell. These are known as shell metacharacters.

In general, most characters which are neither letters nor digits are metacharacters. For instance, the following are examples of metacharacters:

	& > | *

(Don't worry about what these characters mean just yet -- we'll come to them in due course.)

Quoting

To pass metacharacters to the shell without them being interpreted specially, they will need to be quoted. The recommended mechanism for placing characters which are neither numbers, digits, `/', `.' or `-' in an argument word to a command, is to enclose it between a pair of single quotation characters as in:

	active12% echo ´*´
	*
	active12%

Of course there is the problem of `escaping' the single quote character itself to the shell. This is done by using the backslash `\' character. The backslash is also used to escape the `!' (exclamation mark) of the history mechanism.

	active12% echo \´\!
	´!
	active12%

In general:

\c escapes the single character c following the `backslash'

´cde´ escapes cde between the pair of single quotes

"string" protects string, but allows, for instance, the value of variables (such as TERM) to be substituted.

Filename expansion

For shorthand purposes it is often useful to refer to filenames using a set of metacharacters known as wild card characters: * and ?

The * matches any sequence of zero or more characters -- so to list all files beginning with `ch':

	active12% ls ch*
	ch1 ch1.1 ch2 ch4.2 chapter4 chi
	active12%

The ? character matches any single character:

	active12% ls ch?
	ch1 ch2 chi
	active12%

Another pattern matching mechanism is the use of square brackets: [string]. Any one of the characters string in the square brackets will be matched:

	active12% ls ch[23]
	ch2
	active12%

A range of consecutive letters or digits can also be specified, as in:

	active12% ls ch[1-4].[1-4]
	ch1.1 ch4.2
	active12%

Commands

The shell acts as a user interface, executing commands passed to it from the screen or from a file. Most commands are executed via a process known as forking . When a user issues a command, the shell creates a new process to run the program. After forking the new process, and executing the program, the shell has to pass the options and arguments to it.

Linux command syntax

The Linux command structure looks like this:

	command arguments

There are typically two kinds of arguments: options, and a file name or list of file names. The option (also known as a flag or flag argument ) usually follows the command and is separated from it by a blank. The convention is that the option is preceded by a hyphen (`-') commonly called `minus'. Thus ls with the minus l option, will provide the long listing of files in the current directory.

	ls -l

The way commands deal with options is not uniform. For instance, some commands require that multiple options are grouped together with just one hyphen

	command -abc

while others expect each option to be preceded by a hyphen:

	command -a -b -c

The man pages will describe which commands take which options.

Command grouping

Multiple commands may be strung together on a single line by separating them with semicolons; the shell executes the commands one after the other, displaying its prompt when the last command terminates.

	cd News; ls

This will change to a new working directory, News, and list its contents.

There is another way of command grouping, and that is to include the command list within brackets:

	(cd News; ls)

The contents of the directory News are listed, but the working directory of the invoking shell is not changed. What happens is that a new shell is forked which executes the commands in parenthesis.

Command aliases

The C-Shell has simple string substitution facilities that can provide a kind of shorthand. Called aliasing it can be used to provide short names for commands, to provide default arguments, and to define new short commands out of one or more other commands. An alias can be set from the command line, but it is more usual (and more convenient) for aliases to be placed in the .cshrc file. This way, your aliases will be set up each time you log in.

For instance, you may regularly search through your directory tree looking for files not accessed for more than two weeks, and then compressing them, thus keeping your file space within your quota limit. The command:

	find ~ -type f -atime +14 -exec gzip {} \;

would do it (Note the `\' escaping the `;'). To save time, though, it would be easier to alias that long command to something more memorable and quicker to type. So, the line

	alias squash 'find ~ -type f -atime +14 -exec gzip {} \;'

in your .cshrc file means that in future, you need only issue

	squash

Note the quotes (` `) surrounding the entire find command: this ensures that it is passed whole to the shell without any of the metacharacters being interpreted by the shell.

Aliases can also contain multiple commands, pipelines and accept the arguments of the original command by clever use of the history mechanism.

	alias cd 'cd \!* ; ls'

does an ls command after every change directory ( cd) command. The `!*' translates to `all the arguments from the last command'. The `\' escapes the `!' to the shell, preventing it from being interpreted when it is first read by the shell on startup.

Command history

The C-Shell maintains a history list into which it places the words of previous commands (tcsh and bash also maintain a history list). There is a special notation which allows you to reuse commands or words from commands in forming new commands. The metacharacter which invokes this history mechanism is the ! character (`bang' or `pling' in computerese). The following sequence gives an example of its use:

	active12% mv mbox old_mail
	active12% ls -la !$
	-rw------- 1 nuu 2905 Aug 28 10:41 old_mail 
	active12%

The `!$' translates to "the last argument of the last command". Other useful history metacharacter sequences are:

!! repeat the last command

!^ the first argument of the last command

!* all the arguments of the last command

Typing history lists all the commands stored in your history list along with the command number of each command. The history command also takes arguments:

	active12% history 5
	    8 mv mbox old_mail
	    9 ls -la old_mail
	    10 compress old_mail
	    11 quota -v
	    12 history 5
	active12%

To re-issue, say, the 9th command:

	active12% !9
	ls -la old_mail
	-rw------- 1 nuu 	2905 Aug 28 10:41 old_mail
	active12%

is the same as requesting the last ls command:

	active12% !ls
	ls -la old_mail
	-rw------- 1 nuu 	2905 Aug 28 10:41 old_mail
	active12%

Job Control

The shell treats any command, or sequence of commands separated by a semi-colon as a job. Each job is assigned an identifying job number. Jobs can be run in the background by typing an ampersand ( `&' ) at the end of the command line:

	active12% find ~/course_work -type f -atime +14 -exec gzip {} \; &
	 [2] 8083
	active12%

The number in square brackets is the job number, (the example above assumes you have another job running -- hence the number 2) and the next number is the process number of the executing command find . Running a job in the background means the shell does not wait for it to complete, but immediately returns the prompt and is ready for another command.

Ctrl-z will suspend a job that is already running. Suspending a job effectively stops it from executing any further instructions. A job running in the background can be suspended by issuing the stop command with the job number as its argument. Stopping the job running above:

	stop 2

To make it run in the background again:

bg

and to bring it to the foreground:

fg

All the job control commands described above will take a job number as an argument, or enough of the command to make it identifiable.

	fg 2

is the same as

	fg find

The list of the jobs running in the background or suspended can be obtained with the jobs command.

	active12% jobs
	[1]+ Stopped vi Second.mi
	[2]- Running find ~/course_work -type f -atime +14 -exec gzip {}
	\; &
	active12%

The `'and `-' tell you which job is current (`+') and which is previous `-'. Job control commands without arguments act on the current job. To discontinue a job:

	active12% kill %2
	[2] Killed find ~/course_work -type f -atime +14 -exec gzip {} \;
	active12%

kill also works with a process number as an argument.

Input/Output Redirection

One of the most useful features of Linux is the ability to redirect output from commands to a file, or join commands together using a pipe.

Every Linux process has three streams connected to it: standard input (stdin), standard output (stdout), standard error output (stderr). Usually, during a standard, interactive terminal session, stdin is from the keyboard, stdout is to the screen, and so is stderr.

However, these streams can be redirected, so that the output of a command can be sent to a file, or another command, and the input to a command can come from a file or from the output of another command.

The following is a list of the metacharacters which control this redirection:

> file directs stdout to file

>> file appends stdout to file

< file accepts stdin from file

>& file redirects stdout and stderr to file

command1 | command2 pipes stdout from command1 to stdin of command2

Variables

There are a number of variables maintained by the shell. These are used to store values which the shell may refer to from time to time. For instance the variable TERM stores the value of your terminal type. One of the more important variables is the variable PATH which contains a list of directory names where the shell searches for commands.

Most of the special variables are defined in your .cshrc file. Typing:

set

will give a list of the variables which have been set, and the values associated with them.

Variables are assigned with the command set. So:

	set path=($path ~/bin/myscripts)

will add the directory bin/myscripts in your home directory to your search path. If you add a new command to this directory, you will need to issue the command:

	rehash

which will causes the shell to re-compute its internal table of command locations, and it can then find the new command in the directory.) Of course, to make the change permanent you will need to add ~/bin/myscripts to the path list in your .cshrc file (see /usr/local/etc/default.cshrc).

Environment variables can be listed with the command

env

Environment variables are different from shell variables in that they are exported to any new process. For instance, when you invoke the screen editor vi, the command looks for the environment variable TERM in order find out what your terminal type is and so it knows how to draw the screen. To set or change an environment variable use the command setenv.

	setenv TERM vt100

sets the environment variable TERM for a vt100 terminal. Generally, however, TERM, like the other environment variables PATH and SHELL get automatically exported from the shell variables term path and shell. (Note that environment variables are by convention uppercase while shell variables are generally lowercase).

Shell Procedures

The shell can read commands from a file. Files containing shell commands are called shell scripts or shell procedures. These procedures can have arguments associated with them which are referred to in the file using the positional parameters $1, $2 . . .

If, for example, the file First.sh contains the commands:

	who | grep $1

then

	sh First.sh nuu

is equivalent to issuing the following on the command line:

	who | grep nuu

It is also possible to invoke the script in another way. If you set execute status on the file First.sh using the chmod command:

	chmod +x First.sh

then the command

	First.sh nuu

is equivalent to

	sh First.sh nuu

The shell also provides an if/then/else construct for conditional branching, and a case statement for multiway branching.

Utilities

There are many utilities available under Linux and this section attempts to indicate what some of them do. We have defined `utility' very broadly to encompass text formatters, filters, programming and project management tools.

TeX

TeX is a command-driven text formatting package. It is a large and complicated program that goes to extraordinary lengths to produce attractive typeset material. However, it is easy to typset simple text using TeX. It is particularly useful for creating mathematical texts, as its fonts include mathematical symbols. TeX has a number of `macro' packages available, the best known of which is LaTeX . LaTex has a number of different style `templates' for documents such as articles, reports, books and letters. For most text processing purposes, LaTeX is recommended for its simplicity.

Filters

The word filter is derived from the analogy with the kinds of filters used in plumbing or electronics. Formally, it can be defined as a command that reads its standard input, transforms it in some way, and prints the results as output. The input is in the form of a stream of data saved on a file. Most Linux commands can be used as filters. Some of the useful commands which can be used as filters are described below. For further details of their options and usage please refer to the relevant man pages.

grep

grep is a command which searches one or more files, line by line, for lines containing a pattern. Suppose you have a variable called testflag in your Turing program saved under the name of First.t . To search for all occurences of this variable, type the following:

	grep -i -n testflag First.t

All the lines that contain this variable will be printed along with their line numbers (the -n flag) in your program. The -i option makes the search case-insensitive. It causes both uppercase and lowercase letters to match a pattern containing lowercase letters.

If you only need the number of times that you have used the variable testflag in your program, then type the following:

	grep -c testflag First.t

Other useful options:

-v prints all the lines that do not contain the pattern

-l prints the name of each file that contains one or more matches

Finally, it is worth mentioning that there are two other commands called fgrep and egrep which perform functions similar to the grep command.

sort

sort is a utility program which sorts and/or merges the contents of one or more text files in alphabetic or numeric order. Like grep, there are a number of options which can be used for sorting lists with different sort-fields. Suppose that you have a file called telephone, which contains the names of various people along with their telephone numbers.

Lucy Houseman 123 4567
Sunil Shah 456 7891
Jenny Ho 987 6543

To sort the file in alphabetical order, type the following:

	sort telephone

The output of this command would be the list of people sorted into alphabetical order by their first names. To sort the list by their last names, we need to tell sort to skip the first field. This is done thus:

	sort +1 telephone

You can also tell sort to treat the characters in a field as numbers, and to sort them in numeric order. So to sort the list by telephone numbers you can type:

	sort -n +2 telephone

More useful options:

-r reverses the sense of the sort
-f causes sort to consider all uppercase letters to be lowercase letters
-m causes multiple sorted files to be merged
-u gets rid of the duplications when files are merged together

Bear in mind, however, that the ordering is actually ascii order - in other words LUCY and Lucy and lucy will not necessarily appear sequentially. Use the -f to option make the sort case-insensitive.

sed

sed is a non-interactive text editor. It is a lineal descendant of the Linux editor ed, and is ideal for performing multiple `global' editing functions efficiently in one pass through the input. It will take commands from a script or from the command line, and for simple usage the latter is preferable.

Continuing with our example file telephone (containing the names and telephone numbers of various people), suppose we wanted to change the name Ho to Harris. This can be done by typing:

	sed s/Ho/Harris/ telephone > newtelephone

The s option stands for substitute, and the result of this operation will be that all occurences of the name Ho will be substituted for Harris and saved on the file newtelephone. This also means that Houseman will now become Harrisuseman -- an effect which is probably not desirable. You should be careful always to specify the pattern-match precisely.

	sed 's/Jenny Ho/Jenny Harris/' telephone >newtelephone

To delete all lines in which there is an occurence of the name Jenny from the file, simply type:

	sed /Jenny/d telephone > newtelephone

Note: This deletes all lines where the string Jenny occurs -- not simply string itself.

Some other options --

-f causes sed to take the edit commands from a named file
-n causes sed to selectively display parts of a file

awk

awk is a programming language whose basic function is to search files for lines that contain certain patterns. When a line matching any of those patterns is found, awk performs specified actions on that line. Like sed, awk takes its input from a script of requests or from the command line. You can write simple, short awk programs on the command line, without having to create a separate file. An awk program consists of one or more program lines containing a pattern and/or action. Actions must be enclosed within braces so that awk can differentiate them from patterns. An awk pattern can be one of many things -- from a simple search string to a regular expression. If you do not specify a pattern then awk will perform the specified action(s) on all the lines in the file.

Continuing with the same example involving the file telephone, if, you wanted to print the first name and phone number of all the people who had Shah as their last name, then this can be achieved by specifying a pattern in the awk command followed by the print action:

	awk '/Shah/ {print $1, $3, $4}' telephone

In many ways awk is a more powerful filter than either grep or sed and certainly has many more features than can be dealt with here.

perl

Perl is a more modern programming language designed by Larry Wall, intended to replace shell scripts using sed, awk and many other filters with a single unified interpreted language. Learn it now!

Duncan White gives an annual Perl lecture course in December, here are the latest set of notes .

Other useful filters

There are many other filters available, and following table lists some of the more useful:

colrm removes columns

fold folds long lines into fixed lengths

join joins lines in two files that have identical fields

tail prints the last few lines of a file

head prints the first few lines of a file

tr translates characters, for example from uppercase to lowercase and vice versa

uniq counts and removes duplicate lines

wc counts characters, words and lines

Project Management

Revision Control System

The Revision Control System (RCS) manages multiple revisions of text files. It is useful for text that is revised frequently, for example programs, documentation, etc. Since RCS stores and retrieves multiple revisions of programs, one can maintain one or more releases while developing the next release, with a minimum of space overhead. RCS maintains a complete history of changes and resolves access conflicts. This can prove to be very useful in group projects when two or more members of the group wish to modify the same revision of the program.

Basic use of RCS is simple; to submit a file (eg. First.t) to RCS you first have to `check in' the file:

	ci First.t

This command creates a file First.t,v and stores First.t into it as revision 1.1 and deletes First.t. Files ending in ,v are called RCS files (`v' stands for `versions'), the others are called working files. It is a good idea to create a directory called RCS, in which case all the ,v files will automatically saved in that directory. You can get back the working file by typing the check out command:

	co -l First.t

This command extracts the latest revision from First.t,v and writes it into First.t. Note the -l option which locks and sets write permission on the file. You can now edit it and check it back in by invoking:

	ci -u First.t

The -u option unlocks the file, and ci increments the revision number appropriately.

You can also retrieve any previous revision and edit it. For example, if you wanted the second revision of your program, you could do so by typing:

	co -lr2 First.t

You can also have strict locking where you are the only person who is going to deposit revisions. Strict locking is turned on with the command:

	rcs -L First.t

and turned off by:

	rcs -U First.t

RCS offers many more features for good project management. A full description is beyond the scope of this introductory document.

make

make is a useful software maintenance utility. It streamlines the process of generating and maintaining object files and executable programs. It helps one to compile programs consistently, and eliminates the unnecessary compilation of modules that are unaffected by source code changes. It reads a file called the makefile, which is created by the programmer. The makefile can be regarded as a recipe. It contains information about what files to build and how to build them. Each file to build is called a target.

make has a number of pre-defined macros which eliminates the need to edit makefiles when the underlying compilation environment changes. make commands can also be nested. make has evolved into a powerful and flexible tool for consistently processing files that stand in a hierarchical relationship to one another.

Programming Tools

lex

lex is a Lexical Analyser Generator. It is a program generator designed for lexical processing of character input streams. lex accepts a high-level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognises regular expressions. lex source is a table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. As each such string is recognised the corresponding program fragment is executed. lex can also be used with a parser generator to perform the lexical analysis phase. Therefore lex has been designed to work in close harmony with the yacc parsers.

Yacc

yacc stands for Yet Another Compiler-Compiler. It provides a general tool for imposing structure on the input to a computer program. The yacc programmer specifies the structure of the input process. The specification includes rules describing the input structure, code to be invoked when these rules are recognised, and a low-level routine to do the basic input. yacc then turns such a specification into a subroutine that handles the input process. The input subroutine produced by yacc is called the parser. The parser then calls the programmer supplied low-level input routine (called the lexical analyser) to pick up the basic items called tokens from the input stream. These tokens are organised according to the input structure rules called grammar rules. When one of these rules has been recognised, the corresponding code for that rule is invoked. In some cases, yacc fails to produce a parser when given a set of specifications. So although yacc cannot handle all possible specifications, its power compares favourably with similar systems.

gdb

The purpose of a debugger such as GDB is to allow you to see what is going on "inside" another program while it executes-- or what another program was doing at the moment it crashed. GDB can do four main kinds of things (plus other things in support of these) to help you catch bugs in the act:

Start your program, specifying anything that might affect its behavior.
Make your program stop on specified conditions.
Examine what has happened, when your program has stopped.
Change things in your program, so you can experiment with correcting the effects of one bug and go on to learn about another.

`colrm`	removes columns
`fold`	folds long lines into fixed lengths
`join`	joins lines in two files that have identical fields
`tail`	prints the last few lines of a file
`head`	prints the first few lines of a file
`tr`	translates characters, for example from uppercase to lowercase and vice versa
`uniq`	counts and removes duplicate lines
`wc`	counts characters, words and lines