Welcome to Duncan White's Practical Software Development (PSD) Pages.
I'm Duncan White, an experienced and professional programmer, and have been programming for well over 30 years, mainly in C and Perl, although I know many other languages. In that time, despite my best intentions:-), I just can't help learning a thing or two about the practical matters of designing, programming, testing, debugging, running projects etc. Back in 2007, I thought I'd start writing an occasional series of articles, book reviews, more general thoughts etc, all focussing on software development without all the guff.
![]()
Individual Safety Tips or "Shared Norms" of C Programming?
Background to this Draft Article
Following a discussion on the "Plain Ordinary C" LinkedIn group, I started wondering whether C Programmers (or programmers more generally) develop a shared sense of normal and safe ways of programming, rather in the sense of "Social Norms" in psychology - heuristics that help drive behaviour in specialist groups, and avoid dangerous behaviours etc. A less highbrow term for these heuristics and Norms might be safety tips. Such practical safety tips as I have in mind might alert you intuitively to something "not being right" just because it's very abnormal, and by doing so keep you in the safe parts of C, and away from dangerous less explored areas, once marked "Here be Dragons" on nautical maps.
The specific trigger for these thoughts was a C newbie, writing a recursive call to main(). In fact, worse than that, main() called function f() which called main() again, i.e. it was Mutual Recursion involving main(). This seemed so weird and evil to me that I knew it was a very confused and bad example - before even adding in the fact that the recursive calls had no conditions, so it was an infinite mutual recursion that would not terminate until stack overflow occurred. So Do not call main(): that's the runtime system's job became my first safety tip - since reworded as Thou Shalt Not call main().
Having thought about this for a few days, I wrote the initial version of this document. Then I started a fresh discussion on the "Plain Ordinary C" LinkedIn group to discuss this idea and to solicit new tips (and revisions to the tips herein). Note that this is the first time that I've gone public with one of these articles before I judge that it's finished, this is a wisdom of crowds experiment for me. So far, the LinkedIn discussion has had over 400 (overwhelmingly positive) comments in about 3 weeks, and I have incorporated many suggestions into this article with full attribution - and the comments are still coming in, in fact there's a growing backlog of suggestions to incorporate! I'd like to thank everyone on the LinkedIn who have made contributions to this article, I'm extremely grateful for all the suggestions, disagreements and arguments, this article is gradually improving through their help.
ANSI C: Safety, Danger and Power
Now, just how dangerous is modern C? C used to be widely viewed as a powerful language with fewer-than-average protections against common mistakes made by unwary programmers. It used to be said that (for example): "Writing in C is like running a chainsaw with all the safety guards removed" (attributed to Bob Gray).
Howard Brodale of LinkedIn makes the important point that C isn't as dangerous as it once was, saying today's C compilers now hold our hands much more with prototypes and better parameter datatype checking. I completely agree with Howard that C is much less dangerous now than it once was, but I should add that C still contains many pitfalls, and the purpose of this article is to outline practical tips to avoid many of the remaining dangers in modern ANSI C.
Jack Purdum of LinkedIn adds that danger is not necessarily a bad thing, pointing out that all C programmers have implicitly agreed to trade some elements of safety (eg, array boundary checking, pointer confusions) for the performance improvements and economy of expression. I agree that (counterintuitively) C's dangers are a good thing - one reason why we all love C so much and continue to use it is because it occupies a sweet spot in which some safety is traded off for a lot of efficiency.
Safety Tips - what do I mean?
As a first step, before we can judge whether useful "Shared Norms" exist in C programming we need to see some individual C programmers collections of personal "safety tips". So, to get the ball rolling, here's my collection, classified into several areas:
- It seems certain (to me) that every individual C programmer develops their own personal sense of what's normal and safe as their experience grows, but the more interesting question is whether these individual senses combine into a Shared Norm, or whether C programmers have very disjoint views about this. I don't know the answer here. It's quite possible that I have mine, you have yours, etc, and we'll fight to the death over them. But it's interesting to see!
- I should point out that these safety tips are distinct from low-level questions of style and layout (eg. the eternal controversy over where to put {}s, variable naming conventions, how many spaces to indent, whether to use tabs or spaces to indent, etc), and are also distinct from higher-level architectural and design considerations (eg. loose coupling, information hiding, abstraction, cohesiveness, design patterns etc). Like Goldilocks, I want to exclude both extremes and investigate the middle bowl of porridge - the one that's just right.
- After discussions on LinkedIn, with respect to questions of style and layout, just to say it once and for all: I believe that the details of the style that an individual programmer chooses is unimportant, but that consistency of style within a project/team is hugely important. Some programmers indent by 2 spaces, some 3, some 4, some 8 (and some of the latter use a hard tab to do it). Some programmers lay {} braces out as K&R did, others in the BSD style, or several other variations. Some programmers use Hungarian Notation for variable and function names, or MixedCase, or mixedCaseWithInitialLowerCase or lowercase_with_underscores. Some like short variable names like i for an index, others long like currentIndex.
Many programmers have strongly held opinions about such questions of style, and are not going to change their minds. Thus, discussions of these questions often descend into unending religious wars, which contribute nothing and should be avoided (IMHO). Note that the CERT Secure C Coding Standard (discussed below) agrees, saying Coding style issues are subjective, and it has proven impossible to develop a consensus on appropriate style guidelines. In summary: for each stylistic choice within a project: make your choice and stick to it. In a team adopt their team style.
- Several rather similar lists exist of C gotchas, eg:
and some books are relevant too:
- The andromeda.com/people/ddyer/topten.html.
- The C Programming Substance Guidelines.
- The Guidelines for Safer C code (PDF).
- Andrew Koenig's wonderful C Traps and Pitfalls book investigates many of the worst pitfalls in C about 20 years ago - and how to avoid them.
- Also, the wonderful The Pragmatic Programmer book (see my Review here) contains a collection of pithy tips which distill decades of pragmatic software engineering experience down for you. One of the most famous tips (not original to the authors, Hunt and Thomas) is the DRY principle which finds expression throughout design and programming:
DRY---Don't Repeat Yourself: Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.
However, in my mind at least, none of these are quite the same as my idea of safety tips (although there is some overlap). I'm sure some suggestions from those sources appear here - especially C Traps and Pitfalls and The Pragmatic Programmer.
- Several people on LinkedIn mentioned that various heavyweight "safety-critical programming in C" methodologies exist - ones from NASA, MISRA-C, CERT, JPL, IBM etc. I should admit upfront that I haven't read them (well, I did glance at the NASA guide recently, and hated it when it told me not to use a potentially infinite loop. That may make sense for NASA in the context of navigational systems for spacecraft, but it's (IMHO) ridiculously restrictive in general C practice). I am about to read the MISRA-C 2012 guide, so hope to be more informed about it soon. Of course it's very important for me to be fair about these guides, and argue from a position of some knowledge. I realised that I am a programming libertarian or anarchist, reacting badly to methodologies that tell me what not to do, without letting me exercise my judgement. I believe as programmers it's our professional duty to think about such issues, and exercise choice - i.e. to make our own minds up.
- The people who mentioned MISRA-C et al asked me how a collection of safety tips would differ from these methodologies - would my idea tend to slide into the space they occupy? If so, that would seriously undercut this effort! After some thought and discussion, I think there is a difference and perhaps it's this: I'm talking about a BOTTOM-UP view from the (anarchistic) trenches, as opposed to a TOP-DOWN industrial strength guide designed to constrain the programmer in the name of safety. So I'm into practical tips that really help, discovered and shared by individual programmers, not a coding methodology that must be obeyed. Some of those same people felt that this venture would have no value without enforcement. Here I completely disagree: I have no interest in forcing anyone to obey any of these tips. Read them and make up your minds! That's kind of the point:-)
Dave Aronson of LinkedIn suggested that The CERT Secure C Coding Standard may be the most similar to my "safety tips" idea of the above guides and standards: it's freely accessible online and started as a community developed collection of security/safety tips. You can also buy it in book form through Amazon. Clayton Weimer of LinkedIn adds that the CERT website contains useful cross-links to other standards such as MISRA-C. From a quick inspection of The CERT Secure C Coding website I can see that Dave and Clayton are right, there is considerable overlap, and a lot of work has gone into it. I'll investigate that further and report back.
- Ken Gregg of LinkedIn added if something can catch a real-world class of problems early (ideally, at compile time), then it's worth having on the "safety tip" list. So, mine has been built up mostly from personal experiences with specific bits of code or debugging saga - sort of my own "institutional knowledge" repository. I agree completely, that's exactly the sort of "safety tip" I have in mind.
Simplicity and clarity
Note that all these safety tips are guidelines rather than strict rules - as in Bill Murray's quote from Ghostbusters, actually, it's more of a guideline than a rule. As a professional programmer, you should not only have the freedom to choose which safety tips you like and don't like in general, but you should also keep the freedom to break your own rules in a specific case. Until you're replaced by a AI - you are the expert. Trust yourself. When you do deliberately break one of your own rules, this should be surprising to you - so it's worth adding a comment saying why you did. Some future maintainer of your code - you in the future, for instance - might thank you for it.
- Write code primarily for simplicity and clarity (the KISS principle). This is the core principle of all these PSD articles - see the cover of Kernighan and Pike's wonderful book 'The Practice of Programming' (see my review), in which a sausage dog is pointing at a blackboard on which are written the 3 cardinal principles: simplicity, clarity, generality. Speed is much less important while developing your programs than simplicity and clarity - and of course correctness! As Kernighan and Plauger said: first make it work, then make it right, and, finally, make it fast. (Thanks to Rick Marshall of LinkedIn for reminding me of the "Make it work, then make it fast" saying and Carl Reynolds of LinkedIn for the attribution of it to K&P). When the speed of a working program becomes important, see my later comments on optimization and - especially - code profiling.
- Favour simple techniques over complex ones - but not to the complete exclusion of more complex techniques, use complex techniques when you judge they are necessary on grounds of clarity, simplicity or efficiency.
- Favour iteration over recursion: Do not use recursion when a simple iterative solution is as clear or clearer than the recursive solution. Save recursion for cases where it really helps to reduce complexity - some algorithms (eg Quicksort) are naturally written recursively, and Quicksort is one of the fastest sort algorithms, so it helps to dispell the myth that recursion is always slow.
Note: I should make it clear that I love recursion, and functional programming (which uses recursion almost to the exclusion of everything else). What I hate is all those contrived, artificial, STUPID recursion examples - that you find in books, articles and Wikipedia - that use recursion when a simple for loop will do the job. New programmers can often feel that recursion is a way of making a loop harder to understand. For example, on the Wikipedia Mutual Recursion page, the first "basic example" of mutual recursion is how to calculate whether a number is even or odd via 2 mutually recursive functions. This doesn't need to use mutual recursion. It doesn't even need to use plain or direct recursion (a function calling itself). For goodness sake, it doesn't even need a for loop: have they not heard of the modulus operator, written in C as:
#define is_even(n) ((n)%2==0) #define is_odd(n) ((n)%2==1)Aside: as well as being an application of Occam's razor, you can think of this argument as a form of strength reduction. In the same way that << and + are (usually) cheaper arithmetic operations than *, so x*2 is often implemented by compilers as x+x or x<<1, iteration is cheaper (and usually simpler) than plain recursion, which in turn is cheaper than mutual recursion. Don't create an arbitrarily-deep collection of stack frames, potentially causing your program to run out of memory and die, when you don't need to.
- Thou shalt not call main() - that's the OS or run time system's job. Thanks to Chris Ryan of LinkedIn for the wording of this one; much better than what I originally wrote, which was the rather dull Do not call main(). Chris adds that the ANSI C specification says that calling main() is unspecified behaviour.
- Favour double over float. Float/double type conversions contain many gotchas, so don't mix them. You may think it's easy to avoid mixing float and double, but it isn't. For example:
float x = 3.14;takes a double and converts it to a float - because 3.14 is a double literal. Perhaps you should have written:float x = 3.14f;(but who does that?). So I recommend using double in preference to float in most circumstances - because of the extra precision, and to simplify the above double literals example. Personally, I don't think that I've used float in the last 20 years - so in practice, I go further: I only use double. As ever, your milage may vary - and for a counter-case, see the later discussion of when you might break your own guidelines. Search for rule-breaking.- Be careful with floating point comparisons: avoid them where possible - be aware that C, and most other programming languages, only store floating point values approximately (suggested by Nigel Evans of LinkedIn).
- Howard Brodale added a good point about floating point gotchas: parsing floating point values from strings via atof() works well. But if you accidentally use atoi() by mistake, this causes silent truncation with no warnings, as in:
double x = atoi("3.14159"); printf( "x=%f\n", x );which gives 3.0000 This single character typo is very hard to track down. Rick Marshall explained exactly why this occurs (atoi() only parses the integer prefix, and returns an int, assigning an int to a double promotes the int to double - without needing a cast), but I agree with Howard that a one-character typo with such unexpected results, that cannot be easily detected, is well worth having as a safety tip, especially as Howard says that he's seen this occur in practice.- Use typedef to simplify type definitions and to hide irrelevant "implementation details" that should not be exposed (in case they change later). This is a form of information hiding or abstraction. Zoltan Kocsi of LinkedIn added: I like to write code on the need-to-know basis; if a module doesn't need to know what's inside a particular abstract object, it shouldn't see it in the header either, because that's temptation to make shortcuts. Similarly, Bill Moyer of LinkedIn wrote: I find typedefs handy for abstracting away the actual type, so that if I change the underlying type (say, from a struct to a struct pointer), I only have to change my code in the typedef and in functions which need to know the details of the type's implementation. Carefully designing one's data structures first can help avoid the need, but sometimes a little wiggle room is convenient. I couldn't have made both points better myself:-)
- As an example, suppose you define a variable:
struct intlistnode *todo;Ask yourself: is this is as clear as it could be? Or is there some clearer, simpler conceptual name for a pointer to a intlistnode structure, that doesn't expose quite so much about it's implementation? Alternatively, ask what is todo?- In this case, our first thought might be to call the type intlistnodepointer, but that's still exposing implementation details to anyone who can read the separate words, and frankly it's not much better than the original. As you think about it a bit more, you probably realise: it's a list of integers. So why not define (note the presence of a nice clear comment):
/* an intlist is a linked list of intlistnodes */ typedef struct intlistnode *intlist;Now, your variable definition is simply:
intlist todo;This is much clearer. Adding a one line comment to tell us what logical data the todo list stores would make it perfect:intlist todo; /* stores the list of values still to process */- Having said all that, it's possible to overuse typedef. Personally, I don't see the point of aliasing existing types, so I wouldn't write:
typedef double length;Similarly, I'm not sure I'd even use typedef to define a pointer-to-an-existing type, as in:
typedef double *doubleptr;In particular, the fact that I can't think of a better conceptual type name than doubleptr is an indication that there's no point. Perhaps on rare occasions I might do this - especially if a suitable conceptual name made it sound like a good idea.But note, contradicting myself, that I used to regularly place:
typedef char BOOL; #define TRUE 1 #define FALSE 0into a bool.h header file which I included everywhere, to work around the omission of a standard boolean type from C.- If something is available via the standard headers/libraries, use it. Ken Gregg of LinkedIn pointed out that I'm out of date in my comment about booleans: C99 defines a boolean type (called bool with values true and false) in the standard header file:
#include <stdbool.h>I shall immediately start using it, thanks Ken! Having been tripped up by an application of this rule just now, how can I avoid adding it?- To hide * in typedef, or not to hide * in typedef: make your own mind up.
There is a clear division in the C community on LinkedIn as to whether it is a good idea to hide "*" (pointer to) in typedef definitions. I nearly always do, as in my example above ("intlist todo") which hides the "*". I argued above that the concept is the important thing: it's a list of integers, so the user should be encouraged to think of it as such, and use operations that we provide - treating the list as an implementation of an Abstract Data Type (ADT). The users of our intlists don't need to know that an intlist is a pointer when writing their code, in fact they should be encouraged to ignore this via abstraction, in case they start to fiddle with the internals:-)
But many others equally passionately argue that they want to write intlist *todo. Note that C's standard library has an obvious example of this style, when we work with filehandles, we write:
FILE *f = fopen(..Even though we revere the godlike powers of K&R who were geniuses who designed C and gave us all gainful employment for decades, my question is - if an open file is a FILE *, what separate concept does a FILE represent? In practice FILE is an implementation dependent structure definition, deliberately left opaque. But that's not a separate conceptual type, it's an implementation detail. I have to say that this is one of the few occasions where I (gasp) disagree with K&R - burn him, burn him - if I had designed <stdio.h> I would have defined the user-visible filehandle type as FILE, and hidden the "*" inside FILE's typedef definition. Then I would have written the "not quite C code":
FILE f = fopen(..and a tiny bit of nasty complexity ("a file is a pointer? a pointer to what? can I dereference it?") would have been hidden away. But clearly FILE * is ingrained in a million programmers' minds - including mine - and will not go away.Giving up on FILE *, thinking about the more general case - to hide "*" or not - this is a situation where I don't think either side will convince the other - so each programmer should make their own mind up: Pick one convention, stick to it, and try not to declare a holy war on the Infidels who disagree with you (just like tabs vs spaces:-)).
I did wonder whether there's a distinction to be made between when we store one thing, vs when we store a collection of things (a list or an array or a queue). But in practice I make no distinction: ie. I hide the "*" in both cases.
- Favour using standard library functions like qsort(), bsearch() - and even strlen(), strcat(), strcpy() et al - rather than writing your own routines. Sort and binary search algorithms are notoriously hard to get right, and the library versions have been extremely well debugged - and optimized - for you. Plus, as Steve Jobs said: The line of code you don't write is the line of code you never have to debug", Howard Brodale said much the same: The less software there is the more safe it is.
Against this viewpoint Dave Ryland of LinkedIn suggested that you should only use bounded operations, and hence functions like strcpy, strcat, sprintf, etc should be banned. By bounded, I think he means: guaranteed to terminate in known amounts of time. While ensuring that termination occurs (ie. that programs end:-)) is very important, deciding never to use strcpy() - because it won't terminate properly if you give it a nonterminated string, ie. a character array that does not contain a '\0' character - seems like an extreme position that I can't agree with.
- Favour using idiomatic C constructs, because everyone is used to them and will recognise them at a glance. For example, an N-times loop is most idiomatically written:
for( x=0; x<n; x++ ) { }If you instead write:for( x=0; x<=n; x++ ) { }that'll cause confusion: it's an N+1 times loop! It's almost certain to cause an off-by-one error! You probably meant to write the other idiomatic N times loop form:for( x=1; x<=n; x++ ) { }- Mike Thompson of LinkedIn added there's a more critical reason for writing idiomatic code. erroneous code does "not look right", making it easier to find defects. Correct code should have a harmonious feel about it; defective code "sound" off-key, causing the reader to look more closely for the cause.
As a corollary, if you need to do something out of the ordinary, add a comment. The unusual code will disrupt the reader's "flow". In the best case, it will cause a closer examination, leading to the conclusion the code is correct. It the worst case, someone will "fix" the "broken" code. A good example of this is falling through from one switch case to the next: add:
/* no break */or the old lint convention (anyone remember lint?):/* FALLTHRU */Note that Kernighan & Pike support Mike (in the Practice of Programming): bugs often lurk in non-idiomatic stylistic constructs that skilled programmers don't recognise. .
What sort of rule-breaking might we want to do? for example, we talked about favouring double over float above. But a float typically occupies half as much memory as a double. If you're working on a memory constrained system such as an embedded system, and storing millions of floating point values, declaring a large array of float instead of double may make the difference between your code fitting in available memory or not!
Avoid Undefined Behaviour and Gotchas
ANSI C clearly defines certain things as unspecified, undefined or implementation-defined behaviour. To quote the much missed Douglas Adams: C has rigidly defined areas of doubt and uncertainty. Do your best to avoid these. There are too many to list (eg. order of evaluation of parameters and expressions with side-effects) - note that the CERT Secure C Coding and the MISRA-C-2012 guides go into a lot of them. But here are a few positive tips worth mentioning:
- Set all variables before using them. Unlike some other languages, C does not initialise variables unless there is an explicit initialisation. Using the value of an uninitialised variable causes major problems. You know it's your responsibility to set variables before using them - so get used to it. Always ensure a variable is set on all code paths before it is used. (Thanks to Mike Thompson for the improved wording of this point after a useful discussion on LinkedIn. I'd said Initialise all variables but others thought that I meant always use an initializer at the point of defining a variable. This was not what I meant).
- Enable maximum compiler warnings: Compilers are getting better at warning about undefined behaviour (like detecting uninitialised variables) so let them!
- Having enabled maximum warnings, understand and fix every warning that you receive. Rick Marshall of LinkedIn adds that, as a result your program will be better, safer, and you will understand it and C better.
- Always put {} around single statements in if, while and for loops, to avoid the "wrong scope" problem: C allows you to write:
if( x>1 ) x++;Which is fine. However, if you later add some debugging:if( x>1 ) printf( "debug: x>1\n" ); x++;Then the indentation makes you believe the printf and the increment are the "then" part, but they're not (except in Python which is just bizarre). In C, the above code actually meant:if( x>1 ) { printf( "debug: x>1\n" ); } x++;That's probably not what you meant. Avoid it by writing:if( x>1 ) { x++; }from the start. You may be happy to allow the one-line variation as a special case:if( x>1 ) x++;- Another classic gotcha of this type involves empty bodies in for loops:
for( i=0; s[i]!='\0'; i++ ) /* do nothing */ return i;which, due to the lack of {} or a semi-colon, does not mean what the programmer wanted. It means:for( i=0; s[i]!='\0'; i++ ) return i;Which is equivalent to:i=0; if( s[0]!='\0' ) { return 0; }To avoid this danger, write:for( i=0; s[i]!='\0'; i++ ) { /* do nothing */ } return i;Note: this is the one tip here that you may classify as a stylistic rule. But it's a genuine safety tip because it can save you hours of confusion. Also, I'm not telling you how to layout the {}, just suggesting that you put them in:-)
- Carl Reynolds adds a tip I hadn't consciously thought of - but thinking about it, I automatically follow. If-then-else: shortest case first: When writing an if-then-else-statement, structure your logical expression so the true statement will be the shorter of the two code blocks. I know this may sound like a stylistic rule, but if I have two blocks one with 20 lines and the other with a single line, and the logical expression is set up so the 20 line block comes first, I can forget what is going on with the statement by the time I get to the else part. If, contrariwise, the logical is structured so the single line block comes first, I can read the logical expression, understand how it leads to the single line block, then while the logical is still on the screen, I can move on to the else block and refer back to the logical if I need to without loosing my train of thought. It makes the code much easier to read, and is not just a stylistic rule.
- In a similar way (avoiding future problems), put a trailing ',' into every array literal: While C allows you to write:
char *x[] = { "hello", "there", "how", "are", "you" };I recommend adding the trailing comma:char *x[] = { "hello", "there", "how", "are", "you", };because it makes it easier to add an extra element at the end without the risk of forgetting to add the ',' to the previous line. C allowing the trailing ',' also makes writing various tools that generate C easier to write, because those tools don't need a special case "add a comma between every two entries but not after the last".- Use the C pre-processor as little as you need to, and be extremely careful when you do. The C pre-processor is brilliant, I'm so glad that C has it, you can do many amazing things with it. But you can also confuse yourself utterly. For example, consider the classic:
#define square(a) a*a square(x+1)This expands to:x+1*x+1Or 2x+1 as it's known to it's friends. The solution to this particular problem is round brackets (aka "parentheses"); lots and lots of round brackets..#define square(a) ((a)*(a))Of course, even this doesn't defend against the horror that occurs if the expression you're squaring has side effects:square( x++ );(There's no solution to this one, apart from don't use macros at all).- Nigel Evans adds another macro-related tip: Use __FILE__/__LINE__ in macros for logging and tracing: This is an excellent point. Very few other languages have a macro-processor embedded in them - and many have criticised C's inclusion of it's pre-processor. But features like __FILE__ and __LINE__ are incredibly powerful.
For one use of them, see my 2nd C-Tools lecture, there's a section starting at slide 5 about detecting memory leaks, using a home brew library from a Dr Dobbs Journal article back in the early 1990s called libmem. The associated tarball contains the complete source code of libmem and examples of how to use it. How does libmem work it's magic? By redefining malloc(), free(), exit() and so forth, the redefined malloc() allocates a bigger block than the user wanted, links all the allocated blocks together into a linked list, and records the source filename and line number - using __FILE__ and __LINE__. The redefined exit() then reports on any allocated blocks that weren't free()ed, reporting (for each block) the size of the block, and the source filename and source line no where the allocation took place. This makes it easy to search for corresponding missing free() calls elsewhere. libmem simply could not be written without the pre-processor.
- Use round brackets in order to make expressions clearer - but not too many: In moderation, extra () can be a good thing, especially with obscure things like bitwise operators. For example, in x&7<<5, how many of us can remember whether that means (x&7)<<5 which we perhaps hoped it meant, or x&(7<<5)? Actually, due to one of the few poor historical coping-with-change decisions K&R made and later regretted, it means the latter, and since 7<<5 is a constant == 224, it means x&224. I once got badly bitten by this sort of case. Much better to write (x&7)<<5 to make sure you're right. The question of what constitutes sensible bracketing, and what is overuse, is a style question.
- Learn operator precedences: Slightly supporting, and slightly contradicting, the above tip, Rick Marshall of LinkedIn gives us this tip, saying: This is really important. If you understand precedence the code is clearer and far less likely to have errors. I know some of the rules are hard to remember, but not that hard. I'd certainly agree that round-bracketing extremes like (3+(4*5)) should be avoided; IMHO, if a programmer doesn't know the priorities of + and * in C, they should not be programming in C!
- Nigel Evans suggested the tip: Limited use of globals often indicates good design Generally, I agree with Nigel, but please note that this is not the same as never use globals (not that Nigel suggested that). I certainly try to avoid having publically-visible globals in modules, unless there's a really good reason why one makes sense. I'm considerably more relaxed about static globals in modules; private but long-lived per-module state.
Pointers and Storage Management
One of the most complex and error-prone parts of C programming is using pointers safely, getting memory allocation and storage management right. Here are a few safety tips in this area:
- Develop a good respect for pointers: know how many different ways you can use them and know even more ways of misusing them can furrow up your code and your brow. (Suggested by Laura Nass of LinkedIn).
- Always free() what you malloc() exactly once: Make sure that you free() everything that you (or a library you use) malloc()s. Failing to do this means that you leak memory. Use tools such as valgrind (or my personal favourite, a homebrew library called libmem from an ancient Dr Dobbs Journal article) to solve this problem. Learn how to use such a tool, and then use it regularly. This topic is sufficiently deep that I intend to write a future PSD article exploring it, with examples.
- Write defensive code so you don't overrun array bounds: Overrunning the bounds of an array is another common and hard to debug case. While we might wish that C compilers would warn you about the blatant error:
int x[4]=...; int i; for( i=0; i<10; i++ ) { ... access x[i] }They won't. Of course, the obvious way of avoiding this problem (following the the DRY principle) is to define the number of elements via a #define:
#define NELEMENTS 4 int x[NELEMENTS]=...; int i; for( i=0; i<NELEMENTS; i++ ) { ... access x[i] }Chris Ryan of LinkedIn pointed out another technique to prevent these simple compile-time bounds errors - that I should have remembered myself (duh!):#define countof(x) (sizeof(x)/sizeof(*(x))) for(i=0; i<countof(x); i++) { ... access x[i] safely }This even works when your array definition has an initializer but no explicit size (and then copes with future changes to the number of elements in the initializer):
#define countof(x) (sizeof(x)/sizeof(*(x))) int x[] = { 10, 20, 30, 40, 50 }; for(i=0; i<countof(x); i++) { ... access x[i] safely }This technique is great, I recommend you use it as much as possible. But note one danger with it: when an array is passed into a function, or an array name is used alone in an expression, what is really passed is a basal pointer, and then inside that function the countof() trick doesn't work. See the later discussion of Understanding the difference between arrays and pointers (in the section on Functions) for more information on this.In addition, countof() doesn't address the sorts of array boundary overruns which can occur at run-time. To handle these, write defensive code that checks (at run-time) that array indexes are in bound. To allow this inside a function pass a separate argument: how many items there are in the array. Tools like valgrind can also help debug bound checking problems, but it remains mainly your responsibility.
- Write defensive code so you don't dereference NULL. At least one platform (eg BSD Unix on a VAX, back in the 1980s) allowed C code to dereference NULL - and get NULL. This is a source of latent bugs. Avoid them by checking where possible: If you know that it is logically impossible at a particular point in your program for a variable to be NULL, check it by:
#include <assert.h> ... assert( variable != NULL );(or some other checking mechanism if you don't want the program to abort at runtime when the condition that must be true isn't).- Understand variable lifetimes. Any attempt to take &localvariable should ring alarm bells, since if you return it, or build it into a long-lived data structure, then return from the current function invocation, the local variable has been destroyed. Note that an array name automatically turns into a pointer, without an explicit &, so the following is a classic example of this problem:
char *badfunction( void ) { char localbuffer[100]; strcpy( localbuffer, "string" ); return localbuffer; }- Be careful with shared pointers and deep/shallow copies: Personally, the most troublesome storage management issue for me, even after 30 years of programming in C, is the thorny problem of shared pointers and shallow vs deep copying of pointer structures. It's often efficient to store one pointer in two or more different data structures. But then the question becomes: who is responsible for free()ing that shared pointer? As C doesn't have a built-in reference counting system, there's no simple answer to this - it's depressingly easy to either forget to free() the shared pointer, or to try to free() it twice! Both are equally bad. Again, tools like valgrind can at least tell you when this has happened, leaving you to figure out why and how to fix it.
Functions
- Give each function a contract specification comment: When you start to define a function, before you even start working out how to program it, think it as a black box with a label that tells you what it does. My personal technique is to write down a typical call to it, and explain clearly what that call will do for the user (the caller), not how it will do it. Make this the function's top-level comment. This forms the Contract that the function makes with the user - see Design by Contract. For example:
/* * int found = find_something( searchvalue, c ); * Search within a collection (c) for a given * searchvalue (a string). If you find it, return the * position in the collection - an integer >= 0. * If the searchvalue is not present in c, * return -1. */ int find_something( char *searchvalue, collection c ) { /* STUB: implement me! */ }This gives us a beautifully defined function that, once we've implemented it and tested it, will be a pleasure to use. Note that you can copy the example call (and tweak the parameters) into a place where you want to invoke the function..BTW, this fits very nicely into Test Driven Development, which I wrote an article about recently. First you write the contract above, with a stub implementation (like "return -1;"), then you copy the example call into your testsuite and decide how many calls to this new function you need, what parameters each call should have, what the correct answer should be to each call, and how to test that you get the correct answer. Then you run your testsuite and see some/all of your tests of find_something() fail.. Now go back and actually implement it.
What if your functions don't look anything like this? Well they should look as clear as this! You know how you were always told that comments were important, and you nodded sagely and then ignored the advice? Contract comments are the most important type of comment, IMHO! Note that I'm not saying that my layout and style of comment is the only one, just that you should think about it, pick whatever style of comment you like, and use it consistently. But whatever style you choose, it should be as clear and helpful to the client as that shown above. The client, btw, is the programmer who wants to determine whether (and how) to call your function.
The core point here is: Give a function some form of comment that tells the programmer who wants to call it what it does. A function with no comment at all - with no form of specification contract - is virtually useless IMHO. How is a programmer to decide whether they want to call your function or not? All they have to go on "outside the box" is the function and parameter names, and that's not enough to call it safely.
The "But What Does it Do" Corollary: if you struggle to clearly define what the function does - without explaining how it does it - then stop and think some more. How can you implement something without knowing what it does? Even if you can, based on some vague intuition of what you mean, how is another programmer to know that your function is the one to do what they need to do? Try to explain it to someone else (or, failing that, to the Rubber Duck or Teddy Bear on your desk - see this article of mine for an explanation of what I'm talking about here). In extreme cases, change your mind and delete the function!
- Carl Reynolds added the following useful heuristic about function contract comments: If it takes more than one or two sentences (of reasonable length) to describe what a function does, it's probably too long and should be broken up and Jack Purdum agreed, saying: Carl: Exactly my definition of a cohesive function!
- Implement a function by opening the box: When you've written a contract comment as discussed above, and you decide to start implementing the function, invert the picture - open the black box, ignore everything outside (except the label, it's on the boundary between the inside and outside, ok?), and use the contract comment to remind you what the function must do, guiding you while you implement it. I find this mental discipline extremely helpful, and recommend it strongly.
- As you can see above, Design by Contract (DbC) is a very powerful technique that I favour. Chapter 4 of the Pragmatic Programmer describes DbC better than I ever could. Clayton Weimer also recommended 2 excellent articles that cover DbC thoroughly:
Jack Ganssle's DbC article, and
Miro Samek's DbC for Embedded Software.I really like the "assertions are like fuses" point from both articles, and the strong defense of leaving assertions turned on in (most) production code, because - after all - you wouldn't remove the fuses from an electronic circuit in production. However, after a depressingly lengthy discussion of this on LinkedIn, I can only summarise the argument as:
- Check your preconditions and postconditions (or some of them): Earlier, we showed a defensive check using assert():
assert( parameter != NULL );This may be a special case of checking a precondition - a condition which must be true when a function is entered, in order for it to work properly. Similarly, a postcondition specifies a condition which must be true on exit from a particular function (these are less common). Normally, these conditions are just written as comments. But if you like, you can make some/all of them active checks by using assert() as above. Adding assertions, especially while you are developing your code, is another defensive programming technique, and can really help you to test and debug your code. Here, we are following the Pragmatic Programmer tip: Crash Early: A dead program normally does a lot less damage than a crippled one.Similarly, assertions can check a loop invariant: a condition which must be true at the top of every iteration of some kind of loop. However, assertions have runtime cost, so try not to clutter things up unnecessarily - for example don't check a property inside a tight loop especially if that property is guaranteed by the structure of an loop over a whole array or collection.
Note that the standard assert() can be disabled by defining NDEBUG before assert.h is included. For example, defining this in a Makefile (via -DNDEBUG) when compiling for production will turn all assertions off. (Thanks to Ken Gregg for reminding me of this). Note that this brings up a very subtle point: if you turn off assertions in production (which many programmers recommend), what will happen if (say) a precondition is violated? Answer: something undefined will happen;-). Often an uninitialized result will be returned, containing random old-stack data - and then your production code will carry on running, using a piece of uninitialized data in unforeseen ways. That's not going to turn out well either! We're still discussing this on LinkedIn (not that we're going to solve it, it's unsolvable IMHO).
- Favour functions with single exit points: Carl Reynolds suggested this tip, and Peter Hanely discussed it: In C, a function can have any number of return statements. Value-returning functions may return the answer as soon as they know it, in some special case or perhaps deep inside one or more for loops. Void functions (or "procedures" as we used to call them) may get to a point in the algorithm where they wish to return immediately from the call.
A function with a single exit point (either via an explicit return statement, or an implicit "fall off the end") is generally easily to understand than one with several returns. However, multiple return statements - exit points - can be useful in some circumstances, in fact judicious use of them may increase clarity, so you should feel free to use them when you judge that they increase clarity rather than reduce it. For example, searching for an element in an array (or other collection) and returning it's position in the array, or -1 if it's not found, is most clearly written:
int i; for( i=0; i<NELEMENTS; i++ ) { if( a[i] == target ) return i; /* found target at position i */ } return -1; /* target not found */Note in passing that "return result if condition" is one of the few occasions when I permit myself not to add {} around the then part.- Simplify function definitions using typedef: We saw earlier in the Simplicity and Clarity section that typedef could simplify type and variable definitions. This applies with increased force to function definitions and declarations.
C allows you to write gobsmackingly complex function definitions like the following:
struct intlistnode *maplist( struct intlistnode *list, int (*f)(struct intlistnode *) ) { ...Can you understand that? Can I? Doesn't it give you a headache? It needs simplifying to increase it's clarity: Earlier, we decided above that a struct intlistnode * is an intlist, and wrote:typedef struct intlistnode *intlist;So using that our function definition becomes:intist maplist( intlist list, int (*f)(intlist) ) { ...Now, that's simpler, but not simple enough for me - of course, you may well find this simple enough for you, i.e. as always, your milage may vary. If you agree with me, define a pointer-to-function type:/* a pointer to a function taking an intlist and returning an int */ typedef int (*intlist2intfunc)(intlist);Now the function definition becomes:intlist listmap( intlist list, intlist2intfunc f ) { ...That's much simpler.Now add the contract comment and you get:
/* * intlist newlist = listmap( list, transform ); * listmap() takes an intlist (list) and an element * transformation function (transform), which takes * a list and returns an integer (usually based only * on the first item of the list). * * listmap() applies the transform function to each * element in list, building a new list of the transformed * element results. It returns the new list. * * For example: * int double_element( intlist list ) * { * return 2 * list->head; * } * intlist newlist = listmap( [1,5,7], &double_element ); * will build a new list [2,10,14] */ intlist listmap( intlist list, intlist2intfunc f ) { ...Once again, we now have a beautifully defined function that will be a pleasure to use. Well, it will be once we've implemented it:-)Note: depending on the context, there may be a more intuitive name than intlist2intfunc; you should choose the most intuitive conceptual name (as we said earlier). In this context, perhaps a better type name might be transformelementfunc?
- Try to keep function bodies reasonably short: a function body is undoubtedly easier to understand if it's reasonably short. (This was suggested by Takuya Tokiwa of LinkedIn) Personally, I don't want to try to define what reasonably short means - many programmers (including Takuya) like their functions to fit on a single screenful, or printed page (for oldies like me who still make, gasp, printouts on occasion). The obvious technique to use when a function body exceeds your personal length tolerance is to pick a discrete section of the body, and refactor it into a separate function - taking the opportunity to generalise it, as in the (tasteless but funny) programmer's joke that a programmer would never write a function called BombBaghbad, they would write a function called BombCity which takes the name of the city to bomb as a parameter.
- Understand the difference between arrays and pointers: This is an important one. C has a tendency to turn an array name into a basal pointer at the drop of a hat (when using the name of an array variable in an expression, assigning an array to a variable, or when passing an array into a function - which is the same thing as assigning it to a variable). Note that I've put this tip under Functions, but it applies to plain assignments too. C's "turn it into a pointer" tendency trips lots of new C programmers up - especially as C doesn't fully expose this tendency. So, even if you write:
void wibble( int a[] ) { }and then call:int x[4]={10,20,-4,5}; wibble( x );The array x is not passed to wibble. A pointer to the first element of x is passed instead (i.e. &x[0]). It's equivalent to the assignment:int *a = &x[0];or the canonical form (which means the same thing):int *a = x;For this reason, for decades I've made this "turn into a pointer" behaviour explicit in function definitions. Rather than writing the above, I write:void wibble( int *a ) { }This is another application of the KISS principle again: given that C passes an array as a pointer whether or not I wanted that to happen, I document the reality in my code. I don't pretend that C has passed a generic "array of ints", because it really hasn't. But then, inside wibble(a), it's perfectly safe for you to use the pointer a as an array, because of pointer/array equivalence in expressions.While we're at it, we're probably going to need a way of determining how big the array that was passed in was (unless the array is terminated by a sentinel, eg. a string (array of characters) terminated by '\0', or an array of strings terminated by NULL), so in most cases we need to pass the array size explicitly:
void wibble( int *a, int nel ) { int i; for( i=0; i<nel; i++ ) { access a[i] } }To provide this information in the call, we might use the countof() macro that we mentioned earlier:wibble( x, countof(x) );(this assumes that in the caller's scope x was an array, not already a basal pointer by earlier assignments/parameter passing).Aside: on the rare occasions when you must pass a whole array, not a basal pointer, into a function, you can wrap it into a structure (because structs are the only large things that are copied in function calls and variable assignments) - but that's really inefficient as the whole array is copied - and because it's a copy the function cannot change the array. So this trick is rarely used. If you don't want to change the array inside a function, just don't change it:-) And write a comment to say "the array will not be changed". Alternatively, this may be a case when you want to use const.
Note that C makes no distinction between a pointer to one thing and a pointer to the first of many (a collection like our array). This is one reason the contract comment is so important: it documents whether wibble() takes a pointer to a single integer as in:
int n = 100; wibble( &n, 1 );or whether it really takes an array (turning it into a basal pointer) - or malloc()ed chunk which often serves as a dynamic array. The C compiler cannot tell these cases apart. Actually, as long as you claim that the "array basal pointer" &n has only 1 element, &n can safely act as an "array of 1 element" inside wibble(). So in that sense, an int is like an "array of 1 int", sort of.- Stepping outside ANSI standard C for a moment, I recommend that you don't use nested functions even in languages that allow you to: they sound like a good idea but really aren't. One of the things I loved most about C compared to Pascal, Modula-2 and Simula-67 many years ago was that C doesn't allow you to nest functions. I learned to distrust nested functions when I briefly maintained a Hope interpreter written in Pascal. The main interpreter function was approximately 8000 lines long. 7990 of those lines comprised some shared variables followed by about 50 nested functions (that were obviously only called from the interpreter function). The body of the interpreter function, when I eventually found it - on a physical print out which I had to scan through, btw - nearly 8000 lines further down the enormous single source file, was 10 lines long. (Sadly Pascal didn't have standard separate compilation facilities at the time), No, I thought, nested functions are unnecessary and can be massively abused. C's model is that a source file is a linear sequence of functions. Simple. Minimalist. Lovely.
Note also that nested functions allow "outer scope" shared variables that "inner functions" can freely modify, this complicates compiler code generation (as well as a frame pointer pointing to the previous stack frame, you need a outer frame pointer to the textually enclosing function's newest frame. Yuck!). They also make variable lifetime problems much more complex to understand and detect. It's particularly ironic that some have proposed adding nested functions to C (in a "what should be added to C" wishlist discussion on LinkedIn recently). Clearly, I'm not going to agree with that idea:-) Some existing C compilers (including some versions of gcc) support nested functions as a non-standard C extension. My advice is: Nested functions: just say no.
Separate Compilation
C has always supported compilation of separate C source files or compilation units, and provides external declarations, header files and linking support to glue compilation units together. However, these low-level facilities are so flexible that they allow you to get really confused, encounter linking errors etc. To avoid most of these problems, a set of conventions have emerged over time, enabling you to use separate compilation in a highly standardised and structured way - specifically, to achieve modular programming in C. Before we get to that, there is an important distinction we need to make, leading to our next tip:Understand the difference between definitions and declarations: An amazing proportion of C programmers use the terms definition and declaration interchangeably in C. Declaration is the more general term, a definition is a special kind of a declaration - we'll see the difference later. For local variables and function parameters there's no difference - they're all declarations. But for global variables and functions (what the standard calls external linkage variables and functions) declarations and definitions are different:
Ok, so what's the difference?
- A global declaration such as
extern int x; extern int *hello( char *, int );says that this thing exists, defined somewhere else, and I'd like to use it here - in the first case, the thing in question is an integer global variable called x, in the second case it's a function called hello taking a char * and an int and returning an int *. It's rather like an "import this single thing for me to use it" command. Note that in the case of function prototypes, the extern keyword is optional, as in:int *hello( char *, int );Aside: note that parameter names are allowed in prototypes, are a useful aid to documentation, but are totally ignored. Some people prefer to write:extern int *hello( char *str, int n );- By contrast, a global definition like
int x; int y = 100; int *hello( char *str, int n ) { ... }in the global scope of a C source file - outside a function - says I wish to define the integer global variable called x, or the function called hello (with it's parameters and return type) right here and now. Storage for x is allocated right now (but it's value is left uninitialized - often a bad thing), similarly storage for y is allocated - and it's initial value set to 100 - similarly the function hello has a body which is compiled and made callable.The keyword static, when used outside a function, as in:
static int x; static int y = 100; static int *hello( char *s, int n ) { ... }defines a private uninitialized global variable called x, a private initialized global variable called y with value 100, and a private function called hello, allocating storage as for externally visible definitions. The only difference is that static (private) variables and functions can only be accessed by code in the current compilation unit (source file) - and furthermore, due to C's one pass nature, code that is written textually below the definition.Note that there's no way of declaring a static variable without defining it. A static function like hello can be defined as we've seen, but there is a way of declaring it as well. To do this, write a static prototype like:
static int *hello( char *, int );near the top of the same compilation unit. This allows you call hello() from anywhere below the prototype, even if you place the true definition of hello() right at the end of the compilation unit.- Most linker errors in C are caused by people not understanding the simple rule of definitions in C: there must be exactly one definition of each public function and public global variable in the collection of C source files that are compiled and linked together. There can be any number of declarations - but only one definition. Private (i.e. static) functions and global variables don't count - they are made private to the C source file that contains them, usually by the C compiler mangling the function/variable name so only that source file knows what it is called and can use it:-). Thus, you could have two C source files which link together perfectly successfully, in which each source file defines a static function or variable called wibble.
- Consider, for example, two C files, a.c on the left, and b.c on the right:
double d = 7.4; extern int global1; extern int func3( int ); static double hello( double f, int i ) { d *= func3(i); return f + d; } int func1( int x ) { return global1 * hello(d,x); } int func2( int x ) { return 10 * x; } extern int func1( int ); extern int func2( int ); extern double d; int global1 = 10; static int hello( int n, int m ) { return func1(n) + func2(m); } int func3( int x ) { return 4 + hello(x,3); } int main( void ) { hello( 10, 20 ); }- Here a.c defines:
And declares that it wants to use the following things defined somewhere else:
- A public global variable double d.
- A private double function hello().
- A public int function func1().
- A second public int function func2().
- A public global variable int global1.
- A public int function func3().
- b.c defines:
And declares that it wants to use the following things defined somewhere else:
- A public global variable int global1.
- A private int function hello().
- A public int function func3().
- A public int function main().
- A public global variable double d.
- A public int function func1().
- A public int function func2().
- Note that calls to hello() from within a.c call the private double function, whereas calls to hello() from within b.c call the private int function. These two C files will successfully compile and link together, because everything that is used is available (either by definition or declaration), and everything that is used is defined exactly once - including main() to get things started.
But note that this approach to separate compilation is hopelessly intricate and fragile: not only do we have to make sure that everything we need to use is available, but we also have to keep all the definitions and declarations synchronized. For example, if we change the definition of func1()'s return type from int to double in a.c, then we must not forget to alter the declaration in b.c. If we forget to update anything, even once, the source files will still compile and link together with no warnings, but all calls to func1 will corrupt the return value silently. Similarly, if a single type definition is repeated in two compilation units that both use it, but the two definitions get out of sync, then a different form of data corruption will occur.
This is horrid. We need a simpler and more manageable way of using C's separate compilation and linking features. We'll come back to this later.
- Note that C also allows you to place an extern declaration for a variable or function inside a function body, as in:
size_t wibble( char *s ) { extern size_t strlen(char *); return 1 + strlen(s); }Or, from our previous example:static double hello( double f, int i ) { extern double d; d *= func3(i); return f + d; }K&R use this style all the time, and seem to come perilously close at one point to suggesting that it's the only style that can be used safely. If used rigorously throughout a C project, it does have the advantage of documenting every external function and variable that a particular function uses. This would be quite useful if you could ask the C compiler to complain at any use of a global variable or function used inside another function body without such a "local extern declaration" - but you can't!Despite K&R's approval, I think these local extern declarations are messy - I recommend placing the extern declaration outside the function definition, as in:
extern size_t strlen(char *); size_t wibble( char *s ) { return 1 + strlen(s); }Note that the word "extern" is optional, being the default. But I think it's clearer to write it explicitly.Of course, in this specific example, strlen() is declared in string.h so we should just include it:
#include <string.h> .. size_t wibble( char *s ) { return 1 + strlen(s); }I have a personal preference Don't use these nested declarations: put them in global scope (i.e. outside functions), or in a #include file. But many programmers like the nested declarations.Modular Programming in C
The above section really defines all the separate compilation rules that exist in C. As long as every C source file contains a definition of each type, public function and global variable that other C files should be able to use, and declarations of every public function or global variable belonging to another C file that it wants to use, it will compile and link successfully. However, we pointed out the fragility - that if any declaration is out of sync with the definition, then the program will not work correctly, parameters or return values with be corrupted in some complex way. Not only is such separate compilation code painful to write and maintain, but debugging and fixing such declaration/definition mismatch problems when they occur is horribly difficult.
Hence techniques have emerged that use C's separate compilation facilities in a standardised way to provide modular programming in C. So, our next tip is: Learn and use one of the techniques to provide modular programming in C: because the raw separate compilation rules are simply too unsafe to deal with. There are several such techniques, which vary slightly among themselves. It almost doesn't matter which specific technique or variation you use - what's important is that you pick one of them and use it.
The Common Header File
- A first step towards modular programming, as recommended by K&R for medium size projects, is to move all the extern declarations into a single header file, we might call it extern.h:
extern int global1; extern double d; extern int func1( int ); extern int func2( int ); extern int func3( int );Note that the extern keyword is optional on function declarations (prototypes) so this could just be written:extern int global1; extern double d; int func1( int ); int func2( int ); int func3( int );personally I prefer the consistency of the first version, using "extern" for every declaration in the header file.- Now our two C files, a.c on the left, and b.c on the right, include extern.h instead of writing their own extern declarations:
Everything will be ok here if each external declaration in extern.h is defined compatibly in exactly one of the two C files, as we can see here:
#include "extern.h" double d = 7.4; static double hello( double f, int i ) { d *= func3(i); return f + d; } int func1( int x ) { return global1 * hello(d,x); } int func2( int x ) { return 10 * x; } #include "extern.h" int global1 = 10; static int hello( int n, int m ) { return func1(n) + func2(m); } int func3( int x ) { return 4 + hello(x,3); } int main( void ) { hello( 10, 20 ); }
Declaration Defined in Definition extern int global1;
b.c int global1;
extern double d;
a.c double d;
extern int func1( int );
a.c int func1( int n ) {...}
extern int func2( int );
a.c int func2( int n ) {...}
extern int func3( int );
b.c int func3( int n ) {...}
Compared to the chaotic use of extern declarations scattered through both source files, this is already much clearer and easier to maintain. Only two files have to be changed when (say) func1()'s parameter is changed from int to double: func1()'s definition lives in a.c, and it's extern declaration is in extern.h. Suppose you edited the definition in a.c but forgot to edit the prototype in extern.h. Because a.c includes extern.h, when compiling a.c, the C compiler would see:
extern int func1( int ); .. int func1( double x ) { .. }and would generate compile time error messages about func1()'s definition and declaration being incompatible. This is precisely what we want - allowing the compiler to detect our error, and forcing us to fix the prototype. Having fixed the prototype, every call to func1() will be checked against the correct prototype.- Note that, if our program defined it's own constants with #define, or types via enumerations, struct/union declarations and typedef, these could also be placed in the common extern.h header file. Again, this gives us a single place to store and edit these declarations.
Modules: a .c file and a .h file
- Going beyond the single header file technique, to the fullblown concept of Modular Programming, we can implement the abstract concept of a Module called M as a matched pair of files:
- A header file called M.h which defines the public interface of the module, and
- A C file called M.c which implements all the public functions mentioned in the interface, and additional private helper functions if needed.
- So, for example, a module called intlist (that might wrap up our earlier linked list of integers data type) would be implemented as an intlist.h header file that defines the public interface, and an intlist.c file that implements it.
- Some modules simply present a collection of related functions, possibility sharing some private state data. A specialised type of module - called an Abstract Data Type - define an Abstract Type and a set of public functions that implement all the Operations that can be applied by such an Abstract Type object. ADT modules are the closest that a non-Object Oriented language like C gets to objects and classes - an ADT is very like a class without inheritance. Our intlist example module is an ADT.
- There's a pretty good article Modular Programming in C, written by John Hayes in 2001, on the embedded.com website, which explains the concepts of Modular Programming and shows one particular variation. It explains most things well - although I strongly recommend that you ignore his suggestion of redefining the module's public function names from Fun1() to Foo_Fun1() via macros (Foo being the name of the example module). If you decide to include the module name in the function names - which is a good idea to avoid name clashes as your programs get larger - then rather than naming a function Fun1() and then renaming it via an unnecessarily clever macro, why not simply name the function Foo_Fun1() in the first place!
- Some of the basic consequences of such an approach are as follows:
- You'll structure your C code as some number of loosely-coupled modules, some of which use (depend on) other modules. It's best to reduce the number of inter-module dependencies as far as possible. Each module will probably have a unit test program - a C source file (perhaps called testM.c) which contains a main program which runs a series of tests on the module.
- As well as these pairs of files (each pair comprising a module), you may have a small number of isolated header files, such as defns.h containing some basic type definitions for the whole project that many modules need to include.
- Also, you'll have any number of main programs, including the module unit test programs we've already mentioned, additional multi-module integration test programs - and of course one main program (or more than one) that forms the actual application we're building!
- A module's .h file contains: suitable comments describing what the module as a whole does, public constants and well-commented type definitions, extern declarations of module-global variables (in rare cases where there are any) and (finally) prototypes of public functions, i.e. extern declarations of the public functions the module provides.
- A module's .c file should always include it's own interface header - in exactly the same way as each C source file included extern.h - for prototype-definition compatibility checks. So intlist.c might start:
#include <stdio.h> /* other stdlib includes needed */ #include "intlist.h"
- Then the implementation should proceed to define each publically declared function from the interface, in any order, with suitable comments and testing. Additional private (static) helper functions, variables and types may also be placed in the implementation C file, these private helper functions will presumably be called by the public functions - if they aren't, they are unused and should be deleted!
- Client C source files - any other module or main program that wants to use a module's types or public functions - must also include that module's interface header. Suppose a test program wants to use intlist's facilities. Just like intlist.c itself, that test program must:
#include <stdio.h> /* and other standard library headers */ #include "intlist.h"Module Interface Dependencies: to use Include Guards or not
- So far so good. But things get a bit trickier when a module's interface depends on another module. This typically happens with ADTs, when a ADT type defined by one module is used in another module's interface (eg. embedded into a structure or typedef, or used as a parameter type in a public function). For example, suppose we already have a freqtable module to store a collection of (word,frequency_of_word) pairs - K&R present a very nice simple implementation of this ADT, btw. freqtable.h might read:
/* * freqtable.h: * a frequency table stores (string,int) pairs: associating an integer * frequency count with each distinct string. Looking up the associated * frequency count, or modifying it, is a very fast operation. */ typedef struct freqtable_s *freqtable; /* opaque data type */ /* A freqtable callback function gets a (name,freq) pair and an extra value */ typedef void (*freqtablecb)( char *name, int freq, void *extra ); /* prototypes of freqtable operations, such as.. */ extern freqtable freqtable_create( void ); extern void freqtable_add( freqtable f, char *name, int count ); extern int freqtable_get( freqtable f, char *name ); extern void freqtable_foreach( freqtable f, freqtablecb cb, void *extra ); extern void freqtable_free( freqtable f );- Then we want to build a module that deals with arrays-of-freqtables, so freqtablearray.h might read:
#define MAXFTAELEMENTS 1000 typedef freqtable freqtablearray[MAXFTAELEMENTS]; (prototypes of freqtablearray operations)Another way a similar situation can occur is when one of the public functions in the higher level module needs to take or return a freqtable - you might have a readfreqtable.h interface declaring a function that reads words from an open file and builds and returns a new freqtable:
extern freqtable count_wordfrequencies( FILE *in );Now, clearly wherever either of these C snippets (freqtablearray.h or readfreqtable.h) is compiled, the type freqtable must already be declared. It's going to a gross compile time error if it's not! Furthermore, it needs to be defined correctly for the code to work once everything's linked together.
- The obvious way of giving freqtablearray.h or readfreqtable.h knowledge of the freqtable data type is to add a nested include - making freqtablearray.h read:
#include "freqtable.h" #define MAXFTAELEMENTS 1000 typedef freqtable freqtablearray[MAXFTAELEMENTS]; (prototypes of freqtablearray operations)and similarly, making readfreqtable.h read:#include "freqtable.h" extern freqtable count_wordfrequencies( FILE *in );However, nested includes such as these cause some very tricky problems. On the one hand, a client program wishing to use an array of freqtables can simply write:#include "freqtablearray.h" main() { freqtablearray fta; int i; for( i=0; i<MAXFTAELEMENTS; i++ ) { fta[i] = freqtable_create(); } ... }where freqtable_create() is assumed to be one of the freqtable operations, declared in freqtable.h, included implicitly into this client program via the explicit include of freqtablearray.h. So far so good.
- However, what if our client wants to combine using freqtablearrays and the count_wordfrequencies() function from the readfreqtable module?
#include "freqtablearray.h" #include "readfreqtable.h" main( int argc, char **argv ) { freqtablearray fta; int i; assert( argc < MAXFTAELEMENTS ); for( i=1; i<argc; i++ ) { fta[i] = count_wordfrequencies( argv[i] ); } ... }Unfortunately, this doesn't compile. Why not? Because freqtable.h is included twice into the above program - by both freqtablearray.h and readfreqtable.h.To solve this nested include problem, you can use the old Include Guard trick - see https://en.wikipedia.org/wiki/Include_guard for details. Specifically, we might add an Include Guard to freqtable.h as follows:
#ifndef FREQTABLE_INCLUDED #define FREQTABLE_INCLUDED /* * freqtable.h: * a frequency table stores (string,int) pairs: associating an integer * frequency count with each distinct string. Looking up the associated * frequency count, or modifying it, is a very fast operation. */ typedef struct freqtable_s *freqtable; /* opaque data type */ /* A freqtable callback function gets a (name,freq) pair and an extra value */ typedef void (*freqtablecb)( char *name, int freq, void *extra ); /* prototypes of freqtable operations, such as.. */ extern freqtable freqtable_create( void ); extern void freqtable_add( freqtable f, char *name, int count ); extern int freqtable_get( freqtable f, char *name ); extern void freqtable_foreach( freqtable f, freqtablecb cb, void *extra ); extern void freqtable_free( freqtable f ); #endifNote that, when you use Include Guards, you usually have to use them on pretty much every single module's header file. In this example, you might get away with only adding an Include Guard to freqtable.h, but this is rare.- So, Include Guards are a complete, and popular solution, to allow you to nest includes inside .h files which are themselves included. Many C programmers love Include Guards.
- But, I have a different view: Include Guards only sort out the problems caused by Nested Includes. So: if you don't allow Nested Includes, then you don't need to use Include Guards - this happens to be my preference! Instead you decide that every client of freqtable must explicitly:
#include "freqtable.h"Similarly every client of freqtablerray must explicitly write:
#include "freqtable.h" #include "freqtablearray.h"to record the fact that freqtablearray.h cannot be compiled without freqtable.h.Similarly every client of readfreqtable must explicitly write:
#include "freqtable.h" #include "readfreqtable.h"to record the fact that readfreqtable.h cannot be compiled without freqtable.h.Now, a client of freqtablearray.h and readfreqtable must explicitly write the includes in the right order:
#include "freqtable.h" #include "freqtablearray.h" #include "readfreqtable.h"to record the fact that both freqtablearray.h and readfreqtable.h cannot be compiled without freqtable.h .Effectively, here we are describing the header dependencies via explicit include statements. So here's another tip: Choose whether to use Nested Includes and Include Guards - or Neither. You can do it either way - it's your choice!
Interface Afterthoughts
- One tension I've never fully resolved in my own mind relates to those contract comments I insisted we give each function to define what it does. In a 2-part module, should these comments go in the interface or the implementation?
- The answer that feels more correct intuitively is that they are logically part of the interface, so should go into the .h file, not the .c file. But this would mean that each single line function prototype would be preceded by 10-20 lines of comment, making it harder to see all the functions in the public interface at a glance.
- Although it feels wrong, in practice I prefer putting the contract comments in the implementation part (the .c file), with each contract immediately above the definition of the function it applies to. That way, the contracts are also immediately available when you start to implement the function body, to remind you what the function is supposed to do (but not how). Then I use a tool I wrote a few years ago to automatically generate the prototypes in the header file, so they're always up to date. So the user of a function in one of my modules has to look at the implementation of that module in order to find the contract that the function promises to implement. Hmm.. I think there's no perfect solution here.
General Tips
- Avoid parts of the language you don't like - or don't understand very well yet. You don't have to be an expert on every feature of a language to be productive in it (like me in Ruby:-)), and it's better not to force yourself to use features that you dislike or don't understand. Every programmer essentially works in a personal subset of every language they know, it just depends what proportion of the whole language each programmer's subset occupies!
One personal example of this: I am confused by the const qualifier, so I choose not to use it in code I write. Does that make me a bad C programmer? No! To be honest, my confusion about const is largely historical - it didn't exist in C when I first learned it, it's not well described even in the latest K&R, and I've never bothered to learn it properly due to being lazy:-). Not even laziness makes me a bad programmer - in fact, according to Larry Wall's famous saying that's one of the 3 cardinal virtues of a programmer!
But IMHO there's also a genuine possibility for confusion - for instance, what on earth does the following actually mean, is it even valid?
const int *const x;and (if it is valid) how does it's meaning differ from:const int const *x;The important point is that the above examples boil down to a pointer to int, if you delete all the const keywords:
int *x;Update: Discussions on LinkedIn have shown me that simple uses of const may have genuine safety-tip value, so I will review const and come back to this section. Maybe I've not been fair to const - we'll see.- Another part of C I tend to avoid is bitfields. They're not useful for the obvious thing you might want to use them for (replacing all that bit twiddling when decoding a structured binary word, via union'ing the word with a bitfield struct), so I just don't see the point. But I'm not trying to convince you to avoid the parts of C that I don't like - just convince you that you have the freedom to avoid those parts of C that you don't like!
- The complexity of the const declarations above takes us neatly to a rather general tip (inspired by the brilliant Pragmatic Programmer book): build a toolbox of tools that help you to write less code and become expert in using each tool. One tool that every C programmer should have installed is called cdecl - it can answer questions such as this. You run it, and then enter lines like:
int x; int *x; int x[10]; typedef int *intp; const int *x; const int *const x; const int const *x; struct intlistnode *maplist( struct intlistnode *list, int (*f)(struct intlistnode *) );It parses each line as a C declaration/definition and tells you what it means, in structured English:declare x as int; declare x as pointer to int; declare x as array[10] of int; declare intp as typedef pointer to int; declare x as pointer to const int; declare x as const pointer to const int; declare x as pointer to const int const; declare maplist as function that expects (list as pointer to struct intlistnode, f as pointer to function that expects (pointer to struct intlistnode) returning int) returning pointer to struct intlistnode;Better still, a companion tool goes the other way: how do I write a static array of 10 integers? Run cundecl and type:
declare x as static array 10 of int;and it responds:static int x[10];Brilliant! Nowadays, there's a web version of this tool as well: cdecl.org.- Aside: if you can understand either the maplist function declaration or it's English translation, I withdraw my objection to you writing such a complex definition, because you're obviously a genius:-) Actually, I don't because others will read the code, and they need to understand it too - be generous to them. One reason to be generous to future readers is given by the apocryphal advice: comment your code as if it will be read by a homicidal maniac who knows where you live.
Plus, you should bear in the mind that the homicid.. sorry, future reader in question may be a future version of you. After all, the most obvious person who knows where you live today is a future version of you:-)
- There are many other powerful C tools you may wish to become familiar with, many of them are discussed in my C Tools lectures. Note that I don't just mean use tools others have written: I mean occasionally you should build a tool yourself because you desperately want it! One such tool I wrote many years ago is called datadec and is the subject of my recent Bringing Recursive Data Types to C PSD article.
- It's also worth building a library of useful data structure modules (lists, queues, stacks, hashes, sets, bags, trees etc) and reuse them aggressively. OO zealots will tell you that data structure libraries, and code reuse in general, is only possible with objects and classes. That's simply not true - you can easily build reusable modules in C.
- Design software to be tested: Nigel Evans made the sensible point Software that can be easily tested is far safer than software that can't be easily tested. Quite often it's simply a question of design and planning (compare this with the Pragmatic Programmer tip 48: Design to test).
- One point emerging from this is that since GUI software is much harder to test than non-GUI, even when you have a GUI you should keep it very loosely coupled with the rest of the code, so at least you can test all the rest - including the non-GUI parts of the actions that the GUI invokes as callbacks.
- This applies whether or not you are using Test Driven Development - see my recent PSD article 5.
Making it run fast: Optimization and Profiling
- We talked, right at the start, about building simple and clear code and getting it working correctly, before considering how quickly it runs, as in Kernighan and Plauger's: first make it work, then make it right, and, finally, make it fast.
- So, what do you do when you need to make your (believed correct) code run faster? Spend time analysing and optimizing it.
- Your first step is to ask the C compiler to optimize your code: Most compilers have optimization flags (gcc has the -On flags which enable more and more optimizations as n increases). So of course it's worth enabling them. But this is not a silver bullet - in my experience, you tend to get a 5-10% speed increase by using -O3 on gcc.
- What can you do beyond this? Inexperienced programmers (or those under time pressure) have a tendency to dive into the codebase and start tweaking bits, almost at random. Unsurprisingly, this does not tend to work. Usually it ends up breaking the program! As Donald Knuth said: premature optimization is the root of all evil.
- Don't guess which bits of your program are slowing it down: because you'll nearly always be wrong. Use the proper tool for the job - a code profiler - to measure the actual run-time performance of your program - and identify the hot spots (the 5% of your program that takes 95% of the run-time).
- To profile your code with gcc you recompile with the -pg flag, which generates an instrumented version of your program. Run this program, with usual input files - or interact with it if you must. You'll find that it runs slightly slower than usual (yes, this is ironic - to make it go faster, first make it go slower!) but it uses that time to gather information on time your code spends in every function, how many times each function is called, and the call graph - which function A calls which other function B. This profiling information is written to a file called gmon.out. Then you run a separate tool called gprof to analyse the profile: gprof your_program gmon.out. This generates a "top 10 hot spots" report on screen.
- If you haven't used profiling yet, I strongly recommend that you give it a try on some of your existing code, I find it works best on code that runs for 30-60 seconds performing some reasonably complex algorithm on some reasonably sized data. I can almost guarantee that the profiling results will surprise you. Why is this? I have gradually realised that no programmer ever really understands the run-time behaviour of their own programs, let alone anyone else's. I've certainly found major surprises every time I've profiled my own code!
- Most recently, I profiled a hash table implementation that I'd used for several years - trusted production code, that for some reason I'd never profiled - and found that it was spending over 50% of it's time ignoring empty hash buckets in a large array, while copying or free()ing entire hash tables. It turns out that if you write (pseudocode):
foreach element in large array { process( element ); }and process(element) starts with:if( element == NULL ) return;then billions of calls to process() that return immediately take significant amounts of time: over 50% of the total program run-time in my case. Making the trivial change:foreach element in large array { if( element != NULL ) { process( element ); } }halved the total run-time of the program. There's simply no way that anyone could have known that without profiling - not even me, who wrote it!- Once you've identified the hot spots, you selectively optimize them and then you profile again. Iterative profiling-driven optimization is very powerful, and can deliver impressive speedups. 10x is common, 100x is not uncommon.
- Nigel Evans wondered if Kernighan & Plauger's first make it work, then make it right, and, finally, make it fast idea could lead to bad design, especially under release pressure from management. I'm sure this often happens in the software industry so I can see his point, and I always admire someone who will go against Brian Kernighan, but I don't really agree with Nigel. I replied: If you must optimize as you go, one member of a team could profile the codebase once or twice a week, significant profiling-led optimizations can be delivered with tiny tweaks in a few hours work. Rick Marshall pointed out that Engineers live in a world where we design and design and design again so we should always expect, in line with Fred Brooks, to: plan to throw one away; you will, anyhow. The only question in my mind is whether one is enough to throw away!
- Finally, after you've exhausted micro-optimizations of individual hot spots, the most amazing speedups are obtained by macro-optimizations of the whole program, still driven by profiling, In the discussion with Nigel, I wrote: Use the profiler again and think about ways of speeding up the whole runtime behaviour of the program - as shown by the whole top 5 or 10 functions. Often, building an intermediate data structure, or changing a critical algorithm, can revolutionise performance. The latter point is very important, but that's a topic for a whole separate article.
- Juha Aaltonen of LinkedIn made the point: The simpler code you write, the better chances the optimizer has for optimizing your code.
Final Words
Ok, that's the end (for now) of my personal list of safety tips for C programming. Given the shared nature of the discussion on LinkedIn, it seems appropriate that I should not have the last word. Chris Ryan summed up quite a lot of my thoughts on the role of simplicity and clarity in building software:Simple engineering is just easier to understand.
You are less likely to make mistakes if your code and your algorithms are simple.
Simpler code is also normally much leaner and faster.
Simple code is normally faster to write, debug and maintain.
Simple code is also easier for the compiler to optimizeThanks very much Chris, I couldn't have said it better myself! (Ok, so I did have the last word:-))
d.white@imperial.ac.uk Back to PSD Top Written: August-September 2015