DistribSort: A Multi-Level Array Sorting Method in 4D Code

  by Naranker Dulay (nd@doc.ic.ac.uk),  October 1999
  
Acknowledgements

  DistribSort uses code & documentation from the earlier ShellSort.
  
  Coding ideas are also taken from the sorts of Michael Weisenberger and Tony
  Ringsmuth.
  

Introduction

  As seasoned 4D programmers will know, 4D does not provide a multi-level SORT
  ARRAY command.  One solution would be to use a plug-in. Another is to
  implement a multi-level sorting algorithm directly in 4D code.  This note
  documents one such method called DistribSort.  We hope that some of you find
  it useful.


Uses

  Here are a few suggestions on when you might wish to consider using
  DistribSort:
  
  0) When you don't wish to purchase or deploy a plug-in for multi-level array
     sorts (or can't find one to buy!)
     
  1) When you require sorts (single or multi-level) on built arrays.
  
  2) When DistribSort happens to outperform other sorting methods.
  
  3) When you wish to modify or extend (or debug) the DistribSort source code.
    
  DistribSort is reasonably quick. Performance however, depends on many
  factors, such as the sorting algorithm, the processor speed, the sizes of
  processor caches, the types and sizes of the arrays being sorted, the
  pre-sortedness of the data, the number of distinct values in the data, the
  quality of 4D compiled code.
 

The Interface(s)

  There are two interfaces for DistribSort:
  
  Call Interface 1 (modelled on da_sort from Foresight's DynArray plug-in)
  ----------------
  
    DistribSort (sort directions; arraypointer1 {;...; arraypointerN})

    The parameters for this call are:
      
    $1 (sort directions)
       a text string consisting of '>' (ascending) and '<' (descending)
       characters.  The number of sort levels (i.e. sort keys), K, is given by
       length($1).
       
       At least one sort direction character must be specified.
          
    $2; ...; $N
       pointers to the arrays to be sorted.  Arrays can be of type INTEGER,
       LONGINT, REAL, STRING, TEXT, DATE or BOOLEAN.  PICTURE and POINTER
       arrays can be passed as non-key arrays.
     
       At least one array pointer must be passed.
     
  Call Interface 2
  ----------------
  
    DistribSort (sort directions + "A"; builtarraypointer)

    The parameters for this call are:
      
    $1 (sort directions) 
       a text string consisting of '>' (ascending) and '<' (descending)
       characters followed by the letter A (uppercase or lowercase). The
       number of sort levels (sort keys), K, is given by length($1)-1 for this
       call interface.
     
       At least one sort direction character must be specified.
          
    $2 (build array pointer)
       a pointer to an array of pointers to the arrays to be sorted. The
       arrays to be sorted can be of type INTEGER, LONGINT, REAL, STRING,
       TEXT, DATE or BOOLEAN.  PICTURE and POINTER arrays can be passed as
       non-key arrays.
     
       The built array must have at least one element, i.e. there must be at
       least one array to sort.
     
  Description
  -----------
    
    The first K arrays act as key arrays. Any additional arrays are sorted
    relative to the final sort order of the K key arrays.  The ordering of
    each key array is defined by the corresponding sort direction character in
    $1.
    
  Example 1:
  ----------
    
    DistribSort ("<>>"; ->Age; ->Surname; ->Firstname; ->Score; ->Class)
       
    This sorts the specified arrays by Age (descending), then by Surname
    (ascending), and then by Firstname (ascending). Score and Class are sorted
    relative to the first three arrays.
  
  Example 2:
  ----------
  
    DistribSort ("<>>A"; ->BuiltArray)
    
    This sorts the specified arrays by BuiltArray {1}-> (descending), then by
    BuiltArray {2}-> (ascending), then by BuiltArray {3}-> (ascending).
    Additional arrays (i.e. BuiltArray {4}-> onwards) are sorted relative to
    the first three arrays
    

The Caveats

  0) To sort arrays of size N, DistribSort uses 1 longint array of size N and
     1 "position" array of size N and type longint or string or text. This
     position array holds elements that have log10(N)*(number of sort levels)
     digits encoded either into a longint or string or text value.
     
     For example to sort 3 or more 40,000 element arrays on 3 levels requires
     one 40,000 element longint array (=160,000 bytes) plus one 40,000 element
     string array (=40,000x5*3=600,000 bytes).
     
     The array type used for the position array is determined by keylength: a
     longint array is used for keylengths <= 9, a string12 array is used for
     keylengths <= 12, a string16 array is used for keylengths <= 16, a
     string20 array is used for keylengths <= 20, a text array is used for
     keylengths > 20. In addition, for speed, a blob is used for temporarily
     saving and restoring key arrays.

  1) Do not sort on PICTURE or POINTER arrays.  Elements of such arrays cannot
     be ordered in 4D.  PICTURE and POINTER arrays can be passed as non-key
     arrays however.
          
  2) Do not pass arrays of different sizes.
  
  3) The number of arrays that can be passed to DistribSort is limited only by
     the number of parameters that 4D will allow a user-defined method to
     have. This limit is not defined in 4D's reference manuals (we suspect the
     limit is 24 for interpreted 4D and 64 for compiled 4D). If you pass a
     large number of arrays consider changing the stack size of the calling
     process.  When more than 10 arrays are passed, DistribSort uses 4D's
     EXECUTE command to build a SORT ARRAY command of all passed arrays and
     the key array that DistribSort builds. Again the number of parameters
     that can be passed to SORT ARRAY is not defined, nor are its stack
     requirements known.
   
  4) Do not pass the same array via two different parameters for the same call
     or via two elements of a built array sort i.e. pointers to passed arrays
     must be distinct.
   
  5) Sorts on string & text arrays with embedded '@' wildcard characters may
     cause unexpected results. DistribSort uses 4D's SORT ARRAY command, which
     may treat the '@' character as a wildcard character.
        
  6) DistribSort does not sort the zero'th element of arrays.

  7) Do not pass pointers to local arrays.
        
  8) Do not pass pointers to 2-dimensional arrays. Direct sorting on
     2-dimensional arrays is not supported by 4D's SORT ARRAY command. You can
     however pass pointers to the 1st dimension of a 2-dimensional array.
     
  9) Do not rely on a sensible result from DistribSort if a sort is performed
     on a REAL array that holds NaN (Not-A-Number) or INFINITY values.
       
 10) DistribSort declares a number of process arrays and variables:
     DS_BUILTARRAY, DS_KEY12, DS_KEY16, DS_KEY16, DS_KEYTEXT, DS_KEYPTR,
     DS_ARRAYPTR.  Do not use these identifiers in your own code.
     
 11) Finally, when developing and testing using interpreted 4D, keep the
     arrays to be sorted as small as possible.
     

The Code
 
  The sorting algorithm we have chosen to implement is Distribution Sort (aka
  Radix Sort).  Distribution Sort is potentially a very fast linear time
  algorithm.  In our implementation, we use 4D's SORT ARRAY command on each
  sort level to build a concatenated position key.  This key is then used to
  sort all arrays into the correct order.  Thus sorting is actually performed
  by multiple calls to the SORT ARRAY command (which is thought to use the
  Quicksort algorithm internally).
  
  The method for DistribSort is long.  We apologise in advance to readers who
  dislike lengthy methods - much of the code is devoted to checking the
  parameters passed to DistribSort.
  
  
Installation

  Copy and paste the DistribSort method from the distribsort database into
  your own database (you will need to create a new (empty) datafile when the
  distribsort structure file is first opened).  If you use compiler methods
  copy and paste the method Compiler_DistribSort as well. The distribsort
  database also includes a small test method Test_DistribSort which can be
  executed from the User Environment to gauge the performance of DistribSort.


Some Exercises for ACI

  0) Extend the SORT ARRAY command to support sorts on multiple-levels, as
     well as "built" array sorts.
      
  1) Speed up multi-level sorting of record selections (it's currently too
     slow).  Provide a "built" option for ORDER BY.
            
  2) Implement pointers to local variables and local arrays.  Of course such
     pointers could be dangerous when left in the hands of inexperienced
     programmers, but 4D needs to cater for experienced programmers as well.
     Ensure also, that such array pointers when dereferenced, can be passed to
     plug-in methods.
     
  3) Add to 4D Compiler the capability to produce compiled code libraries that
     can hold generic methods like DistribSort.  Allow such libraries to be
     used in interpreted databases.


Finally
     
  Happy sorting!


Revision History

  July 1996: ShellSort written (plus experimental versions of distribsort,
             quicksort and heapsort).
             
  Sept 1996: Minor cleanup of code and documentation. Made available on ftp.
  
  Dec  1996: Added support for built array sorts. Code to move elements to
             final positions rewritten.  Diminishing gaps changed to
             (5n-1)/11.

  Oct  1998: ShellSort mentioned on NUG-list.  ShellSort made available on ftp
             in v3 and v6, Macintosh and Windows versions. This note edited to
             use more v6 terminology.

  Oct  1999: Sorting algorithm changed to original Distribution Sort.
             Enhancements from the sorts of Michael Weisenberger and Tony
             Ringsmuth incorporated.
             
             Distribsort is very similar to Tony's Radixsort.  The idea to use
             a blob to save and restore key arrays is taken from Tony's code,
             as well the use of Command Name (229) to get the name of the SORT
             ARRAY command for use in the EXECUTE command.
             
             Michael's Radixsort probably pre-dates both this and Tony's sort.
             It has a similar-ish approach to Distribution Sort but different
             memory requirements. The idea of using a longint key array to
             encode small keys is taken from Michael's sort.


