DistribSort: A multi-level array sorting method for 4D

by N. Dulay, nd@doc.ic.ac.uk, October 1999.

Download: Windows, Mac (4D v6.0 versions)


Acknowledgements

Introduction

As seasoned 4D programmers will know, 4D does not provide a multi-level SORT ARRAY command. One solution would be to use a plug-in. Another is to implement a multi-level sorting algorithm directly in 4D code. This note documents one such method called DistribSort, and we hope that some of you find it useful.

Uses

Here are a few suggestions on when you might wish to consider using DistribSort:

DistribSort is reasonably quick. Performance however, depends on many factors, such as the sorting algorithm, the processor speed, the sizes of processor caches, the types and sizes of the arrays being sorted, the pre-sortedness of the data, the number of distinct values in the data, the quality of 4D compiled code.


The Interfaces

There are two ways of calling DistribSort. The first is modelled on Foresight's defunct dynarray plugin. The second supports sorts on built arrays.

Call Interface 1

	DistribSort (sort directions; arraypointer1 {;...; arraypointerN})
The parameters for this call are:

Call Interface 2

	DistribSort (sort directions + "A"; builtarraypointer)
The parameters for this call are:

Description

The first K arrays act as key arrays. Any additional arrays are sorted relative to the final sort order of the K key arrays. The ordering of each key array is defined by the corresponding sort direction character in $1.

Example 1:

	DistribSort ("<>>"; ->Age; ->Surname; ->Firstname; ->Score; ->Class)

This sorts the specified arrays by Age (descending), then by Surname (ascending), and then by Firstname (ascending). Score and Class are sorted relative to the first three arrays.

Example 2:

    DistribSort ("<>>A"; ->BuiltArray)
This sorts the specified arrays by BuiltArray {1}-> (descending), then by BuiltArray {2}-> (ascending), then by BuiltArray {3}-> (ascending). Additional arrays (i.e. BuiltArray {4}-> onwards) are sorted relative to the first three arrays

The Caveats

  1. To sort arrays of size N, DistribSort uses 1 longint array of size N and 1 "position" array of size N and type longint or string or text. This position array holds elements that have log10(N)*(number of sort levels) digits encoded either into a longint or string or text value.

    For example to sort 3 or more 40,000 element arrays on 3 levels requires one 40,000 element longint array (=160,000 bytes) plus one 40,000 element string array (=40,000x5*3=600,000 bytes).

    The array type used for the position array is determined by keylength: a longint array is used for keylengths <= 9, a string12 array is used for keylengths <= 12, a string16 array is used for keylengths <= 16, a string20 array is used for keylengths <= 20, a text array is used for keylengths > 20. In addition, for speed, a blob is used for temporarily saving and restoring key arrays.

  2. Do not sort on PICTURE or POINTER arrays. Elements of such arrays cannot be ordered in 4D. PICTURE and POINTER arrays can be passed as non-key arrays however.

  3. Do not pass arrays of different sizes.

  4. The number of arrays that can be passed to DistribSort is limited only by the number of parameters that 4D will allow a user-defined method to have. This limit is not defined in 4D's reference manuals (we suspect the limit is 24 for interpreted 4D and 64 for compiled 4D). If you pass a large number of arrays consider changing the stack size of the calling process. When more than 10 arrays are passed, DistribSort uses 4D's EXECUTE command to build a SORT ARRAY command of all passed arrays and the key array that DistribSort builds. Again the number of parameters that can be passed to SORT ARRAY is not defined, nor are its stack requirements known.

  5. Do not pass the same array via two different parameters for the same call or via two elements of a built array sort i.e. pointers to passed arrays must be distinct.

  6. Sorts on string & text arrays with embedded '@' wildcard characters may cause unexpected results. DistribSort uses 4D's SORT ARRAY command, which may treat the '@' character as a wildcard character.

  7. DistribSort does not sort the zero'th element of arrays.

  8. Do not pass pointers to local arrays.

  9. Do not pass pointers to 2-dimensional arrays. Direct sorting on 2-dimensional arrays is not supported by 4D's SORT ARRAY command. You can however pass pointers to the 1st dimension of a 2-dimensional array.

  10. Do not rely on a sensible result from DistribSort if a sort is performed on a REAL array that holds NaN (Not-A-Number) or INFINITY values.

  11. DistribSort declares a number of process arrays and variables: DS_BUILTARRAY, DS_KEY12, DS_KEY16, DS_KEY16, DS_KEYTEXT, DS_KEYPTR, DS_ARRAYPTR. Do not use these identifiers in your own code.

  12. Finally, when developing and testing using interpreted 4D, keep the arrays to be sorted as small as possible.

The Code

The sorting algorithm we have chosen to implement is Distribution Sort (aka Radix Sort). Distribution Sort is potentially a very fast linear time algorithm. In our implementation, we use 4D's SORT ARRAY command on each sort level to build a concatenated position key. This key is then used to sort all arrays into the correct order. Thus sorting is actually performed by multiple calls to the SORT ARRAY command (which is thought to use the Quicksort algorithm internally).

The method for DistribSort is long. We apologise in advance to readers who dislike lengthy methods - much of the code is devoted to checking the parameters passed to DistribSort.

Installation

Download: Windows, Mac (4D v6.0 versions)

Copy and paste the DistribSort method from the DistribSort database into your own database (you will need to create a new (empty) datafile when the DistribSort structure file is first opened). If you use compiler methods copy and paste the method Compiler_DistribSort as well. The DistribSort database also includes a small test method Test_DistribSort which can be executed from the User Environment to gauge the performance of DistribSort.


Some Exercises for ACI

  1. Extend the SORT ARRAY command to support sorts on multiple-levels, as well as "built" array sorts.

  2. Speed up multi-level sorting of record selections (it's currently too slow). Provide a "built" option for ORDER BY.

  3. Implement pointers to local variables and local arrays. Of course such pointers could be dangerous when left in the hands of inexperienced programmers, but 4D needs to cater for experienced programmers as well. Ensure also, that such array pointers when dereferenced, can be passed to plug-in methods.

  4. Add to 4D Compiler the capability to produce compiled code libraries that can hold generic methods like DistribSort. Allow such libraries to be used in interpreted databases.

Finally

Happy sorting!


Revision History

Oct 1999:
Sorting algorithm changed to original distribution sort and renamed DistribSort. Enhancements from the sorts of Michael Weisenberger and Tony Ringsmuth incorporated.

DistribSort is very similar to Tony's Radixsort. The idea to use a blob to save and restore key arrays is taken from Tony's code, as well as the use of Command Name (229) to get the name of the SORT ARRAY command for use in the EXECUTE command.

Michael's Radixsort probably pre-dates both this and Tony's sort. It has a similar-ish approach to DistribSort but different memory requirements. The idea of using a longint key array to encode small keys is taken from Michael's sort.

Oct 1998:
ShellSort mentioned on NUG-list. ShellSort made available on ftp in v3 and v6, Macintosh and Windows versions. This note edited to use more v6 terminology.

Dec 1996:
Added support for built array sorts. Code to move elements to final positions rewritten. Diminishing gaps changed to (5n-1)/11.

Sept 1996:
Minor cleanup of code and documentation. Made available on ftp.

July 1996:
ShellSort written (plus experimental versions of distribution sort, quicksort and heapsort).