<!--
 vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=mkd spell
-->

# Introduction

Boost.Locale is a library that brings high quality localization facilities in C++ way.
It uses `std::locale`, and `std::locale` facets in order to provide localization in transparent and
C++ aware way to user.

C++ has quite a good base for localization via existing C++ locale facets: `std::num_put`, `std::ctype`, `std::collate` etc.. But
they are very limited and sometimes buggy by design. The support of localization varies between different
operating systems, compilers, standard libraries and frequently incompatible between them.

On the other hand, there is great, well debugged, high quality, widely used ICU library that gives all of the goodies but,
it has very old API that mimics Java behavior, it completely ignores STL and provides useful API only
for UTF-16 encoded text, ignoring other popular Unicode encodings like UTF-8 and UTF-32 and limited but still popular
national character sets like Latin1.

Boost.Locale provides the natural glue between C++ locales framework, iostreams and powerful ICU library in following areas:

- Correct case conversion, case folding and normalization
- Collation including support of 4 Unicode collation levels.
- Date, time, timezone and calendar manipulations, formatting and parsing including transparent support of calendars other then Gregorian.
- Boundary analysis for characters, words, sentences and line-breaks.
- Number formatting, spelling and parsing.
- Monetary formatting and parsing.
- Powerful message formatting including support plural forms, using GNU catalogs.
- Character set conversion.
- Transparent support of 8-bit character sets like Latin1.
- Support of `char`, `wchar_t` and C++0x `char16_t`, `char32_t` strings and streams.


# Tutorial

## Introduction to C++ Standard Library localization support

C++ standard library provides a simple and powerful way to provide locale specific information. It is done
via `std::locale` class that is the container that holds all required information about specific culture like: number formatting
patters, date and time formatting, currency, case conversion etc.

All this information is provided by facets: special classes derived from `std::locale::facet` base class. Such facets are
packed into `std::locale` class and allow you to provide arbitrary information about the locale. `std::locale` class keeps
reference counters on installed facets and can be efficiently copied.

Each facet that was installed into the `std::locale` object can be fetched using `std::use_facet`function. For example.
`std::ctype<Char>` facet provides rules for case conversion. So you can convert character to upper case as following:

    std::ctype<char> const &ctype_facet = std::use_facet<std::ctype>(some_locale);
    char upper_a = ctype_facet.toupper('a');

Locale class can be imbued to `iostream` so it would format information according to locale needs:

    cout.imbue(std::locale("en_US.UTF-8"));
    cout << 1345.45 << endl;
    cout.imbue(std::locale("ru_RU.UTF-8"));
    cout << 1345.45 << endl;

Would display:

    1,345.45 1.345,45

You can also create your own facets and install them to existing locale class. For example:

    class measure : public std::locale::facet {
    public:
        typedef enum { inches, ... } measure_type;
        measure(measure_type m,size_t refs=0) 
        double from_metric(double value) const;
        std::string name() const;
        ...
    };

And now you can simply provide such information to locale:

    std::locale::global(std::locale(std::locale("en_US.UTF-8"),new measure(paper_size::inches)));
    /// Create default locale built from en_US locale and add paper size facet.


Now you can print distance according to correct locale:

    void print_distance(std::ostream &out,double value)
    {
        measure const &m = std::use_facet<measure>(out.getloc());
        // Fetch locale information from stream
        std::cout << m.from_metric(value) << " " << m.name();
    }

This technique was adopted by Boost.Locale library in order to provide powerful and correct localization. However instead of using
standard and very limited standard library C++ facets it created its own facets that use ICU under the hood in order to make much powerful.

## Locale Generation

Each locale is defined by specific locale identifier that contains a mandatory part---Language and optional pars Country, Variant, keywords
and character encoding of `std::string`.

Class `boost::locale::generator` provides us tool to generate locales we need. The simplest way to use generator is to create a locale
and set it as global one:

    #include <boost/locale.hpp>
    
    using namespace boost::locale;
    int main()
    {
        generator gen;
        // Create locale generator 
        std::locale::global(gen(""));
        // Set system default global locale
    }

Of course we can specify locale manually, using default system encoding:

    std::locale loc = gen("en_US"); 
    // Use English, United States locale

Or specify both locale and encoding independently or using POSIX locale specifier that includes both locale 
information and encoding:

    std::locale loc = gen("ja_JP","UTF-8"); 
    // Separation of locale and encoding
    std::locale loc = gen("ja_JP.UTF-8");
    // POSIX locale name with encoding

When you generate more then one locale, you may specify the default encoding used
for `std::string` by calling `octet_encoding` member function of `generator`. For example:

    generator gen;
    gen.octet_encoding("UTF-8");

    std::locale en=gen("en_US");
    std::locale ja=gen("ja_JP");

**Note:** Even if your application uses wide strings anywhere it is recommended to specify
8-bit encoding that would be used for all wide stream IO operations like `wcout` or `wfstream`.

**Tip:** Prefer using UTF-8 Unicode encoding over 8-bit encodings like ISO-8859-X ones.

By default the locale generated for all supported categories and character types. However, if your
application uses strictly 8-bit encodings, uses only wide character encodings only or it uses
only specific parts of the localization tools  you can limit facet generation to specific categories
and character types, by calling `categories` and `characters` member functions of `generator` class.

For example:

    generator gen;
    gen.characters(wchar_t_facet);
    gen.categories(collation_facet | formatting_facet);
    std::locale::global(gen("de_DE.UTF-8"));


## Collation 

Boost.Locale provides `collator` class derived from `std::collate` that extends it with support of comparison levels:
primary -- the default one, secondary, tertiary, quaternary and identical levels. They can be approximately defined as:

1. Primary -- ignore accents and characters' case compare base letters only. For example "facade" and "Façade" are same.
2. Secondary -- ignore characters case but consider accents "facade" and "façade" are different but "Façade" and "façade" are same.
3. Tertiary -- consider case and accents: "Façade" and "façade" are different, ignore punctuation
4. Quaternary -- consider all case, accents, punctuation, the words are identical in terms of Unicode representation.
5. Identical -- as quaternary but consider code point comparison as well.

There are two ways of using `collator` facet: direct by calling its member functions `compare`, `transform` and `hash` or indirect by
using `comparator` template class in STL algorithms.

For example:

    wstring a=L"Façade", b=L"facade";
    bool eq = 0 == use_facet<collator<wchar_t> >(loc).compare(collator_base::secondary,a,b);
    wcout << a <<L" and "<<b<<L" are " << (eq ? L"identical" : L"different")<<endl;

`std::locale` is designed to be useful as comparison class in STL collection and algorithms.
In order to get similar functionality with addition of comparison levels you  use comparator class.

    std::map<std::string,std::string,comparator<char,collator_base::secondary> > strings;
    // Now strings uses default system locale for string comparison

You can also set specific locale or level when creating and using `comparator` class:

    comparator<char> comp(some_locale,some_level);
    std::map<std::string,std::string,comparator<char> > strings(comp);

## Conversions

There is a set of function that perform basic string conversion operations: upper, lower and title case conversions, case folding
and Unicode normalization. The functions are called `to_upper`, `to_lower`, `to_title`, `fold_case` and `normalize`.

You may notice that there are existing functions `to_upper` and `to_lower` under in Boost.StringAlgo library, what is the difference?
The difference is that these function operate over entire string instead of performing incorrect character-by-character conversions.

For example:

    std::wstring gruben = L"grüßen";
    std::wcout << boost::algorithm::to_upper_copy(gruben) << " " << boost::locale::to_upper(gruben) << std::endl;

Would give in output:

> GRÜßEN GRÜSSEN

Where a letter "ß" was not converted correctly to double-S in first case because of limitation of `std::ctype` facet.

**Notes:**

-   `normalize` operates only on Unicode encoded strings, i.e.: UTF-8, UTF-16 and UTF-32 according to the character width. So be
    careful when using non-UTF encodings in the program they may be treated incorrectly.
-   `fold_case` is generally locale independent operation, however it receives locale as parameter in order to determinate
    8-bit encoding.
-   All functions can work with STL string, NUL terminated string, and a range defined by two pointers. They always
    return a newly created STL string.
-   Length of string may be changed, see an example above.

## Numbers, Time and Currency formatting and parsing

All formatting and parsing is performed via `iostream` STL library. Each one of the above information types is represented as number.
The formatting information is set using iostream manipulators. All manipulators are placed in `boost::locale::as` namespace.

For example:

    cout << as::currency << 123.45 << endl;
    // display 123.45 in local currency representation.
    cin >> as::currency >> x ;
    // Parse currency representation and store it in x

There is a special manipulator `as::posix` that unset locale specific settings and returns back to ordinary, default `iostream` formatting
and parsing methods. Please note, such formats may still be localized by default `std::num_put` and `std::num_get` facets.

### Numbers and number manipulators

These are manipulators for number formatting:

-   `as::number` -- format number according to local specifications, it takes in account various `std::ios_base` flags like scientific
    format and precision.

-   `as::percent` -- format number as "percent" format. For example:

        cout << as::percent << 0.25 <<endl;
    
    Would create an output that may look like this:

    > 25%

-   `as::spellout` -- spell the number. For example under English locale 103 may be displayed as "one hundred three". _Note:_ not all locales
    provide rules for spelling numbers, in such case the number would be displayed in decimal format.

-   `as::ordinal` -- display an order of element. For example "2" would be displayed as "2nd" under English locale. As in above case not all locales
    provide ordinal rules.

### Currency formatting

These are manipulators for currency formatting:

-   `as::currency` -- set format to currency mode.
-   `as::currency_iso` -- change currency format to international like "USD" instead of "$". This flag is supported when using ICU 4.2 and above.
-   `as::currency_national` -- change currency format to national like "$".
-   `as::currency_default` -- return to default currency format (national)

Note `as::currency_XYZ` manipulators do not affect on general formatting, but only on the format of currency, it is necessary to use both manipulators
in order to use non-default format.

### Date and Time formatting

Dates and times are represented as POSIX time. When date-time formatting is turned on in the `iostream`, each number is treated as
POSIX time. The number may be integer, or double.

There are four major manipulators of Date and Time formatting:

-   `as::date` -- display date only
-   `as::time` -- display time only
-   `as::datetime` -- display both date and time
-   `as::ftime` -- parametrized manipulator that allows specification of time in format that is used `strftime` function. _Note:_ not all formatting
    flags of `strtftime` are supported.

For example:

    double now=time(0);
    cout << "Today is "<< as::date << now << " and tommorrow is " << now+24*3600 << endl;
    cout << "Current time is "<< as::time << now << endl;
    cout << "The current weekday is "<< as::ftime("%A") << now << endl;

There are also more fine grained control of date-time formatting is available:

-   `as::time_default`, `as::time_short`, `as::time_medium`, `as::time_long`, `as::time_full` -- change time formatting.
-   `as::date_default`, `as::date_short`, `as::date_medium`, `as::date_long`, `as::date_full` -- change date formatting.

These manipulators, when used together with `as::date`, `as::time`, `as::datetime` manipulators change the date-time representation.
The default format is medium.


By default, the date and time is shown in local time zone, this behavior may be changed using following manipulators:

-   `as::gmt` -- display date and time in GMT.
-   `as::local_time` -- display in local time format (default).
-   `as::time_zone` -- parametrized manipulator that sets time-zone ID for date-time formatting and parsing. It receives as parameter a string
    that represents time zone id or `boost::locale::time_zone` class.

For example:

    double now=time(0);
    cout << as::datetime << as::locale_time << "Locale time is: "<< now << endl;
    cout << as::gmt << "GMT Time is: "<< now <<endl;
    cout << as::time_zone("EST") << "Eastern Standard Time is: "<< now <<endl;


The list of all available time zone IDs can be received as `set<string>` using `all_zones` static member function of `boost::locale::time_zone` class.

There is a list of supported `strftime` flags:

-   `%a` -- Abbreviated  weekday (Sun.)
-   `%A` -- Full weekday (Sunday)
-   `%b` -- Abbreviated month (Jan.)
-   `%B` -- Full month (January)
-   `%c` -- Locale date-time format. **Note:** prefer using `as::datetime`
-   `%d` -- Day of Month [01,31]
-   `%e` -- Day of Month [1,31]
-   `%h` -- Same as `%b`
-   `%H` -- 24 clock hour [00,23]
-   `%I` -- 12 clock hour [01,12]
-   `%j` -- Day of year [1,366]
-   `%m` -- Month [01,12]
-   `%M` -- Minute [00,59]
-   `%n` -- New Line
-   `%p` -- AM/PM in locale representation
-   `%r` -- Time with AM/PM, same as `%I:%M:%S %p`
-   `%R` -- Same as `%H:%M`
-   `%S` -- Second [00,61]
-   `%t` -- Tab character
-   `%T` -- Same as `%H:%M:%S`
-   `%x` -- Local date representation. **Note:** prefer using `as::date`
-   `%X` -- Local time representation. **Note:** prefer using `as::time`
-   `%y` -- Year [00,99]
-   `%Y` -- 4 digits year. (2009)
-   `%Z` -- Time Zone
-   `%%` -- Percent symbol

Unsupported `strftime` flags are: `%C`, `%u`, `%U`, `%V`, `%w`, `%W`. Also `O` and `E` modifiers are not supported.


**General recommendations:**

- Prefer using generic date-time manipulators rather then specifying full format using `as::ftime`.
- Remember that current calendars may be not Gregorian.


### Internals

All formatting information is stored in stream class by using `xalloc`, `pword`, and `register_callback` member functions
of `std::ios_base`. All the information is stored and managed using special object binded to `iostream`, all manipulators just
change its state.

When a number is written to the stream or read from it. Custom Boost.Locale facet access to this object and checks required formatting
information. Then it creates special object that actually formats the number and caches it in the `iostream`. When
next time another number is written to the stream same formatter would be used unless some flags had changed and formatter object is
invalid.

## Messages Formatting

### Introduction

Messages formatting is probably the most important part of localization --- making your application to speak in users language.

Boost.Locale uses [GNU Gettext](http://www.gnu.org/software/gettext/) localization model.
It is recommended to read general [documentation](http://www.gnu.org/software/gettext/manual/gettext.html) of GNU Gettext that may be
out of scope of this document.

The model is following:

-   First of all our application `foo` is prepared for localization by calling `translate` function for each message used in user interface.

    For example:

        cout << "Hello World" << endl;

    Is converted to

        cout << translate("Hello World") << endl;

-   Then all messages are extracted from source code and a special `foo.po` file is generated that contains all original English strings.

        ...
        msgid "Hello World"
        msgstr ""
        ...

-   `foo.po` file is translated for target supported locales: for example `de.po`, `ar.po`, `en_CA.po`, `he.po`.
        
        ...
        msgid "Hello World"
        msgstr "שלום עולם"
    
    And then compiled to binary `mo` format and stored if following file structure:

        de
        de/LC_MESSAGES
        de/LC_MESSAGES/foo.mo
        en_CA/
        en_CA/LC_MESSAGES
        en_CA/LC_MESSAGES/foo.mo
        ...
    
    When application starts. It loads required dictionaries, and when `translate` function is called and the message is written
    to an output stream dictionary lookup is performed and localized message is written out.


### Loading dictionaries

All the dictionaries are loaded by generator class. So, in order to use localized strings in the application you need to specify following:

1. The search path of the dictionaries
2. The application domain (or name)

It is done by calling following member functions of `generator` class:

-   `void add_messages_path(std::string const &path)` -- add the root path where the dictionaries are placed.
    
    For example: if the dictionary is placed at `/usr/share/locale/ar/LC_MESSAGES/foo.mo`, then path should be `/usr/share/locale`.

-   `void add_messages_domain(std::string const &domain) ` -- add the domain (name) of the application. In the above case it would be "foo".

At least one domain and one path should be specified in order to load dictionaries.

For example, our first fully localized program:

    #include <boost/locale.hpp>
    #include <iostream>

    using namespace std;
    using namespace boost::locale;

    int main()
    {
        generator gen;

        // Specify location of dictionaries
        gen.add_messages_path(".");
        gen.add_messages_domain("hello");

        // Generate locales and imbue them to iostream
        locale::global(gen(""));
        cout.imbue(locale());

        // Display a message using current system locale
        cout << translate("Hello World") << endl;
    }




### Message Translation

These are basic translation functions

-   `message translate(char const *msg)` -- create localized message from id `msg`. `msg` is **not** copied
-   `message translate(std::string const &msg)` -- create localized message from id `msg`. `msg` is copied.
-   `message translate(char const *single,char const *plural)` -- create localized plural message with `single` and `plural` forms for number `n`. 
    Strings `single` and `plural` are **not** copied.
-   `message translate(std::string const &single,std::string const &plural,int n)` -- create localized plural message with `single` 
    and `plural` forms for number `n`. Strings `single` and `plural` are copied.


These functions return special Proxy object of type `message`. It holds all required information for string formatting.
When this object is written to an output `iostream` it performs dictionary lookup of the id using locale imbued in `iostream`.
If the message is found in the dictionary is written to the output stream, otherwise the original string is written to the stream.

**Notes:**

-   `message` can be implicitly converted to each type of supported strings: (i.e. `std::string`, `std::wstring` etc.) using 
    global locale:

        std::wstring msg = translate("Do you want to open the file?");

-   `message` can be explicitly converted to string using `str<CharType>` member function specific locale.

        std::wstring msg = translate("Do you want to open the file?").str<wchar_t>(some_locale)


This allows postpone translation of the message to the place where translation is actually needed, even to different
locale targets.

    std::ofstream en,ja,he,de,ar;
    std::wfstream w_ar;

    // Send single message to multiple streams
    void send_to_all(message const &msg)
    {
        en << msg;
        ja << msg
        he << msg;
        de << msg;
        ar << msg;
        w_ar << ms;
    }

    main()
    {
        ...
        send_to_all(translate("Hello World"));
    }

### Plural Forms

GNU Gettext catalogs has simple, robust and yet powerful plural forms support. It is recommended to read some 
original GNU documentation [there](http://www.gnu.org/software/gettext/manual/gettext.html#Plural-forms).

Let's try to solve a simple problem, display a message to user:

    if(files == 1)
        cout << translate("You have 1 file in the directory") << endl;
    else
        cout < format(translate("You have {1} files in the directory")) % files << endl;

This quite simple task becomes quite complicated when we deal with language other then English. Many languages have more
then two plural forms. For example, in Hebrew there are special forms for single, double, plural, and plural above 10.
They can't be distinguished by simple rule "`n` is 1 or not".

The correct solution is:

    cout << format(translate("You have 1 file in the directory",
                            "You have {1} files in the directory",files)) % files << endl;

Where translate receives single, plural form of original string and the number it should be formatted for.
On the other side, special entry in the dictionary specifies the rule to choose the correct plural form in the specific language,
for example, for Slavic languages family there exist 3 plural forms, that can be chosen using following equation:

    plural=n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;

Such equation is written in the dictionary and it is evaluated during translation supplying the correct form.
For more detailed information please refer to GNU Gettext: [11.2.6 Additional functions for plural forms](http://www.gnu.org/software/gettext/manual/gettext.html#Plural-forms).

### Adding context information

The GNU Gettext model assumes that same source messages are translated to exactly same localized messages, but this
may be wrong. For example a button label "open" is translated to "öffnen" in context of "opening file" or to 
"aufbauen" in context of opening internet connection in German.

Is such cases it is useful to add some context information to the original string by adding a comment.

    button->setLabel(translate("#File#open"));

The comment is placed between first and the following hash symbol -- '#'. The comment is always extracted from
the original string and not displayed, however it is a part of string identification. Translator should 
discard such comment and translate only "open" string.

For example, this how `po` file is expected to look like:
    
    msgid "#File#open"
    msgstr "öffnen"
    
    msgid "#Internet Connection#open"
    msgstr "aufbauen"

In order to insert '#' as fist symbol you may just use double hash string, for example:

    cout<< translate("$ - Dollar symbol") << endl
        << translate("## - Hash symbol") << endl;

**Note:** Hash based comments are extension of the GNU Gettext library.


### Working with multiple messages domains

In some cases it is useful to work with multiple domains, for example if application consists of several independent modules, it may
have several domains. For example, if application consists of modules "foo", "bar" it is possible to specify which dictionary should be used.

There are two ways of using non-default domains:

-   When working with `iostream`, it is possible to use parametrized manipulator `as::domain(std::string const &)` that allows switching domains
    in streams:

        cout << as::domain("foo") << translate("Hello") << as::domain("bar") << translate("Hello");
        // First translation is taken from dictionary foo and other from dictionary bar

-   It is possible to  specify domain explicitly when converting a `message` object to string:

        std::wstring foo_msg = translate("Hello World").str<wchar_t>("foo");
        std::wstring bar_msg = translate("Hello World").str<wchar_t>("bar");

### Extracting messages from the source code

There are many tools that allow you to extract messages from the source code to `.po` file format. The most
popular and "native" tool is `xgettext` which is installed by default on most Unix systems and freely downloadable
for Windows.

For example, we have a source that called `dir.cpp` that prints:

    cout << translate("Listing of catalog {1}:") % file_name << endl;
    cout << translate("Catalog {1} contains 1 file","Catalog {1} contains {2,num} files",files_no) 
            % file_name % files_no << endl;

Now we run:

    xgettext --keyword=translate:1,1t --keyword=translate:1,2,3t dir.cpp

And a file called `messages.po` created that looks like that (approximately):

    #: dir.cpp:1
    msgid "Listing of catalog {1}:"
    msgstr ""
    
    #: dir.cpp:2
    msgid "Catalog {1} contains 1 file"
    msgid_plural "Catalog {1} contains {2,num} files"
    msgstr[0] ""
    msgstr[1] ""

This file can be given to translator to adopt it to specific language.

We had used `--keyword` parameter of `xgettext` in order to make it suitable for extracting messages from the
source localized with Boost.Locale -- search for `translate()` function calls instead of default `gettext()` and `ngettext()` ones.
First parameter `--keyword=translate:1,1t` parameters provides template for basic message: `translate` function that called with 1 
argument (1t) and first message is taken as key. The second one `--keyword=translate:1,2,3t` -- is used for plural forms. 
It tells `xgettext` to use `translate()` function call with 3 parameters (3t) and take 1st and 2nd parameter as keys.

### Questions and Answers

-   Do I need GNU Gettext to use Boost.Locale?
    
    Boost.Locale provides a run-time environment to load and use GNU Gettext message catalogs, but it does
    not provide tools for generation, translation, compilation and managment of these catalogs.
    Boost.Locale only reimplements GNU Gettext libintl.
    
    You would probably need:
    
    1.  Boost.Locale itself -- for runtime.
    2.  A tool for extracting strings from source code, and managing them: GNU Gettext provides good tools, but other
        implementations available as well.
    3.  A good translation program like [Poedit](http://www.poedit.net/) or [KBabel](http://kbabel.kde.org/).
    

-   Is there any reason to prefer Boost.Locale implementation to original GNU Gettext runtime library?
    In any case I would probably need some of GNU tools.
    
    There are two important differences between GNU Gettext runtime library and Boost.Locale implementation:
    
    1.  GNU Gettext runtime supports only one locale per-process. It is not thread safe to use multiple locales
        and encodings in same process. This is perfectly fine for applications that interact directly with 
        single user like most GUI applications, but this is very problematic for services.
    2.  GNU Gettext API supports only 8-bits encoding making it irrelevant in environments that use
        natively wide strings.

## Code-page conversions

Boost.Locale provides `to_utf` and `from_utf` functions placed in `boost::locale::conv` namespace. They are simple functions
to convert string to and from UTF-8/16/32 strings and strings using other encodings.

For example:

    std::string utf8_string = to_utf<char>(latin1_string,"Latin1");
    std::wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
    std::string latin1_string = from_utf(wide_string,"Latin1");

These function may use explicit encoding name like "Latin1" or "ISO-8859-8" or use std::locale as parameter and fetch this information from it.
It also receives a policy parameter that directs it on how to behave if conversion can't be performed (illegal or unsupported character found).
By default these function would skip all illegal characters and try to do the best they can, however, it is possible ask it to throw `onversion_error`
exception by passing `stop` flag to it:

    std::wstring s=to_utf<wchar_t>("\xFF\xFF","UTF-8",stop); 
    // Throws because this string is illegal in UTF-8


Boost.Locale provides stream codepage conversion facets based on `std::codecvt` facet.
This allows converting between wide characters encoding and 8-bit encodings like UTF-8, ISO-8859 or Shift-JIS encodings.

Most of compilers provide such facets, but:

-   Under Windows MSVC does not support UTF-8 encodings at all.
-   Under Linux the encodings are supported only if required locales are generated. For example
    it may be impossible to create `he_IL.CP1255` locale even when `he_IL` locale is available.

Thus Boost.Locale provides an option to generate code-page conversion facets for using it with 
Boost.Iostreams filters or `std::wfstream`. For example:

    std::locale loc= generator().get("he_IL.UTF-8");
    std::wofstream file.
    file.imbue(loc);
    file.open("hello.txt");
    file << L"שלום!" << endl;

Would create file `hello.txt` encoded as UTF-8 with "שלום!" (shalom) in it.

**Important Note**

Boost.Locale codepage conversion facets do not support UTF-16 text outside of BMP (i.e. it supports only UCS-2).
So if you want to provide full Unicode support do not use wide strings under platforms where `sizeof(wchar_t)==2` (i.e. Windows)
or do not use these facets for character I/O.

**Limitations:**

1.  Standard does not provides any useful information about `std::mbstate_t` type that should be used for saving 
    intermediate code-page conversion states. It leave the definition the compiler implementation making it
    impossible to reimplement `std::codecvt<wchar_t,char,mbstate_t>` to any stateful encodings.
    Thus. Boost.Locae `codecvt` facet implementation may be used only with stateless encodings like UTF-8,
    ISO-8859, Shift-JIS, but not with stateful encodings like UTF-7 or SCSU.

2.  Standard requires that code page translation can be done by translating each wide character independently.
    This is not a problem for most fixed width encodings like ISO-8859 family, and this is not a problem
    when `wchar_t` represents a single code point, i.e.`sizeof(wchar_t)=4` which is true for most POSIX platforms.
    
    But under Windows, `sizeof(wchar_t)=2`, and this it can represent only a single character
    in Base Multilingual Plane (BMP) where characters with code points above `0xFFFF` are represented
    using surrogate pairs. Because, the conversion should be stateless (above limitation) and when
    `wchar_t` can't represent single Unicode character, only UCS-2 encoding is supported, `codecvt`
    would fail on surrogate characters of UTF-16 strings.
    
    Same is valid for C++0x `char16_t` base streams.

So, if your system supports required encoding, it would be better to use it directly instead of Boost.Locale
facet.

**General Recommendation:** Prefer Unicode UTF-8 encoding for `char` based strings and files in your application.

## Boundary analysis

Boost.Locale provides boundary analysis tool allowing to split the text into characters, words, sentences and find appropriate
places for line breaks. 

**Note:** Characters are not equivalent to Unicode code points. For example a Hebrew word Shalom -- "שָלוֹם" consists
of 4 characters and 6 Unicode points, where two code points are used for vowels (diacritical marks).

Boost.Locale provides 3 major classes that are used for boundary analysis:

- `boost::locale::boundary::mapping` -- the special map that hold the boundary points of the text.
- `boost::locale::boundary::token_iterator` -- the iterator that returns chunks of text that were split by text boundaries
- `boost::locale::boundary::break_iterator` -- the iterator that returns iterator to the original text.

In order to perform boundary analysis we first of all create a boundary mapping or the text we want to work with it.

    using namespace boost::locale::boundary;
    std::string text="To be, or not to be?"
    // Create mapping of text for token iterator using default locale.
    mapping<token_iterator<std::string::const_iterator> > map(word,text.begin(),text.end()); 
    // Print all "word" -- chunks of word boundary
    for(token_iterator<std::string::const_iterator> it=map.begin(),e=map.end();it!=e;++it)
        std::cout <<"`"<< * it << "'"<< std::endl;

Would print: a list: "To", " ", "be", ",", " ", "or", " ","not"," ","to", " ", "be", "?"
You can also provide filters for better selection of text chunks or boundaries you are interested in. For example:

    map.mask(word_letters);
    // Tell newly created iterators to select words that contain letters only.
    for(token_iterator<std::string::const_iterator> it=map.begin(),e=map.end();it!=e;++it)
        std::cout <<"`"<< * it << "'"<< std::endl;

Would now print only: "To", "be", "or", "not", "to", "be" words ignoring all non-words -- like punctuation.

Break iterator has different role, instead of returning text chunks, it returns the underlying iterator used for text source iteration. For example,
you can select two first sentences as following:

    using namespace boost::locale::boundary;
    std::string const text="First sentence. Second sentence! Third one?"
    // Create a sentence boundary mapping and set the mask of boundaries
    // to select sentence terminators only, like "?", "." ignoring new lines.
    typedef break_iterator<std::string::const_iterator> iterator;
    mapping<iterator> map(sentence,map.begin(),map.end(),sentense_term);
    iterator p=map.begin();
    /// Advance p by two steps, make sure p is still valid;
    for(int i=0;i<2 && p!=text.end();i++)
        ++p;
    std::cout << "First two sentences are " << std::string(text.begin(),*p) << std::endl;

Would print: "First sentence. Second sentence!"

## Localized Text Formatting

The `iostream` manipulators are very useful but when we create a messages to the user, sometimes we need something
like old-good `printf` or `boost::format`.

Unfortunately `boost::format` has several limitations in context of localization:

1.  It renders all parameters using global locale rather then target `ostream` locale.
2.  It knows nothing about new Boost.Locale manipulators.
3.  `printf` like syntax is very limited for formatting of complex localized data, not allowing
    formatting of dates, time or currency


Thus new class `boost::locale::format` was introduced. For example:

    wcout << wformat(L"Today {1,date} I would meet {2} at home") % time(0) % name <<endl

Each format specifier is enclosed withing `{}` brackets. Each format specifier is separated with comma "," and
may have additional option after symbol '='. The option may be simple ASCII text or quoted localized text with
single quotes "'". If quote should be inserted to the text, it may be represented with double quote.

For example, format string:

    "Ms. {1} had shown at {2,ftime='%I o''clock'} at home. Exact time is {2,time=full}"

The syntax can be described with following grammar:

    format : '{' parameters '}'
    parameters: parameter | parameter ',' parameters;
    parameter : key ["=" value] ;
    key : [0-9a-zA-Z<>] ;
    value : ascii-string-excluding-"}"-and="," | local-string ; 
    local-string : quoted-text | quoted-text local-string;
    quoted-text : '[^']*' ;

Following format key-value pairs are supported:

-   `[0-9]+` -- digits, the index of formatted parameter -- mandatory key.
-   `num` or `number` -- format a number. Optional values are:

    -   `hex` -- display hexadecimal number
    -   `oct` -- display in octal format
    -   `sci` or `scientific` -- display in scientific format
    -   `fix` or `fixed` -- display in fixed format
        
    For example `number=sci`
-   `cur` or `currency` -- format currency. Optional values are:

    -   `iso` -- display using ISO currency symbol.
    -   `nat` or `national` -- display using national currency symbol.

-   `per` or `percent` -- format percent value.
-   `date`, `time` , `datetime` or `dt` -- format date, time or date and time. Optional values are:

    -   `s` or `short` -- display in short format
    -   `m` or `medium` -- display in medium format.
    -   `l` or `long` -- display in long format.
    -   `f` or `full` -- display in full format.

-   `ftime` with string (quoted) parameter -- display as with `strftime` see, `as::ftime` manipulator
-   `spell` or `spellout` -- spell the number.
-   `ord` or `ordinal` -- format ordinal number (1st, 2nd... etc)
-   `left` or `<` -- align to left.
-   `right` or `>` -- align to right.
-   `width` or `w` -- set field width (requires parameter).
-   `precision` or `p` -- set precision (requires parameter).
-   `locale` -- with parameter -- switch locale for current operation. This command generates locale
    with formatting facets giving more fine grained control of formatting. For example:

        cout << format("This article was published at {1,date=l} (Gregorian) {1,locale=he_IL@calendar=hebrew,date=l} (Hebrew)") % date;


The constructor of `format` class may receive an object of type `message` allowing easier integration with localized messages.
For example:

    cout<< format(translate("Adding {1} to {2}, we get {3}")) % a % b % (a+b) << endl;

Formatted string can be fetched directly using `get(std::locale const &loc=std::locale())` member function. For example:

    std::wstring de = (wformat(translate("Adding {1} to {2}, we get {3}")) % a % b % (a+b)).str(de_locale);
    std::wstring fr = (wformat(translate("Adding {1} to {2}, we get {3}")) % a % b % (a+b)).str(fr_locale);

**Important Note:**

There is one significant difference between `boost::format` and `boost::locale::format`: Boost.Locale format converts its parameters
only when it is written to `ostream` or when `str()` member function is called. It only saves a references to the objects that
can be written to a stream.

This is generally not a problem when all operations are done in one statement as:

    cout << format("Adding {1} to {2}, we get {3}") % a % b % (a+b);

Because temporary value of `(a+b)` exists until the format is actually written to the stream. But following code is wrong:

    format fmt("Adding {1} to {2}, we get {3}");
    fmt % a;
    fmt % b;
    fmt % (a+b);
    cout << fmt;

Because temporary value of `(a+b)` is no longer exists when `fmt` is written to the stream. The correct solution would be:

    format fmt("Adding {1} to {2}, we get {3}");
    fmt % a;
    fmt % b;
    int a_and_b = a+b;
    fmt % a_and_b;
    cout << fmt;

## Working with dates, times, timezones and calendars.

One of the important flaws of most libraries that provide operations over dates is the fact that they support only Gregorian calendar.
It is correct for `boost::date_time`, it is correct for `std::tm` and standard functions like `localtime`, `gmtime` that
assume that we use Gregorian calendar.

Boost.Locale provides generic `date_time`, and `calendar`class that allows to to perform operation on dates and time 
for non-Gregorian calendars like Hebrew, Islamic or Japanese calendars.

`calendar` -- the class that represents generic information about the calender, independent from specific time point. For example you can get the maximal number of days in month for this calender.
`date_time` -- represents current time point. It is constructed from calendar and allows us to perform manipulation of various time periods.
`boost::locale::period` -- holds an enumeration of various periods like, month, year, day, hour that allows us to 
manipulate with dates. You can add periods, multiply them by integers and get set them or add them to `date_time` objects.

For example:

    using namespace boost::locale;
    date_time now; // Create date_time class width default calendar initialized to current time;
    date_time tomorrow = now + period::day;
    cout << "Let's met tomorrow at " << as::date << tomorrow << endl;
    date_time some_point = period::year * 1995 + period::january + period::day*1;
    // Set some_point's date to 1995-Jan-1.
    cout << "The "<<as::date << some_point " is " 
        << as::ordinal << some_point / period::day_of_week_local << " day of week"  << endl;

You can calculate the difference between dates by dividing the difference between dates by period:

    date_time now;
    cout << " There are " << (now + 2 * period::month - now) / period::day << " days "
            "between " << as::date << now << " and " << now + 2*period::month << endl;

`date_time` -- provides member functions `minimum` and `maximum` to get the information about minimal and maximal
possible value of certain period for specific time.

For example, for February `maximum(period::date)` would be 28 or 29 if the year is leap and 31 for January. 

**Note:** be very careful with assumptions about what you know about calendar. For example, in Hebrew calendar the
number of months is changed according if current year is leap or not.

It is recommended to take a look on `calendar.cpp` example provided to this library to get understanding of how
to manipulate with dates and times using these classes.

In order to convert between various calendar dates you may get and get current POSIX time via `time` member function.
For example:

    using namespace boost::locale;
    using namespace boost::locale::period;
    generator gen;
    // Create locales with Hebrew and Gregorian (default) calendars.
    std::locale l_hebrew=gen("en_US@calendar=hebrew");
    std::locale l_gregorian=gen("en_US");
    
    // Create Gregorian date from fields
    date_time greg(2010*year + february + 5*day,l_gregorian);
    // Assign time point taken from Gregorian date to date_time with
    // Hebrew calendar
    date_time heb(greg.time(),l_hebrew);
    // Now we can query the year now.
    std::cout << "Hebrew year is " << heb / year << std::endl;


## Getting information about current locale

`std::locale::name` function provides quite limited information about locale. Thus additional facet was created for giving
more precise information: `boost::locale::info`. It has following member functions:

-   `std::string language()` -- get the language code of current locale, for example "en".
-   `std::string country()` -- get country code of currect locale, for example "US".
-   `std::string variant()` -- get variant of currecnt locale, for example "euro".
-   `std::string encoding()` -- get charset used for `char` based strings, for exaple "UTF-8"
-   `bool utf8()` -- fast way to check if the encoding is UTF-8 encoding.

## Working with multiple locales

Boost.Locale allows you to work safely with multiple locales in the same process. As we mentioned before, the locale
generation process is not a cheap one. Thus, when we work with multiple locales it is recommended to create all used
locales at the beginning and then use them.

`generator` class has member function `preload` that allows you create locale and put it into cache. Then, next time
you create locale, if it is exists it would be fetched from the existing preloaded locale set.


For example:

    generator gen;
    gen.octet_encoding("UTF-8");
    gen.preload("en_US");
    gen.preload("de_DE");
    gen.preload("ja_JP");
    // Create all locales

    std::locale en=gen("en_US"); 
    // Fetch existing locale from cache
    std::locale ar=get("ar_EG");
    // Because ar_EG not in cache, new locale is generated (but not cached)

**Note:** generation of locale does not put it in cache only `generator::preload` does this.

Then these locales can be imbued to `iostreams` or used directly as parameters in various functions.

# Recommendations and Myths

## Recommendations

-   1st and most important recommendation: prefer UTF-8 encoding for narrow strings --- it represents all
    supported Unicode characters and most convenient for general use then other encodings like Latin1.
-   Remember, there are many different cultures, you may assume very few about possible user language. Calendar
    may not have "January", it may be not possible to convert integer numbers using simple `atoi` because
    they may not use "ordinary" digits 0..9 at all, you may not assume that "space" characters are frequent 
    because in Chinese space do not separates different words. The text may be written from Right-to-Left or
    from Up-to-Down and so far.
-   Using message formatting try to provide as more context information as you can. Prefer translating entire 
    sentences over short word. When translating words, **always** add some context information.


## Myths

**In order to use Unicode in my application I should use wide strings anywhere.**

Unicode property is not limited to wide strings, in fact both `std::string` and `std::wstring`
are absolutely fine to hold and process Unicode text. More then that the semantics of `std::string`
is much cleaner in multi-platform application, because, if the string is "Unicode" string then 
it is UTF-8. When we talk about "wide" strings they may be "UTF-16" or "UTF-32" encoded, depending
on platform.

So wide strings may be even less convenient when dealing with Unicode then `char` based strings.

**UTF-16 is the best encoding to work with.**

There is common assumption that it is one of the best encodings to store information because it gives "shortest" representation
of strings.

In fact, it probably the most error prone encoding to work with it. The biggest issue is code points laying outside of BMP that
are represented with surrogate pairs. In fact these characters are very rare and many applications are not tested with them.

For example:

-   Qt3 could not deal with characters outside of BMP. 
-   Editing a character with codepoint above 0xFFFF shows a not pleasant bug, in order to erase such character you should press backspace twice in
    Windows Notepad.

So, UTF-16 can be used for dealing with Unicode, in-facet ICU and may other applications use UTF-16 as internal Unicode representation, but
you should be very careful and never assume one-code-point == one-utf16-character.


# Design Rationale 

**Why is it needed?**

Why do we need localization library, standard C++ facets (should) provide most of required functionality:

- Case conversion is done using `std::ctype` facet
- Collation is supported by `std::collate` and has nice integration with `std::locale`
- There are `std::num_put`, `std::num_get`, `std::money_put`, `std::money_get`, `std::time_put` and `std::time_get` for numbers,
    time and currency formatting and parsing.
- There are `std::messages` class that supports localized message formatting.


So why do we need such library if we have all the functionality withing standard library?

Almost each(!) facet has some flaws in their design:

-  `std::collate` supports only one level of collation, not allowing to choose whether case, accents sensitive or insensitive comparison
    should be performed.

-  `std::ctype` that is responsible for case conversion assumes that conversion can be done on per-character base. This is
    probably correct for many languages but it isn't correct in general case.
    
    1. Case conversion may change string length. For example German word "grüßen" should be converted to "GRÜSSEN" in upper
    case: the letter "ß" should be converted to "SS", but `toupper` function works on single character base.
    2. Case conversion is context sensitive. For example Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς" where Greek letter
    "Σ" is converted to "σ" or to "ς", according to position in the word.
    3. Case conversion can not assume that one character is a single code point, which is incorrect for most popular "UTF-8" encoding under
    Linux and "UTF-16" encoding under Windows. Where each code-point is represented up to 4 `char`'s in UTF-8 and up to two `wchar_t`'s under
    Windows platform. This makes `std::ctype` totally useless with UTF-8 encodings.

-   `std::numpunct` and `std::moneypunct` do not specify digits code point for digits representation at all. 
    Thus it is impossible to format number using digits used under Arabic locales, for example:
    the number "103" is expected to be displayed as "١٠٣" under `ar_EG` locale.
    
    `std::numpunct` and `std::moneypunct` assume that thousands separator can be represented using a single character. It is quite untrue
    for UTF-8 encoding where only Unicode 0-0x7F range can be represented as single character. As a result, localized numbers can't be
    represented correctly under locales that use Unicode "EN SPACE" character for thousands separator, like Russian locale.
    
    This actually cause a real bugs under GCC and SunStudio compilers where formatting numbers under Russian locale creates invalid 
    UTF-8 sequences..


-   `std::time_put` and `std::time_get` have several flows:
    
    1. It assumes that the required calendar is Gregorian calendar, by using `std::tm` for time representation, ignoring the fact that in many countries
    dates may be displayed using different calendars.
    2. It always uses global time zone not-allowing specification of time zone for formatting -- actually standard `std::tm` does not include
    timezone field.
    3. `std::time_get` is not symmetric with `std::time_put` now allowing parsing dates and times created with `std::time_put`. This issue is addressed
    in C++0x and some STL implementation like Apache standard C++ library.

-   `std::messages` does not provide support of plural forms making impossible to localize correctly such simple strings like: 
    "There are X files in directory".

Also many features are not really supported by `std::locale` at all: timezones mentioned above, text boundary analysis, numbers spelling and many
others. So it is clear that standard C++ locales are very problematic for real-world applications of internationalization and localization.


**Why to use ICU wrapper instead of ICU?**

ICU is very good localization library but it has several serious flaws:

- It is absolutely unfriendly to C++ developer. It ignores most of popular C++ idioms: STL, RTTI, exceptions etc. Instead
it mostly mimics Java API.
- It provides support of only one kind of strings: UTF-16 strings, when some users may want to use other Unicode encodings.
For example for XML, HTML processing UTF-8 is much more convenient and UTF-32 easier to use. Also there is no support of 
"narrow" encoding that are still very popular like ISO-8859 encodings family that are useful and applicable for use.

For example: Boost.Locale provides direct integration with `iostream` allowing more natural way of data formatting. For example:

    cout << "You have "<<as::currency << 134.45 << " at your account at "<<as::datetime << std::time(0) << endl;

**Why the ICU API is not exposed to user?**

It is true, all ICU API is hidden behind opaque pointers and user have no access to it. This is done for several reasons:

- At some point, better localization tools may be accepted by future upcoming C++ standards and thus, they may not use ICU directly.
- At some point, there should be a possibility to switch underlying localization engine to other, for example use native operating
system API or use some other toolkits like GLib or Qt that provide similar functionality.
- Not all localization is done withing ICU. For example, message formatting uses GNU Gettext message catalogs. In future more functionality
may be taken from ICU and reimplemented directly in the Boost.Locale library.

**Why to use GNU Gettext catalogs for message formatting?**

There are many available localization formats, most popular so far are: OASIS XLIFF, GNU gettext po/mo files, POSIX catalogs, Qt ts/tm files, Java properties, Windows resources. However, the last three are popular each one in its specific area, POSIX catalogs are too simple and limited so there are two quite reasonable options:

1. Standard localization format OASIS XLIFF.
2. GNU Gettext binary catalogs.

The first one generally seems like more correct localization solution but... It requires XML parsing for loading documents, it is very complicated
format and even ICU requires preliminary compilation of it into ICU resource bundles.

On the other hand:

- GNU Gettext binary catalogs have very simple, robust and yet very useful file format.
- It is so far the most popular and de-facto standard localization format (at least in Open Source world.)
- It has very simple and very powerful support of plural forms.
- It uses original English text as key making the process of internationalization much easier. Because at least
one basic translation is always available.
- There are many tools for editing and managing gettext catalogs like: Poedit, kbabel etc.

So, even thou GNU Gettext mo catalogs format is not officially approved file format:

- It is de-facto standard and most popular one.
- It implementation is much easier and does not requires XML parsing and validation


**Note:** Boost.Locale does not use any of GNU Gettext code, it just
reimplements tool for reading and using mo-files, getting rid of current biggest GNU Gettext flaw -- thread safety
when using multiple locales.

**Why a plain number is used for representation of date-time instead of Boost.DateTime date of Boost.DateTime ptime?**

There are several reasons:

1.  Gregorian Date is by definition can't be used for representation of locale independent dates, because not all
    used calendars are Gregorian.
2.  `ptime` -- is defiantly could be used unless it had several problems:
    
    -   It is created in GMT or Local time clock, when `time()` gives a representation that is independent of time zone,
        usually GMT time, and only then it should be represented in time zone that user requests.
        
        The timezone is not a property of time itself, but it is rather the property of time formatting.
        
    -   `ptime` already defines and `operator<<` and `operator>>` for time formatting and parsing.
    
    -   The existing facets for `ptime` formatting and parsing were not designed the way user can override their behavior.
        The major formatting and parsing functions are not virtual. It makes impossible reimplementing formatting and
        parsing functions of `ptime` unless developers of Boost.DateTime library would decide to change them.
        
        Also, the facets of `ptime` are not "correctly" designed in terms of devision between formatting information and 
        local information. Formatting information should be stored withing `std::ios_base` when information about how
        to format according to the locale should be stored in the facet itself.
        
        The user of library should not create new facets in order to change formatting information like: display only
        date or both date and time.

Thus, at this point, `ptime` is not supported for formatting localized date and time.

# Appendix

## Glossary

-   **Basic Multilingual Plane (BMP)** -- a part of _Universal Character Set_ with code points in range of U-0000--U-FFFF. Most daily
    used UCS characters lay in this plane including all Western, Cyrillic, Hebrew, Thai, Arabic and CJK encodings. However there are many
    characters that lay outside of BMP and their support absolutely required for correct support of East Asian languages.
-   **Code Point** -- a unique number that represents a "character" in Universal Character Set. Code points lay in range of 0-0x10FFFF. Usually
    displayed as U+XXXX or U+XXXXXX where X ins hexadecimal digit.
-   **Collation** -- a definition of sorting order of text, usually alphabetical. It differs for various languages and countries even for same
    characters.
-   **Encoding** -- a representation of character set. Some encodings are capable of representing full UCS like UTF-8 and some represent
    only its subset -- ISO-8859-8 represents only small subset of about 250 characters of UCS.
    
    Non-Unicode encodings are still very popular, for example Latin-1 (Or ISO-8859-1) encoding covers most of characters for representation
    of Western European languages and significantly simplifies processing of text for application designed to handle such languages only.
    
    In Boost.Locale you should provide an octets (`std::sting`) encoding as a part of Locale code name, for example `en_US.UTF-8` or `he_IL.cp1255`.
    
    `UTF-8` is recommended one.
-   **Facet** -- or `std::locale::facet` -- a base class that every object that describes specific locale is derived from it. Facets can be
    added to locale to provide additional culture information.
-   **Formatting** -- representation of various value according to locale preferences. For example number 1234.5 (C) should be displayed as
    1,234.5 in US locale and 1.234,5 in Russian locale. Date November 1st, 2005 would be represented as 11/01/2005 in United states, and
    01.11.2005 in Russia. This is important part of localization, allowing to represent various values correctly.
    
    For example: does "You have to bring 134,230 kg of rise at 04/01/2010" means "134 tons of rise in 1 in April" or "134 kg 230 g of rise at 
    January 4th". That is quite different.
-   **Gettext** -- GNU localization library used for message formatting. Today it is de-facto standard localization library in Open Source
    world. Boost.Locale message formatting is totally build on Gettext message catalogs.
-   **Locale** -- a set of parameters that define specific preferences for users in different cultures. It is generally defined by language,
    country, variants, encoding and provide information like: collation order, date-time formatting, message, formatting, number formatting
    and many others. `std::locale` class is used in `C++` for representation of _Locale_ information.
-   **Message Formatting** -- representation of UI in the users language. Generally process of translation of UI strings is done 
    using some dictionary provided by program's translator.
-   **Message Domain** -- in _gettext_ therms the keyword that represents message catalog. Usually this is an application name. When _gettext_
    and Boost.Locale search for specific message catalog it search in the specified path for file named after domain.
-   **Normalization** -- Unicode normalization is a process of converting strings to standard form suitable for text processing and comparison.
    For example, character "ü" can be represented using single code point or a combination of character "u" and diaeresis "¨". Normalization
    is important part of Unicode text processing.
    
    Normalization is not locale dependent but, because it is important part of Unicode processing it is included in Boost.Locale library.
-   **UCS-2** -- fixed width Unicode encoding which is capable of representing code points in _Basic Multilingual Plane (BMP)_ only.
    It is legacy encoding and not recommended for use.
-   **Unicode** -- industry standard that defines representation and manipulation of text suitable for most languages and countries. 
    It should not be mixed with _Universal Character Set_, it is much wider standard that also defines algorithms like bidirectional
    display order, Arabic shaping, etc..
-   **Universal Character Set (UCS)** -- international standard that defines a set of characters for many scripts and their _code points_.
-   **UTF-8** -- variable width Unicode transformation format. Each UCS code point is represented as a sequence of 1 to 4 octets,
    that can be easily distinguished. It includes ASCII as subset. It is most popular Unicode encoding for web applications, data transfer
    and storage, it is de-facto standard encoding for most POSIX operation systems.
-   **UTF-16** -- variable width Unicode transformation format. Each UCS code point is represented as sequence of one or two 16-bit words.
    It is very popular encoding for various platforms Win32 API, Java, C#, Python, etc. However, it is frequently misinterpreted with _UCS-2_
    fixed width limited encoding which is suitable for representation of characters in _Basic Multilingual Plane (BMP)_ only.
    
    This encoding is used for `std::wstring` under Win32 platform, where `sizeof(wchar_t)==2`.
-   **UTF-32/UCS-4** - fixed width Unicode transformation format, where each code point is represented as single 32-bit word. It has
    advantage of simplicity of code points representation but quite wasteful in terms of memory usage. It is used for `std::wstring` encoding
    for most POSIX platforms where `sizeof(wchar_t)==4`.

## Reference

Full, Doxygen generated reference can be found:

- [Main Page](index.html)
- [Modules](modules.html)
- [Classes](annotated.html)
- [Namespaces](namespaces.html)
