(Under development for Boost)
Boost.Locale is a library that brings high quality localization facilities in C++ way. It uses std::locale
, and std::locale
facets in order to provide localization in transparent and C++ aware way to user.
C++ has quite a good base for localization via existing C++ locale facets: std::num_put
, std::ctype
, std::collate
etc.. But they are very limited and sometimes buggy by design. The support of localization varies between different operating systems, compilers, standard libraries and frequently incompatible between them.
On the other hand, there is great, well debugged, high quality, widely used ICU library that gives all of the goodies but, it has very old API that mimics Java behavior, it completely ignores STL and provides useful API only for UTF–16 encoded text, ignoring other popular Unicode encodings like UTF–8 and UTF–32 and limited but still popular national character sets like Latin1.
Boost.Locale provides the natural glue between C++ locales framework, iostreams and powerful ICU library in following areas:
char
, wchar_t
and C++0x char16_t
, char32_t
strings and streams.C++ standard library provides a simple and powerful way to provide locale specific information. It is done via std::locale
class that is the container that holds all required information about specific culture like: number formatting patters, date and time formatting, currency, case conversion etc.
All this information is provided by facets: special classes derived from std::locale::facet
base class. Such facets are packed into std::locale
class and allow you to provide arbitrary information about the locale. std::locale
class keeps reference counters on installed facets and can be efficiently copied.
Each facet that was installed into the std::locale
object can be fetched using std::use_facet
function. For example. std::ctype<Char>
facet provides rules for case conversion. So you can convert character to upper case as following:
std::ctype<char> const &ctype_facet = std::use_facet<std::ctype>(some_locale);
char upper_a = ctype_facet.toupper('a');
Locale class can be imbued to iostream
so it would format information according to locale needs:
cout.imbue(std::locale("en_US.UTF-8"));
cout << 1345.45 << endl;
cout.imbue(std::locale("ru_RU.UTF-8"));
cout << 1345.45 << endl;
Would display:
1,345.45 1.345,45
You can also create your own facets and install them to existing locale class. For example:
class measure : public std::locale::facet {
public:
typedef enum { inches, ... } measure_type;
measure(measure_type m,size_t refs=0)
double from_metric(double value) const;
std::string name() const;
...
};
And now you can simply provide such information to locale:
std::locale::global(std::locale(std::locale("en_US.UTF-8"),new measure(paper_size::inches)));
/// Create default locale built from en_US locale and add paper size facet.
Now you can print distance according to correct locale:
void print_distance(std::ostream &out,double value)
{
measure const &m = std::use_facet<measure>(out.getloc());
// Fetch locale information from stream
std::cout << m.from_metric(value) << " " << m.name();
}
This technique was adopted by Boost.Locale library in order to provide powerful and correct localization. However instead of using standard and very limited standard library C++ facets it created its own facets that use ICU under the hood in order to make much powerful.
Each locale is defined by specific locale identifier that contains a mandatory part—Language and optional pars Country, Variant, keywords and character encoding of std::string
.
Class boost::locale::generator
provides us tool to generate locales we need. The simplest way to use generator is to create a locale and set it as global one:
#include <boost/locale.hpp>
using namespace boost::locale;
int main()
{
generator gen;
// Create locale generator
std::locale::global(gen(""));
// Set system default global locale
}
Of course we can specify locale manually, using default system encoding:
std::locale loc = gen("en_US");
// Use English, United States locale
Or specify both locale and encoding independently or using POSIX locale specifier that includes both locale information and encoding:
std::locale loc = gen("ja_JP","UTF-8");
// Separation of locale and encoding
std::locale loc = gen("ja_JP.UTF-8");
// POSIX locale name with encoding
When you generate more then one locale, you may specify the default encoding used for std::string
by calling octet_encoding
member function of generator
. For example:
generator gen;
gen.octet_encoding("UTF-8");
std::locale en=gen("en_US");
std::locale ja=gen("ja_JP");
Note: Even if your application uses wide strings anywhere it is recommended to specify 8-bit encoding that would be used for all wide stream IO operations like wcout
or wfstream
.
Tip: Prefer using UTF–8 Unicode encoding over 8-bit encodings like ISO–8859-X ones.
By default the locale generated for all supported categories and character types. However, if your application uses strictly 8-bit encodings, uses only wide character encodings only or it uses only specific parts of the localization tools you can limit facet generation to specific categories and character types, by calling categories
and characters
member functions of generator
class.
For example:
generator gen;
gen.characters(wchar_t_facet);
gen.categories(collation_facet | formatting_facet);
std::locale::global(gen("de_DE.UTF-8"));
Boost.Locale provides collator
class derived from std::collate
that extends it with support of comparison levels: primary—the default one, secondary, tertiary, quaternary and identical levels. They can be approximately defined as:
There are two ways of using collator
facet: direct by calling its member functions compare
, transform
and hash
or indirect by using comparator
template class in STL algorithms.
For example:
wstring a=L"Façade", b=L"facade";
bool eq = 0 == use_facet<collator<wchar_t> >(loc).compare(collator_base::secondary,a,b);
wcout << a <<L" and "<<b<<L" are " << (eq ? L"identical" : L"different")<<endl;
std::locale
is designed to be useful as comparison class in STL collection and algorithms. In order to get similar functionality with addition of comparison levels you use comparator class.
std::map<std::string,std::string,comparator<char,collator_base::secondary> > strings;
// Now strings uses default system locale for string comparison
You can also set specific locale or level when creating and using comparator
class:
comparator<char> comp(some_locale,some_level);
std::map<std::string,std::string,comparator<char> > strings(comp);
There is a set of function that perform basic string conversion operations: upper, lower and title case conversions, case folding and Unicode normalization. The functions are called to_upper
, to_lower
, to_title
, fold_case
and normalize
.
You may notice that there are existing functions to_upper
and to_lower
under in Boost.StringAlgo library, what is the difference? The difference is that these function operate over entire string instead of performing incorrect character-by-character conversions.
For example:
std::wstring gruben = L"grüßen";
std::wcout << boost::algorithm::to_upper_copy(gruben) << " " << boost::locale::to_upper(gruben) << std::endl;
Would give in output:
GRÜßEN GRÜSSEN
Where a letter “ß” was not converted correctly to double-S in first case because of limitation of std::ctype
facet.
Notes:
normalize
operates only on Unicode encoded strings, i.e.: UTF–8, UTF–16 and UTF–32 according to the character width. So be careful when using non-UTF encodings in the program they may be treated incorrectly.fold_case
is generally locale independent operation, however it receives locale as parameter in order to determinate 8-bit encoding.All formatting and parsing is performed via iostream
STL library. Each one of the above information types is represented as number. The formatting information is set using iostream manipulators. All manipulators are placed in boost::locale::as
namespace.
For example:
cout << as::currency << 123.45 << endl;
// display 123.45 in local currency representation.
cin >> as::currency >> x ;
// Parse currency representation and store it in x
There is a special manipulator as::posix
that unset locale specific settings and returns back to ordinary, default iostream
formatting and parsing methods. Please note, such formats may still be localized by default std::num_put
and std::num_get
facets.
These are manipulators for number formatting:
as::number
—format number according to local specifications, it takes in account various std::ios_base
flags like scientific format and precision.
as::percent
—format number as “percent” format. For example:
cout << as::percent << 0.25 <<endl;
Would create an output that may look like this:
25%
as::spellout
—spell the number. For example under English locale 103 may be displayed as “one hundred three”. Note: not all locales provide rules for spelling numbers, in such case the number would be displayed in decimal format.
as::ordinal
—display an order of element. For example “2” would be displayed as “2nd” under English locale. As in above case not all locales provide ordinal rules.
These are manipulators for currency formatting:
as::currency
—set format to currency mode.as::currency_iso
—change currency format to international like “USD” instead of “$”. This flag is supported when using ICU 4.2 and above.as::currency_national
—change currency format to national like “$”.as::currency_default
—return to default currency format (national)Note as::currency_XYZ
manipulators do not affect on general formatting, but only on the format of currency, it is necessary to use both manipulators in order to use non-default format.
Dates and times are represented as POSIX time. When date-time formatting is turned on in the iostream
, each number is treated as POSIX time. The number may be integer, or double.
There are four major manipulators of Date and Time formatting:
as::date
—display date onlyas::time
—display time onlyas::datetime
—display both date and timeas::ftime
—parametrized manipulator that allows specification of time in format that is used strftime
function. Note: not all formatting flags of strtftime
are supported.For example:
double now=time(0);
cout << "Today is "<< as::date << now << " and tommorrow is " << now+24*3600 << endl;
cout << "Current time is "<< as::time << now << endl;
cout << "The current weekday is "<< as::ftime("%A") << now << endl;
There are also more fine grained control of date-time formatting is available:
as::time_default
, as::time_short
, as::time_medium
, as::time_long
, as::time_full
—change time formatting.as::date_default
, as::date_short
, as::date_medium
, as::date_long
, as::date_full
—change date formatting.These manipulators, when used together with as::date
, as::time
, as::datetime
manipulators change the date-time representation. The default format is medium.
By default, the date and time is shown in local time zone, this behavior may be changed using following manipulators:
as::gmt
—display date and time in GMT.as::local_time
—display in local time format (default).as::time_zone
—parametrized manipulator that sets time-zone ID for date-time formatting and parsing. It receives as parameter a string that represents time zone id or boost::locale::time_zone
class.For example:
double now=time(0);
cout << as::datetime << as::locale_time << "Locale time is: "<< now << endl;
cout << as::gmt << "GMT Time is: "<< now <<endl;
cout << as::time_zone("EST") << "Eastern Standard Time is: "<< now <<endl;
The list of all available time zone IDs can be received as set<string>
using all_zones
static member function of boost::locale::time_zone
class.
There is a list of supported strftime
flags:
%a
—Abbreviated weekday (Sun.)%A
—Full weekday (Sunday)%b
—Abbreviated month (Jan.)%B
—Full month (January)%c
—Locale date-time format. Note: prefer using as::datetime
%d
—Day of Month [01,31]%e
—Day of Month [1,31]%h
—Same as %b
%H
—24 clock hour [00,23]%I
—12 clock hour [01,12]%j
—Day of year [1,366]%m
—Month [01,12]%M
—Minute [00,59]%n
—New Line%p
—AM/PM in locale representation%r
—Time with AM/PM, same as %I:%M:%S %p
%R
—Same as %H:%M
%S
—Second [00,61]%t
—Tab character%T
—Same as %H:%M:%S
%x
—Local date representation. Note: prefer using as::date
%X
—Local time representation. Note: prefer using as::time
%y
—Year [00,99]%Y
—4 digits year. (2009)%Z
—Time Zone%%
—Percent symbolUnsupported strftime
flags are: %C
, %u
, %U
, %V
, %w
, %W
. Also O
and E
modifiers are not supported.
General recommendations:
as::ftime
.All formatting information is stored in stream class by using xalloc
, pword
, and register_callback
member functions of std::ios_base
. All the information is stored and managed using special object binded to iostream
, all manipulators just change its state.
When a number is written to the stream or read from it. Custom Boost.Locale facet access to this object and checks required formatting information. Then it creates special object that actually formats the number and caches it in the iostream
. When next time another number is written to the stream same formatter would be used unless some flags had changed and formatter object is invalid.
Messages formatting is probably the most important part of localization—making your application to speak in users language.
Boost.Locale uses GNU Gettext localization model. It is recommended to read general documentation of GNU Gettext that may be out of scope of this document.
The model is following:
First of all our application foo
is prepared for localization by calling translate
function for each message used in user interface.
For example:
cout << "Hello World" << endl;
Is converted to
cout << translate("Hello World") << endl;
Then all messages are extracted from source code and a special foo.po
file is generated that contains all original English strings.
...
msgid "Hello World"
msgstr ""
...
foo.po
file is translated for target supported locales: for example de.po
, ar.po
, en_CA.po
, he.po
.
...
msgid "Hello World"
msgstr "שלום עולם"
And then compiled to binary mo
format and stored if following file structure:
de
de/LC_MESSAGES
de/LC_MESSAGES/foo.mo
en_CA/
en_CA/LC_MESSAGES
en_CA/LC_MESSAGES/foo.mo
...
When application starts. It loads required dictionaries, and when translate
function is called and the message is written to an output stream dictionary lookup is performed and localized message is written out.
All the dictionaries are loaded by generator class. So, in order to use localized strings in the application you need to specify following:
It is done by calling following member functions of generator
class:
void add_messages_path(std::string const &path)
—add the root path where the dictionaries are placed.
For example: if the dictionary is placed at /usr/share/locale/ar/LC_MESSAGES/foo.mo
, then path should be /usr/share/locale
.
void add_messages_domain(std::string const &domain)
—add the domain (name) of the application. In the above case it would be “foo”.
At least one domain and one path should be specified in order to load dictionaries.
For example, our first fully localized program:
#include <boost/locale.hpp>
#include <iostream>
using namespace std;
using namespace boost::locale;
int main()
{
generator gen;
// Specify location of dictionaries
gen.add_messages_path(".");
gen.add_messages_domain("hello");
// Generate locales and imbue them to iostream
locale::global(gen(""));
cout.imbue(locale());
// Display a message using current system locale
cout << translate("Hello World") << endl;
}
These are basic translation functions
message translate(char const *msg)
—create localized message from id msg
. msg
is not copiedmessage translate(std::string const &msg)
—create localized message from id msg
. msg
is copied.message translate(char const *single,char const *plural)
—create localized plural message with single
and plural
forms for number n
. Strings single
and plural
are not copied.message translate(std::string const &single,std::string const &plural,int n)
—create localized plural message with single
and plural
forms for number n
. Strings single
and plural
are copied.These functions return special Proxy object of type message
. It holds all required information for string formatting. When this object is written to an output iostream
it performs dictionary lookup of the id using locale imbued in iostream
. If the message is found in the dictionary is written to the output stream, otherwise the original string is written to the stream.
Notes:
message
can be implicitly converted to each type of supported strings: (i.e. std::string
, std::wstring
etc.) using global locale:
std::wstring msg = translate("Do you want to open the file?");
message
can be explicitly converted to string using str<CharType>
member function specific locale.
std::wstring msg = translate("Do you want to open the file?").str<wchar_t>(some_locale)
This allows postpone translation of the message to the place where translation is actually needed, even to different locale targets.
std::ofstream en,ja,he,de,ar;
std::wfstream w_ar;
// Send single message to multiple streams
void send_to_all(message const &msg)
{
en << msg;
ja << msg
he << msg;
de << msg;
ar << msg;
w_ar << ms;
}
main()
{
...
send_to_all(translate("Hello World"));
}
GNU Gettext catalogs has simple, robust and yet powerful plural forms support. It is recommended to read some original GNU documentation there.
Let’s try to solve a simple problem, display a message to user:
if(files == 1)
cout << translate("You have 1 file in the directory") << endl;
else
cout < format(translate("You have {1} files in the directory")) % files << endl;
This quite simple task becomes quite complicated when we deal with language other then English. Many languages have more then two plural forms. For example, in Hebrew there are special forms for single, double, plural, and plural above 10. They can’t be distinguished by simple rule “n
is 1 or not”.
The correct solution is:
cout << format(translate("You have 1 file in the directory",
"You have {1} files in the directory",files)) % files << endl;
Where translate receives single, plural form of original string and the number it should be formatted for. On the other side, special entry in the dictionary specifies the rule to choose the correct plural form in the specific language, for example, for Slavic languages family there exist 3 plural forms, that can be chosen using following equation:
plural=n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;
Such equation is written in the dictionary and it is evaluated during translation supplying the correct form. For more detailed information please refer to GNU Gettext: 11.2.6 Additional functions for plural forms.
The GNU Gettext model assumes that same source messages are translated to exactly same localized messages, but this may be wrong. For example a button label “open” is translated to “öffnen” in context of “opening file” or to “aufbauen” in context of opening internet connection in German.
Is such cases it is useful to add some context information to the original string by adding a comment.
button->setLabel(translate("#File#open"));
The comment is placed between first and the following hash symbol—‘#’. The comment is always extracted from the original string and not displayed, however it is a part of string identification. Translator should discard such comment and translate only “open” string.
For example, this how po
file is expected to look like:
msgid "#File#open"
msgstr "öffnen"
msgid "#Internet Connection#open"
msgstr "aufbauen"
In order to insert ‘#’ as fist symbol you may just use double hash string, for example:
cout<< translate("$ - Dollar symbol") << endl
<< translate("## - Hash symbol") << endl;
Note: Hash based comments are extension of the GNU Gettext library.
In some cases it is useful to work with multiple domains, for example if application consists of several independent modules, it may have several domains. For example, if application consists of modules “foo”, “bar” it is possible to specify which dictionary should be used.
There are two ways of using non-default domains:
When working with iostream
, it is possible to use parametrized manipulator as::domain(std::string const &)
that allows switching domains in streams:
cout << as::domain("foo") << translate("Hello") << as::domain("bar") << translate("Hello");
// First translation is taken from dictionary foo and other from dictionary bar
It is possible to specify domain explicitly when converting a message
object to string:
std::wstring foo_msg = translate("Hello World").str<wchar_t>("foo");
std::wstring bar_msg = translate("Hello World").str<wchar_t>("bar");
There are many tools that allow you to extract messages from the source code to .po
file format. The most popular and “native” tool is xgettext
which is installed by default on most Unix systems and freely downloadable for Windows.
For example, we have a source that called dir.cpp
that prints:
cout << translate("Listing of catalog {1}:") % file_name << endl;
cout << translate("Catalog {1} contains 1 file","Catalog {1} contains {2,num} files",files_no)
% file_name % files_no << endl;
Now we run:
xgettext --keyword=translate:1,1t --keyword=translate:1,2,3t dir.cpp
And a file called messages.po
created that looks like that (approximately):
#: dir.cpp:1
msgid "Listing of catalog {1}:"
msgstr ""
#: dir.cpp:2
msgid "Catalog {1} contains 1 file"
msgid_plural "Catalog {1} contains {2,num} files"
msgstr[0] ""
msgstr[1] ""
This file can be given to translator to adopt it to specific language.
We had used --keyword
parameter of xgettext
in order to make it suitable for extracting messages from the source localized with Boost.Locale—search for translate()
function calls instead of default gettext()
and ngettext()
ones. First parameter --keyword=translate:1,1t
parameters provides template for basic message: translate
function that called with 1 argument (1t) and first message is taken as key. The second one --keyword=translate:1,2,3t
—is used for plural forms. It tells xgettext
to use translate()
function call with 3 parameters (3t) and take 1st and 2nd parameter as keys.
Do I need GNU Gettext to use Boost.Locale?
Boost.Locale provides a run-time environment to load and use GNU Gettext message catalogs, but it does not provide tools for generation, translation, compilation and managment of these catalogs. Boost.Locale only reimplements GNU Gettext libintl.
You would probably need:
Is there any reason to prefer Boost.Locale implementation to original GNU Gettext runtime library? In any case I would probably need some of GNU tools.
There are two important differences between GNU Gettext runtime library and Boost.Locale implementation:
Boost.Locale provides to_utf
and from_utf
functions placed in boost::locale::conv
namespace. They are simple functions to convert string to and from UTF–8/16/32 strings and strings using other encodings.
For example:
std::string utf8_string = to_utf<char>(latin1_string,"Latin1");
std::wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
std::string latin1_string = from_utf(wide_string,"Latin1");
These function may use explicit encoding name like “Latin1” or “ISO–8859–8” or use std::locale as parameter and fetch this information from it. It also receives a policy parameter that directs it on how to behave if conversion can’t be performed (illegal or unsupported character found). By default these function would skip all illegal characters and try to do the best they can, however, it is possible ask it to throw onversion_error
exception by passing stop
flag to it:
std::wstring s=to_utf<wchar_t>("\xFF\xFF","UTF-8",stop);
// Throws because this string is illegal in UTF-8
Boost.Locale provides stream codepage conversion facets based on std::codecvt
facet. This allows converting between wide characters encoding and 8-bit encodings like UTF–8, ISO–8859 or Shift-JIS encodings.
Most of compilers provide such facets, but:
he_IL.CP1255
locale even when he_IL
locale is available.Thus Boost.Locale provides an option to generate code-page conversion facets for using it with Boost.Iostreams filters or std::wfstream
. For example:
std::locale loc= generator().get("he_IL.UTF-8");
std::wofstream file.
file.imbue(loc);
file.open("hello.txt");
file << L"שלום!" << endl;
Would create file hello.txt
encoded as UTF–8 with “שלום!” (shalom) in it.
Important Note
Boost.Locale codepage conversion facets do not support UTF–16 text outside of BMP (i.e. it supports only UCS–2). So if you want to provide full Unicode support do not use wide strings under platforms where sizeof(wchar_t)==2
(i.e. Windows) or do not use these facets for character I/O.
Limitations:
Standard does not provides any useful information about std::mbstate_t
type that should be used for saving intermediate code-page conversion states. It leave the definition the compiler implementation making it impossible to reimplement std::codecvt<wchar_t,char,mbstate_t>
to any stateful encodings. Thus. Boost.Locae codecvt
facet implementation may be used only with stateless encodings like UTF–8, ISO–8859, Shift-JIS, but not with stateful encodings like UTF–7 or SCSU.
Standard requires that code page translation can be done by translating each wide character independently. This is not a problem for most fixed width encodings like ISO–8859 family, and this is not a problem when wchar_t
represents a single code point, i.e.sizeof(wchar_t)=4
which is true for most POSIX platforms.
But under Windows, sizeof(wchar_t)=2
, and this it can represent only a single character in Base Multilingual Plane (BMP) where characters with code points above 0xFFFF
are represented using surrogate pairs. Because, the conversion should be stateless (above limitation) and when wchar_t
can’t represent single Unicode character, only UCS–2 encoding is supported, codecvt
would fail on surrogate characters of UTF–16 strings.
Same is valid for C++0x char16_t
base streams.
So, if your system supports required encoding, it would be better to use it directly instead of Boost.Locale facet.
General Recommendation: Prefer Unicode UTF–8 encoding for char
based strings and files in your application.
Boost.Locale provides boundary analysis tool allowing to split the text into characters, words, sentences and find appropriate places for line breaks.
Note: Characters are not equivalent to Unicode code points. For example a Hebrew word Shalom—“שָלוֹם” consists of 4 characters and 6 Unicode points, where two code points are used for vowels (diacritical marks).
Boost.Locale provides 3 major classes that are used for boundary analysis:
boost::locale::boundary::mapping
—the special map that hold the boundary points of the text.boost::locale::boundary::token_iterator
—the iterator that returns chunks of text that were split by text boundariesboost::locale::boundary::break_iterator
—the iterator that returns iterator to the original text.In order to perform boundary analysis we first of all create a boundary mapping or the text we want to work with it.
using namespace boost::locale::boundary;
std::string text="To be, or not to be?"
// Create mapping of text for token iterator using default locale.
mapping<token_iterator<std::string::const_iterator> > map(word,text.begin(),text.end());
// Print all "word" -- chunks of word boundary
for(token_iterator<std::string::const_iterator> it=map.begin(),e=map.end();it!=e;++it)
std::cout <<"`"<< * it << "'"<< std::endl;
Would print: a list: “To”, " “,”be“,”,“,” “,”or“,” “,”not“,” “,”to“,” “,”be“,”?" You can also provide filters for better selection of text chunks or boundaries you are interested in. For example:
map.mask(word_letters);
// Tell newly created iterators to select words that contain letters only.
for(token_iterator<std::string::const_iterator> it=map.begin(),e=map.end();it!=e;++it)
std::cout <<"`"<< * it << "'"<< std::endl;
Would now print only: “To”, “be”, “or”, “not”, “to”, “be” words ignoring all non-words—like punctuation.
Break iterator has different role, instead of returning text chunks, it returns the underlying iterator used for text source iteration. For example, you can select two first sentences as following:
using namespace boost::locale::boundary;
std::string const text="First sentence. Second sentence! Third one?"
// Create a sentence boundary mapping and set the mask of boundaries
// to select sentence terminators only, like "?", "." ignoring new lines.
typedef break_iterator<std::string::const_iterator> iterator;
mapping<iterator> map(sentence,map.begin(),map.end(),sentense_term);
iterator p=map.begin();
/// Advance p by two steps, make sure p is still valid;
for(int i=0;i<2 && p!=text.end();i++)
++p;
std::cout << "First two sentences are " << std::string(text.begin(),*p) << std::endl;
Would print: “First sentence. Second sentence!”
The iostream
manipulators are very useful but when we create a messages to the user, sometimes we need something like old-good printf
or boost::format
.
Unfortunately boost::format
has several limitations in context of localization:
ostream
locale.printf
like syntax is very limited for formatting of complex localized data, not allowing formatting of dates, time or currencyThus new class boost::locale::format
was introduced. For example:
wcout << wformat(L"Today {1,date} I would meet {2} at home") % time(0) % name <<endl
Each format specifier is enclosed withing {}
brackets. Each format specifier is separated with comma “,” and may have additional option after symbol ‘=’. The option may be simple ASCII text or quoted localized text with single quotes “’”. If quote should be inserted to the text, it may be represented with double quote.
For example, format string:
"Ms. {1} had shown at {2,ftime='%I o''clock'} at home. Exact time is {2,time=full}"
The syntax can be described with following grammar:
format : '{' parameters '}'
parameters: parameter | parameter ',' parameters;
parameter : key ["=" value] ;
key : [0-9a-zA-Z<>] ;
value : ascii-string-excluding-"}"-and="," | local-string ;
local-string : quoted-text | quoted-text local-string;
quoted-text : '[^']*' ;
Following format key-value pairs are supported:
[0-9]+
—digits, the index of formatted parameter—mandatory key.num
or number
—format a number. Optional values are:
hex
—display hexadecimal numberoct
—display in octal formatsci
or scientific
—display in scientific formatfix
or fixed
—display in fixed formatnumber=sci
cur
or currency
—format currency. Optional values are:
iso
—display using ISO currency symbol.nat
or national
—display using national currency symbol.per
or percent
—format percent value.date
, time
, datetime
or dt
—format date, time or date and time. Optional values are:
s
or short
—display in short formatm
or medium
—display in medium format.l
or long
—display in long format.f
or full
—display in full format.ftime
with string (quoted) parameter—display as with strftime
see, as::ftime
manipulatorspell
or spellout
—spell the number.ord
or ordinal
—format ordinal number (1st, 2nd… etc)left
or <
—align to left.right
or >
—align to right.width
or w
—set field width (requires parameter).precision
or p
—set precision (requires parameter).locale
—with parameter—switch locale for current operation. This command generates locale with formatting facets giving more fine grained control of formatting. For example:
cout << format("This article was published at {1,date=l} (Gregorian) {1,locale=he_IL@calendar=hebrew,date=l} (Hebrew)") % date;
The constructor of format
class may receive an object of type message
allowing easier integration with localized messages. For example:
cout<< format(translate("Adding {1} to {2}, we get {3}")) % a % b % (a+b) << endl;
Formatted string can be fetched directly using get(std::locale const &loc=std::locale())
member function. For example:
std::wstring de = (wformat(translate("Adding {1} to {2}, we get {3}")) % a % b % (a+b)).str(de_locale);
std::wstring fr = (wformat(translate("Adding {1} to {2}, we get {3}")) % a % b % (a+b)).str(fr_locale);
Important Note:
There is one significant difference between boost::format
and boost::locale::format
: Boost.Locale format converts its parameters only when it is written to ostream
or when str()
member function is called. It only saves a references to the objects that can be written to a stream.
This is generally not a problem when all operations are done in one statement as:
cout << format("Adding {1} to {2}, we get {3}") % a % b % (a+b);
Because temporary value of (a+b)
exists until the format is actually written to the stream. But following code is wrong:
format fmt("Adding {1} to {2}, we get {3}");
fmt % a;
fmt % b;
fmt % (a+b);
cout << fmt;
Because temporary value of (a+b)
is no longer exists when fmt
is written to the stream. The correct solution would be:
format fmt("Adding {1} to {2}, we get {3}");
fmt % a;
fmt % b;
int a_and_b = a+b;
fmt % a_and_b;
cout << fmt;
One of the important flaws of most libraries that provide operations over dates is the fact that they support only Gregorian calendar. It is correct for boost::date_time
, it is correct for std::tm
and standard functions like localtime
, gmtime
that assume that we use Gregorian calendar.
Boost.Locale provides generic date_time
, and calendar
class that allows to to perform operation on dates and time for non-Gregorian calendars like Hebrew, Islamic or Japanese calendars.
calendar
—the class that represents generic information about the calender, independent from specific time point. For example you can get the maximal number of days in month for this calender. date_time
—represents current time point. It is constructed from calendar and allows us to perform manipulation of various time periods. boost::locale::period
—holds an enumeration of various periods like, month, year, day, hour that allows us to manipulate with dates. You can add periods, multiply them by integers and get set them or add them to date_time
objects.
For example:
using namespace boost::locale;
date_time now; // Create date_time class width default calendar initialized to current time;
date_time tomorrow = now + period::day;
cout << "Let's met tomorrow at " << as::date << tomorrow << endl;
date_time some_point = period::year * 1995 + period::january + period::day*1;
// Set some_point's date to 1995-Jan-1.
cout << "The "<<as::date << some_point " is "
<< as::ordinal << some_point / period::day_of_week_local << " day of week" << endl;
You can calculate the difference between dates by dividing the difference between dates by period:
date_time now;
cout << " There are " << (now + 2 * period::month - now) / period::day << " days "
"between " << as::date << now << " and " << now + 2*period::month << endl;
date_time
—provides member functions minimum
and maximum
to get the information about minimal and maximal possible value of certain period for specific time.
For example, for February maximum(period::date)
would be 28 or 29 if the year is leap and 31 for January.
Note: be very careful with assumptions about what you know about calendar. For example, in Hebrew calendar the number of months is changed according if current year is leap or not.
It is recommended to take a look on calendar.cpp
example provided to this library to get understanding of how to manipulate with dates and times using these classes.
In order to convert between various calendar dates you may get and get current POSIX time via time
member function. For example:
using namespace boost::locale;
using namespace boost::locale::period;
generator gen;
// Create locales with Hebrew and Gregorian (default) calendars.
std::locale l_hebrew=gen("en_US@calendar=hebrew");
std::locale l_gregorian=gen("en_US");
// Create Gregorian date from fields
date_time greg(2010*year + february + 5*day,l_gregorian);
// Assign time point taken from Gregorian date to date_time with
// Hebrew calendar
date_time heb(greg.time(),l_hebrew);
// Now we can query the year now.
std::cout << "Hebrew year is " << heb / year << std::endl;
std::locale::name
function provides quite limited information about locale. Thus additional facet was created for giving more precise information: boost::locale::info
. It has following member functions:
std::string language()
—get the language code of current locale, for example “en”.std::string country()
—get country code of currect locale, for example “US”.std::string variant()
—get variant of currecnt locale, for example “euro”.std::string encoding()
—get charset used for char
based strings, for exaple “UTF–8”bool utf8()
—fast way to check if the encoding is UTF–8 encoding.Boost.Locale allows you to work safely with multiple locales in the same process. As we mentioned before, the locale generation process is not a cheap one. Thus, when we work with multiple locales it is recommended to create all used locales at the beginning and then use them.
generator
class has member function preload
that allows you create locale and put it into cache. Then, next time you create locale, if it is exists it would be fetched from the existing preloaded locale set.
For example:
generator gen;
gen.octet_encoding("UTF-8");
gen.preload("en_US");
gen.preload("de_DE");
gen.preload("ja_JP");
// Create all locales
std::locale en=gen("en_US");
// Fetch existing locale from cache
std::locale ar=get("ar_EG");
// Because ar_EG not in cache, new locale is generated (but not cached)
Note: generation of locale does not put it in cache only generator::preload
does this.
Then these locales can be imbued to iostreams
or used directly as parameters in various functions.
atoi
because they may not use “ordinary” digits 0..9 at all, you may not assume that “space” characters are frequent because in Chinese space do not separates different words. The text may be written from Right-to-Left or from Up-to-Down and so far.In order to use Unicode in my application I should use wide strings anywhere.
Unicode property is not limited to wide strings, in fact both std::string
and std::wstring
are absolutely fine to hold and process Unicode text. More then that the semantics of std::string
is much cleaner in multi-platform application, because, if the string is “Unicode” string then it is UTF–8. When we talk about “wide” strings they may be “UTF–16” or “UTF–32” encoded, depending on platform.
So wide strings may be even less convenient when dealing with Unicode then char
based strings.
UTF–16 is the best encoding to work with.
There is common assumption that it is one of the best encodings to store information because it gives “shortest” representation of strings.
In fact, it probably the most error prone encoding to work with it. The biggest issue is code points laying outside of BMP that are represented with surrogate pairs. In fact these characters are very rare and many applications are not tested with them.
For example:
So, UTF–16 can be used for dealing with Unicode, in-facet ICU and may other applications use UTF–16 as internal Unicode representation, but you should be very careful and never assume one-code-point == one-utf16-character.
Why is it needed?
Why do we need localization library, standard C++ facets (should) provide most of required functionality:
std::ctype
facetstd::collate
and has nice integration with std::locale
std::num_put
, std::num_get
, std::money_put
, std::money_get
, std::time_put
and std::time_get
for numbers, time and currency formatting and parsing.std::messages
class that supports localized message formatting.So why do we need such library if we have all the functionality withing standard library?
Almost each(!) facet has some flaws in their design:
std::collate
supports only one level of collation, not allowing to choose whether case, accents sensitive or insensitive comparison should be performed.
std::ctype
that is responsible for case conversion assumes that conversion can be done on per-character base. This is probably correct for many languages but it isn’t correct in general case.
toupper
function works on single character base.char
’s in UTF–8 and up to two wchar_t
’s under Windows platform. This makes std::ctype
totally useless with UTF–8 encodings.std::numpunct
and std::moneypunct
do not specify digits code point for digits representation at all. Thus it is impossible to format number using digits used under Arabic locales, for example: the number “103” is expected to be displayed as “١٠٣” under ar_EG
locale.
std::numpunct
and std::moneypunct
assume that thousands separator can be represented using a single character. It is quite untrue for UTF–8 encoding where only Unicode 0–0x7F range can be represented as single character. As a result, localized numbers can’t be represented correctly under locales that use Unicode “EN SPACE” character for thousands separator, like Russian locale.
This actually cause a real bugs under GCC and SunStudio compilers where formatting numbers under Russian locale creates invalid UTF–8 sequences..
std::time_put
and std::time_get
have several flows:
std::tm
for time representation, ignoring the fact that in many countries dates may be displayed using different calendars.std::tm
does not include timezone field.std::time_get
is not symmetric with std::time_put
now allowing parsing dates and times created with std::time_put
. This issue is addressed in C++0x and some STL implementation like Apache standard C++ library.std::messages
does not provide support of plural forms making impossible to localize correctly such simple strings like: “There are X files in directory”.
Also many features are not really supported by std::locale
at all: timezones mentioned above, text boundary analysis, numbers spelling and many others. So it is clear that standard C++ locales are very problematic for real-world applications of internationalization and localization.
Why to use ICU wrapper instead of ICU?
ICU is very good localization library but it has several serious flaws:
For example: Boost.Locale provides direct integration with iostream
allowing more natural way of data formatting. For example:
cout << "You have "<<as::currency << 134.45 << " at your account at "<<as::datetime << std::time(0) << endl;
Why the ICU API is not exposed to user?
It is true, all ICU API is hidden behind opaque pointers and user have no access to it. This is done for several reasons:
Why to use GNU Gettext catalogs for message formatting?
There are many available localization formats, most popular so far are: OASIS XLIFF, GNU gettext po/mo files, POSIX catalogs, Qt ts/tm files, Java properties, Windows resources. However, the last three are popular each one in its specific area, POSIX catalogs are too simple and limited so there are two quite reasonable options:
The first one generally seems like more correct localization solution but… It requires XML parsing for loading documents, it is very complicated format and even ICU requires preliminary compilation of it into ICU resource bundles.
On the other hand:
So, even thou GNU Gettext mo catalogs format is not officially approved file format:
Note: Boost.Locale does not use any of GNU Gettext code, it just reimplements tool for reading and using mo-files, getting rid of current biggest GNU Gettext flaw—thread safety when using multiple locales.
Why a plain number is used for representation of date-time instead of Boost.DateTime date of Boost.DateTime ptime?
There are several reasons:
ptime
—is defiantly could be used unless it had several problems:
It is created in GMT or Local time clock, when time()
gives a representation that is independent of time zone, usually GMT time, and only then it should be represented in time zone that user requests.
The timezone is not a property of time itself, but it is rather the property of time formatting.
ptime
already defines and operator<<
and operator>>
for time formatting and parsing.
The existing facets for ptime
formatting and parsing were not designed the way user can override their behavior. The major formatting and parsing functions are not virtual. It makes impossible reimplementing formatting and parsing functions of ptime
unless developers of Boost.DateTime library would decide to change them.
Also, the facets of ptime
are not “correctly” designed in terms of devision between formatting information and local information. Formatting information should be stored withing std::ios_base
when information about how to format according to the locale should be stored in the facet itself.
The user of library should not create new facets in order to change formatting information like: display only date or both date and time.
Thus, at this point, ptime
is not supported for formatting localized date and time.
Encoding—a representation of character set. Some encodings are capable of representing full UCS like UTF–8 and some represent only its subset—ISO–8859–8 represents only small subset of about 250 characters of UCS.
Non-Unicode encodings are still very popular, for example Latin–1 (Or ISO–8859–1) encoding covers most of characters for representation of Western European languages and significantly simplifies processing of text for application designed to handle such languages only.
In Boost.Locale you should provide an octets (std::sting
) encoding as a part of Locale code name, for example en_US.UTF-8
or he_IL.cp1255
.
UTF-8
is recommended one.std::locale::facet
—a base class that every object that describes specific locale is derived from it. Facets can be added to locale to provide additional culture information.Formatting—representation of various value according to locale preferences. For example number 1234.5 (C) should be displayed as 1,234.5 in US locale and 1.234,5 in Russian locale. Date November 1st, 2005 would be represented as 11/01/2005 in United states, and 01.11.2005 in Russia. This is important part of localization, allowing to represent various values correctly.
For example: does “You have to bring 134,230 kg of rise at 04/01/2010” means “134 tons of rise in 1 in April” or “134 kg 230 g of rise at January 4th”. That is quite different.std::locale
class is used in C++
for representation of Locale information.Normalization—Unicode normalization is a process of converting strings to standard form suitable for text processing and comparison. For example, character “ü” can be represented using single code point or a combination of character “u” and diaeresis “¨”. Normalization is important part of Unicode text processing.
Normalization is not locale dependent but, because it is important part of Unicode processing it is included in Boost.Locale library.UTF–16—variable width Unicode transformation format. Each UCS code point is represented as sequence of one or two 16-bit words. It is very popular encoding for various platforms Win32 API, Java, C#, Python, etc. However, it is frequently misinterpreted with UCS–2 fixed width limited encoding which is suitable for representation of characters in Basic Multilingual Plane (BMP) only.
This encoding is used forstd::wstring
under Win32 platform, where sizeof(wchar_t)==2
.UTF–32/UCS–4 - fixed width Unicode transformation format, where each code point is represented as single 32-bit word. It has advantage of simplicity of code points representation but quite wasteful in terms of memory usage. It is used for std::wstring
encoding for most POSIX platforms where sizeof(wchar_t)==4
.
Full, Doxygen generated reference can be found: