gcc/libstdc++-v3/doc/html/manual/codecvt.html - Issue 3050029: [gcc] GCC 4.5.0=>4.5.1

Unified Diff: gcc/libstdc++-v3/doc/html/manual/codecvt.html

Issue 3050029: [gcc] GCC 4.5.0=>4.5.1 (Closed) Base URL: ssh://git@gitrw.chromium.org:9222/nacl-toolchain.git

Patch Set: Created 10 years, 5 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Index: gcc/libstdc++-v3/doc/html/manual/codecvt.html

diff --git a/gcc/libstdc++-v3/doc/html/manual/codecvt.html b/gcc/libstdc++-v3/doc/html/manual/codecvt.html

deleted file mode 100644

index c65f960febd485b174b0218249d5c2ba690a7a33..0000000000000000000000000000000000000000

--- a/gcc/libstdc++-v3/doc/html/manual/codecvt.html

+++ /dev/null

@@ -1,379 +0,0 @@

-<?xml version="1.0" encoding="UTF-8" standalone="no"?>

-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

-<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>codecvt</title><meta name="generator" content="DocBook XSL Stylesheets V1.74.0" /><meta name="keywords" content="
 ISO C++
 , 
 codecvt
 " /><meta name="keywords" content="
 ISO C++
 , 
 library
 " /><link rel="home" href="../spine.html" title="The GNU C++ Library Documentation" /><link rel="up" href="facets.html" title="Chapter 15. Facets aka Categories" /><link rel="prev" href="facets.html" title="Chapter 15. Facets aka Categories" /><link rel="next" href="messages.html" title="messages" /></head><body><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">codecvt</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="facets.html">Prev</a> </td><th width="60%" align="center">Chapter 15. Facets aka Categories</th><td width="20%" align="right"> <a accesskey="n" href="messages.html">Next</a></td></tr></table><hr /></div><div class="sect1" lang="en" xml:lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="manual.localization.facet.codecvt"></a>codecvt</h2></div></div></div><p>

-The standard class codecvt attempts to address conversions between

-different character encoding schemes. In particular, the standard

-attempts to detail conversions between the implementation-defined wide

-characters (hereafter referred to as wchar_t) and the standard type

-char that is so beloved in classic “<span class="quote">C</span>” (which can now be

-referred to as narrow characters.) This document attempts to describe

-how the GNU libstdc++ implementation deals with the conversion between

-wide and narrow characters, and also presents a framework for dealing

-with the huge number of other encodings that iconv can convert,

-including Unicode and UTF8. Design issues and requirements are

-addressed, and examples of correct usage for both the required

-specializations for wide and narrow characters and the

-implementation-provided extended functionality are given.

-</p><div class="sect2" lang="en" xml:lang="en"><div class="titlepage"><div><div><h3 class="title"><a id="facet.codecvt.req"></a>Requirements</h3></div></div></div><p>

-Around page 425 of the C++ Standard, this charming heading comes into view:

-</p><div class="blockquote"><blockquote class="blockquote"><p>

-22.2.1.5 - Template class codecvt

-</p></blockquote></div><p>

-The text around the codecvt definition gives some clues:

-</p><div class="blockquote"><blockquote class="blockquote"><p>

-<span class="emphasis"><em>

--1- The class codecvt<internT,externT,stateT> is for use when

-converting from one codeset to another, such as from wide characters

-to multibyte characters, between wide character encodings such as

-Unicode and EUC.

-</em></span>

-</p></blockquote></div><p>

-Hmm. So, in some unspecified way, Unicode encodings and

-translations between other character sets should be handled by this

-class.

-</p><div class="blockquote"><blockquote class="blockquote"><p>

-<span class="emphasis"><em>

--2- The stateT argument selects the pair of codesets being mapped between.

-</em></span>

-</p></blockquote></div><p>

-Ah ha! Another clue...

-</p><div class="blockquote"><blockquote class="blockquote"><p>

-<span class="emphasis"><em>

--3- The instantiations required in the Table ??

-(lib.locale.category), namely codecvt<wchar_t,char,mbstate_t> and

-codecvt<char,char,mbstate_t>, convert the implementation-defined

-native character set. codecvt<char,char,mbstate_t> implements a

-degenerate conversion; it does not convert at

-all. codecvt<wchar_t,char,mbstate_t> converts between the native

-character sets for tiny and wide characters. Instantiations on

-mbstate_t perform conversion between encodings known to the library

-implementor. Other encodings can be converted by specializing on a

-user-defined stateT type. The stateT object can contain any state that

-is useful to communicate to or from the specialized do_convert member.

-</em></span>

-</p></blockquote></div><p>

-At this point, a couple points become clear:

-</p><p>

-One: The standard clearly implies that attempts to add non-required

-(yet useful and widely used) conversions need to do so through the

-third template parameter, stateT.</p><p>

-Two: The required conversions, by specifying mbstate_t as the third

-template parameter, imply an implementation strategy that is mostly

-(or wholly) based on the underlying C library, and the functions

-mcsrtombs and wcsrtombs in particular.</p></div><div class="sect2" lang="en" xml:lang="en"><div class="titlepage"><div><div><h3 class="title"><a id="facet.codecvt.design"></a>Design</h3></div></div></div><div class="sect3" lang="en" xml:lang="en"><div class="titlepage"><div><div><h4 class="title"><a id="codecvt.design.wchar_t_size"></a><span class="type">wchar_t</span> Size</h4></div></div></div><p>

- The simple implementation detail of wchar_t's size seems to

- repeatedly confound people. Many systems use a two byte,

- unsigned integral type to represent wide characters, and use an

- internal encoding of Unicode or UCS2. (See AIX, Microsoft NT,

- Java, others.) Other systems, use a four byte, unsigned integral

- type to represent wide characters, and use an internal encoding

- of UCS4. (GNU/Linux systems using glibc, in particular.) The C

- programming language (and thus C++) does not specify a specific

- size for the type wchar_t.

- </p><p>

- Thus, portable C++ code cannot assume a byte size (or endianness) either.

- </p></div><div class="sect3" lang="en" xml:lang="en"><div class="titlepage"><div><div><h4 class="title"><a id="codecvt.design.unicode"></a>Support for Unicode</h4></div></div></div><p>

- Probably the most frequently asked question about code conversion

- is: "So dudes, what's the deal with Unicode strings?"

- The dude part is optional, but apparently the usefulness of

- Unicode strings is pretty widely appreciated. Sadly, this specific

- encoding (And other useful encodings like UTF8, UCS4, ISO 8859-10,

- etc etc etc) are not mentioned in the C++ standard.

- </p><p>

- A couple of comments:

- </p><p>

- The thought that all one needs to convert between two arbitrary

- codesets is two types and some kind of state argument is

- unfortunate. In particular, encodings may be stateless. The naming

- of the third parameter as stateT is unfortunate, as what is really

- needed is some kind of generalized type that accounts for the

- issues that abstract encodings will need. The minimum information

- that is required includes:

- </p><div class="itemizedlist"><ul type="disc"><li><p>

- Identifiers for each of the codesets involved in the

- conversion. For example, using the iconv family of functions

- from the Single Unix Specification (what used to be called

- X/Open) hosted on the GNU/Linux operating system allows

- bi-directional mapping between far more than the following

- tantalizing possibilities:

- </p><p>

- (An edited list taken from <code class="code">`iconv --list`</code> on a

- Red Hat 6.2/Intel system:

- </p><div class="blockquote"><blockquote class="blockquote"><pre class="programlisting">

-8859_1, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ARABIC, ARABIC7,

-ASCII, EUC-CN, EUC-JP, EUC-KR, EUC-TW, GREEK-CCIcode, GREEK, GREEK7-OLD,

-GREEK7, GREEK8, HEBREW, ISO-8859-1, ISO-8859-2, ISO-8859-3,

-ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8,

-ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14,

-ISO-8859-15, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4,

-ISO-10646/UTF-8, ISO-10646/UTF8, SHIFT-JIS, SHIFT_JIS, UCS-2, UCS-4,

-UCS2, UCS4, UNICODE, UNICODEBIG, UNICODELIcodeLE, US-ASCII, US, UTF-8,

-UTF-16, UTF8, UTF16).

-</pre></blockquote></div><p>

-For iconv-based implementations, string literals for each of the

-encodings (i.e. "UCS-2" and "UTF-8") are necessary,

-although for other,

-non-iconv implementations a table of enumerated values or some other

-mechanism may be required.

-</p></li><li><p>

- Maximum length of the identifying string literal.

-</p></li><li><p>

- Some encodings require explicit endian-ness. As such, some kind

- of endian marker or other byte-order marker will be necessary. See

- "Footnotes for C/C++ developers" in Haible for more information on

- UCS-2/Unicode endian issues. (Summary: big endian seems most likely,

- however implementations, most notably Microsoft, vary.)

-</p></li><li><p>

- Types representing the conversion state, for conversions involving

- the machinery in the "C" library, or the conversion descriptor, for

- conversions using iconv (such as the type iconv_t.) Note that the

- conversion descriptor encodes more information than a simple encoding

- state type.

-</p></li><li><p>

- Conversion descriptors for both directions of encoding. (i.e., both

- UCS-2 to UTF-8 and UTF-8 to UCS-2.)

-</p></li><li><p>

- Something to indicate if the conversion requested if valid.

-</p></li><li><p>

- Something to represent if the conversion descriptors are valid.

-</p></li><li><p>

- Some way to enforce strict type checking on the internal and

- external types. As part of this, the size of the internal and

- external types will need to be known.

-</p></li></ul></div></div><div class="sect3" lang="en" xml:lang="en"><div class="titlepage"><div><div><h4 class="title"><a id="codecvt.design.issues"></a>Other Issues</h4></div></div></div><p>

-In addition, multi-threaded and multi-locale environments also impact

-the design and requirements for code conversions. In particular, they

-affect the required specialization codecvt<wchar_t, char, mbstate_t>

-when implemented using standard "C" functions.

-</p><p>

-Three problems arise, one big, one of medium importance, and one small.

-</p><p>

-First, the small: mcsrtombs and wcsrtombs may not be multithread-safe

-on all systems required by the GNU tools. For GNU/Linux and glibc,

-this is not an issue.

-</p><p>

-Of medium concern, in the grand scope of things, is that the functions

-used to implement this specialization work on null-terminated

-strings. Buffers, especially file buffers, may not be null-terminated,

-thus giving conversions that end prematurely or are otherwise

-incorrect. Yikes!

-</p><p>

-The last, and fundamental problem, is the assumption of a global

-locale for all the "C" functions referenced above. For something like

-C++ iostreams (where codecvt is explicitly used) the notion of

-multiple locales is fundamental. In practice, most users may not run

-into this limitation. However, as a quality of implementation issue,

-the GNU C++ library would like to offer a solution that allows

-multiple locales and or simultaneous usage with computationally

-correct results. In short, libstdc++ is trying to offer, as an

-option, a high-quality implementation, damn the additional complexity!

-</p><p>

-For the required specialization codecvt<wchar_t, char, mbstate_t> ,

-conversions are made between the internal character set (always UCS4

-on GNU/Linux) and whatever the currently selected locale for the

-LC_CTYPE category implements.

-</p></div></div><div class="sect2" lang="en" xml:lang="en"><div class="titlepage"><div><div><h3 class="title"><a id="facet.codecvt.impl"></a>Implementation</h3></div></div></div><p>

-The two required specializations are implemented as follows:

-</p><p>

-<code class="code">

-codecvt<char, char, mbstate_t>

-</code>

-</p><p>

-This is a degenerate (i.e., does nothing) specialization. Implementing

-this was a piece of cake.

-</p><p>

-<code class="code">

-codecvt<char, wchar_t, mbstate_t>

-</code>

-</p><p>

-This specialization, by specifying all the template parameters, pretty

-much ties the hands of implementors. As such, the implementation is

-straightforward, involving mcsrtombs for the conversions between char

-to wchar_t and wcsrtombs for conversions between wchar_t and char.

-</p><p>

-Neither of these two required specializations deals with Unicode

-characters. As such, libstdc++ implements a partial specialization

-of the codecvt class with and iconv wrapper class, encoding_state as the

-third template parameter.

-</p><p>

-This implementation should be standards conformant. First of all, the

-standard explicitly points out that instantiations on the third

-template parameter, stateT, are the proper way to implement

-non-required conversions. Second of all, the standard says (in Chapter

-17) that partial specializations of required classes are a-ok. Third

-of all, the requirements for the stateT type elsewhere in the standard

-(see 21.1.2 traits typedefs) only indicate that this type be copy

-constructible.

-</p><p>

-As such, the type encoding_state is defined as a non-templatized, POD

-type to be used as the third type of a codecvt instantiation. This

-type is just a wrapper class for iconv, and provides an easy interface

-to iconv functionality.

-</p><p>

-There are two constructors for encoding_state:

-</p><p>

-<code class="code">

-encoding_state() : __in_desc(0), __out_desc(0)

-</code>

-</p><p>

-This default constructor sets the internal encoding to some default

-(currently UCS4) and the external encoding to whatever is returned by

-nl_langinfo(CODESET).

-</p><p>

-<code class="code">

-encoding_state(const char* __int, const char* __ext)

-</code>

-</p><p>

-This constructor takes as parameters string literals that indicate the

-desired internal and external encoding. There are no defaults for

-either argument.

-</p><p>

-One of the issues with iconv is that the string literals identifying

-conversions are not standardized. Because of this, the thought of

-mandating and or enforcing some set of pre-determined valid

-identifiers seems iffy: thus, a more practical (and non-migraine

-inducing) strategy was implemented: end-users can specify any string

-(subject to a pre-determined length qualifier, currently 32 bytes) for

-encodings. It is up to the user to make sure that these strings are

-valid on the target system.

-</p><p>

-<code class="code">

-void

-_M_init()

-</code>

-</p><p>

-Strangely enough, this member function attempts to open conversion

-descriptors for a given encoding_state object. If the conversion

-descriptors are not valid, the conversion descriptors returned will

-not be valid and the resulting calls to the codecvt conversion

-functions will return error.

-</p><p>

-<code class="code">

-bool

-_M_good()

-</code>

-</p><p>

-Provides a way to see if the given encoding_state object has been

-properly initialized. If the string literals describing the desired

-internal and external encoding are not valid, initialization will

-fail, and this will return false. If the internal and external

-encodings are valid, but iconv_open could not allocate conversion

-descriptors, this will also return false. Otherwise, the object is

-ready to convert and will return true.

-</p><p>

-<code class="code">

-encoding_state(const encoding_state&)

-</code>

-</p><p>

-As iconv allocates memory and sets up conversion descriptors, the copy

-constructor can only copy the member data pertaining to the internal

-and external code conversions, and not the conversion descriptors

-themselves.

-</p><p>

-Definitions for all the required codecvt member functions are provided

-for this specialization, and usage of codecvt<internal character type,

-external character type, encoding_state> is consistent with other

-codecvt usage.

-</p></div><div class="sect2" lang="en" xml:lang="en"><div class="titlepage"><div><div><h3 class="title"><a id="facet.codecvt.use"></a>Use</h3></div></div></div><p>A conversions involving string literal.</p><pre class="programlisting">

- typedef codecvt_base::result result;

- typedef unsigned short unicode_t;

- typedef unicode_t int_type;

- typedef char ext_type;

- typedef encoding_state state_type;

- typedef codecvt<int_type, ext_type, state_type> unicode_codecvt;

- const ext_type* e_lit = "black pearl jasmine tea";

- int size = strlen(e_lit);

- int_type i_lit_base[24] =

- { 25088, 27648, 24832, 25344, 27392, 8192, 28672, 25856, 24832, 29184,

- 27648, 8192, 27136, 24832, 29440, 27904, 26880, 28160, 25856, 8192, 29696,

- 25856, 24832, 2560

- };

- const int_type* i_lit = i_lit_base;

- const ext_type* efrom_next;

- const int_type* ifrom_next;

- ext_type* e_arr = new ext_type[size + 1];

- ext_type* eto_next;

- int_type* i_arr = new int_type[size + 1];

- int_type* ito_next;

- // construct a locale object with the specialized facet.

- locale loc(locale::classic(), new unicode_codecvt);

- // sanity check the constructed locale has the specialized facet.

- VERIFY( has_facet<unicode_codecvt>(loc) );

- const unicode_codecvt& cvt = use_facet<unicode_codecvt>(loc);

- // convert between const char* and unicode strings

- unicode_codecvt::state_type state01("UNICODE", "ISO_8859-1");

- initialize_state(state01);

- result r1 = cvt.in(state01, e_lit, e_lit + size, efrom_next,

- i_arr, i_arr + size, ito_next);

- VERIFY( r1 == codecvt_base::ok );

- VERIFY( !int_traits::compare(i_arr, i_lit, size) );

- VERIFY( efrom_next == e_lit + size );

- VERIFY( ito_next == i_arr + size );

-</pre></div><div class="sect2" lang="en" xml:lang="en"><div class="titlepage"><div><div><h3 class="title"><a id="facet.codecvt.future"></a>Future</h3></div></div></div><div class="itemizedlist"><ul type="disc"><li><p>

- a. things that are sketchy, or remain unimplemented:

- do_encoding, max_length and length member functions

- are only weakly implemented. I have no idea how to do

- this correctly, and in a generic manner. Nathan?

-</p></li><li><p>

- b. conversions involving std::string

- </p><div class="itemizedlist"><ul type="circle"><li><p>

- how should operators != and == work for string of

- different/same encoding?

- </p></li><li><p>

- what is equal? A byte by byte comparison or an

- encoding then byte comparison?

- </p></li><li><p>

- conversions between narrow, wide, and unicode strings

- </p></li></ul></div></li><li><p>

- c. conversions involving std::filebuf and std::ostream

-</p><div class="itemizedlist"><ul type="circle"><li><p>

- how to initialize the state object in a

- standards-conformant manner?

- </p></li><li><p>

- how to synchronize the "C" and "C++"

- conversion information?

- </p></li><li><p>

- wchar_t/char internal buffers and conversions between

- internal/external buffers?

- </p></li></ul></div></li></ul></div></div><div class="bibliography"><div class="titlepage"><div><div><h3 class="title"><a id="facet.codecvt.biblio"></a>Bibliography</h3></div></div></div><div class="biblioentry"><a id="id415012"></a><p><span class="title"><i>

- The GNU C Library

- </i>. </span><span class="author"><span class="firstname">Roland</span> <span class="surname">McGrath</span>. </span><span class="author"><span class="firstname">Ulrich</span> <span class="surname">Drepper</span>. </span><span class="copyright">Copyright © 2007 FSF. </span><span class="pagenums">Chapters 6 Character Set Handling and 7 Locales and Internationalization. </span></p></div><div class="biblioentry"><a id="id514935"></a><p><span class="title"><i>

- Correspondence

- </i>. </span><span class="author"><span class="firstname">Ulrich</span> <span class="surname">Drepper</span>. </span><span class="copyright">Copyright © 2002 . </span></p></div><div class="biblioentry"><a id="id416284"></a><p><span class="title"><i>

- ISO/IEC 14882:1998 Programming languages - C++

- ISO/IEC 9899:1999 Programming languages - C

- System Interface Definitions, Issue 6 (IEEE Std. 1003.1-200x)

- The Open Group/The Institute of Electrical and Electronics Engineers, Inc.. </span><span class="biblioid">

- <a class="ulink" href="http://www.opennc.org/austin/docreg.html" target="_top">

- </a>

- . </span></p></div><div class="biblioentry"><a id="id459755"></a><p><span class="title"><i>

- The C++ Programming Language, Special Edition

- </i>. </span><span class="author"><span class="firstname">Bjarne</span> <span class="surname">Stroustrup</span>. </span><span class="copyright">Copyright © 2000 Addison Wesley, Inc.. </span><span class="pagenums">Appendix D. </span><span class="publisher"><span class="publishername">

- Addison Wesley

- . </span></span></p></div><div class="biblioentry"><a id="id430786"></a><p><span class="title"><i>

- Standard C++ IOStreams and Locales

- </i>. </span><span class="subtitle">

- Advanced Programmer's Guide and Reference

- . </span><span class="author"><span class="firstname">Angelika</span> <span class="surname">Langer</span>. </span><span class="author"><span class="firstname">Klaus</span> <span class="surname">Kreft</span>. </span><span class="copyright">Copyright © 2000 Addison Wesley Longman, Inc.. </span><span class="publisher"><span class="publishername">

- Addison Wesley Longman

- . </span></span></p></div><div class="biblioentry"><a id="id407191"></a><p><span class="title"><i>

- A brief description of Normative Addendum 1

- </i>. </span><span class="author"><span class="firstname">Clive</span> <span class="surname">Feather</span>. </span><span class="pagenums">Extended Character Sets. </span><span class="biblioid">

- <a class="ulink" href="http://www.lysator.liu.se/c/na1.html" target="_top">

- </a>

- . </span></p></div><div class="biblioentry"><a id="id394112"></a><p><span class="title"><i>

- The Unicode HOWTO

- </i>. </span><span class="author"><span class="firstname">Bruno</span> <span class="surname">Haible</span>. </span><span class="biblioid">

- <a class="ulink" href="ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html" target="_top">

- </a>

- . </span></p></div><div class="biblioentry"><a id="id394140"></a><p><span class="title"><i>

- UTF-8 and Unicode FAQ for Unix/Linux

- </i>. </span><span class="author"><span class="firstname">Markus</span> <span class="surname">Khun</span>. </span><span class="biblioid">

- <a class="ulink" href="http://www.cl.cam.ac.uk/~mgk25/unicode.html" target="_top">

- </a>

- . </span></p></div></div></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="facets.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="facets.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="messages.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 15. Facets aka Categories </td><td width="20%" align="center"><a accesskey="h" href="../spine.html">Home</a></td><td width="40%" align="right" valign="top"> messages</td></tr></table></div></body></html>

« no previous file with comments | « gcc/libstdc++-v3/doc/html/manual/bk01pt12pr03.html ('k') | gcc/libstdc++-v3/doc/html/manual/configure.html » ('j') | no next file with comments »