This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Bug in libiconv?


On Jan 29 10:21, Eric Blake wrote:
> On 01/29/2011 09:01 AM, Corinna Vinschen wrote:
> >> So, using UTF-16 surrogate encodings for characters outside the basic
> >> plane violates POSIX, but it's the best we can do for those characters.
> > 
> > Right, and we discussed this already on this list.  Or the developer
> > list, I don't remember.  Maybe we should have stick to the base plane
> > and only use UCS-2 to be more POSIX compatible.
> 
> The burden is on the application, not on cygwin.  If the application
> wants POSIX behavior, then they obey __STDC_ISO_10646__ and use ONLY
> characters from the basic plane (no surrogates), at which point their
> use of wchar_t fits the POSIX definition (one wchar_t per character).
> The moment they pass a surrogate, they are no longer honoring the
> restriction documented by __STDC_ISO_10646__ so they are no longer under
> the rules of POSIX, and then cygwin can do whatever it wants (and in

Erm... hang on.  __STDC_ISO_10646__ and the POSIX requirement are two
different beasts.  I still think that __STDC_ISO_10646__ does not
restrict a 2 byte wchar_t to UCS-2.  Per the definition UTF-16 is a
valid coded representation of characters from ISO/IEC 10646.

So, to say it with your words, the moment applications pass a surrogate,
they are no longer under the rules of POSIX, but they still honor the
restriction documented by __STDC_ISO_10646__.

However, *usually* an application shouldn't really notice that a
surrogate has been used, at least as long as they only manipulate entire
strings.

> this case, QoI demands that we honor surrogates to the best of our
> ability for full UTF-16 support, and you can have multi-wchar_t
> characters just as you already have multi-byte UTF-8 char characters).
> In other words, cygwin IS being POSIX-compliant by advertising only the
> Unicode 4.0 character set in the __STDC_ISO_10646__, while still
> supporting Unicode 5.2 (should we upgrade to Unicode 6.0?) as an
> extension when you no longer care about POSIX.
> 
> > However, the POSIX definition doesn't contradict what I said about the
> > definition of __STDC_ISO_10646__ as far as I'm concerned.
> 
> Yep - I think we're in violent agreement :)

Hmm, I'm not quite sure, see above.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]