This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

From: Andy Koppe <andy dot koppe at gmail dot com>
To: cygwin-developers at cygwin dot com
Date: Sun, 27 Sep 2009 07:32:14 +0100
Subject: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

> The __utf8_wctomb function could just create the corresponding
> UCS-2 values if no first half has been encountered before. ÂThe
> __utf8_mbtowc function could simply allow these UCS-2 values again.
>
> That works (I just tested it) and is a small change, but is it really
> desirable to allow UCS-2 values in UTF-8 strings?

I don't know. The Wikipedia UTF-8 article is in two minds on the issue:

"UTF-8 may only legally be used to encode valid Unicode scalar values.
According to the Unicode standard the high and low surrogate halves
used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are
not legal Unicode values, and the UTF-8 encoding of them is an invalid
byte sequence and should be treated as described above.

"Whether an actual application should treat these as invalid is
questionable. Allowing them allows lossless conversion of an invalid
UTF-16 string and allows CESU encoding (described below) to be
decoded. There are other code points that are far more important to
detect and reject, such as the reversed-BOM U+FFFE, or the codes
U+0080..U+00AF which may indicate improperly translated CP1252 or
double-encoded UTF-8."

The pragmatic approach is tempting though, and we do have reasonable
grounds for it given the 16-bit wchar_t. But I think it would need to
work for both low and high surrogates.

Regarding the latter, __utf8_wctomb() currently writes the first byte
of a four-byte sequence when it sees a high surrogate, which of course
it can't take back if the following codepoint isn't a low surrogate.
This is a problem even if lone high surrogates aren't going to be
supported, because that byte on its own is invalid UTF-8.

Reading the POSIX spec, however, wctomb() is allowed to write nothing,
return zero, and leave the entire high surrogate to be dealt with on
the next call. It just says "wctomb() shall [...] return the number of
bytes that constitute the character corresponding to the value of
wchar", and unlike with mbtowc(), a return value of zero is not
defined to have special meaning.

There's also room to deal with a lone high surrogate at string end:
"If wchar is 0, a null byte shall be stored, preceded by any shift
sequence needed to restore the initial shift state, and wctomb() shall
be left in the initial shift state."

Andy

Follow-Ups:
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]