"C" character set (again)

Thomas Wolff towo@towo.net
Fri Jan 8 14:45:00 GMT 2010


I just wrote and forgot two things:
> Corinna Vinschen wrote:
>> On Jan  8 12:12, Thomas Wolff wrote:
>>  
>>> Andy Koppe wrote:
>>>    
>>>> There's an important distinction here between the C locale and the
>>>> defaut locale. The C locale is what you get if you don't call
>>>> setlocale at all, whereas the default locale is what you get if you
>>>> call setlocale(LC_FOO, "") and the relevant environment variables are
>>>> all unset or empty.
>>>>
>>>> The default locale uses UTF-8, and I most certainly agree that this
>>>> should stay as is. The charset of the filesystem and the console are
>>>> both controlled by the default locale (unless overridden in the
>>>> environment). They are independent of the C locale's charset or
>>>> whether an application calls setlocale.
>>>>
>>>> No, this is about the C locale only. Lots of people and programs make
>>>> assumptions about the C locale which may not be valid according to
>>>> POSIX, but which nevertheless hold true for Linux and most (if not
>>>> all) other Unices, including Cygwin 1.5. The most important assumption
>>>> is that the C locale is 8-bit clean.
>>>>       
>>> And byte-transparent, right?
>>> Which gets me back to this printf issue; actually your point here
>>> seems to approve my arguments there, if only I had explicitly
>>> restricted them to the C locale.
>>> Could you agree that functions like sprintf should handle their char
>>> * arguments byte-transparently if acting in the C locale?
>>>     
>> It does!  ...
> I couldn't reproduce this for an hour until I noticed why, and 
> suddenly all arguments seem to blend well together:
> My sample program (attached) ...
:-[
> as well as the sample from the other thread do not even work if cygwin 
> runs in an 8-bit locale.
> This is surprising - a user cannot rectify the problem using the 
> locale mechanism although it is supposed to provide the feature of 
> proper adjustment.
> The program can and can only be convinced to do what's expected if the 
> setlocale is invoked to explicitly set an 8 bit locale
> (included in a comment of my program).
> The reason is probably the programs always start in the "C" locale (I 
> think that's something claimed by POSIX?). If that's UTF-8, however, 
> behaviour of locale-agnostic programs is not as expected. This 
> actively breaks legacy compatibility.
> So, actually, reconsidering your response above, no it does not. If 
> running in the C locale, whether explicitly or implicitly,
> sprintf is not byte-transparent in 1 of 3 cases (of my sample 
> program), and printf is not byte-transparent in 2 of 3 cases (which is 
> another surprising inconsistency, between printf and sprintf).
>
> Some of the details have been noted before (sorry), but for me, this 
> summary results in a clearer picture now,
> and the best and easiest solution IMHO would be to indeed change the C 
> locale back to 8 bit, byte-transparent, and not even plan to rechange 
> that later.
> (That's why I'm discussing it here, not in the sprintf thread.)
>
>> The problem occurs in the *format* string.  ...
> [Maybe this should be discussed in the other thread but let's keep it 
> together for now.]
> Yes, and I doubted (in the other thread) that is should occur, putting 
> it more precisely now, because in
> http://www.opengroup.org/onlinepubs/9699919799/functions/sprintf.html
> the condition that "a wide-character code that does not correspond to 
> a valid character has been detected" is only mentioned as a condition 
> for the EILSEQ error.
> While Andy had a valid point in finding *format* to be described as a 
> "character string" and relating that to a generic POSIX definition of 
> character,
I doubt any implicit consequence was intended by the authors of this 
POSIX manpage, considering that error conditions are otherwise 
extensively and carefully described,
and
> this certainly does not justify the current behaviour of slient 
> dropping and reporting partial success because that is not one of the 
> options in the "RETURN VALUE" section;
> also I don't see what Andy's claim "Including invalid bytes in the 
> format string is undefined behaviour." is based on.
>
> So I'd like to encourage you to apply your patch to vprintf (I don't 
> see a need to feel uneasy about it) in any case - whether or not the C 
> locale gets changed;
> there is an additional consideration in favour of it:
> The printf functions, especially fprintf and sprintf, are not 
> necessarily preparing text output, esp. to a terminal. They can also 
> be used to prepare binary data for output into a file which is totally 
> locale-agnostic and shouldn't be broken.
>
> ------
> Thomas
>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: p.c
URL: <http://cygwin.com/pipermail/cygwin-developers/attachments/20100108/653d44b1/attachment.c>


More information about the Cygwin-developers mailing list