CYGWIN=codepage? Or LC_CTYPE=foo?

Sun Apr 6 11:14:00 GMT 2008

On Apr  6 15:59, Kazuhiro Fujieda wrote:
> >>> On Thu, 03 Apr 2008 17:54:48 +0200
> >>> Corinna Vinschen said:
> 
> > That means, in theory there's no reason anymore to keep the
> > CYGWIN=codepage setting in the environment.  We could use the LC_CTYPE
> > setting, just as on other systems.  Right now, we need the LC_CTYPE
> > set to "C-UTF-8" anyway when using the codepage:utf8 setting, otherwise
> > the wcstombs and mbstowcs conversions in newlib will be broken.
> >
> > But there's a problem.  The newlib conversion functions don't know
> > anything about Windows codepages, and the Windows conversion functions
> > used in the Cygwin functions sys_wcstombs and sys_mbstowcs don't know
> > anything about LC_CTYPE. 
> 
> The LC_CTYPE is defined to control the character handling of not
> system calls but C library functions by the specification. I
> believe Cygwin DLL should use sys_wcstombs and sys_mbstowcs with
> CYGWIN=codepage, and not depend on userland functions.

Isn't that somewhat error prone?  Right now, if you define codepage:utf8
and don't define LC_CTYPE='C-UTF-8', you will probably still have
working file names most of the time, but you get a screwed up console
output because the strings sent by the application are incorrectly
evaluated by the console code.  That's one reason I hoped that we don't
need two places to define language/codepage stuff.

Another is that Cygwin is not using any function which really requires a
codepage.  The codepage is needed for application calling Windows ANSI
functions.  But Cygwin doesn't call these functions, so the focus of
language and character set support has moved from the Cygwin->OS
interface to the application->Cygwin interface.

So, given my vague understanding of this language stuff, the conversion
from wide char to multibyte string *can* be based on the notion the
applications have of the language/codepage.  Which sounds to me as if
using LANG/LC_CTYPE would also make sense for Cygwin's internal
conversions.

Does that make sense?  I don't know.  No. 5: "More input, please!"

> Cygwin DLL, however, has both of system calls and userland
> functions. Controlling them by LC_CTYPE at the same time is not
> bad idea.
> 
> To achieve this, it is necessary to make functions related to
> character handling know about the mapping between locale names
> and Windows codepages. For example, if LC_CTYPE is set to
> de_DE@ISO-8859-15, they should know it designate the codepage 28605.
> 
> The current implementations of mbstowcs and wcstombs do not work
> at all in this scenario. We must replace the implementations
> with ones based on MultiByteToWideChar and WideCharToMultiByte.
> The emulation will take a little cost. Cygwin DLL should also
> use sys_wcstombs and sys_mbstowcs in this scenario.

I would be basically fine with that, we just have to replace the
newlib functions _mbtowc_r and _wctomb_r.  All other conversions are
based on these.  What we still also need is a good conversion
function from LANG/LC_CTYPE to Windows codepage.

And here's the problem:  I don't think I understand this stuff good
enough.  Does anybody have fun and time to come up with that?

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat