This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: default charset for imlicit locale specificatio

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin-developers at cygwin dot com
Date: Wed, 20 Jan 2010 11:07:06 +0100
Subject: Re: default charset for imlicit locale specificatio
References: <20100119181535.GI2402@calimero.vinschen.de> <4B561006.2020000@cygwin.com> <416096c61001192329m228d832coe957d848550d9c79@mail.gmail.com>
Reply-to: cygwin-developers at cygwin dot com

On Jan 20 07:29, Andy Koppe wrote:
> However, as Thomas Wolff mentioned previously, there's a de-facto
> standard for the charset used with each language when none is
> specified explicitly, so implementing that instead is worth
> considering. 

The problem is that this information isn't provided by Windows.  I can
fetch the ANSI or OEM codepage, but not the ISO-8859 compatible codepage
for a language, if such a codepage exists.

Further testing shows that only a handful of codepages are used as
default ANSI codepages for languages.  This would make a very small
transition table:

  874	ANSI/Thai		-> CP874 (== ISO-IR-166 used on Linux)
  932	SJIS			-> SJIS
  936	GB2312			-> GBK
  949	ANSI/Korean		-> EUCKR
  950	Big-5			-> Big-5
 1250	ANSI/Central European	-> ISO-8859-2
 1251	ANSI/Cyrillic		-> ISO-8859-5
 1252	ANSI/Latin 1		-> ISO-8859-1
 1253	ANSI/Greek		-> ISO-8859-7
 1254	ANSI/Turkish		-> ISO-8859-9
 1255	ANSI/Hebrew		-> ISO-8859-8
 1256	ANSI/Arabic		-> ISO-8859-6
 1257	ANSI/Baltic		-> ISO-8859-4
 1258	ANSI/Vietnamese		-> UTF-8
65001	UTF-8			-> UTF-8

Is that a valid transition?

What's missing is a transition to ISO-8859-15 for languages with the
EUR currency letter.  I assume that's by adding the @euro modifier?

>  But at least the Windows-based solution should come
> fairly close to it, because many of the Windows codepages are largely
> compatible to their ISO equivalents. And it uses data that's already
> there, avoiding the need for maintaining a mapping table.

That's what I like most.  Windows has (almost) all the information
we need.  Why not just use it?

> Btw, just out of curiosity, how do you find the Windows locale for a
> given POSIX locale? Do you have to iterate through all the Windows
> locales until finding one with the correct ISO language and territory
> codes?

Starting with Windows Vista, Windows uses (almost) POSIX compatible
locale strings, rather than numerical LCIDs to specify a locale.  For
instance, "German (Germany)" has the locale string "de-DE".  The only
difference is the dash instead of the underscore.  Windows also knows
languages without territory, like "de".  there's a new call
LocaleNameToLCID(), which converts the (almost) POSIX compatible locale
string to an LCID, so I can use LCIDs for further stuff.  On systems
before Vista I have to iterate through the LCIDs, but that's quickly done
since the valid range is small.

Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

Follow-Ups:
- Re: default charset for imlicit locale specificatio
  - From: Corinna Vinschen

References:
- default charset for imlicit locale specificatio
  - From: Corinna Vinschen
- Re: default charset for imlicit locale specificatio
  - From: Larry Hall (Cygwin Developers)
- Re: default charset for imlicit locale specificatio
  - From: Andy Koppe

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]