This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
Re: charset changes
- From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
- To: cygwin-developers at cygwin dot com
- Date: Fri, 5 Feb 2010 22:50:47 +0100
- Subject: Re: charset changes
- References: <416096c61001230305x20619d39x55e3a46b428ba@mail.gmail.com> <4B6C474B.7090600@towo.net>
- Reply-to: cygwin-developers at cygwin dot com
On Feb 5 17:28, Thomas Wolff wrote:
> On 23.01.2010 12:05, Andy Koppe wrote:
> >I'm in awe at Corinna's latest locale changes. Getting closer and
> >closer to the real thing.
> Me too.
Thanks. I'm still mulling over the LC_MESSAGES problem. The
information is just not available in Windows so I assume we need a
file-based solution. But that's certainly nothing for 1.7.2.
> I found the following inconsistencies, and since the agreed strategy
> seems to be to prefer Linux compatibility over Windows mapping,
> I think especially the first group of a few incompatible mappings
> should be fixed before the 1.7.2 release.
>
> ------------------------------------------------------------------------
> These locales have inconsistent encodings:
> Locale Linux Cygwin
> et_EE ISO-8859-1 ISO-8859-15
In the latest glibc-2.11, per the localedate/locales/et_EE file,
there's only one charset, ISO-8859-15. I have no glibc-2.11 based
system running, on Fedora 11 (glibc-2.10) it's ISO-8859-1. Hmm.
> ka_GE GEORGIAN-PS UTF-8
We don't have an implementation of GEORGIAN-PS. If you provide one for
newlib, I'll take it. The problem is, how to integrate it into the
existing model which only has ISO and CPxxx codeset arrays for ctype and
wide char conversion? Faking a non-existant Windows codepage? That's
probably the easiest solution.
> kk_KZ PT154 ISO-8859-5
Same here.
> sr_CS ISO-8859-5 UTF-8
Doesn't exist in newer glibc since it's superseded by sr_RS and sr_ME.
I just mapped it to sr_RS. Is it really worth to special case given
that it's outdated? Incidentally, on Fedora 11 you get ANSI_X3.4-1968.
> uz_UZ ISO-8859-1 UTF-8
Thanks, fixed.
> zh_HK BIG5-HKSCS BIG5
> - zh_HK is the dedicated Hongkong locale, so should use the Hongkong
> extension
How? This special variation of Big5 charset isn't supported by Windows
and we need Windows support for multibyte charsets other than UTF-8.
Per MSDN (http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx)
codepage 950 is "ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR,
PRC); Chinese Traditional (Big5)". That has to be sufficent, unless
you provide a Big5/Big5-HKSCS multibyte <-> Unicode conversion with a
Cygwin-compatible license.
> - With respect to other differences above, linux has these two
> distinguished locales:
> et_EE.iso885915 ISO-8859-15
All the .charset variants are automatically available. There's no
special code required. If you like, just specify de_DE.KOI8-U.
> uz_UZ@cyrillic UTF-8
Added and documented (together with tt_RU@iqtelif).
> - getlocale -a lists the following twice, without indicating a difference:
> sr_SP
> sr_BA
> az_AZ
> se_FI
> uz_UZ (see above)
Yeah, that's how the very simple mechanism works. On W7 there are even
more duplicates. I didn't want to make it more complicated than
necessary. After all it's job is just to provide what's available on
the system.
> ------------------------------------------------------------------------
> Also, some generic encoding suffixes are not handled:
> - .iso885915 and .iso8859-15 (cygwin only recognizes .iso-8859-15
> and its capital)
> - .koi8r (cygwin only recognizes .koi8-r and .KOI8-R)
> - .koi8u (cygwin only recognizes .koi8-u and .KOI8-U)
I just applied code to newlib to allow to specify iso-8859 and koi8
charsets without dashes.
> - .tcvn (in vi_VN.tcvn)
Codeset not supported.
> - .gb18030 (in zh_CN.gb18030)
Ditto. However, it's supported by Windows XP and later. Maybe we
should add it after 1.7.2?
> - .eucjp (in ja_JP.eucjp)
This one *is* recognized by newlib, same as euckr/euc-kr.
> - .euctw (in zh_TW.euctw)
Codeset not supported. Wikipedia claims that EUC-TW isn't widely used
and Big5 is much more common in TW.
> (Maybe the latter lack Windows support or depend on Windows
> configuration...)
> - .koi8t
> - .armscii8
> - .big5hkscs
> - .gb2312
> - .georgianps
> - .pt154
None of them is supported. Yet! As far as they are singlebyte charsets
we should be able to add them easily by providing ctype and widechar
conversion tables to newlib. Care to contribute?
> - .ujis (-> EUC-JP)
That's just another name for euc-jp? Let's ignore that for now.
> ------------------------------------------------------------------------
> These locales are not known or handled on cygwin at all:
As documented, with 1.7.2 we start to support only locales which are
also supported by the underlying Windows (with the weird sr_SP/CS/RS/ME
exception). It doesn't make sense to support locales for which the
underlying Windows has no locale-specific LC_COLLATE/LC_MONETARY/
LC_NUMERIC/LC_TIME information available. That's the reason I provided
getlocale.exe, so that you can find out which locales are supported by
your Windows.
And, btw., your list is not quite correct. I don't know on which
Windows you tested that, but on Windows 7 the following locales *are*
supported:
> am_ET UTF-8
> bn_BD UTF-8
> bo_CN UTF-8
> br_FR ISO-8859-1
> en_IN UTF-8
> en_SG ISO-8859-1
> es_US ISO-8859-1
> ga_IE ISO-8859-1
> gd_GB ISO-8859-15
> ha_NG UTF-8
> hsb_DE ISO-8859-2
> ig_NG UTF-8
> iu_CA UTF-8
> kl_GL ISO-8859-1
> km_KH UTF-8
> lo_LA UTF-8
> ne_NP UTF-8
> no_NO ISO-8859-1
not, but nn_NO is. If no_NO really makes sense, we could map it to nn_NO.
> nso_ZA UTF-8
> oc_FR ISO-8859-1
> or_IN UTF-8
> rw_RW UTF-8
> si_LK UTF-8
> tg_TJ KOI8-T
> tk_TM UTF-8
> ug_CN UTF-8
> wo_SN UTF-8
> yo_NG UTF-8
>
> ------------------------------------------------------------------------
> And finally, some systems (e.g. Fedora) maintain a number of
> full-word locales (locale aliases?) that are not known on cygwin
> either (maybe not harmful):
That's something for the far future, I guess.
Thanks for your input,
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat