This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
Re: Suffixes in non-western charsets
- From: Igor Peshansky <pechtcha at cs dot nyu dot edu>
- To: cygwin-developers at cygwin dot com
- Date: Mon, 28 Jan 2008 09:14:49 -0500 (EST)
- Subject: Re: Suffixes in non-western charsets
- References: <20080128124211.GH30866@calimero.vinschen.de>
- Reply-to: cygwin-developers at cygwin dot com
On Mon, 28 Jan 2008, Corinna Vinschen wrote:
> Hi,
>
> sorry for my ignorance, but I found that I have no idea how file
> suffixes are handled when working in a non-western charset environment.
> What I'm up to is this:
>
> When you're using a latin-character based charset like ASCII or
> ISO-8859-1, then the suffixes used for instance for executables or
> shortcuts are always the same. An executable has ".exe" or ".com", a
> shortcut has ".lnk", a batch file ".bat" and so on.
>
> How is that in non-latin charsets like, say, in cyrillic, chinese or in
> japanese? Are these suffixes in some way translated into the non-latin
> charset? If so, how?
>
> Given that NTFS uses UTF-16, it would be possible to keep the latin
> characters part of the filename. So, if I try to find out if a path
> name is a batch file, the comparison with L".bat" would still be valid.
> But, is it working this way?
>
> FAT uses the system OEM charset. Many applications are still using
> single/multi-byte functions. So, how does it work? Are the suffixes
> fixed by using always the same byte value, regardless of the meaning of
> that byte value in the used charset? Or are they translated to
> characters which have some similarity with the latin characters the
> suffixes are based on? Would the "usual" comparison work after
> converting the filename to UTF-16 (as for L".bat")?
>
> Can anybody enlighten me here?
As far as I know, most 8-bit charsets share the ASCII 7-bit portion, and
differ only in the upper 128 characters. When using the Windows-1251
(Cyrillic) charset, the suffixes are in the ASCII subset, and thus are
unchanged. I have seen some other 8-bit charsets used (1252, 1255), and
there was no translation of suffixes.
I'm not certain that the CJK charsets share this property.
Igor
--
http://cs.nyu.edu/~pechtcha/
|\ _,,,---,,_ pechtcha@cs.nyu.edu | igor@watson.ibm.com
ZZZzz /,`.-'`' -. ;-;;,_ Igor Peshansky, Ph.D. (name changed!)
|,4- ) )-,_. ,\ ( `'-' old name: Igor Pechtchanski
'---''(_/--' `-'\_) fL a.k.a JaguaR-R-R-r-r-r-.-.-. Meow!
"That which is hateful to you, do not do to your neighbor. That is the whole
Torah; the rest is commentary. Go and study it." -- Rabbi Hillel