This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line


On Sat, May 30, 2009 at 1:04 AM, Edward Lam <edward@sidefx.com> wrote:
> Alexey Borzenkov wrote:
>> It might be safe for you, but not for other people. If you have a
>> Russian default codepage and ever need to work with chineese/japanese
>> filenames and cygwin uses default codepage for filesystem operations
>> (as in 1.5 right now), then you are really screwed. In my opinion
>> utf-8 is a silver bullet here, and I'm very glad it went that way.
> I must be missing something here. Suppose you have a default Russian code
> page, with LANG unset (ie. cygwin 1.7 uses UTF-8). Now, if you're using any
> non-Unicode, non-CodePage aware, native application to create a Russian
> filename, isn't Windows going to convert the filename from the Russian code
> page into UTF-16 for storage in NTFS? If that is the case, and then you do
> an ls from cygwin 1.7, aren't you going to get the wrong filename displayed?
> ie. interoperability with non-Unicode, non-CodePage aware native
> applications will be broken for you too with the current default cygwin 1.7
> behaviour.
>
> Or is this, not a case that you care about and you *only* use cygwin
> applications?

No, it is precisely that I care about both ends of interoperability.
Here is a hypotetical situation:

for filename in `ls`; do
  someprogram $filename
done

Here, when I use russian Windows and I don't have LANG set (or when I
have LANG=en_US.UTF-8), filename will be utf-8 multibyte string. So
both, russian and european/chinese/japanese filenames will be valid.
Now there are three possibilities:

1) someprogram is a cygwin application, then it must be that $filename
will be passed as is, without any conversions
2) someprogram is a unicode application, then it will have a correct
unicode argument
3) someprogram is an ansi application, then Windows (cygwin has
nothing to do with it) will convert its unicode arguments to system's
codepage (cp1251 for Russian) and any character that can't be encoded
will be replaced with question marks. This is solely someprogram's
fault and cygwin has nothing to do with it.

All I'm trying to say is that on Windows (since WinNT) arguments are
always in unicode. It just so happens that when ansi applications call
other ansi applications with a sequence of bytes, it first gets
converted to unicode, then back to ansi, and you get the same sequence
of bytes. But the arguments are always characters, not bytes.

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]