This is the mail archive of the
cygwin
mailing list for the Cygwin project.
Re: [1.7] codepage:utf removal and python
- From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
- To: cygwin at cygwin dot com
- Cc: Jason Tishler <jason at tishler dot net>
- Date: Thu, 2 Apr 2009 10:40:38 +0200
- Subject: Re: [1.7] codepage:utf removal and python
- References: <49D3EB8D.3040802@acm.org>
- Reply-to: cygwin at cygwin dot com
On Apr 1 15:32, David Rothenberger wrote:
> When codepage:utf was supported, this worked fine. Now, it fails, even
> when I have LANG=en_US.UTF-8 in my environment. It all boils down to
> this python code:
>
> import os
> os.listdir('.')
>
> (That's an example I run from within the directory.) This fails with an
> error
>
> OSError: [Errno 138] Invalid or incomplete multibyte or wide
> character: '.'
>
> unless one does this first:
>
> import locale
> locale.setlocale(locale.LC_ALL, '')
That's always the better approach, otherwise the application works
in the C locale.
> I've patched rdiff-backup to do this, but I'm still wondering if this is
> the correct thing to do. I know that on my Linux machine, I don't have
> to do this, but I'm not sure if that's because there's some default
> locale that's being picked up by Python from somewhere other than the
> environment.
The basic problem is that Windows stores filenames in UTF-16 while Linux
and other OSes store the filename as a simple, zero-terminated
bytestream. A simple bytestream is always valid. OTOH, a UTF-16 to
singlebyte conversion has always characters which can't be converted.
To workaround that I created the filename conversion method explained in
http://cygwin.com/1.7/cygwin-ug-net/using-specialnames.html#pathnames-unusual
I'm not sure why this doesn't work in your simple case. The locale is C
because the application didn't use setlocale. The resulting charset is
ASCII. The filename should have been converted to use the ASCII SO/UTF-8
sequence for the non-readable characters.
[...time passes...]
And it works as designed in your above testcase.
I tested with a filename containing a Euro sign (Unicode 0x20ac), in
HTML speak "qq€". Cygwin converted it to "qq\016\342\202\254"
The strace looks perfectly normal. I have no idea what python complains
about!
Jason, can you shed some light on this problem?
Corinna
--
Corinna Vinschen Please, send mails regarding Cygwin to
Cygwin Project Co-Leader cygwin AT cygwin DOT com
Red Hat
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Problem reports: http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/