This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [1.7] codepage:utf removal and python


On Apr  1 15:32, David Rothenberger wrote:
> When codepage:utf was supported, this worked fine. Now, it fails, even  
> when I have LANG=en_US.UTF-8 in my environment. It all boils down to  
> this python code:
>
>   import os
>   os.listdir('.')
>
> (That's an example I run from within the directory.) This fails with an  
> error
>
>   OSError: [Errno 138] Invalid or incomplete multibyte or wide  
> character: '.'
>
> unless one does this first:
>
>   import locale
>   locale.setlocale(locale.LC_ALL, '')

That's always the better approach, otherwise the application works
in the C locale.

> I've patched rdiff-backup to do this, but I'm still wondering if this is  
> the correct thing to do. I know that on my Linux machine, I don't have  
> to do this, but I'm not sure if that's because there's some default  
> locale that's being picked up by Python from somewhere other than the  
> environment.

The basic problem is that Windows stores filenames in UTF-16 while Linux
and other OSes store the filename as a simple, zero-terminated
bytestream.  A simple bytestream is always valid.  OTOH, a UTF-16 to
singlebyte conversion has always characters which can't be converted.

To workaround that I created the filename conversion method explained in
http://cygwin.com/1.7/cygwin-ug-net/using-specialnames.html#pathnames-unusual

I'm not sure why this doesn't work in your simple case.  The locale is C
because the application didn't use setlocale.  The resulting charset is
ASCII.  The filename should have been converted to use the ASCII SO/UTF-8
sequence for the non-readable characters.

[...time passes...]

And it works as designed in your above testcase.

I tested with a filename containing a Euro sign (Unicode 0x20ac), in
HTML speak "qq€".  Cygwin converted it to "qq\016\342\202\254"

The strace looks perfectly normal.  I have no idea what python complains
about!

Jason, can you shed some light on this problem?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]