This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Windows NTFS UCS2 characters


Hi Cygwin folks,

I have a Windows file on NTFS named (using \uXXXX representation):
xxx_\u212B_A\u030A_\u00C5_xxx.txt

# ls -alb xxx_*_xxx.txt

ls: xxx_\305_A\260_\305_xxx.txt: No such file or directory

Windows sees it just fine.  The bash *-expansion is expanding it to
/something/... just not a good something it appears.

I can select the file in Explorer, I can double click on it to edit it.  Use
MS-Notepad (shudder -- Cygwin's Vim's can't see the file either, neither
passed on the command line nor through Vim's explorer; I don't have a
Windows native Vim/gVim to test) to put some text in it.  Save it.

But Cygwin / bash / ls finds that filename unpalatable.  Hmmm.

# echo -n xxx_*_xxx.txt | xxd -g 1

78 78 78 5F C5 5F 41 B0 5F C5 5F 78 78 78 2E 74 78 74
 x  x  x  _ Ao  _  A ^o  _ Ao  _  x  x  x  .  t  x  t

(The character representation line was typed in by me, not xxd.  Using Ao to
represent the A-with-overcircle, ^o combining overcircle.)

I presume Cygwin's bash operates using UTF8 encoded POSIX filenames.  I
expect the name should have been expanded as:

78 78 78 5F E2 84 AB 5F 41 CC 8A 5F C3 85 5F 78 78 78 2E 74 78 74
            ^^^^^^^^       ^^^^^    ^^^^^

E2 84 AB is UTF8 for \u212B
CC 8A is UTF8 for \u030A
C3 85 is UTF8 for \u00C5
(Assuming I didn't mess up)

Hmmm.  Yep, it appears that xxx_*_xxx.txt is expanding funny.

# ls -alb -n xxx_$'\xE2\x84\xAB'_A$'\xCC\x8A'_$'\xC3\x85'_xxx.txt

ls: xxx_\342\204\253_A\314\212_\303\205_xxx.txt: No such file or directory

Drat.  Still no love.  So even if hand fed the UTF8 representation, ls is
not able to digest the name.  (Assuming I didn't mess up.)

Is there some sort of UCS2 or UTF8 or Unicode compatibility setting I need
to set for Cygwin to be able to work in Window's NTFS environment, when some
filenames have some arbitrary UCS2 (Unicode 1.x, of course) characters?

I presume that somewhere something is set to CP1252 and causing grief.

Hmmm, I don't have LANG nor LC_ALL (or any other LC_xxx) set.  Maybe that's
my problem.  [Tries it.]  Nope -- or I didn't do it correctly.

I can always fallback to use scripts for CMD.EXE to manipulate these files;
but I'd rather be able to do it in my Bash shell scripts.

Please don't suggest Interix, SFU or MKS alternatives.  Those are fine
products, I'm sure, but I'm not interested.

Thanks,
--Eljay

/* MSVS8: cl test.c */
#include <Windows.h>
#include <stdio.h>

int main()
{
  /* Create file name that Cygwin does not like. */
  HANDLE h = CreateFileW(
    L"xxx_\u212B_A\u030A_\u00C5_xxx.txt",
    GENERIC_READ | GENERIC_WRITE,
    0,
    NULL,
    OPEN_ALWAYS,
    0,
    NULL);

  if (h == INVALID_HANDLE_VALUE)
  {
    fprintf(stderr, "Invalid handle\n");
  }
  else
  {
    fprintf(stderr, "Successfully opened\n");
    CloseHandle(h);
  }

  return 0;
}


--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]