This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: bug/deficiency in zip: non-ascii chars in file names work, but fail in directory names


Doug Henderson wrote:
    "You need to add the -r option to recurse into directories:"


You are 100% correct; my oversight.


Actually, it was a copy and paste error: the real code that I want to test does use -r, but when I tried to adapt that code to a simpler format for my email, I accidentally dropped the -r.


The code that I really want to test fails with a different error, so you solved a mystery that was really bugging me: why the console code in my email behaved differently from the test code I really care about.



I returned to analysing my real test code more carefully, and I still see a problem with cygwin's unzip: it fails to extract zip files with unicode names that are produced by OTHER programs (i.e. some other program besides cygwin zip).


In particular, one part of my test code creates a zip archive using Java (ZipOutputStream and ZipEntry), and then confirms that the archive can be extracted and exactly reproduced by multiple other means.

The first extraction method is to again use Java (ZipFile and ZipEntry); this works perfectly, as it should.

The second extraction method is to use cygwin's unzip; this fails: IT MANGLES THE NAMES.  In particular:
    1) the directory should be ÃÃÃÃÃ (\u00E5\u00D8\u00E2\u00E9\u00F1)
    2) the file should be ãäéïï_file#2_length2048.txt (first 5 chars \u3400\u4E01\u9FA6\uF900\uFA30)
but what cygwin unzip actually produces during extraction is
    1) the directory is +ï++ï+ï+ï
    2) the file is ïïïïïïÚïïïÇïï_file#2_length2048.txt

To rule out Java as being non-standard, I manually took the zip archive it produced and extracted it using the latest 7-zip (9.20), which worked perfectly (the directory and file names came out exact).  To further verify, I also temporarily installed the latest WinZip (19.0 build 11293) and once again, it extracted Java's zip file with non-ASCII names perfectly.  If anyone wants to verify these claims, I am attaching the zip file produced by Java (and extractable by 7zip and WinZip, but NOT by cygwin unzip) to this email.  [UPDATE: my original email yesterday had this attachment, but I do not see it showing up on the mailing list.  I take it that cygwin mailing lists auto reject emails with attachments?]


So, I reckon that cygwin unzip is the odd man out.


Oh, when I try to view this zip file using Windows 7's integrated zip viewed in Windows Explorer, it displays mangled directory and file names that are something different still from what cygwin unzip produced.  This link
    https://www.jam-software.com/treesize/online_manual/EN/unicode_zip_files.html

claims that Windows 7 does not really support unicode names, so this is perhaps expected.

Also, I found that this inter-program compatibility is limited to cygwin unzip: cygwin zip seems to produce archives involving unicode names that other programs can extract just fine.



I did some web research, and the most relevant link that I could find about cygwin unzip and unicode is this old announcement from 2009:
    https://cygwin.com/ml/cygwin-announce/2009-08/msg00006.html

That announcement contains this ominous text:
    Currently, on Windows the UTF-8 handling is limited to the character subset
    contained in the configured non-unicode "system code page".

Is it possible that the deficiency mentioned above has simply not been fixed in the last 5 years?

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]