This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Unicode width data inconsistent/outdated


Am 07.08.2017 um 11:28 schrieb Corinna Vinschen:
On Aug  5 21:06, Thomas Wolff wrote:
Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:
On Aug  3 21:44, Thomas Wolff wrote:
My attempt would be to base the functions on a common table of character categories instead.
...Keep in mind that the table is not loaded into memory on demand, as on
Linux.  Rather it will be part of the Cygwin DLL, and worse in case
newlib, any target using the wctype functions.
Maybe we could change that (load on demand, or put them in a shared library
perhaps), but...
That won't work for embedded targets, especially small ones.

If you want to go that route, you would have to extend struct __locale_t
or lc_ctype_T (in newlib/libc/locale/setlocale.h) to contain pointers to
conversion tables (Cygwin-only), and the __set_lc_ctype_from_win function
or a new function inside Cygwin (but called from __ctype_load_locale)
could load the tables.

Then you could create new iswXXX, towXXX, and wcwidth functions inside
Cygwin using these tables, rather than relying on the newlib code.

Alternatively, if RTEMS is interested as well, we may strive for a
newlib solution which is opt-in.  Loading tables (or even big tables at
all) isn't a good solution for very small targets.

The idea here is that the tables take less space than a full-fledged
category table.  The tables in utf8print.h and utf8alpha.h and the code
in iswalpha and iswprint combined are 10K, code and data of the
tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K,
covering Unicode 5.2 with 107K codepoints.

A category table would have to contain the category bits for the entire
Unicode codepoint range.  The number of potential bits is > 8 as far as I
know so it needs 2 bytes per char, but let's make that 1 byte for now.
For Unicode 5.2 only the table would be at least 107K, and that would
only cover the iswXXX functions.
I have a working version now, and it uses much less as the category table is
range-based.
Another table is needed for case conversion. Size estimates are as follows
(based on Unicode 5.2 for a fair comparison, going up a little bit for 10.0
of course):

Categories: 2313 entries (10.0: 2715)
each entry needs 9 bytes, total 20817 bytes
I don't know whether that expands by some word-alignment.
I could pack entries to 7 bytes, or even 6 bytes if that helps (total 16191
or 13878).

Case conversion: 2062 entries (10.0: 2621)
each entry needs 12 bytes, total 24744
packed 8 bytes, total 16496

The Categories table could be boiled down to 1223 entries (penalty: double
runtime for iswupper and iswlower)
The Case conversion table could be transformed to a compact form
Case conversion compact: 1201 entries
each entry needs 16 bytes, total 19216
packed 12 or 11 (or even 10), total 14412 (or 12010)
So I think the increase is acceptable for the benefit of simple and
automatic generation
So we're at 40K+ plus code then.
No, if I implement the packed versions, it's 19.3K, so even smaller the currently.

newlib: embedded targets, looking for small sized solutions.  Simple
and automatic generation is not the main goal.

and also more efficient processing by some of the
functions. Also they would apply to more functions, e.g. iswdigit which
would confirm all Unicode digits, not just the ASCII ones.
Don't do that.  There's a collision with C99 if you define other
characters than ASCII digits to return nonzero from iswdigit.  ...
OK.

Issue 3 is the special conversion jp2uc which seems to be half-bred; there
is no such handling for Chinese or Korean.
This shouldn't matter to you, just keep it in place.  It's a historical,
low footprint conversion for japanese characters without pulling in the
unicode stuff.  Not used on Cygwin so just ignore.
I had noticed meanwhile that this is not active in Cygwin, but it's broken
anyway for multiple reasons:
    * platforms for which wchar_t is not Unicode should be explicitly listed
    * if used, the transformation needs to be applied to all non-Unicode
locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
    * for towupper and towlower, the result must be back-transformed into the
respective locale encoding
    * particulary the locale-specific _l functions inconsistently do not use
the transformation but have this note:
No, no, no.  The functionality is restricted to certain use-cases and
always was.  It was a paid-for customer extension back in the day and it
was *sufficient* for the use-cases.  It's not clear how many newlib
users are still using it, but it's not a good idea to remove it without
checking first.  That means, ask on the newlib mailing list how many are
using the historical jp2uc code, and if we don't get a reply within,
say, a month, we can probably nuke it.
OK, let's make such a request after holiday time.
But, even if this shall persist as a special solution, it's still broken and should be fixed. Can we then substitute the current table with calling the iconvdata functions? In that case, as I said, the back-conversion would be available too, and I could fix that and add the missing handling of the _l functions, for a consistent solution.

Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]