This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: LC_COLLATE vs. egrep -- bug or (non-)feature?


On 10/11/2011 01:20 PM, Henry S. Thompson wrote:
Is this a feature, or a bug associated with the current ongoing
discussion about locales:

(mis)-feature, and not necessarily a cygwin bug. Historically, POSIX 1992 _required_ that regular expression ranges expand out to all characters in Collation Element Order, between the two end points. The intent there was to allow accented characters common in some languages to automatically be picked up, so that [a-z] would also pick up accented vowels. But it backfired with several unintended consequences: 1) in locales that collate case-insensitively, you are collating via y aAbBcC... or AaBbCc..., so that [a-b] now means [aAb] or [aBb], which adds unwanted capital letters into your range. And although you can write a locale definition where collation element order is sane (all lowercase, followed by all uppercase, followed by collation rules that merge the two sets), it is not as easy to do (the naive locale definition writes the collation rules first, intermixing upper and lower case). 2) even if you write the locale definition in a sane collation element order, do you put the accents first or last? That is, [a-e] is liable to pick up all accented a's but no accented e's, even though [a-z] picks up all accented lower case vowels.


POSIX 2001 and 2008 "fixed" things by saying that the use of range expressions in regular expressions is undefined in all but the C locale, but the cat is already out of the bag, and you are stuck with existing behavior. glibc refuses to change their regex library, preferring to stick to POSIX 1992 behavior, and claiming that the "bug" instead lies with any locale definition that still uses naive ordering. Cygwin could behave differently than glibc here and still comply with POSIX, but then we'd get bug reports for "why does cygwin not emulate Linux".

Meanwhile, several GNU apps are sick of bug reports about the unintuitive nature of ranges, and are introducing what is called native ordering, where range expressions _always_ mean the C locale expansion, even when not in the C locale; but given glibc behavior, this means adding code on top of glibc, for all programs that understand regex (awk, bash, sed, grep, m4, etc.). So don't expect that to save you any time soon; likewise, that only helps you on GNU systems (Solaris will still continue to suffer from the confusion).

So, your only safe way to work around it is to request LC_COLLATE=C up front.


> LC_ALL= egrep '^[a-b]l[dl]e.n$' /usr/share/dict/words aldern
LC_COLLATE= egrep '^[a-b]l[dl]e.n$' /usr/share/dict/words
aldern
Alleen
Alleyn

If it's a feature, how do I set LC_COLLATE w/o changing the other
aspects of my locale?

LANG=preferred LC_COLLATE=C


and don't set LC_ALL.

--
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]