This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Problem with Bash regex test case sensitivity


On 12/4/10, Lee Rothstein <lee@ > wrote:
> On 12/4/2010 10:06 AM, Corinna Vinschen wrote:
>
>  > On Dec  4 10:05, Lee wrote:
>
>  >> On 12/3/10, Eric Blake <eblake@ > wrote:
>  >>> Read the FAQ.  http://www.faqs.org/faqs/unix-faq/shell/bash/, E9.
>
>  >> Which says the en_US locale collates the upper and lower case
>  >> letters like this:
>  >>     AaBb...Zz
>
>  >> I got that much :)  What I don't get is why someone would _want_ the
>  >> collating sequence to be AaBb... or why that sequence was picked for
>  >> en_US instead of using the natural order of A-Za-z.
>
>  > It's not the "natural" order, it's an arbitrary order which has been
>  > chosen back in 1963 when the ASCII code has been defined.  It's not used
>  > as "natural" order outside of computer systems and it's not even the
>  > natural order on some computer systems (See EBCDIC).
>
>  > If you take a look into a hardcopy encyclopedia written in english,
>  > you'll be very comfortable that the words are ordered lexicographically
>  > instead of in ASCII coding, probably.  Needless to say that ordering
>  > criteria for non-english languages may contain more characters in the
>  > sequence, in german for instance
>
>  >   "AaäBb...Ooö...Ssß...Uuü...Zz"
>
>  > So, let's reiterate:
>
>  > - If I need the order for the computer language, I say so:
>
>  >    LC_COLLATE=C.UTF-8
>
>  > - Otherwise, if I need the order for the natural language, I
>  >   say so:
>
>  >    LC_COLLATE=en_US.UTF-8
>  >    LC_COLLATE=de_DE.UTF-8
>  >    ...
>
> Here's my takeaway, given Corinna's interesting and complete
> context, and my intents. (My intentions, BTW, are for my scripts
> to have as much generality as possible [given my limited skills
> ;-|].)
>
> Therefore, instead of using '[A-Z]' to represent caps, I should
> have used (?) the Posixly Correct, '[:upper:]'.

Close, you should have used '[[:upper:]]'

$ cat t_regex
#!/bin/bash
# t_regex: Test test regex
# By Lee Rothstein, 2010-12-03, 16:27:38

regex_test () {

echo -n "[A-Z] test: "
if [[ "$1" =~ [A-Z] ]] ; then
   echo Contains Capital Letters: $1
else
   echo Doesn\'t Contain Capital Letters: $1
fi

echo -n "[:upper:] test: "
if [[ "$1" =~ [[:upper:]] ]] ; then
   echo Contains Capital Letters: $1
else
   echo Doesn\'t Contain Capital Letters: $1
fi

}

unset LC_COLLATE
export LANG="C.UTF-8"
echo "=== LANG=$LANG"
regex_test dfgh
regex_test Dfgh

echo
echo

export LANG="en_US.UTF-8"
echo "=== LANG=$LANG"
regex_test dfgh
regex_test Dfgh


 ~/src
$ ./t_regex
=== LANG=C.UTF-8
[A-Z] test: Doesn't Contain Capital Letters: dfgh
[:upper:] test: Doesn't Contain Capital Letters: dfgh
[A-Z] test: Contains Capital Letters: Dfgh
[:upper:] test: Contains Capital Letters: Dfgh


=== LANG=en_US.UTF-8
[A-Z] test: Contains Capital Letters: dfgh
[:upper:] test: Doesn't Contain Capital Letters: dfgh
[A-Z] test: Contains Capital Letters: Dfgh
[:upper:] test: Contains Capital Letters: Dfgh

 ~/src
$


Lee

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]