This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: awk gsub problem


Am 19.09.2010 22:33, schrieb Lee:

Thank you - I appreciate the follow-up.


Was the reply from the upstream maintainer answered on a mailing list?
  (&  if so, which one?)  I'd like to understand the problem they're
solving..  I get the idea of "[[:lower:]]" working regardless of
collating order of the current char set, but how "[a-z]" gets
translated to something like "[aAbBcCdD...zZ]" boggles my mind.  It
seems like they had to have gone out of their way to translate [a-z]
into a case-insensitive RE.

But regardless, it still seems broken to me. From the gawk man page:

    The various command line options control how gawk interprets
characters in regular expressions.

    --traditional
       Traditional Unix awk regular expressions are matched.  The GNU
operators are not special, interval expressions are not available, and
neither are the POSIX character classes ([[:alnum:]] and so on).

The way I read it, I can change the line in my .bashrc from
   export AWK="/usr/bin/gawk.exe"
to
   export AWK="/usr/bin/gawk.exe --traditional"
and not have to change any scripts that use $AWK.  If "--traditional"
meant one no longer was able to do a case-sensitive RE ("[a-z]" gets
translated into "[aAbB...zZ]" and "[[:lower:]]" isn't interpreted as a
lower case character RE) I'd expect that to be high-lighted in the man
page.  But like I said in my initial msg, --traditional doesn't fix
the problem:

$ cat test.awk
awk --traditional '
BEGIN {
   s="Serial0"
   gsub("[a-z]","",s)
   printf("s= ::%s::  should = ::S0::\n", s)
   exit
} '

$ export LANG=en_US.UTF-8

$ sh test.awk
s= ::0::  should = ::S0::


What you really want is this:
s/really want/have to do/

   BEGIN {
     s="Serial0"
     gsub("[[:lower:]]","",s)
     printf("s= ::%s::  should = ::S0::\n", s)
     exit
   }

The "[[:lower:]]" expression always catches all valid lowercase letters,
independent of the langauge, territory, and charset used.
At least for the short term, my work-around is not setting LANG.

Thanks again,
Lee

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple




Hello Lee,


you hit a well know problem with different character sets.
Normally it is not recognized, because the standard character set from UNIX, LINUX And WINDOWS systems
have the characters "abcdefghijklmnopqrstuvwxyz" in a sequence. But this is not the case for all character sets.
E.g. *EBCDIC* is one example for such a character set.
The different character set are a great problem for porting programs from one system to another.


The documentation for gawk in the man page is not complete. Many GNU programs have the full/better documentation in the info pages.
The documentation for your problem is accessible by the following command:
info gawk character list


It is the first paragraph in the info page.

2.4 Using Character Lists
=========================

Within a character list, a "range expression" consists of two
characters separated by a hyphen.  It matches any single character that
sorts between the two characters, using the locale's collating sequence
and character set.  For example, in the default C locale, `[a-dx-z]' is
equivalent to `[abcdxyz]'.  Many locales sort characters in dictionary
order, and in these locales, `[a-dx-z]' is typically not equivalent to
`[abcdxyz]'; instead it might be equivalent to `[aBbCcDdxXyYz]', for
example.  To obtain the traditional interpretation of bracket
expressions, you can use the C locale by setting the `LC_ALL'
environment variable to the value `C'.

Regards
Dirk

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]