This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Bogus assumption prevents d2u/u2d/conv/etal working on mixed files.


On Fri, 2 Apr 2004, Charles Wilson wrote:

> Dave Korn wrote:
>
> >   I was pretty stunned to find d2u didn't have the same effect as tr -d.  A
> > few seconds work in the debugger, however, made it clear.
> >
> >   Right inside conv.c, in the main convert (...) function, there's an
> > attempted optimisation.  After opening the file for conversion, it reads a
> > char at a time until it finds the first '\n' or '\r' in the whole file.  If
> > a '\n' comes first, it assumes the file is in Unix format; if a '\r' comes
> > first, it assumes the file must be in DOS format.
> >
> >   Now, these assumptions are reasonable enough ways of guessing the file
> > format if it hasn't been specified by the command name or command line
> > switch, and therefore of deducing which kind of translation is required.
> >
> >   But then it checks to see if the guessed format matches the format you've
> > asked it to convert into.  If so, it attempts to 'optimise' the conversion
> > by simply not performing it: it closes the file and leaves it untouched.
>
> This is not just a performance enhancement.  It also has the following
> properties:
>    (1) the access time on files which are not actually modified does not
> get updated
>
>    (2) it's an attempt to prevent users from permanently scrogging
> binary files.  See: d2u, on a binary file, is an irreversible operation.
>   So, if you do "d2u *" you'll probably kill something deep inside some
> binary file, and you can't fix it -- unless some minimal safeguards are
> in place.

IMO, (2) sounds like an important reason to keep existing default
behavior.  In the spirit of open source, though, we should provide a way
for the user to shoot themselves in the foot if they really want to,
though, so the "--force" option sounds marginally useful for cases like
the above.

>    u2d MAY be reversible -- IF there were no pre-exising \r\n
> combinations in the file to begin with -- so when (OMG-fixit-)d2u is
> run, obviously the first '\n' is preceeded by a (newly-added) '\r\n', so
> the prog merrily replaces ALL '\r\n' with '\n'...which MAY fix your
> oops, but maybe not.

I was about to say that u2d is fully reversible, since any \n will get
converted into a \r\n, regardless of whether it was already preceded by \r
or not (so, a \r\n combination will become \r\r\n), but it seems that it's
doing something smarter (i.e., it's not "perl -i -pe 's/$/\r/'").
Incidentally, "sed -i 's/$/\r/'" does NOT work as expected; it works more
like u2d.

The problem with d2u (and u2d) is that after it's done, you don't know
whether a particular \n (\r\n) was the result of a conversion, or was
there initially.

> So, with the current code, if you snarf the first "line" -- all chars
> until the first '\n' -- if it's a binary file the odds are pretty low
> that the immediately-preceeding character is a '\r' -- so d2u as
> currently coded will bail out, and no harm is done.

Well, if the first "line" of a binary ended in \r\r\n, you're still
screwed, but that's a different story.

> It doesn't work so well in the other direction -- by the same logic
> above, you'll almost never bail out early if you run 'u2d' on a binary
> file -- but if you immediately do a 'd2u' you MIGHT be able to recover.)

Only if the binary didn't contain any \r\n combinations before u2d...

> >   Unfortunately, there is an extra unstated assumption in between deducing
> > the file type from the first EOL in the file and deducing that you don't
> > need to perform a conversion, and that assumption is that every other line
> > in the file has the same EOL as the first line.  And that assumption is
> > bogus, and it means that d2u/u2d and friends are no use on files which have
> > mixed EOL types, unless by good chance the very first line has the EOL type
> > that you wish to convert away from.
> >
> >   My attached patch simply removes the attempted optimisation.  Like I say,
> > I think it's an invalid shortcut to assume that every line in a file has the
> > same EOL type.  I could imagine a case could be made for keeping the
> > 'optimisation' and perhaps providing a command-line switch "-f" or "--force"
> > to force full processing of files even if they seem to already be in the
> > right mode;  OTOH I'd say that even if you wanted to keep the optimisation
> > in some cases, it's a dangerous optimisation that can lead to incorrect
> > output, and therefore it should only be switched on when the user
> > deliberately adds a command-line option, rather than being on by default and
> > disableable.
>
> While I admit that the 'safety' aspect is not foolproof, it is a
> valuable mini-feature.  To make the default behavior un-"safe" kinda
> defeats the purpose.  Further, why do you need an extra commandline
> option at all?  Obvious, the first EOL is either one or the other.

The commandline option would help those who want to shoot themselves in
the foot do so.  It would also simplify tech support, e.g., the transcript
would be something like "-Was the --force option supplied? -Well, yes...
*Click*".

> So, if you have a mixed-mode file, and want to go to UNIX EOL, then this
> will always work:
>
> u2d myFile.txt && d2u myFile.txt
>
> I'm not shooting down your patch, but I'd like further discussion on
> list.  I think the "safety" features of the current behavior have been
> overlooked, and I hesistate to remove them without a thorough
> discussion.  'cause as soon as I remove it, some poor slob is going to
> do 'd2u /cygdrive/c/WINNT/nt.dat' and invent new swear words using my
> name...

FYI:
$ cp -p /cygdrive/c/WINNT/nsreg.dat . && d2u nsreg.dat && ls -l nsreg.dat /cygdrive/c/WINNT/nsreg.dat
nsreg.dat: done.
-rwx------+   1 Administ None        22428 Apr  2 08:16 /cygdrive/c/WINNT/nsreg.dat*
-rwx------    1 Administ None        22365 Apr  3 12:49 nsreg.dat*

IOW, the safeguard doesn't work anyway.
	Igor
-- 
				http://cs.nyu.edu/~pechtcha/
      |\      _,,,---,,_		pechtcha@cs.nyu.edu
ZZZzz /,`.-'`'    -.  ;-;;,_		igor@watson.ibm.com
     |,4-  ) )-,_. ,\ (  `'-'		Igor Pechtchanski, Ph.D.
    '---''(_/--'  `-'\_) fL	a.k.a JaguaR-R-R-r-r-r-.-.-.  Meow!

"I have since come to realize that being between your mentor and his route
to the bathroom is a major career booster."  -- Patrick Naughton

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]