This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Bogus assumption prevents d2u/u2d/conv/etal working on mixed files.


Dave Korn wrote:

  I was pretty stunned to find d2u didn't have the same effect as tr -d.  A
few seconds work in the debugger, however, made it clear.

  Right inside conv.c, in the main convert (...) function, there's an
attempted optimisation.  After opening the file for conversion, it reads a
char at a time until it finds the first '\n' or '\r' in the whole file.  If
a '\n' comes first, it assumes the file is in Unix format; if a '\r' comes
first, it assumes the file must be in DOS format.

  Now, these assumptions are reasonable enough ways of guessing the file
format if it hasn't been specified by the command name or command line
switch, and therefore of deducing which kind of translation is required.

  But then it checks to see if the guessed format matches the format you've
asked it to convert into.  If so, it attempts to 'optimise' the conversion
by simply not performing it: it closes the file and leaves it untouched.

This is not just a performance enhancement. It also has the following properties:
(1) the access time on files which are not actually modified does not get updated


(2) it's an attempt to prevent users from permanently scrogging binary files. See: d2u, on a binary file, is an irreversible operation. So, if you do "d2u *" you'll probably kill something deep inside some binary file, and you can't fix it -- unless some minimal safeguards are in place.

u2d MAY be reversible -- IF there were no pre-exising \r\n combinations in the file to begin with -- so when (OMG-fixit-)d2u is run, obviously the first '\n' is preceeded by a (newly-added) '\r\n', so the prog merrily replaces ALL '\r\n' with '\n'...which MAY fix your oops, but maybe not.


So, with the current code, if you snarf the first "line" -- all chars until the first '\n' -- if it's a binary file the odds are pretty low that the immediately-preceeding character is a '\r' -- so d2u as currently coded will bail out, and no harm is done.


It doesn't work so well in the other direction -- by the same logic above, you'll almost never bail out early if you run 'u2d' on a binary file -- but if you immediately do a 'd2u' you MIGHT be able to recover.)


Unfortunately, there is an extra unstated assumption in between deducing the file type from the first EOL in the file and deducing that you don't need to perform a conversion, and that assumption is that every other line in the file has the same EOL as the first line. And that assumption is bogus, and it means that d2u/u2d and friends are no use on files which have mixed EOL types, unless by good chance the very first line has the EOL type that you wish to convert away from.

  My attached patch simply removes the attempted optimisation.  Like I say,
I think it's an invalid shortcut to assume that every line in a file has the
same EOL type.  I could imagine a case could be made for keeping the
'optimisation' and perhaps providing a command-line switch "-f" or "--force"
to force full processing of files even if they seem to already be in the
right mode;  OTOH I'd say that even if you wanted to keep the optimisation
in some cases, it's a dangerous optimisation that can lead to incorrect
output, and therefore it should only be switched on when the user
deliberately adds a command-line option, rather than being on by default and
disableable.

While I admit that the 'safety' aspect is not foolproof, it is a valuable mini-feature. To make the default behavior un-"safe" kinda defeats the purpose. Further, why do you need an extra commandline option at all? Obvious, the first EOL is either one or the other.


So, if you have a mixed-mode file, and want to go to UNIX EOL, then this will always work:

u2d myFile.txt && d2u myFile.txt

I'm not shooting down your patch, but I'd like further discussion on list. I think the "safety" features of the current behavior have been overlooked, and I hesistate to remove them without a thorough discussion. 'cause as soon as I remove it, some poor slob is going to do 'd2u /cygdrive/c/WINNT/nt.dat' and invent new swear words using my name...

--
Chuck




-- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]