1

Closed

Invalid characters in file are not ignored

description

Example:
jquery.caret.js - http://code.google.com/p/jcaret

Compressing this using the 'official' jar version of YUI Compressor works without error.
On the .net port, it throws syntax error.

There's a hex character at the start of (function

Would be good if the .net port could strip out unsupported characters before trying to compress?

Blogged about it here:
http://www.alexjamesbrown.com/blog/development/syntax-error-using-yahoo-yui-compressor-net-port-works-using-java-version/
Closed Jun 27, 2013 at 3:56 PM by freeranger
Not a YUI Issue

comments

freeranger wrote Jan 15, 2013 at 2:26 PM

Hi,

hmm... The problem character appears to be a BOM.
I would suggest that the very short term solution here is to remove the offending character.
From our side, I think it would be appropriate to handle these correctly (e.g. stripping out the BOM before we pass the script to the compressor, or opening the file in such a way that the BOM is detected and not presented as part of the text...though we need to consider if a BOM in the input should be preserved in the output) and successfully compress the files

I don't think removing invalid/unsupported characters is a desirable feature as it may indicate a problem with the file itself and/or that the resulting compressed file content is not equivalent to the input file.

Correct handling of BOM markers IS a desirable feature though

alexjamesbrown wrote Jan 15, 2013 at 2:30 PM

Yes, I agree... (that's what I did in fact)

I've submitted a failing unit test as a patch to get started on this.

freeranger wrote Jan 15, 2013 at 3:24 PM

"I've submitted a failing unit test as a patch" - what a top man, if only all our users were as kind!
Now if you want to go ahead and fix it... Just Kidding ;)
<br/>
We need to think about whether the BOM should be preserved in the output......and my thinking goes like this:<br/>
If it was one file in, one file out then it would seem a simple "yes" may be the correct answer - BOM is preserved, but since it will usually be multiple files in -> one file out, each input file could have a different BOM, so which one do you preserve? There is no "Correct" one to use in this case and I wouldn't want tob e inconsistent by saying "BOM is preserved if only one input file is presented, otherwise it is stripped out.<br/>

That aside, BOM is used to indicate both byte order (which, lets face it, is going to be little endian here) and encoding used. We already have the Encoding parameter to specify encoding....so again it seems like the BOM should not be preserved in the output in any instance.<br/>

Does that sound about right?

purekrome wrote Jan 15, 2013 at 11:05 PM

When i read the suggestion about explicitly checking for BOM chars, my warning spider sense started tingling.

I'm thinking: why are those chars there in the first place? meaning ... is there a better / proper way in which we should be reading in the files instead? Imaging if we just add a check for BOM .. but then there's another thing .. then another.. then .. u get the idea.

Quick google came up with this blog post help: http://andrewmatthewthompson.blogspot.com.au/2011/02/byte-order-mark-found-using-net.html

So maybe we should be investigating how we're reading in the file, instead of just opening it and then doing a hard coded explicit BOM check.

(i'm also on holidays for another fortnight before I go home).

alexjamesbrown wrote Jan 15, 2013 at 11:33 PM

Yeah...
I think (maybe) the problem in the example file is the BOM character isn't at the very start of the file...
it's several chars in (after all the comments)
I'll try editing the file (to make it at the start of the file) and see if that makes a difference.

If that's the case, our "check" for the BOM char can just ensure it's at the start of the file?

alexjamesbrown wrote Jan 15, 2013 at 11:53 PM

Ok....
So, if the BOM char is at the start of the file (i edited the jquery.caret file i included for unit tests) it works.

As expected, it's just if the BOM char is NOT at the start of the file (where it's supposed to be) that it fails.
Patch submitted (again) with a passing unit test, and renamed the failing one

With this info, I guess, if the BOM char isn't at the start of the file, we should remove it?
I realise this is a severe edge case, but it happened to me today (and took an hour to figure out what was going on...!)

Plus, it works with the official YUI compressor, either at the start, or BOM char in the middle...
WOuld be good to look at what that does maybe?

freeranger wrote Jan 16, 2013 at 6:15 AM

@PK, keep your shirt on man ;) The rest of that sentence went " or opening the file in such a way that the BOM is detected and not presented as part of the text".

We definitely don't want to be dealing with lots of different characters that shouldn't be in the file, but the BOM is a bit special.

We read a file in at the moment using File.ReadAllText and pass in the Encoding type...which raises a question of what should happen when then Encoding Type is specified as X but there is a BOM and it says Y?


We could try to be smart and, if there is a BOM, respect that...or we could barf and report that the BOM says X but it is actually Y....or we ignore the BOM and read it as X anyway (which is essentially what we are doing today).


None of that helps with the issue here which is the BOM in the wrong place, which is a completely separate issue really. My first instinct would be to say "tough, it is an illegal character, therefore it should correctly throw a wobbler.....BUT if the java YUI manages to deal with it then we should "deal" with it too as we do try to produce equivalent output to them.


Thoughts?

freeranger wrote Jan 16, 2013 at 6:18 AM

Oh and i have added a comment to the issue at the jcaret site to ask them how the BOM ended up in the wrong place. It is unlikely, though possible, that this is a valid scenario so we should understand that to be able to deal with it

freeranger wrote Jan 16, 2013 at 7:12 AM

Having said that, I just noticed that the original issue was raised in Jul 2010 and the last code release was May 2010, and none of the issues has had a response from the developer, so I suspect this is a dead project & we'll never know the answer...ah well...

purekrome wrote Jan 16, 2013 at 7:22 AM

AH! i totally missed / misread the bit that said: it's not at the -start- of the file, but -within- the file.

If it's not the start of the file, then it's not really a BOM. It's just some illegal chars within the file.

I vote to say: bad luck. Issue is with the source file. You're comment at jcaret site i say, is the correct solution here. aka: fix up the file, jcaret team.

freeranger wrote Jan 16, 2013 at 7:54 AM

"I vote to say: bad luck." - My only problem with that is the Yahoo implementation deals with a dodgy BOM (or perhaps it deals with any dodgy character) so, we are trying to match them output for output then we should consider dealing with this too....
Ordinarily I would agree with you - wrong character in the wrong place at the wrong time, tough!

alexjamesbrown wrote Jan 16, 2013 at 8:32 AM

Yes, it's the correct solution to get the jcaret Dev to fix the file (or creator of any invalid file)
However....
The only reason I created this issue is running this exact file through the official compressor works, while running it through this .net port doesn't.

purekrome wrote Jan 16, 2013 at 8:47 AM

I wonder if this is an issue of java vs .net and how they handle reading in files?

I'd love to know which line(s) in java does the reading in of the file and then seeing how the java does it and how the java is handling for illegal chars within a file.

freeranger wrote Jan 16, 2013 at 9:21 AM

If the BOM were the first characters in the file then it could well be java vs .NET, but since they are not the first characters in this instance, it seems more likely they are doing something specific.

No time to check it out now, hopefully at the weekend....

freeranger wrote Jan 28, 2013 at 6:57 AM

Hi, well I'm no java programmer but I've had a bit of a look at the source code to the original YUI and the compressor just passes a stream reader to the Rhino parser which tokenises the stream.
Somewhere in here, in getToken, it detects unicode sequences and decides what to do with them.
A quick step through the code passing in data with a BOM in the middle would tell us what's what exactly, but I don't know enough about java to be able to do that I'm afraid....
This is where PK steps up to the plate, not only as the original porter (who presumably understands this new fangled java stuff :)) but also, our equivalent of Rhino is the ECMAScript project with PK maintains on github here: https://github.com/PureKrome/EcmaScript.NET
It is this library which should handle dodgy characters in the same way the original YUI implementation does and then we will benefit from that work.
I have patched the project with the two tests you kindly supplied - one with a BOM at the start and the other with it in the middle (currently Ignored) in readiness for PK patching up the ECMAScript library.

freeranger wrote Feb 15, 2013 at 11:27 AM

I think this issue should be closed as it is not an issue with the YUICompressor as such, but an issue with the ECMAScript library - PK, what say you?