Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Help removing BOM character from beginning of UTF-8 XML file

807580Feb 3 2010 — edited Feb 3 2010

I am trying to automate my application which processes xml documents generated by a third party. We encountered this issue once before with a different type of output file they provided to me and after a little bit of a struggle they resolved the issue by removing the <?xml> declaration from the output file entirely. We've recently started processing other types of output files from them and it turns out these other output files have the same issue - yet the third party company is not willing to help or even implement the same fix they did in the past. They would rather see me start a 2 year project over from scratch than simply remove the <?xml> declaration from these files for me!! They are insisting no other users have complained - but I am pretty sure no other users are building a sophisticated application to handle the processing (most users use a web site to process individual records vs batch processing such as I am doing).

In any case, the problem is that their output files include the BOM character, and I am trying to figure out a way to programmatically remove it from the file before I parse it. I am using Castor 1.2 XML to unmarshall the document (need to extract the data from the elements) and the unmarshall process is failing with the error "Content not allowed in prolog". When I open the document in a HEX editor I see the BOM - ï»¿ (or in hex EF BB BF ). Of course I can manually remove the BOM and process the file fine - but our objective is to have this process automated as output files come at various times and in various quantities.

I've tried to implement various forms of the code below. First strange thing is that when the program reads the first three bytes are being recognized as the negative numbers listed in the code, and not as the hex numbers as seen when opening the file in a hex editor. (I could be wrong but I thought I found online that this may mean they are labeling the document as UTF-8, but are actually using ANSI or ISO-8859-1 encoding?).

Now the code I tried was actually found by searching mailing list archives and the person claimed to resolve this same exact issue by using this code. However, even though the code executes successfully it doesn't remove the BOM and the file still fails when my application tries to Unmarshall it. After reading the API on PushbackInputStream I'm getting the impression that removing the characters is not what this class is doing...

How can I actually REMOVE this BOM character before I pass it to the Unmarshaller so that it can be processed successfully? I have seen examples of reading in the file, then writing out to a new file omitting the first three characters - but I am wondering if there is a cleaner way....

Hope this is enough info (may be too much =) ). Thanks in advance.

                 try{
			File file = new File(props.getProperty("lan_work_dir")+xmlFileName);
			byte[] buf = new byte[(int)file.length()];
			PushbackInputStream pis = new PushbackInputStream(new FileInputStream(file), buf.length);
			pis.read(buf, 0, 3);
			log.info("buf[0]="+buf[0]+" buf[1]="+buf[1]+" buf[2]="+buf[2]);

			//note that bytes are not being recognized as 0x00EF, 0x00BB, or 0x00BF as expected.  As per previous
                        //log statement, bytes are being recognized as the negative numbers below 
			if ( (buf[0] == -17) && (buf[1] == -69) && (buf[2] == -65) ){
				log.info("Found matching BOM");
				pis.unread(buf, 0, 3);
			}
			pis.close();
		}catch(IOException ex){
			ex.printStackTrace();
			log.error("Could not remove BOM character from ACK file. Ignore and continue");
		}

Locked Post

New comments cannot be posted to this locked post.

Locked on Mar 3 2010

Added on Feb 3 2010

8 comments

5,502 views