Hi
I'm trying to write some code that checks whether an InputStream contains only characters with a given encoding. I'm using java.nio for that. For tests, I downloaded some character set examples from http://www.columbia.edu/kermit/csettables.html
When creating the CharsetDecoder, I want to get all errors:
Charset charset = Charset.forName( encoding );
CharsetDecoder decoder = charset.newDecoder();
decoder.onMalformedInput( CodingErrorAction.REPORT );
decoder.onUnmappableCharacter( CodingErrorAction.REPORT );
I then read an InputStream and try to convert it. If that fails, it can't contain the desired encoding:
boolean isWellEncoded = true;
ByteBuffer inBuffer = ByteBuffer.allocate( 1024 );
ReadableByteChannel channel = Channels.newChannel( inputStream );
while ( channel.read( inBuffer ) != -1 )
{
CharBuffer decoded = null;
try
{
inBuffer.flip();
decoded = decoder.decode( inBuffer );
}
catch ( MalformedInputException ex )
{
isWellEncoded = false;
}
catch ( UnmappableCharacterException ex )
{
isWellEncoded = false;
}
catch ( CharacterCodingException ex )
{
isWellEncoded = false;
}
if ( decoded != null )
{
LOG.debug( decoded.toString() );
}
if ( !isWellEncoded )
{
break;
}
inBuffer.compact();
}
channel.close();
return isWellEncoded;
Now I want to check whether a file containing Windows 1252 characters is ISO-8859-1. From my point of view, the code above should fail when it gets to the Euro symbol (decimal 128), since that's not defined in ISO-8859-1.
But all I get is a ? character instead:
(}) 125 07/13 175 7D RIGHT CURLY BRACKET, RIGHT BRACE
(~) 126 07/14 176 7E TILDE
[?] 128 08/00 200 80 EURO SYMBOL
[?] 130 08/02 202 82 LOW 9 SINGLE QUOTE
I also tried to replace the faulty character, using
decoder.onUnmappableCharacter( CodingErrorAction.REPLACE );
decoder.replaceWith("!");
but I still get the question marks.
I'm probably doing something fundamentally wrong, but I dont get it :-)
Any help is greatly appreciated!
Eric