Parsing HTML entities
800366Jan 26 2009 — edited Jan 29 2009Hello all,
I've been writting an application that modifies XML where the data is parsed using a SAX parses and the information is placed into a list.
I understand that when SAX parses it escapes HTML entites. So for *"* in the data... it becomes *"* when I generate my list. I need this to stop happening or at least correct this before I recompile the XML. I've been using regular expressions to find *"*, *©*, etc... and replace them, which is working well. However, the main problem is that for a character like *Ã* it is being converted against my will into *ø* by the parser before my code can handle it, so in the end my regex will convert the wrong value.
When the *startElement* method (of SAX) is called, data is grabbed using *attribute.getValue("value")* call, afterwhich the value is passed into a +ConversionEntity+ object. Nothing outrageous, just two parallel arrays hold values as follows...
{code}
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class EntityConversion {
//entity characters
String entityCharacters[]={
"&", "'", "\"", "<", ">", "¡", "¢", "£", "¤", "¥", "¦", "§",
"¨", "©", "ª", "«", "¬", "", "®", "¯", "°", "±", "²", "³", "´",
"µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", "¾", "¿", "×", "÷",
"À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì",
"Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú",
"Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", "æ", "ç",
"è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô",
"õ", "ö", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ"};
//entity names to replace entity characters
String entityNames[]={
"&", "'", """, "<", ">", "¡",
"¢", "£", "¤", "¥", "¦", "§", "¨",
"©", "ª", "«", "¬", "­", "®", "¯",
"°", "±", "²", "³", "´", "µ", "¶",
"·", "¸", "¹", "º", "»", "¼", "½",
"¾", "¿", "×", "÷", "À", "Á", "Â",
"Ã", "Ä", "Å", "Æ", "Ç", "È", "É",
"Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð",
"Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "Ø",
"Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß",
"à", "á", "â", "ã", "ä", "å", "æ",
"ç", "è", "é", "ê", "ë", "ì", "í",
"î", "ï", "ð", "ñ", "ò", "ó", "ô",
"õ", "ö", "ø", "ù", "ú", "û", "ü",
"ý", "þ", "ÿ"};
/*****************************************************
*CONSTRUCTOR
****************************************************/
public EntityConversion(){
}
/*****************************************************
*CONVERT ENTITY CHARACTERS
****************************************************/
public String convertEntityCharacters(String tempIdref){
System.out.println("Attempting conversion...");
//temp replacement string
String replaceStr=null;
/*
* for the given string, loop though each entity
* character and replace each with the entity
* name
*/
for(int x=0;x<entityCharacters.length;x++){
// Compile regular expression
String patternStr = "("+entityCharacters[x]+")";
//System.out.print("Find "+patternStr+"\t"); //testing system.out
//value to replace within string
replaceStr=entityNames[x];
// System.out.print("\tReplace with "+replaceStr); //testing system.out
//compile the pattern
Pattern pattern = Pattern.compile(patternStr);
// Replace all embedded entities with
Matcher matcher = pattern.matcher(tempIdref);
//replace with entity name
tempIdref = matcher.replaceAll(replaceStr);
// System.out.print("\tnew Results:\t"+tempIdref+"\n"); //testing system.out
}
//return newly modified string
return tempIdref;
}
}
{code}
An assumption of a constant conversion between *Ã* and *ø* is not feasible as this XML data is used for linking in a larger application. Thanks for your help in advance!
Bests,
Goob