Java EE (Java Enterprise Edition) General Discussion

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Parsing HTML entities

800366Jan 26 2009 — edited Jan 29 2009

Hello all,

I've been writting an application that modifies XML where the data is parsed using a SAX parses and the information is placed into a list.

I understand that when SAX parses it escapes HTML entites. So for *"* in the data... it becomes *"* when I generate my list. I need this to stop happening or at least correct this before I recompile the XML. I've been using regular expressions to find *"*, *©*, etc... and replace them, which is working well. However, the main problem is that for a character like *Ã* it is being converted against my will into *ø* by the parser before my code can handle it, so in the end my regex will convert the wrong value.

When the *startElement* method (of SAX) is called, data is grabbed using *attribute.getValue("value")* call, afterwhich the value is passed into a +ConversionEntity+ object. Nothing outrageous, just two parallel arrays hold values as follows...

{code}

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class EntityConversion {
//entity characters
String entityCharacters[]={
"&", "'", "\"", "<", ">", "¡", "¢", "£", "¤", "¥", "¦", "§",
"¨", "©", "ª", "«", "¬", "", "®", "¯", "°", "±", "²", "³", "´",
"µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", "¾", "¿", "×", "÷",
"À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì",
"Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú",
"Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", "æ", "ç",
"è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô",
"õ", "ö", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ"};
//entity names to replace entity characters
String entityNames[]={
"&", "'", """, "<", ">", "¡",
"¢", "£", "¤", "¥", "¦", "§", "¨",
"©", "ª", "«", "¬", "", "®", "¯",
"°", "±", "²", "³", "´", "µ", "¶",
"·", "¸", "¹", "º", "»", "¼", "½",
"¾", "¿", "×", "÷", "À", "Á", "Â",
"Ã", "Ä", "Å", "Æ", "Ç", "È", "É",
"Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð",
"Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "Ø",
"Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß",
"à", "á", "â", "ã", "ä", "å", "æ",
"ç", "è", "é", "ê", "ë", "ì", "í",
"î", "ï", "ð", "ñ", "ò", "ó", "ô",
"õ", "ö", "ø", "ù", "ú", "û", "ü",
"ý", "þ", "ÿ"};

/*****************************************************
*CONSTRUCTOR
****************************************************/
public EntityConversion(){

}

/*****************************************************
*CONVERT ENTITY CHARACTERS
****************************************************/
public String convertEntityCharacters(String tempIdref){
System.out.println("Attempting conversion...");
//temp replacement string
String replaceStr=null;

/*
* for the given string, loop though each entity
* character and replace each with the entity
* name
*/
for(int x=0;x<entityCharacters.length;x++){

// Compile regular expression
String patternStr = "("+entityCharacters[x]+")";

//System.out.print("Find "+patternStr+"\t"); //testing system.out

//value to replace within string
replaceStr=entityNames[x];
// System.out.print("\tReplace with "+replaceStr); //testing system.out

//compile the pattern
Pattern pattern = Pattern.compile(patternStr);
// Replace all embedded entities with
Matcher matcher = pattern.matcher(tempIdref);
//replace with entity name
tempIdref = matcher.replaceAll(replaceStr);

// System.out.print("\tnew Results:\t"+tempIdref+"\n"); //testing system.out
}
//return newly modified string
return tempIdref;
}
}

{code}

An assumption of a constant conversion between *Ã* and *ø* is not feasible as this XML data is used for linking in a larger application. Thanks for your help in advance!

Bests,
Goob

Locked Post

New comments cannot be posted to this locked post.

Locked on Feb 26 2009

Added on Jan 26 2009

#java-technology-xml

7 comments

1,472 views