Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Regular expressions for xml parsing

807580Nov 19 2009 — edited Nov 19 2009

I have a xml parsing problem that I have to solve using regular expressions. It's not possible for me to use a different method other than regular expression. But there is a problem that I cannot seem to rap my head around. I want to extract the contents of a tag but the problem is that this tag occurs serveral times in the XML file but I only want the contents of one particular occurence. Basically the problem is as follows;

I want to extract

<bp:NAME ***stufff***>(I want this part)</bp:NAME>

This tag can occur is serval places. For example here;

<bp:ORGANISM>
***bunch of tags***
<bp:NAME ***stufff***>***stufff***</bp:NAME>
***bunch of tags***
</bp:ORGANISM>

or here;

<bp:DATABASE>
***bunch of tags***
<bp:NAME ***stufff***>***stufff***</bp:NAME>
***bunch of tags***
</bp:DATABASE>

I do not want the content of those tags. I want the content of the <NAME> tag that is not between either the <ORGANISM> tags or the <DATABASE> tags. These tags can be in any order. I for the life of me cannot seem to figure this problem out. I tried several different approaches. For example I tried using the following regex

(?:<bp:NAME [^>]*>([^<]*).*?<bp:ORGANISM>.*?</bp:ORGANISM>|
<bp:ORGANISM>.*?</bp:ORGANISM>.*?<bp:NAME [^>]*>([^<]*))

This kind of works, the information I want is either in the first captured group or in the second one. So I just check which group is not empty and that is the one I want. But this only works if there is only one other tag containing the name tag (in this particular regular expression that is the organism tag). Since there is another tag (the database tag) I have to work around, and these tags can be in any order, the regular expression then becomes three times as large and then there are six different groups in which the information I want can occur. This does not seem like a good idea to me. There has to be another way to do this. So I tried using the following regex;

(?:</bp:ORGANISM>)?.*?(?:</bp:DATABASE>)?.*?<bp:NAME [^>]*>([^<]*)

I thought this would get rid of any occurences of the other tags in front of the name tag, but it doesn't work either. It seems like it is not greedy enough. Well I think you get the point. I don't know what to try next so I really need some help.

Here is an example of the type of data I will run into. The tags can be in any order and they do not always have to occur. In the example below the <DATABASE> tag is not part of the data and the name tag I want just happens to be in front of the organism tag but this is not always the case. The name tag I want is the firstname tag in the file, namely;

<bp:NAME rdf:datatype="xsd:string">Progesterone receptor</bp:NAME>

So I don't want the name tag that is in between the organism tags.

<bp:protein rdf:ID="CPATH-27885">
&#8722;<bp:COMMENT rdf:datatype="xsd:string">
Belongs to the nuclear hormone receptor family. NR3 subfamily. SIMILARITY: Contains 1 nuclear receptor DNA-binding domain. WEB RESOURCE: Name=NIEHS-SNPs; URL="http://egp.gs.washington.edu/data/pgr/"; WEB RESOURCE: Name=Wikipedia; Note=Progesterone receptor entry; URL="http://en.wikipedia.org/wiki/Progesterone_receptor"; GENE SYNONYMS: NR3C3. COPYRIGHT:  Protein annotation is derived from the UniProt Consortium (http://www.uniprot.org/).  Distributed under the Creative Commons Attribution-NoDerivs License.
</bp:COMMENT>
<bp:SYNONYMS rdf:datatype="xsd:string">Nuclear receptor subfamily 3 group C member 3</bp:SYNONYMS>
<bp:SYNONYMS rdf:datatype="xsd:string">PR</bp:SYNONYMS>
<bp:NAME rdf:datatype="xsd:string">Progesterone receptor</bp:NAME>
&#8722;<bp:ORGANISM>
&#8722;<bp:bioSource rdf:ID="CPATH-LOCAL-112384">
<bp:NAME rdf:datatype="xsd:string">Homo sapiens</bp:NAME>
&#8722;<bp:TAXON-XREF>
&#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112385">
<bp:DB rdf:datatype="xsd:string">NCBI_TAXONOMY</bp:DB>
<bp:ID rdf:datatype="xsd:string">9606</bp:ID>
</bp:unificationXref>
</bp:TAXON-XREF>
</bp:bioSource>
</bp:ORGANISM>
<bp:SHORT-NAME rdf:datatype="xsd:string">PRGR_HUMAN</bp:SHORT-NAME>
&#8722;<bp:XREF>
&#8722;<bp:relationshipXref rdf:ID="CPATH-LOCAL-112386">
<bp:DB rdf:datatype="xsd:string">ENTREZ_GENE</bp:DB>
<bp:ID rdf:datatype="xsd:string">5241</bp:ID>
</bp:relationshipXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112387">
<bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
<bp:ID rdf:datatype="xsd:string">P06401</bp:ID>
</bp:unificationXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112388">
<bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
<bp:ID rdf:datatype="xsd:string">A7X8B0</bp:ID>
</bp:unificationXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:relationshipXref rdf:ID="CPATH-LOCAL-112389">
<bp:DB rdf:datatype="xsd:string">GENE_SYMBOL</bp:DB>
<bp:ID rdf:datatype="xsd:string">PGR</bp:ID>
</bp:relationshipXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:relationshipXref rdf:ID="CPATH-LOCAL-112390">
<bp:DB rdf:datatype="xsd:string">REF_SEQ</bp:DB>
<bp:ID rdf:datatype="xsd:string">NP_000917</bp:ID>
</bp:relationshipXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-112391">
<bp:DB rdf:datatype="xsd:string">UNIPROT</bp:DB>
<bp:ID rdf:datatype="xsd:string">Q9UPF7</bp:ID>
</bp:unificationXref>
</bp:XREF>
&#8722;<bp:XREF>
&#8722;<bp:unificationXref rdf:ID="CPATH-LOCAL-113580">
<bp:DB rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CPATH</bp:DB>
<bp:ID rdf:datatype="http://www.w3.org/2001/XMLSchema#string">27885</bp:ID>
</bp:unificationXref>
</bp:XREF>
</bp:protein>

Edited by: Dani3ll3 on Nov 19, 2009 2:51 AM

Locked Post

New comments cannot be posted to this locked post.

Locked on Dec 17 2009

Added on Nov 19 2009

6 comments

593 views