As part of my work as data analysis relating to NLP I get XML metadata and XML-ish data to be scraped (after tidy(ing) it). People keep data files to their heart and most see, treat it as text. You have no way to enforce data handling policies on customers ...
I have been using SAX Parsers with Content Handlers to parse the data I need, but you have to basically write a Content Handlers for each data you get. At some point I started to think about and implement some code which reads a bunch of XPath expressions from an input file to be fed to a Content Handler. Do you know of such algorithmic ideas and/or implementations even if they are not coded in java?
In the same way that, say, some software has to pass the ECMAScript compliance test in order to call it an ECMAScript engine, is there some sort of "XPath compliance" test?, or, where could I you find a comprehensively large test set of XML files and the corresponding XPath expressions to be tested?
In order to have a sense of how much the structure of some semi-structured text has changed (not the content itself) I would like to run some code which would print out all the XPath expressions in it in the order in which they appeared. Do you know of such thing?