Which data structures are used to represent duplicates in secondary DBs?
527197May 7 2010 — edited May 11 2010Hello,
Iam currently running an evaluation where I compare two methods of representing textual data structures. It's Oracle BDB XML vs. an approach of mine which relies on Berkeley DB.
I use a secondary database (DB_Type HASH) with DUP_SORT to index words of texts in order to allow fast search and retrieval. As far as I understood, the Hash is used to lookup the hits of (in my case) a given word. Or more precisely: It maps from a word to a set/list of keys in the primary database. Since a word can appear multiple times I must use a secondary database which allows for duplicates. So far so good.
Compared to a similar setting where I use the indexing mechanisms of Oracle BDB XML I discovered that the time complexity to extract all hits of a lookup scales pretty bad using my BerkeleyDB based approach. It is something like O(log_n). In contrast the Oracle BDB XML needs roughly the same time to retrieve 5,000 30,000 or even 900,000 hits.
My question (to the developers I guess): Which data structures are used to store the duplicates in secondary databases, and which approach is used in Oracle BDB XML?
Peter