The preset invention is in the general field of accessing data including but not limited to eXtensible Markup Language (XML) documents.
There follows a glossary of conventional terms. The meaning of terms are generally known per se and accordingly the definitions below are provided for clarity and should not be regarded as binding.
Glossary of Terms
Dataxe2x80x94Information that one wants to store and/or manipulate.
Databasexe2x80x94A collection of data organized by some set of rules.
Attributexe2x80x94A feature or characteristic of specific data, represented e.g. as xe2x80x9ccolumnsxe2x80x9d in a relational database. A record representing a person might have an attribute xe2x80x9cagexe2x80x9d that stores the person""s age. Each column represents an attribute. In XML (XNM is defined below), there is an xe2x80x9cattributexe2x80x9d that exists as part of a xe2x80x9ctag.xe2x80x9d
Columnxe2x80x94In a relational database, columns represent attributes for particular rows in a relation. For example, a single row might contain a complete mailing address. The mailing address would have four columns (xe2x80x9cattributesxe2x80x9d): street address, city, state, and zip code.
Recordxe2x80x94A single entry in a database. Often referred to as axe2x80x9ctuplexe2x80x9d or xe2x80x9crowxe2x80x9d in a relational database.
Tuplexe2x80x94See xe2x80x9crecordxe2x80x9d
Rowxe2x80x94See xe2x80x9crecordxe2x80x9d
Tablexe2x80x94See xe2x80x9crelationxe2x80x9d
Relationxe2x80x94A way of organizing data into a table consisting of logical rows and columns. Each row represents a complete entry in the table. Each column represents an attribute of the row entries. Frequently referred to as a xe2x80x9ctable.xe2x80x9d
Relational databasexe2x80x94A database that consists of one or more xe2x80x9crelationsxe2x80x9d or xe2x80x9ctablesxe2x80x9d
Database administratorxe2x80x94A person (or persons) responsible for optimizing and maintaining a particular database
Schemaxe2x80x94The organization of data in a database. In a relational database, all new data that comes into the database must be consistent with the schema, or the database administrator must change the schema (or reject the new data).
Indexxe2x80x94Extra information about a database used to reduce the time required to find specific data in the database. It provides access to particular rows based on a particular column or columns.
Pathxe2x80x94A series of relationships among data elements. For instance, a path from a grandson to grandfather would be two steps: from son to father, and from father to grandfather.
Structurexe2x80x94The embodiment of paths in particular documents or data For example, in a xe2x80x9cfamily tree,xe2x80x9d the structure of the data is hierarchical: it is a tree with branches from parents to children. Data without a hierarchical structure is often referred to as xe2x80x9cflat.xe2x80x9d
Queryxe2x80x94A search for information in a database.
Range queryxe2x80x94A search for a range of data values, like xe2x80x9call employees aged 25 to 40.xe2x80x9d
I/Oxe2x80x94A read from a physical device, such a fixed disk (hard drive). I/Os take a significant amount of time compared to memory operations: usually hundreds and even thousands of times (or more) longer.
Block readxe2x80x94Reading a fixed sized chunk of information for processing. A block read implies an xe2x80x9cI/Oxe2x80x9d if the block is not in memory.
Treexe2x80x94A data structure that is either empty or consists of a root node linked by means of d (d ? 0) pointers (or links) to d disjoint trees called subtrees of the root. The roots of the subtrees are referred to as xe2x80x9cchild nodesxe2x80x9d of the root node of the tree, and nodes of the subtrees are xe2x80x9cdescendent nodesxe2x80x9d of the root. A node in which all the subtrees are empty is called a xe2x80x9cleaf node.xe2x80x9d The nodes in the tree that arc not leaves are designated as xe2x80x9cinternal nodes.xe2x80x9d
In the context of the invention, leaf nodes are also nodes that are associated with data.
Nodes and trees should be construed in a broad sense. Thus, the definition of tree encompasses also a tree of blocks wherein each node constitutes a block In the same manner, descendent blocks of a said block are all the blocks tat can be accessed from the block For detailed definition of xe2x80x9ctree,xe2x80x9d also refer to the book by Lewis and Deneberg, xe2x80x9cData structures and their algorithms.xe2x80x9d
B-treexe2x80x94A tree structure that can be used as an index in a database. It is useful for exact match and range queries. B-trees frequently require multiple block reads to access a single record. A more complete description of B-trees can be found on pages 473-479 of The Art of Computer Programming, volume 3, by Donald Ktiuth ((copyright) 1973, Addison-Wesley).
Hash tablexe2x80x94A structure that can be used as an index in a database. It is useful for exact match queries. It is not useful for range queries. Hash tables generally require one block read to access a single record. A more complete description of hash tables can be found on e.g. pages 473-479 of The Art of Computer Programming, volume 3, by Donald Knuth ((copyright) 1973, Addison-Wesley).
Inverted listxe2x80x94A structure that can be used as an index in a database. It is a set of character strings that points to records that contain particular strings. For example, an inverted list may have an entry xe2x80x9chello.xe2x80x9d The entry xe2x80x9chelloxe2x80x9d points to all database records that have the word xe2x80x9chelloxe2x80x9d as part of the record. A more complete description of inverted lists can be found on e.g. pages 552-559 of The Art of Computer Programming, volume 3, by Donald Knuth ((copyright) 1973, Addison-Wesley).
Semi-structured dataxe2x80x94Data that does not conform to a fixed schema. Its format is often irregular or only loosely defined.
Data miningxe2x80x94Searching for useful, previously unknown patterns in a database.
Objectxe2x80x94An object is some quantity of data It can be any piece of data, a single path in a document path, or some mixture of structure and data. An object can be a complete record in a database, or formed xe2x80x9con the flyxe2x80x9d out of a portion of a record returned as the result of a query.
Markupxe2x80x94In computerized document preparation, a method of adding information to the text indicating the logical components of a document, or instructions for layout of the text on the page or other information which can be interpreted by some automatic system. (from the Free On-Line Dictionary of Computingxe2x80x94www.foldoc.ic.ac.uk)
Markup Languagexe2x80x94A language for applying markup to text documents to indicate formatting and logical contents. Mark up languages are increasingly being used to add logical structure information to documents to enable automated or semi-automated processing of such documents. Many such languages have been proposed, ranging from generic ones such as SGML and XML, to industry or application-specific versions.
SGMLxe2x80x94A specific example of Markup Language, Standard Generalized Markup Language. SGML is a means of formally describing a language, in this case, a markup language. A markup language is a set of conventions used together for encoding (e.g., HTML or XML).
XMLxe2x80x94A specific example of Markup Language eXtensible Markup Language. A language used to represent semi-structured data. It is a subset of SGML. XML documents can be represented as trees.
Keyxe2x80x94An identifier used to refer to particular rows in a database. In the context of relational database, keys represent column information used to identify rows. For instance, xe2x80x9csocial security numberxe2x80x9d could be a key that uniquely identifies each individual in a database. Keys may or may not be unique.
Joinxe2x80x94A method of match portions of two or more tables to form a (potentially much larger) unified table. This is generally one of the most expensive relational database operations, in terms of space and execution time.
Key searchxe2x80x94The search for a particular value or data according to a key value. This search is usually performed by an index
Searchxe2x80x94In the context of data, searching is the process of locating relevant or desired data from a (typically much larger) set of data based on the content and/or structure of the data. Searching is often done as a batch process, in which a request is submitted to the system, and after processing the request, the system returns the data or references to the data that match the request. Typical (yet not exclusive) examples of searching are the submission of a query to a relational database system, or the submission of key words to a search engine on the World Wide Web.
Path searchxe2x80x94The search for a particular path in the database. A xe2x80x9cpathxe2x80x9d is a series of relationships among data elements. For instance, part of an invoice might have the xe2x80x9cbuyer,xe2x80x9d and that buyer has an xe2x80x9caddressxe2x80x9d on the invoice. A search for the address of all buyers is really for the path xe2x80x9cinvoice to buyer to address.xe2x80x9d This is a search for a particular structure, which is different from key search (the search for particular values). Path search and key search may be combined.
Browsing in the context of data, browsing is the process of interactively locating relevant or desired data by wandering or navigating through a (typically much larger) set of data. Browsing can be done based on data content, structure, or a combination of these. A common example of browsing is the traversal of hyperlinks in the World Wide Web in order to locate relevant web pages.
Accessxe2x80x94In the context of data, access is the process of obtaining data, typically though searching, browsing, or through following references.
Siblingxe2x80x94Elements of a tree that share the same parent are siblings. This is the same sense as brothers and sisters are siblings.
Tagxe2x80x94An XML tag represents structural information in an XML document. A tag may or may not surround data and may or may not contain other tags. All tags have a parent, except the first tag. Additionally see xe2x80x9cmarkup.xe2x80x9d
Parent-childxe2x80x94In a tree, a child is an element that branches from its parent. In XML, if xe2x80x9ctag1xe2x80x9d immediately surrounds xe2x80x9ctag2,xe2x80x9d then xe2x80x9ctag1xe2x80x9d is the parent of xe2x80x9ctag2.xe2x80x9d xe2x80x9cTag2xe2x80x9d is the child of xe2x80x9ctag1.xe2x80x9d
Tokenxe2x80x94A short pattern used to represent another pattern
Complete-key indexingxe2x80x94An indexing method that stores the key as part of the index This provides an exact xe2x80x9cit or missxe2x80x9d result when using the index, but is very large when the keys are large. This is contrasted with a xe2x80x9ccompressed-key indexing.xe2x80x9d
Compressed-key indexingxe2x80x94A compressed-key index does not store the entire key in the index, thus can be significantly smaller than a complete-key index (for the same keys). However, it may provide xe2x80x9cfalse positivesxe2x80x9d (that can be removed leter). It should not miss relevant records (xe2x80x9cfalse negativesxe2x80x9d). This is contrasted with a xe2x80x9ccomplete-key indexing.xe2x80x9d A Compressed-key indexing is described e.g. in U.S. Pat. No. 6,175,835.
Encodingxe2x80x94Transforming one representation into a different, equivalent representation. For example, representing the Roman numeral xe2x80x9cVIIxe2x80x9d as the decimal number xe2x80x9c7xe2x80x9d is a form of encoding.
Sibling Orderxe2x80x94Semi-structured data stored in files have a specific xe2x80x9corderxe2x80x9d associated with the data. In a race, finishers are ordered based on their order of appearance across the fish line: xe2x80x9cfirst,xe2x80x9d xe2x80x9csecond,xe2x80x9d xe2x80x9cthird,xe2x80x9d etc. With semi-structured data, siblings can be ordered by their appearance in the document.
Semantic informationxe2x80x94xe2x80x9cOf or relating to meaning, especially meaning in language.xe2x80x9d (The American Heritage(copyright) Dictionary of the English Language, Third Edition,(copyright) 1996, 1992 by Houghton Mifflin Company) The difference between the word xe2x80x9corangexe2x80x9d used to represent a color and the word xe2x80x9corangexe2x80x9d to represent a ha is a xe2x80x9csemanticxe2x80x9d difference. xe2x80x9cSemantic informationxe2x80x9d is information about the meaning of tags and data.
Syntactic informationxe2x80x94Syntax is the study of the rules whereby words or other elements of sentence structure are combined to form proper sentences. xe2x80x9cSyntactic informationxe2x80x9d in semi-structured data represents the tags and data, without information regarding the meaning of the tags and data.
Homonymxe2x80x94A word that is used to designate several different things. The word xe2x80x9cbowxe2x80x9d represents a stringed weapon, the front of a ship, and a loop of ribbon, among other things. When used with more than one semantic mean xe2x80x9cbowxe2x80x9d would be an example of a homonyms
Synonymxe2x80x94A word having the same or nearly the same meaning as another word in a language. Words like xe2x80x9ctop,xe2x80x9d xe2x80x9cpeak,xe2x80x9d and xe2x80x9capexxe2x80x9d are synonyms in English.
The most popular database today is the relational database. In a relational database, data is stored in relations (or xe2x80x9ctablesxe2x80x9d). Tables have columns and rows. The rows are often referred to as xe2x80x9crecordsxe2x80x9d and consist of a single related group of data, like a complete mailing address. The columns in the tables represent attributes of the rows. A column in a mailing address table might be xe2x80x9cZip Code,xe2x80x9d just one part of a row.
Relations are defined by a database administrator, and have a fixed format called a xe2x80x9cschema.xe2x80x9d For instance, the schema for a mailing address relation night be (name, address, city, state, zip), which is a xe2x80x9cnamexe2x80x9d followed by an xe2x80x9caddressxe2x80x9d followed a xe2x80x9ccity,xe2x80x9d etc. Each mailing address that appears in the table has to have that exact format. Changes to the schema are quite expensive, and result in significant xe2x80x9cdowntimexe2x80x9d for the database.
The database administrator also builds initial indexes. An index is a data structure that provides very quick access to particular rows, based on particular data values. For instance, a database administrator could make an index over the xe2x80x9czipxe2x80x9d column that would make searching for zip codes very fast. Building new indexes over an existing relation can be quite expensive (insofar as space and processing time is concerned).
Relational databases generally index the data using variants of B-trees, hash tables, and inverted lists. These indexes provide good performance when the schema is fixed and when the database administrator has created specific indexes for expected queries. It is generally quite expensive when users wish to perform a query that cannot use an existing index.
As specified above, data to do not conform to a fixed schema are referred to as semi-structured. This type of data is often irregular and only loosely defined. Even in the previous example of a mailing address, one can see how semi-structured data could be used. Imagine a database that takes global mailing addresses. Some addresses would have cities and states, some would include country and country designator, some would have numeric zip codes, some alphanumeric postal codes, and many would include extra information like xe2x80x9coffice number.xe2x80x9d They would be very different, depending on where they originated. In all cases, though, even though they do not look the same, they are still instances of a xe2x80x9cmailing address.xe2x80x9d
Perhaps the Be example of semi-structured data is the content available on the Web. On the web, semi-structured data is the rule, not the exception. Just storing and searching simple web pages is difficult because they are all so different. Semi-structured data also appears in business and scientific applications that access (or generate) heterogeneous data stores. Even a simple task like data mining in a database of customer purchases, where the data comes from a seemingly structured source, has at its core a semi-structured data representation: the xe2x80x9cshopping basketxe2x80x9d Every shopping basket is different and a collection of such baskets is a semi-structured source.
Unlike the relational data stored in a typical database, semi-structured data does not have a strict schema and does not have to conform to a preset format. Because of this, semi-structured data is more interesting at this time Since storage and searching mechanisms are fairly new, by industry standards.
In searching semi-structured data, queries often include information about the structure of the data, not just field contents. For instance, genealogists may care about the grandchildren of a particular historical figure. Such data paths (e.g., the path from xe2x80x9cgrandparentxe2x80x9d to xe2x80x9cgrandchildxe2x80x9d) are often explicit in the semi-structured data, but are not stored explicitly in a relational database. At the same time, semi-structured data may be queried over particular attributes, independent of the structure that surrounds those attributes (e.g. key word search).
Also, queries may contain structural constraints about the objects returned. For instance, one may be interested only in xe2x80x9crestaurant objectxe2x80x9d that minimally have a name and address associated with them. These are not queries solely associated with semi-structured data, but the types of queries that may be encountered and that an effective index should account for. A more detailed discussion of paths in semi-structured data can be found in, e.g. xe2x80x9cKeys for XML,xe2x80x9d Bunernan, et al., August 2000.
Relational databases that store structured data are going through a process of incremental refinement, realizing gains in performance at an ever-slowing pace. Semi-structured storage solutions at this time are neither mature nor standardized. Most storage and indexing systems shoehorn semi-structured data into some well-understood structural form such as relational format. This is far from being a practical, scalable solution and there is accordingly room for significant improvement.
There follows discussion in connection with a specific instance of semi-structured data, i.e. the popular XML (eXtensible Markup Language) data. Various academic papers and emerging product focus on the generation, storage, and search of XML. XML has been considered the xe2x80x9clingua galacticaxe2x80x9d for content exchange over the Internet because it can be used to encode any arbitrary type and quantity of information. As the consideration of XYL has progressed, it has become apparent that it is also an important way to store and search information, not just to transmit it. This importance is reflected in significant industry attention as evidenced e.g. in xe2x80x9cOracle8ixe2x80x94The XML Enabled Data Management System,xe2x80x9d by Banerjee, et al. and xe2x80x9cXML and DB2xe2x80x9d by Cheng and Xu.
XML is a well-understood subset of SGML (Standard Generalized Markup Language) that has received significant attention in recent years as the candidate for a clear standard representation of semi-structured data With a growing proportion of electronic business traffic moving to XML, a reliable, scalable, efficient storage and retrieval solution is becoming critical.
The need to handle semi-structured data is growing in step with increasing business outsourcing, systems integration, mergers and acquisitions, and the explosive growth in information available on the Internet. The new e-commerce era sets new demands on data and information exchange. Wherever any two non-identical data sources come together, the choice is costly translation and mediation, specific to those two sources, or an effective semi-structured search and storage system.
In the past, business data generally had to have a significant amount of structure in order to be stored and searched efficiently. That is not to say semi-structured data is in any way a new idea . . . it was frequently avoided before because it can be expensive and difficult to deal with.
Within an organization, some designating authority can specify standard data representations, and all of the data creators and users expect and understand the canonical data format. Within larger organizations, multiple standards may emerge within departments or divisions, and expensive custom translators cam be built and maintained to transfer data between organizational units. Between organizations however, the problem becomes intractable. There are simply too many formats to deal with.
Traditional storage and querying then happened only at the organizational level, and often with just a fraction of an organizations data. There were significant restrictions on how this data could be organized, as well, if it was stored in a commercial database. These restrictions mean that every object in a relation has the same organization, and is just as likely to be over-specified or under-specified, just to fit it into the current mold.
If the data do not have a fixed schema, however, it can be nearly impossible to get good storage and query results from a relational database. The hitherto known solutions for storing XML and other semi-structured data typically use relational databases and generally require breaking the XML apart into chunks that can fit into a table, with non-conforming pieces going into overflow bins of some sort. For XML that is quite different from current database data, new relations are created. For each new relation created, search becomes slower and more difficult.
Even with this approach, intelligent search is still often difficult The tables are flat structures, in rows and columns, with little obvious indication about the underlying data""s structure. The data in a relation is like logs on a woodpile: so many rows high, so many columns wide. But XML is like a complete tree, with paths and branches and leaves. Imagine searching the woodpile for the xe2x80x9cthird branch on the let 17xe2x80x2 high in the tree.xe2x80x9d You could do it by piecing the tree back together from the woodpile, rebuilding the tree until you found your answer, then stacking the wood back on the pile when you were done. This simplified example illustrates how relational databases storing XML data work.
The treatment of bow structured and semi-structured data is usually by index schemes that have A) uniform key structure, and B) do not store complete relationships. Because relational databases break documents apart into tables, some information is no longer explicit in the database and must be reconstructed for each search. For example, an invoice might have a fixed set of fields (i.e., xe2x80x9cshipping addressxe2x80x9d and xe2x80x9cbilling addressxe2x80x9d) that get stored in one table, and an arbitrary set of fields (i.e., the items on the invoice) that get stored in another table. This is frequently done because it can save a considerable amount of storage space. However, when there is a search that uses one of the relationships that is natural part of the original invoice, but has been split into multiple tables to save space, an expensive reconstruction (i.e., a xe2x80x9cjoinxe2x80x9d) must occur. This leads to two significant problems: large index size and expensive search for structure.
Searching for structure (xe2x80x9cpath searchxe2x80x9d is necessary in certain applications, but prohibitively expensive in relational databases. Imagine a genealogy database that stores census records and information about family trees. Each person would likely be represented as an object in a relation, and that person would have a set of two biological parents. Since the number of children a person has can vary, the children of a person are simply not stored in the same record or in accordance with another solution a large number of slots are allocated in order to accommodate (possibly) many children. In most cases where a person has an average or less number of children most of the slots would be empty. The first solution (not storing children) would be good in terms of database size, since one can tell person A""s children by seeing who has person A as a parent The second option (having many mostly empty children slots) makes searching for children quicker, but at the expense of disk space (and database size).
Using this genealogy database, try to find xe2x80x9cAll of George Washington""s living descendents.xe2x80x9d First George Washington is found in the database, hopefully very quickly, and figure out who his children were. Then his children""s records are searched in order to find out who their children were. Third type of search is continued, again and again, repeatedly inspecting data (the tree is rebuild from the woodpile). Every time the net generation is encountered the problem requires an expensive join, or difficult successive query rewrites. The resulting living descendents are eventually obtained after significant work and many I/O operations.
With the popular available indexes for relational databases (e.g. B-trees, hash tables, inverted lists), searching for XML paths will always be difficult because the document structure is lost in the translation to relations. Every significant new document format requires a new relation. And in order to search this new type of relation, the database administrator must come in and explicitly create new indexes.
A relational mapping of XML into a database is not the only possibility. It is also feasible to map XML data to objects in an object-oriented (OO) database management system. There are clear reasons to store and search semi-structured sources using the relatively ineffective mechanisms available in relational systems over object-oriented databases in many instances. There are also reasons to favor OO databases over their relational counterparts.
In choosing an OO solution, there is significant overhead associated with storing the data. First, XML data elements must be tagged with unique object identifiers (xe2x80x9cOIDsxe2x80x9d) so that they can be later found in the database. This overhead can swell the size of the data. This increase in size occurs well before any indexes to aid search are built. Once indexes are built to aid the search, the OO solution requires even more space.
Furthermore, OO databases generally use index structures that are quite suitable for relational systems, and support relational style queries. B-trees and hash tables and inverted lists are commonly used to index the data in OO databases. As such, even though the structured data may be stored easily, querying the data suffers the identical problems of relational system. B-trees quickly become quite large and unwieldy. Hash tables prevent range queries. Data relationships are not captured by the indexes. Thus, OO databases, as they currently stand, are not an adequate solution for storing XML data
As organizations trade increasing amounts of information, the number of data formats that must be understood grows unfettered Existing database solutions do not scale well to meet this demand. When dealing with semi-structured data, it is important to be able to map and maintain the relationships present in the data. Bunernan, Fan, and Weinstein note in xe2x80x9cQuery Optimization for Semi-structured Data using Path Constraints in a Deterministic Data Modelxe2x80x9d that paths represent important semantic clues about semi-structured data If these clues are lost or not easily accessible because of the database and its indexes, the database has failed in its tasks.
There are more tasks for an effective database and indexing system for searching and storing semi-structured date Simple text searches are required of any relational and semi-structured database. The ability to do path searches is an important additional burden placed on semi-structured databases. A path search is especially useful when the sought type of data is known, but not exactly where it is in the database. For instance, a query like xe2x80x9cfind all addresses of all buyers of all invoicesxe2x80x9d is a search for the path xe2x80x9cinvoicexe2x86x92buyerxe2x86x92address.xe2x80x9d The example of this type of path is illustrated in FIG. 2. In addition to searching for particular paths, one should be able to search for particular structures within the semi-structured data, like a complete set of xe2x80x9cbuyerxe2x80x9d information, which includes the buyer""s name and address.
Finally, since the data is semi-structured, and its onion is not always known, the ability to xe2x80x9cbrowsexe2x80x9d the data is also important. When browsing the data, users hop from elements to children, or back to an element""s parent it is a way to xe2x80x9cwanderxe2x80x9d through the data in a database. Looking again at the document in FIG. 1, once the user is at xe2x80x9cInvoice,xe2x80x9d she can see that there are xe2x80x9cBuyers,xe2x80x9d xe2x80x9cSellersxe2x80x9d xe2x80x9cNumbers,xe2x80x9d and xe2x80x9cItemListsxe2x80x9d available. The user can then follow one of these branches, say xe2x80x9cBuyer,xe2x80x9d to find that xe2x80x9cBuyers,xe2x80x9d have xe2x80x9cNamexe2x80x9d and xe2x80x9cAddressxe2x80x9d fields associated with them.
The requisite task set for effectively indexing and searching semi-structured data is significantly larger than for structured data sources. Quite simply, the lack of a schema makes the problem much harder than before. Our proposed structures and techniques solve the problems and overcome the new burdens presented by XML and other forms of semi-structured data.
There is accordingly a need in the art to provide for a technique that facilitates encoding of semi-structured data which facilitates indexing of the data for efficient search.
There is another need in the art to provide for a technique of the kind specified that is particularly useful for indexing and searching Markup Language (ML) data in general and extensible Markup Language in particular.
There is still another need in the art to provide a technique which facilitates text search, path search, and browsing in semi-structured data in general and ML data in particular.
In the context of the invention efficient access means the ability to locate relevant data in a short amount of time or with only a small number (e.g., several) disk accesses, even though the data items stored may number in the millions or more, or even though the total size of the data items is such that they cannot all fit in the memory available to the search process. Efficient access refers to the overall or typical behavior of the system. In other words, while individual accesses may sometimes be inefficient, tie access behavior observed over time is efficient.
As is well known, semi-structured data includes as an instance the Markup Languages. The Markup languages include numerous instances including but not limited to LaTeX, TeX, RDF, SVG, SML, XCML, XML, XHTML, SGML, HTML, VRML, DHTML, VML, JAML, QAML, DML, MRML, SSML, ThML, TDML, OML, CKML, STML, PHML, CML, DCML, CFML, PGML, MGML, CellML, AnatML, FieldML, XGMML.
For convenience of explanation only the invention is described with reference to XML. Those versed in the art will readily appreciate tat the invention is, likewise, applicable to other instances of Markup Language; and more generally to any semi-structured data.
The invention provides for a method for encoding semi-structured data, comprising:
a) Providing a semi-structured data input;
b) obtaining an encoded semi-structured data by selectively encoding at least part of said semi-structured data into strings of arbitrary length in a way that (i) maintains non-structural and structural information associated with the semi-structured data, and (ii) the so encoded semi-structure data can be indexed for efficient access.
The invention further provides for a method for constructing a metadata dictionary in respect of semi-structured data, comprising:
a) providing a semi-structured data input;
b) constructing a mesa dictionary that facilitates compressed encoding of at least part of said semi-structured data into strings of arbitrary length in a way that at least maintains non-structural and structured information associated with the semi-structured data.
Still further, the invention provides for a method for encoding and indexing semi-structured data, comprising:
a) providing a semi-structured data input;
b) obtaining an encoded semi-structured data by selectively encoding at least part of said semi-structured data into strings of arbitrary length in a way that (i) maintains non-structural and structural information associated with the semi-structured data;
c) indexing the encoded semi-structured data using layered index; the layered index includes basic partitioned index structure; said layered index maintains a balanced structure of blocks.
The invention further provides for a method for encoding and indexing Markup Language (ML) data, comprising:
a) providing an ML data input
b) obtaining an encoded ML data by selectively encoding at least part of said ML data into strings of arbitrary length in a way that (i) main s non-structural and structural information associated with tie semi-structured data;
c) indexing the encoded semi-structured data using layered index; the layered index includes basic partitioned index suture; said layered index maintains a balanced structure of blocks.
Yet further, the invention provides for a method for encoding and indexing semi-structured data, comprising:
(a) providing a semi-structural data input;
(b) selectively encoding at least part of said semi-structured data into keys of arbitrary length in a way that (i) maintains non-structural and structural information associated with the semi-structured data;
c) creating a balanced index structure over the arbitrary-lengthed keys.
The invention further provides for a method for indexing semi-structured data, comprising:
(a) providing a semi-structured data tint that include data items;
(b) indexing keys of the data items of the said semi-structured data such that with about no more than 25,000 bytes of internal memory per 1 million data items it is possible to locate an address of any such said data item with no more than 2 I/Os, irrespective of the size of the key.
The invention provides for a method for indexing markup language (ML) data, comprising:
(a) providing an ML data input that include data items;
(b) indexing keys of the data items of the said ML data such that with about no more than 25,000 bytes of internal memory per 1 million data items it is possible to locate an address of any such said data item with no more than 2 I/Os, irrespective of the size of the key.
A system for encoding semi-structured data, comprising:
storage for storing a semi-structured data input;
processor node configured to construct an encoded semi-structured data by selectively encoding at least part of said semi-structured data into strings of arbitrary length in a way that (i) maintains non-structural and structural information associated with the semi-structured data, and (ii) the so encoded semi-structured data can be indexed for efficient access.
The invention provides for a system for encoding and indexing semi-structured data, comprising:
storage for storing a semi-structured data input;
processor node configured to construct an encoded semi-structured data by selectively encoding at least part of said semi-structured data into strings of arbitrary length in a way that (i) maintains non-structural and structural information associated with the semi-structured data;
processor node configured to construct an indexing e-of the encoded semi-structured data using layered index; the layered index includes basic partitioned index structure; said layered index maintains a balanced structure of blocks.
The invention further provides for a system for encoding and indexing Markup Language (ML) data, comprising:
storage for storing an ML data input;
processor node configured to construct an encoded ML data by selectively encoding at least part of said ML data into strings of arbitrary length in a way that (i) maintains non-structral and structural information associated with the semi-structured data;
processor node configured to construct an indexing of the encoded semi-structured data using layered index; the layered index includes basic partitioned index structure; said layered index maintains a balanced structure of blocks.
Yet further, the invention provides for a system for encoding and indexing semi-structured data, comprising:
storage for storing a semi-structured data input;
processor node configured to selectively encoding at least part of said semi-structured data into keys of arbitrary length in a way that (i) maintains non-structural and structural information associated with the semi-structured data;
processor node configured to creating a balanced index structure over the arbitrary-length keys.
The invention provides for a storage medium storing data indicative of encoded semi-structured data that includes strings of arbitrary length that (i) maintains non-structural and structural information associated with the semi-structured data, and (ii) the so encoded semi-structured data can be indexed for efficient access.
The invention first provides for in a computer system having a storage medium of at last an internal memory and an external memory;
a data structure that includes an index over the keys of the data items; the index is arranged in blocks, such that with about no more than 25,000 bytes of internal memory per 1 million data items it is possible to locate an address of any such said data item with no more than 2 I/Os access to the external memory, irrespective of the size of the key.