Electronic data resides in numerous different forms and formats. Data can be well structured, such as when stored in the form of tables in relational databases, or unstructured, such as when stored as plain text or emails. Much data is generally irregular and loosely defined and does not adhere to a strict schema or conform to a preset format. Semi-structured data contains both structured and unstructured components. Some examples of semi-structured data include:                Product catalogs: Catalogs typically have structured data fields such as price, make and feature specifications but also have some unstructured data such as a product description in the form of text.        Call-center records: Such records typically contain details of the customer, the call-taker, and descriptive text summarizing the call.        Content managers: Documents in a repository typically include meta-data such as the date of creation, the author, the originating department, etc., in addition to the actual content of the document which comprises unstructured data.        Publication databases: Databases such as PUBMED and DBLP contain various details of articles such as a date of publication, the names of the author/s and the journal/conference name in addition to a title and an abstract which comprise unstructured data.        
A need exists to provide improved methods and systems for handling semi-structured data for a variety of reasons. One such reason is the explosive growth of information available on the World Wide Web (WWW), which is a high volume data source that cannot be constrained by a rigid schema. Another reason is the need for exchanging data between disparate systems and databases, which demands an extremely flexible format for representing the data. Yet another reason is the integration of several heterogeneous data sources, notwithstanding the individual data sources being highly structured.
Drivers of the growth of semi-structured data include:                The use of XML as a standard for information exchange over the Internet.        Advances in Natural Language Processing (NLP) and annotator tools have resulted in conversion of a substantial amount of unstructured data to semi-structured data.        Semantic web and annotations.        
As the volume of semi-structured data is growing exponentially, it is becoming increasingly necessary to organize this data in a comprehensible and navigable manner. Exponential growth of text data and unstructured data posed similar problems.
Web directories such as YAHOO, GOOGLE and Dmoz have shown that a hierarchical arrangement of documents is very useful for browsing a document collection. The Dmoz directory was manually created by about 52 thousand editors. Manually generated directories, more comprehensible and accurate than automatically generated directories, are not always feasible and require much effort and time for maintenance in a dynamic world. Therefore, Automatic Taxonomy Generation (ATG) methods are useful for automatically arranging documents into hierarchies.
Summarizing of web search results is an important application of ATG. Internet searches typically return thousands of results and ranked lists returned by search engines do not handle users' browsing needs efficiently. Most users respond by viewing only a few results and may thus miss much relevant information. Moreover the criterion used to rank the search results may not reflect a user's need. Organizing the search results in concept hierarchies summarizes the results and helps users in browsing those search results. However, predefined hierarchies and categories may not be useful in organizing query results, whether the hierarchies are generated automatically or manually. Post-retrieval document clustering provides superior results when query results are clustered to generate concept hierarchies.
Clustering of documents is thus an important part of ATG. The nodes at each level of a hierarchy of documents can be viewed as a clustering of the documents. Monothetic clustering algorithms assign documents to a cluster based on a single feature, whereas polythetic clustering algorithms assign documents to clusters based on multiple features. Known document clustering algorithms include the so-called K-means algorithm and its variants, hierarchical agglomerative clustering (HAC) methods and, more recently, graph partitioning methods. For K-means algorithms, the best performing similarity measure between documents is the cosine measure between two document vectors. HAC algorithms start with singleton documents as clusters, and iteratively merge the two most similar clusters. They differ in their choice of similarity measure between clusters. Once clustered the next important step is to assign proper labels to the clusters to render them comprehensible.
Polythetic ATG algorithms such as K-Means and HAC and monothetic ATG algorithms such as CAARD, DSP and Discover have been applied to unstructured data to automatically generate taxonomies. The VIVISIMO Content Integrator provides federated search or meta-search capabilities to public and private organizations. A federated search capability enables users to perform multiple searches at the same time through as many diverse informational sources as needed, whether the sources comprise internal documents, intranets, partner extranets, web sources, subscription services and databases, syndicated news feeds, or intelligence portals such as HOOVERS. VIVISIMO also provides a product called Clustering Engine which automatically clusters or organizes search results into categories that are intelligently selected from the words and phrases contained in the results or documents themselves.
Some of the more commonly used techniques for analysis and summary of structured data are multidimensional navigation and OLAP. ENDECA search and guided navigation technology enables multidimensional navigation of search results, identifies important dimensions or attributes for a current set of results and groups the results into relevant categories along each dimension. However ENDECA does not rank the various dimensions or attributes nor cluster text or unstructured attributes.
Storage, indexing and searching of semi-structured data poses new challenges. U.S. Pat. No. 6,804,677, entitled “Encoding semi-structured data for efficient search and browsing”, issued to Shadman et al. on Oct. 12, 2004 and is assigned to Ori Software Development Ltd. The patent relates to a method for encoding XML tree data that includes the step of encoding semi-structured data into strings of arbitrary length in a way that maintains non-structural and structural information about the XML data, and enables indexing the encoded XML data in a way facilitates efficient search and browsing.
Searching a large volume of semi-structured data such as the Internet returns a large set of data that is not simply browsed and navigated. Automatic organization of search results into concept hierarchies assists in browsing and navigating the search results. Such taxonomies advantageously also summarize the search results. U.S. Pat. No. 6,606,620, entitled “Method and system for classifying semi-structured documents”, issued to Sundaresan et al. on Aug. 12, 2003 and is assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION. The method and system disclosed in the patent requires predefined classes and training data for learning, which may be expensive and may not be exhaustive. Furthermore, as data in a repository is evolving, a need may arise to form new classes, which is not feasible if done manually.
Recent advancements in technology have made the storage, retrieval, search and handling of semi-structured data more feasible. However, predefined taxonomies are not of any real assistance for semi-structured data. Hence, for semi-structured data, a need exists for methods and systems that automatically discover or generate taxonomies.