The present invention relates to methods for hierarchically parsing and indexing text.
Parsing and indexing text are paramount concerns in the search and retrieval industry. Text improperly parsed cannot be indexed properly, and a poorly constructed index will yield poor search results and correspondingly poor answer-set accuracy. Moreover, parsing and indexing text efficiently and accurately impacts a wide variety of information technologies, such as data mining, data abstracting, data extracting, data linking, data compression, data presentation, data visualization, data intelligence, and the like.
Search efficiency is often measured in terms of search performance and search accuracy. Because users desire nearly instantaneous results to their searches, user search queries are often conducted against small organized indexes, rather than against a composite body of text in its native format. Indexes improve search performance because search engines can detect matches or hits in structured indexes much more efficiently than in a body of natively formatted text documents.
Search accuracy is primarily governed by the rules employed during index generation. Some rules reduce the amount of information in the index to improve search speed and/or reduce index size. Other rules add information to the index to improve the quality of search results.
An indexer can increase speed and reduce index size by excluding certain words, symbols, and characters from the index. The excluded words are typically those that occur frequently, like xe2x80x98thexe2x80x99 and xe2x80x98andxe2x80x99. These excluded words are sometimes referred to as stop words. Punctuation and capitalization, as well as symbol characters like the dollar xe2x80x9c$xe2x80x9d, percent xe2x80x9c%xe2x80x9d and pound xe2x80x9c#xe2x80x9d, and other characters considered to be non-word characters are also typically ignored. While these exclusionary rules do serve to improve search engine response time, search accuracy is sacrificed. Under these rules, a search engine may not be able to respond at all to certain queries, like xe2x80x9cto be, or not to bexe2x80x9d, which may be comprised entirely of non-indexed text.
Existing parsing and indexing techniques partially deal with the organization and representation of text at various levels. Some examples include web site domain names, file system organizations, and documents decomposed as chapters, pages, and paragraphs. Some of these levels are linguistically oriented, such as the representation of noun phrases and grammatical constructs, and other levels focus on the character strings themselves, and may identify how sequences of characters of different types are grouped together into strings and sub-strings. Little progress has been made with respect to the parsing of strings and substrings, which has made search and retrieval particularly problematic and correspondingly less accurate.
Search accuracy is increasingly important as the body of available information continues to expand. The accuracy sacrificed by excluding certain words and characters is a cause of growing frustration for search engine users. As computer processing power and storage capacity increase, the cost of increasing search accuracy decreases.
To improve accuracy, the text being indexed and the search queries themselves are often parsed to identify character strings representing words. Identifying word boundaries presents a number of problems for software implemented parsers and linguistic analyzers. Word boundary parsing software will typically divide words when a symbol character is encountered in a character string. In certain cases, such as xe2x80x9cCAD/CAMxe2x80x9d, the parser will decompose the compound word xe2x80x9cCAD/CAMxe2x80x9d into two individual words, xe2x80x9cCADxe2x80x9d and xe2x80x9cCAMxe2x80x9d. As a result of this lack of ability to detect these compound words as single entities, queries for them yield slightly less precise results.
The changing nature of the text being indexed and searched may also impact search accuracy. Symbol characters typically ignored by indexers are becoming increasingly prominent. Consider the elements of an e-mail or World Wide Web address. The xe2x80x9catxe2x80x9d (@) sign, the dot (.), the colon (:), and the slash (/) have all become commonplace. A search for someone""s e-mail address, for example, will yield much more accurate results if the indexer does not ignore the at xe2x80x9c@xe2x80x9d symbol and the dotxe2x80x9c.xe2x80x9d.
When a search engine returns a hit to the user, it is useful to be able to see the surrounding text (the sentence or paragraph, for example) which contains the words that matched the query. This is referred to as xe2x80x9cshowing the hit in contextxe2x80x9d. The context of the hit may either be derived from the index itself, or re-extracted from the original document. It is typically less costly to reconstruct a portion of a document from the index, which is already at hand, than it is to locate and retrieve it from the original document.
Any text excluded during index generation would naturally be unavailable for reconstructing the context of a hit. The more completely the text is indexed, the more closely a reconstructed portion of the text will match the original.
Even an indexer that does not exclude any words or symbols from the index may not be able to reconstruct a hit in context. This is the case when the indexer records only a reference to the source document for each piece of text indexed, and not their relative positions in the document. Knowing the relative positions of each piece of text is required so that they may be reassembled in the proper order.
For example, if an indexer of this type encountered the phrase xe2x80x9capple piexe2x80x9d while indexing a document called xe2x80x9cMom""s Recipesxe2x80x9d, it would generate two entries, xe2x80x9capplexe2x80x9d and xe2x80x9cpiexe2x80x9d. Each entry would be stored with a reference to xe2x80x9cMom""s Recipesxe2x80x9d, but with no indication that xe2x80x9capplexe2x80x9d came before xe2x80x9cpiexe2x80x9d or that in fact, the two entries were adjacent to one another.
This lack of knowledge with respect to the ordering of text pieces in the source also disallows proximity searches, where the query specifies that certain terms must occur within a certain distance of each other.
Current computing resources enable indexers to include previously excluded words and symbols and their positions in the source, but that is not all. Indexers may also augment the index with additional information to improve search intelligence. For example, using currently available linguistics technology, an indexer may associate thesauri terms, morphological word roots and forms, phonetic and soundex representations, and alternate spellings with the words being indexed. Advanced indexers may also associate concepts, classifications, and categories with the indexed words, permitting more advanced searches and improving the overall quality and relevance of the search results.
Accordingly, an object of the invention is to provide methods of hierarchically parsing and indexing text. By parsing and indexing text at a level above what is ordinarily considered a word, and including stop words, symbol characters, and formatting characters along with the hierarchical relationships between the various text pieces, searching, retrieving, mining, abstracting, extracting, visualizing, and presenting the text becomes more useful and accurate.
Additional objectives, advantages and novel features of the invention will be set forth in the description that follows and, in part, will become apparent to those skilled in the art upon examining or practicing the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims. To achieve the foregoing and other objects and in accordance with the purpose of the present invention, methods of hierarchically parsing and indexing text are provided.
A method of indexing text using a set of executable instructions is provided, comprising receiving one or more characters and recognizing the characters as a first level text entity. Further, lower level text entities are recognized as sub-parts of their parent first level text entities. A first level value is assigned to each first level text entity which is operative to determine the position of the first level text entity in a source text. Moreover, one or more lower level values, which are subordinate to the first level value, are assigned to the lower level text entities. Additionally, the lower level values are operative to determine one or more positions of the lower level text entities in the source text.
Further, a method of data retrieval using a set of executable instructions is provided, comprising parsing a search query into a search criterion, one or more first level text entities, and one or more lower level text entities. One or more index sources, derived from one or more original sources, are searched in accordance with the search criterion for substantial matches on at least one of the first level text entities and one or more of the lower level text entities. Next, at least one of a reference to a matching source, a rendition of the matching source, a portion of the matching source, and the matching source are returned when a substantial match is determined wherein the matching source corresponds to one or more of the original sources.
Finally, a method of warehousing index terms using a set of executable instructions is provided, comprising receiving one or more target sources, parsing each target source into one or more first level text entities and one or more lower level text entities and associating one or more first level values with each first level text entity where the first level values are operative to determine one or more positions of each first level text entity within the target sources. Further, one or more lower level values, subordinate to the first level values, are associated with the lower level text entities when one or more of the lower level text entities are subsets of the first level text entities, the lower level values are operative to determine one or more positions for each lower level text entity within one or more of the target sources. Next, all of the entities and their corresponding values are stored in one or more index sources.
Still other aspects of the present invention will become apparent to those skilled in the art from the following description of an exemplary embodiment which is by way of illustration, one of the best modes contemplated for carrying out the invention. As will be realized, the invention is capable of other different and obvious aspects, all without departing from the invention. Accordingly, the drawings and descriptions are illustrative in nature and not restrictive.