Information technology (IT) continues to rapidly evolve and with this evolution comes advanced complexity. As new technologies are introduced into enterprise networks, the need to interoperate the new technologies with existing legacy technologies becomes of greater concern and necessity. Enterprises are intrinsically multi-functional in nature, yet applications and systems technologies tend to be single-function entities with closed architectures and proprietary internals. This core incongruence results in disparate, incompatible legacy systems of various kinds, incompatible hardware systems and devices, and heterogeneous platform systems mutually incomprehensible to each other. This phenomenon has been referred to as Enterprise Application Dysintegration, or EAD. As a result, the function of IT is more and more becoming the integration of heterogenous components. Currently, no automated means are available and the integration is effected manually by human agents at great cost, time, and inefficiency. Indeed, with the implementation of a new technology, exists the understanding that in addition to implementation issues associated solely with the new technology comes the downtime, cost, and disruption of re-architecting and re-building currently useful legacy functionality associated with integration.
With new applications and systems also comes associated new data and information that needs to be stored and managed. Also comes the need to integrate legacy data for use by the new technologies. Disparate systems inter-operate effectively through well defined interfaces. To facilitate this inter-operability, heterogeneous syntactic formats need to be translated into well known intermediary formats understood by all systems in the exchange. This is often referred to as syntactic transformation, of which XML is being proposed as the universal intermediary for data exchange. Beyond syntax also lies the meaning of terms, a problem commonly referred to as semantic reconciliation. To address semantic reconciliation, a formal agreement is typically made between communicating systems about the meaning of terms in a particular domain of knowledge and application.
There are many robust technologies for data-level integration, including database-specific Call Level Interfaces (CLIs), Open DataBase Connectivity (ODBC), and Java DataBase Connectivity (JDBC). However, these interface technologies require sophisticated user knowledge and are quite tedious to implement and update.
Although there are database-to-database integration technologies currently available, there is no standard methodology for reusing legacy information with newly introduced technologies. A primary objective is to integrate systems and data without disturbing them. Minimizing any type of data conversion plays to this concept of being non-invasive.
In addition to the problem of integrating new technologies with legacy information is the problem of how to manage and access the explosive growth in the amount of data. Increased memory and remote electronic data storage capacity offers access to large amounts of data in a very convenient form and physical size. Data may be available on diskette, CD-ROM, magnetic tape, and on line to a centrally located computer and memory storage medium. On line access to such stored data is primarily provided by business data networks and the world wide web, hereinafter referred to as the Internet. By 1993, the Internet had approximately 130 sites that could be hyper-linked together with keywords. The Internet has grown quickly since then. Sites on the Internet have increased from approximately 1.6 million at the end of 1997 to 9.6 million at the end of 1999. Today, multiple technologies are available to access and manage data presented on the Internet. The challenge remains to extract information from the data simply and efficiently and to have confidence in the result that all relevant items have been uncovered. To focus in on relevant database records, search engines generally use keywords, categorization, segment limitations, Boolean logic, and hit counts. More complex search engines can also employ hierarchical categorization and multifaceted searching.
Keywords are the basis of most searches. A simple keyword search, such as that found in most word processors under the “Find” command, will locate the occurrence of a text string within a document or a record. Misspellings, synonyms, or different tenses of a given text string will not be located. The searcher must be cautious to truncate the text string to a word's root. A search for the text string “graphical”, for instance, will not locate instances of the text string “graphics.” The searcher must also not choose commonly occurring words, as such a search would result in a high number of search results. Keywords are commonly combined with categorization, segment limitations, Boolean logic operators, advanced keywording, date operators and numeric operators to create a more effective search.
Categorization is a technique used to focus the scope of a search. A category is a subset of records. By conducting a search only within this subset of records, fewer irrelevant hits result. Lexis-Nexis™ and Dialog™, two online searchable databases with proprietary search engines, are examples of categorized databases. Prior to conducting a keyword search within the Lexis-Nexis™ or Dialog™ database, the searcher must select from an extensive list of categories. Some categories are broader than others. If the searcher selects an overly broad category, his or her search will result in too many irrelevant hits and the searcher will waste time sorting through the undesired search result records looking for relevant hits. If the searcher selects an overly narrow category, his or her search results will not include some of the desired records. Selection of an appropriate category, therefore, is of vital importance.
Searches can be further focused with the use of segment limitations. Such a search is also commonly referred to as a parametric search. “Segments” are similar to categories in that they are domain specific. Category classifications are used to divide multiple records into subsets, or “fields”. Segment classifications are used to divide individual records into specific groupings of information. Using segments, or parameters, keyword searches can be targeted at certain fields of a record, such as a record's title or author. Search engines distributed by Lexis-Nexis™ and Dialog™, two online searchable database providers, are well-adapted to such targeted searches, often using dozens of segments for each category of records. A news article record, for instance, is typically broken down into separate fields for byline, date, publisher, abstract, and body. To find a news article with the word “elephant” in the title (or headline) using the classical interface of the Lexis-Nexis™ search engine, the following syntax would be needed: “HEADLINE(elephant)”.
Using keyword searching may not be very helpful if the user is not familiar with the appropriate standard terminology related to the information they are looking for. Further, there may be many appropriate ways to describe the information sought by the user. A concept expressed by a standard industry term in one industry may be different from a standard industry term in a different industry. A keyword search would require searching all synonyms used in order to ensure a complete and accurate result.
When a user of a searching/retrieval system enters a keyword search query into a system, the query is parsed. Based on the parsed query, a listing of documents relevant to the query is provided to the user. In the prior art, it is also known to use semantic networks when parsing a query. The number of words used to search the database is then expanded by including the corresponding words or associated words identified by the semantic network in the search instructions. This expansion can be based on any one or a combination of using stems or roots of terms, using sound-a-like words, using wildcard words or any other appropriate semantic technique.
Boolean operators, such as “AND”, “OR” and “MINUS”, are used to enhance the capabilities of a search engine. The basic format of Boolean queries is well known in the art and generally takes on the form of “X OR Y”, where X and Y are two distinct keywords. Because search requests are processed by a computer, syntax rules must be strictly followed when drafting a Boolean keyword search. In many search engines the logical operators “AND” and “OR” must be capitalized. Some search engines allow additional syntax that indicates requisite proximity of keywords or hierarchy within a specific Boolean query. Hierarchy within a Boolean query is usually designated with the use of parenthesis. The “(A OR B) AND © OR D)” query, for instance, finds a first set of records containing “A OR B” and a second set of records containing “C OR D”, then finds records included in both the first set and the second set.
Using the Boolean operator “AND” in a search expression such as “X AND Y,” will yield records which include both X and Y in the record. Using the Boolean operator “OR” in a search expression such as “X OR Y,” will yield records which include either X or Y in the record. Using the Boolean operator “MINUS” in a search expression such as “MINUS X” will yield records which do not include the term X in the record.
A query that is too narrow will result in less than the desired number of records. Correspondingly, a query that is too broad will result in greater than the desired number of records. Immediate user feedback on a specific query helps the searcher construct a better subsequent query. Hit count is perhaps the most effective form of feedback for constructing a better query. If a query is too narrow, the hit count will be very low, possibly even zero. If a query is too broad, the hit count will be very high. Hit count information is used with selected viewing of search results to alert the searcher of mistakes, such as incorrect category or segment choice, or otherwise assist the searcher in drafting more effective queries. Hit counts are generally displayed after a given query is executed. Hit counts are more useful when provided for each search term and each combination of search terms. Boolean Representation One, illustrated below in Table I, demonstrates how individual hit counts can be used for the Boolean keyword search for “(cat OR dog) AND (doctor OR veterinarian)”.
TABLE IBoolean Representation OneIn the above example, the hit counts are as follows: in the database the term “cat” is included in 280 records; in the database the term “dog” is included in 494 records; in the database the term “veterinarian” is included in 34 records; in the database the term “doctor” is included in 194 records; in the database the term “cat” or “dog” is included in 774 records; in the database the term “veterinarian” or “doctor” is included in 228 records; and in the database the Boolean query for the Boolean expression “(cat OR dog) AND (doctor OR veterinarian)” results in the location of 4 records. If the Boolean expression is altered by the replacement of “dog” with “cow”, the hit count change ripples through the Boolean expression's representation as shown in Boolean Representation Two, illustrated below in Table II.
TABLE IIBoolean Representation TwoFeedback from individual hit counts gives the searcher access to information normally hidden. Viewing individual hit counts, a searcher is better able to identify search terms that are too specific, too broad, or misspelled.
An additional search tool is hierarchical categorization. Instead of classifying records into separate categories, hierarchical categories classify records into both broad groupings and progressively narrower groupings. An example of hierarchical categorization is found in biology, where organisms are organized, from broadest to narrowest, by kingdom, phylum, class, order, family, genus, and species. Hierarchical categorization is commonly used in conventional internet search engines, such as those found at the Yahoo!™ and Altavista™ websites. To find information about a specific topic, a search engine user navigates, from a list of broad categories through an increasingly more specific list of categories. Once the first category is selected, a search engine typically displays a lower level screen with another list of alternatives. Such navigation continues down through the various menus of alternatives having decreasing priority levels. At any point of the category navigation, a keyword or Boolean search can be performed upon the records in that category. Search results are only obtained from records located within the category searched. Most search engines only allow searches in one category at a time. To search a second category, the searcher must navigate up the hierarchical category tree and then down to the second category.
Multifaceted classification attempts to address the limits of the hierarchical categorization method. Instead of assigning a record to a single category, multifaceted classification allows a record to belong to multiple categories. The multiple categories become part of a record's description, along with standard information for the record such as the title, the abstract (or keywords), the date, and author. Multifaceted classification improves the likelihood of locating relevant records. First, the searcher can take several different paths to locate the same record. Using the analogy of books in a library, multifaceted classification is able to place a single book on more than one shelf. Second, the multiple categories can be subjected to a Boolean query. Records relating to sports medicine could be found by searching for records included in both the sports category and the medicine category.
Boolean logic, segment limitations, hit counts, hierarchical categorization, and multifaceted classification help the searcher create more effective queries, but at the cost of increased complexity. Often instruction manuals or a software program's help menu must be consulted to draft a query. Dialog™, for instance, publishes a “Bluebook” that contains detailed lists of segment codes for each of their many databases. Lexis-Nexis™ goes so far as to provide free online access and training seminars for students to overcome their search engine's initial learning curve.
New generations of technology and methodologies continue to be developed to improve search accuracy and efficiency. Where one generation fails to meet all demands, another generation arises looking to fill the gaps. Each generation has been partially effective, however no generation to date has been entirely effective. In most cases, current technology is a singular approach technique to access and organize information, which at certain times is productive and efficient in accomplishing the intended task. However, all too frequently, the user uncovers no positive search result or receives hundreds, and sometimes thousands, of end search results. In some instances one technology will yield no positive result while another will possibly solve the research task. What is needed is an approach which allows users to employ a simplified means to access, organize, and manage information contained on the Internet and within business data systems. This approach should combine the best search methodologies on the market to provide the most complete solution possible.
What is also needed is a methodology that takes existing, legacy information and allows users to redefine and reorganize the information without requiring a data conversion thus improving the flow of data.