1. Field of the Invention
This invention relates to a system and method for searching digitally stored information.
More particularly, the invention relates to a system and method for searching for and retrieving the information and then displaying the information in a form which is easy for a user to study.
2. Description of the Related Art
Various computer programs for searching digitally stored data on a computer system are known. In addition, there are also computer programs which display the data retrieved as a result of a search. These include, for example, search engines, find features, data visualization programs, and visual search applications.
Search Engines
Computer programs are known that search and retrieve documents relevant to a query submitted by a computer user on a single computer system and on computer systems networked to that individual's computer. These programs are commonly known as "search engines" and may be found on the Internet, in digital catalog systems in libraries, and other very large data repositories.
Search engines generally employ the following parts:
1) a pre-compiled index of the data repository being searched, allowing for greater search performance with respect to both speed and storage size; PA1 2) a user/query interface which is an interface to accept the query input of the user and decipher (or parse) that query; PA1 3) one or more search mechanisms or algorithms which are used to search the pre-compiled index; and PA1 4) an output or display interface which is a mechanism to represent the results of a user's search. PA1 1) a user/query interface, which functions to accept the query input of the user and decipher the query; PA1 2) one or more search mechanisms which generally employ syntactic pattern matching algorithms or semantic (i.e. a spreadsheet formula) pattern recognition algorithms; and PA1 3) an output/display interface which presents the results of a user search (this is usually just the document itself).
The power of a search engine is its ability to seemingly extract documents relevant to a query from a global scope of information. This power can vary greatly, depending on the level of sophistication of the parts listed above. The search sophistication of search engines typically corresponds to the complexity of their creation.
Boolean, statistical, metric, or conceptual based search methodologies are the most popular, due to the greater success rates of these methods over others in assessing the target information desired by a user.
Once a search has been invoked and completed, the most common method for the ordering and display of relevant documents is in a relevancy sorted bibliographic listing. A bibliographic listing lists the most relevant documents and may also contain two or three lines of excerpt, abstract or other associated information which may be, for example, extracted from the retrieved document, or a citation to the point within the document thought to contain the relevant information, or the line numbers on which a queried keyword appears within the document. More sophisticated search engines can additionally display such listings as "hyper-links" which associate passages in one document with another document which can be displayed when the "link" is clicked on with a mouse.
To attain meaningful search results, search engines generally require the user to have at least some knowledge of how to formulate proper queries, and others require a detailed understanding of complex query languages. Such query languages may be difficult for new users to grasp and usually take some time to master. As a result, alternative search engines have been designed which simplify the process of creating queries by allowing queries to be formulated in seemingly "natural language." The trade-off is that the parsing mechanism (the mechanism responsible for processing and deciphering the query) is significantly more complex than standard Boolean or statistical keyword query parsers. Additionally, the complex nature of a query parsing mechanism almost always prohibits the mechanism, and therefore the search engine, from being used with foreign languages, unless substantial additional programming is provided.
As noted above, many search engines require a pre-compiled index of the repository being searched. As the information in the repository is added to or changed the index may need to be updated. This generally means that the complete search process cannot be accomplished in real-time, due to the time consuming nature of compiling the index. This may present a serious impediment to searching large amounts of real-time data.
By design, most search engines return and display a listing of documents in response to a query. When searching a large data repository the search engine may return a list of hundreds or even thousands of relevant documents. Even if the user knows that the returned listing shows the documents with the greatest chance of satisfying the query, the user probably has no comprehensive way of easily determining the density or accuracy of this information. The result is that the ability to easily analyze information within the retrieved documents is often lost, and the user must browse all the returned information, or even each document, in order to ascertain the location and relevancy of the information returned in response to the query. This usually proves to be a very time consuming process.
Furthermore, the user is subject to the ambiguities inherent in a search engine's algorithmic logic. As a result, the information sought may not necessarily be the information searched for, or the format of the desired query may not be allowed.
Find Feature or Command
"Find" features or commands may be found, for example, in computer programs commonly dealing with organized data such as word processors, spreadsheets, and databases. A "find" feature or command is generally a program or routine which searches and retrieves information from within a document that matches a keyword based query submitted by a user of the program. Typically, find features employ the following parts:
As contrasted with search engines, find features do not usually require a document to be indexed. This allows for a complete search process to be accomplished in real-time, even on constantly changing data. More sophisticated find features may employ the use of case, partial word, and homonym matching. Results of "find" type searches are generally returned by "jumping" to the location of the match and highlighting or flagging it in some manner. These search mechanisms are characterized by their speed as well as their simple creation process and ease of use.
Find features found within common applications typically provide a bare minimum of searching capability and are inadequate for serious search analysis.
First, the scope of these algorithms typically does not extend beyond the bounds of the document.
Second, these algorithms are usually limited to simple, single keyword or phrase based searches.
In other words, there is an inability to: specify complex or compound queries, search for multiple topics at once, or broaden the scope of the mechanism beyond the current document or group of pre-identified documents.
Though both search engines and find features are standard in their respective areas, they are generally cryptic in their nature and do not employ the use of immediately intuitive input or output interfaces.
As proven within the data visualization field, the use of visual cues or indicators has been generally acknowledged as significantly more intuitive for data analysis and extraction. However, neither search engines nor find features typically make use of visual cues to any great extent.
Data Visualizers
In an attempt to address the limitations of the foregoing, computer programs have been created which make use of visual cues to extract information contained in one or more documents. One example of such an application, is Seesoft (described in Eich, Steffen, & Summer, Seesoft: A Tool for Visualizing Line Oriented Software Statistics, IEEE Transactions on Software Engineering, 1992, pgs. 957-968), which may be used to find information about computer programming files. It relies heavily on a multiple phase process. First it obtains the desired information in a statistical form. This is accomplished by running computer programs on the "target" files which accumulate statistics about the desired information. The values extracted from a file or files are grouped together into a single set of data, referred to as a "data set." These data sets are static and contain the statistical values as well as corresponding information of which lines of text, and within which files, the statistics occurred. The data set is then fed into a program commonly referred to as a data visualizer. Typically, the data visualizer creates a "thumbnail" view of the documents being analyzed and color codes the lines associated with pieces of data in the data set. This application primarily uses color to represent the range of values associated with a statistic.
A further example of a data visualization application is a visualization tool within DEC Fuse, which is a commercially available product from Digital Equipment Corp. This visualization tool makes use of multiple data sets. Instead of having a range of values represented by different colors, each data set is represented by a different color.
One problem with using data visualization mechanisms with information searches is that they usually require indexes, i.e. data sets, to color code information within the data visualizer which, like the indexes used with search engines, have to be pre-compiled by other applications. Thus, as with search engines, if the source files are changed, the current data sets are no longer applicable. In other words, if the source files are changed the user must go through the process of creating new data sets. Therefore, because of the static nature of a data set these applications suffer from similar drawbacks as search engines in that there is no ability to act on continuously changing information.
Furthermore, although the user may have a choice regarding which data sets are visualized, and in what color, these applications generally lack intuitive user interfaces. There is typically no ability to interact with the visualized data, or interface, with the express purpose of creating new data sets.
Most importantly, the data sets are also an integral component to the process, although they are separate components. This means that if a data set or index is corrupted, lost, or changed, the capacity to find information is destroyed and the user must restart the data collection and visualization process again.
Visual Search Application
Computer programs have also been written which allow for visual searching, at a given time, within a single document. Like the data visualizers mentioned above, the applications use a method of marking points of interest within a single file on a representation of that file, which is typically displayed as a thumbnail view (i.e., in reduced size) on the user's screen. Unlike the standard data visualizers discussed above, this application integrates a "find" feature or command as discussed above with the ability to visualize the results. It also allows a user to decide, to some extent, what information is shown within the thumbnail representation of the document.
In particular, the application allows a user to interact with a full scale representation of the document, which is shown in a separate window from the thumbnail view, to create queries (and delete them after creation). The results of the queries can then be represented both in the full scale view and the thumbnail or reduced size view of the document. For example, a user can choose a word to be searched by clicking on that word in the full scale view of the document thereby associating a color with that word. In response, the system highlights all occurrences of the selected word in the document. This highlighting appears on the user's screen both in the full scale representation of the document and in the thumbnail representation of the document. Colors of the users choosing (among a selection of six colors) may be associated with the queried words for visualizing the representations in a meaningful manner.
A drawback to this application is that it only has the capability to look at a single document at a time, for each copy of the program running on the user's computer. In other words, to view two documents, two copies of the application must be running. Unlike the data visualizers mentioned above, which had the ability to handle multiple files within one global thumbnail view and within a single program framework, for a user to compare multiple files using the visual search application, the user must open a new copy of the application for each file to be compared or searched.
An additional drawback to the visual search application is that it makes poor use of random access memory ("RAM"). For example, it is estimated that for the application to process a document which has a size of X kilobytes, the application requires an amount of RAM equal to approximately 25 times X. This is due to the creation of a memory index of all words in the document which is placed in RAM.
Referring to FIG. 12A, the memory structure 1202 for the visual search application is shown. In memory, for each word in the document, the application stores the word itself 1204 and the word's corresponding color 1206. Additionally, it stores substantial amounts of additional information 1208, 1209 pertaining to the word. This additional information 1208, 1209 is part of the data structure 1202 for each word. Indeed, although FIG. 12A has been illustrated with 2 "blocks" of additional information 1208, 1209, the prior art device actually stored substantially more than two blocks. Finally, the data structure 1202 contains a pointer 1210 to the data structure 1212 of the next word 1214 in the document. This all causes each word within the document to have a very large memory "footprint" when the document is loaded into the application. Additionally these color representations are only stored with the word once when the user initially queries a word. Therefore, the application is not capable of handling dynamic source information (or information which is constantly changing in real-time).
In the visual search application, queries are saved in a data file which is not part of the original document and, as a result, there is nothing to prevent the data file from being separated from the original document. Since, when the document is retrieved by the user in a later session, the application must use both files (the document and the data file) to retain the color coded information, if the document and data file become separated the user must go through the query process all over again.
Furthermore, the user interface of the visual search application does not have a single coherent standard for interaction with the document being analyzed. The interface is also typically non-intuitive and may be confusing to many users. Necessary interface components for window navigation such as scrolling and window labeling (for easy window differentiation) may also be lacking. This visual search application is also unsuitable for handling media other than simple text.
Therefore, it would be desirable to be able to provide a dynamic visual search and retrieval system and method for digital data which is useable with multiple documents of differing media, which contains a consistent, easy to use querying interface in which queries may be formulated and executed, and in which the results may be retrieved visually and in real-time.
It would also be desirable to have a system and method for visual search and retrieval with the ability to submit multiple queries so that they can be categorized in real-time by the user, be combined visually so they (the queries) may be interpreted by the user as (or to simulate) Boolean, statistical, and conceptual bodies of information.
It would also be desirable to be able to provide in such a system a method for user navigation, and user demarcations of points of interest within the retrieved material.