This invention relates to data analysis, but more specifically to a method and system to search a database for relevant items of information, as well as an article of manufacture embodying computer program instructions enabling implementations of such a method and system.
As the quantity of data accessible via public and private networks increases, there becomes a greater need for efficient searching to identify and locate relevant items of information. Currently, most searching is performed using various forms of queries, content matching techniques, artificial intelligence, or other forms of data analysis to obtain targeted information sought by a user. Also, these systems enable a user to refine an initial query based on results fed back to the user.
Present-day query methods and systems, however, may not provide adequate feedback for compound queries, i.e., queries specifying two or more characteristics or search parameters (e.g., queries including Boolean combination of two or more keywords or phrases). Most present-day systems and method also lack overviews of results in a form that allows a user to identify and explore patterns that emerge during the search or analysis.
For a given compound query, a search engine such as Lycos online search tool returns a total number of matching data objects found and an ordered list of hyperlinks to matching data resources. The list typically starts with the closest matches followed by more distant matches. Each link is prefixed with an ordered sequence number. No indication, however, is provided as to the extent to which the identified link matches each of the terms in the compound query.
Similarly, given a multi-keyword query, another query engine provided by Delphion Research Intellectual Property Network Service returns a prioritized list of links to pieces of intellectual property (IP), where each link is followed by a percentage value indicating the closeness Delphion calculates the associated piece of IP (intellectual property) to be to a given set of keywords. Again, query engine does not provide the user with an indication of how close each item of IP is to each keyword in the compound query.
Google also returns an ordered list of links to matching data objects—ordered from the closest to the most distant, using its own metric for closeness. Google allows the user to have each link indicate which of the keywords occur its title. It does this by color-coding the query keywords, and then showing the color-coded keywords in the link titles. Thus, if the query were “(American apple pie)”—the query implicitly conjoins the keywords and Google highlights all instances of “American” with yellow, “apple” with blue, and “pie” with purple. This approach does not provide an accurate indication of the extent to which each link's data object matches the given query terms for at least two reasons. The first is that since “American” is longer than either “apple” or “pie,” the color of the resulting page will be more yellow than either blue or purple. Thus, a given user obtains an incorrect indication that the matches very closely match “America.”
Second, the words in a given link's title are not the only criteria to determine the link's position in the ordering (e.g., the data object's content is also used). For example, in the list of links returned by Google there is a link to an animation (www.markfiore.com/animation/looting.swf) concerning “American as apple pie” in which the link itself does not include any of the keywords. Here again, color-coding of the link title does not accurately reflect the extent to which the referenced data object matches the given query terms.
The Glass Engine (see http://www.philipglass.com/glassengine/# for details) provides an abstract graphical user interface (GUI) with which a user may explore musical works of composer Philip Glass. In addition to a detailed listing of the composer's works, the GUI provides an abstract graphical representation indicating the extent to which each work possesses one of five characteristics (specifically, joy, sorrow, intensity, density, and velocity). Upon selection of a particular musical work, the extent to which the selected work matches each of the five characteristics is shown abstractly and graphically using a bar-chart-like method. This provides an indication of how the selected work's extent of match compares with that of other works. The Glass Engine also allows users to specify match-extent ranges for each of the characteristics. So, for example, a user may specify a desire only to explore compositions whose intensity is in the low-to-high range and whose joy is in the medium (i.e., medium-to-medium) range.
Although the Glass Engine provides an indication of the extent to which data object (i.e., musical work by Philip Glass) match given characteristics: (1) a user cannot compare the matching level of two works side by side because the engine only displays the matching extent of a single work at a time; (2) the method does not provide an overview of the extent of match of multiple works at once, and (3) both the data objects and the characteristics are predefined, i.e., the user cannot add to either (e.g., no additional data objects, such as works by Bach; or characteristics, such as complexity).
Prior methods also exist that categorize the results of a given compound query and providing abstract graphic representations of results to end-users. After finding all data object matching a given compound query, Grokker (see www.grokker.com) automatically determines grouping of the objects into one or more categories. It then provides the user with an abstract graphical user interface through which the user may navigate these categories and their associated data objects. A system and method provided by U.S. patent application Publication US 2003/0225755 A1 is similar to Grokker, except that, rather than providing groupings of categorized data objects, it provides links to the data objects that indicate how closely each associated data object belongs to each of the automatically derived categories. In both cases, though, the user is not provided with any indication of how closely each discovered data object matches each of the compound query's search terms.
Query Previews (for details see Doan, K., Plaisant, C., and Shneiderman, B., “Query Previews in Networked Information Systems”, Proc. of the Third Forum on Research and Technology Advances in Digital Libraries, ADL '96, Washington, D.C., May 13-15, 1996, IEEE CS Press, 120-129) provides a two-phase dynamic query method, designed to facilitate user's search while reducing the time spent awaiting data returns from the network. In a first phase—the Query Preview phase—users develop an initial query and obtain a graphic representation indicating the number of matching data objects without ever retrieving the full content of each of the objects. In this way, the user may adjust his or her query to avoid retrieving too many (e.g., thousands) or too few objects (e.g., zero). Once the user has developed a query with a reasonable number of matches, the user proceeds to a second phase, i.e., a Query Refinement phase, in which the user bases further query modifications on the content of the retrieved data objects. This method still does not provide any graphic representation indicating how each discovered data object matches each of the compound query's search terms.
History Flow (http://web.media.mit.edu/˜fviegas/papers/history_flow.pdf) is a GUI that provides a representation of wiki versioning over time. It is a collaborative surveillance tool that helps wiki participants monitor content changes of a wiki. It offers a method for community analysis by showing patterns of site revisions through the wiki's history.
In contrast, an embodiment of the present invention provides a GUI that enables a user to make compound queries and displays a visualization of query results that shows the extent to which the parts of the queries match the overall compound query. In addition, such visualization exhibits query returns over any data set, whether the Internet, an intranet, wikis, blogs, or any other data source. Also, visualization displays results organized according to a relevance of match as opposed to History Flow's organization according to time. Advantageously, the present invention enables a complex query as opposed to community surveillance.
TileBars (for details, see Marti Hearst, TileBars: Visualization of Term Distribution Information in Full Text Information Access, Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 59-66, Denver, Colo., May 1995.) are similar in that they provide a visualization that helps users make judgments about the potential relevance of retrieved documents. TileBar querying allows multiple explicit search terms and visualizes them in such a way that shows what role the query terms played in the ranking retrieved documents. TileBars use text structure (document length, query term frequency and query term distribution) of the retrieved documents to build its visualization.
In view of the current state of the art, there remains a need for a system or method that takes a compound query, determines the matching data objects, and then provides not only access to the matching data objects, but also an indication of how closely each data object matches each of the compound query's search terms. There also remains a need for a system or method that provides an overview of all of the matching data objects, the overview providing indication of how closely each data object matches each of the compound query's search terms.
To address the needs of the art, one embodiment of the present invention provides an overview of the query results to indicate overall matches in a large data set. Such overview provides a visualization that orders results horizontally (rather than stacking them vertically) to provide a user with a more direct and swift visual comparison of the list of retrieved data objects. Further, visualization may also place retrieved data objects with similar matches next to each other to show overall patterns across the set of retrieved objects. Other embodiments of the invention allow iterative querying while in the midst of a query—easily signifying which terms are most important to a user by dragging a results indicator that corresponds with the term toward the middle thereby reordering the entire visualization according to shifting priorities.