The present invention relates generally to processing techniques of one or more objects from one or more information sources. More particularly, the present invention provides methods and systems for processing information using a thematic based technique, including processes, systems, and user interfaces to perform, for example, web searching. Merely by way of example, the invention has been applied to searching web sites, but it would be recognized that the invention has a much broader range of applicability. The invention can be applied to searching web sites, information associated with URLs, data sources resident on a local computer system or network storage system, intranets, patent documents (e.g., patents, publications), electronic archives, online-journals, blogs, news articles, any natural language based (electronic) information available in electronic format (textual), electronic messages, any combination of these, and the like.
The amount of electronically produced and stored information has exploded over the years with the proliferation of computers, information devices (e.g. PDAs, cell phones), computer networks, storage devices, shopping, security/privacy databases, and the like. Such electronic information includes, among others, on-line newspapers, magazines, advertisements, web sites, blogs, user forums, commercial publications, commercial databases, as well as electronic information generated by individuals and organizations, such as e-mail messages, transactional information/security, word processing documents, presentation documents, spreadsheet documents, and the like. By way of a world wide network of computers, such as the Internet, billions of pieces of information are now available through operating-system neutral “browser” programs such as Internet Explorer by Microsoft, Firefox by the Mozilla Foundation, Navigator by Netscape, or the like.
As the amount of information continually increases, a user's ability to access specific and targeted information becomes more important. Information retrieval engines such as those made by Yahoo! Corporation, Google, Microsoft, and others present information sources to a user primarily by using a key-word indexing technique. These indexing techniques often rely upon performing a full or partial text indexing of data sources, noting which key words are used more frequently, and associating those key words to the data sources. Yet other techniques, which use associations to a selected website, include those provided by Google Inc. and information stored in meta-data, link, references, etc.
As noted above and further emphasized, conventional techniques are often based on syntactic or lexical analysis; utilizing various techniques to relate information on keywords. Keyword searching, however, is generally ineffective due to the use of multiple terms and varying grammatical structures representing the same concepts in different documents. Keyword searching, is also generally ineffective because the same words and grammatical structures may represent many different concepts, as distinguished only by the greater context of the discourse. The result is poor precision (unsatisfactorily high levels of incorrect or irrelevant results (false positives)) and poor recall (are numbers of missed (false negative) associations). In addition, these systems cannot relate information to a task and are thus limited to providing lists ranked on some non-task related means (such as anchors, links, number of references, number of hits, advertising dollars, etc.—see link analysis systems; i.e. PageRank, HITS, etc.). This limitation reduces current solutions to offering tools for assisting in various steps in a business process, a consumer task, or the like and cannot enable the automation of those steps or offer assistance in completing a task from start-to-finish.
Full text index searching has been another way to retrieve information in conventional retrieval engines. Unfortunately, such full text searching is plagued with many problems. For example, a user of such searching often retrieves thousands of documents or hits for related documents which simply contain one or more of the keywords somewhere in their content. Since the mere inclusion of the keyword in a discourse provides very limited insight into the subject matter, purpose, and a focus of the discourse, the results of the search are therefore not precise. Such searching often requires refinement using a hit or miss strategy, which is often cumbersome and takes time and lacks efficiency. Despite best efforts of users or ranking systems, it may be impossible to identify a set of keywords that locates the best and most related material and/or effectively rank those results without any semantic level understanding or any user interest based ranking. Accordingly, full text searching has much room for improvement.
Natural language techniques are examples of other attempts for searching large quantities of information. Such natural language techniques often use simple logical forms, pattern match systems, ontologies and/or rule based systems which are difficult to scale efficiently and lack precision using large quantities of information. For example, conventional natural language techniques often cause what is known as “combinatorial explosion” when the number of logical forms that are stored as templates grows. Adding to their complexity, they typically require a very high degree of accuracy in order to provide any useful results. For example, speech-to text conversion often employs NLP techniques and, due to the amount of nonsensical output, is generally considered unusable with accuracies below 98%. Accordingly, natural language techniques have not been able to be scale for large complex information systems. Additionally, the inventors recognize that words in any language have meanings that are very context sensitive and written communication is governed by very loose rules of grammar.
In general, various approaches have been attempted to automatically extract human level understanding from digital content, almost all have involved an a priori approach. The systems are based on building a model of human understanding ahead of time and then some form of pattern matching between the model and the data are performed. This is typically done by building some form of ontology; for example, a set of rules that model human behavior or some form of taxonomy to organize information based on a pre-set categorization system. However, this approach has proven to be unscalable and, at best, beyond the limits of the capabilities and scope of our current technology. A fundamental problem with ontologies is that they try to capture “truths.” Such inferences of “truth” are defined and established through reference to the conventions associated with a particular semiotic system in which concepts/truths are represented. More importantly, they may exist and be valid only in the context of a particular representation. In other words, a “truth” may be temporal and exist only for the concept in which it is represented. Thus, any a priori ontology of truth, while valid for some concept, may be invalid for another; thereby making the ontology impossible to predefine.
Beyond their form of implementation, expert systems, taxonomies, etc., ontological rules can be based on any of several different approaches ranging from emulating human cognition to defining formal processes for interpreting grammar. Systems that rely on accurately interpreting grammar suffer from the imperfections in natural languages that tend to cloud rather than disclose the thoughts with they expose. The clouding of thought resulting from the non-formal nature of natural languages makes it difficult for humans to communicate effectively. Thus, it is difficult to accurately infer consistent interpretations of concepts conveyed in discourses. Since the current state of artificial intelligence is far more primitive than even the most basic human capability and, since grammatical interpretation taxes the most advanced human cognitive abilities, any system that relies on interpreting grammar is likely decades and several orders of magnitude of technological advancements away from success. This indicates to the inventors that any practical system must avoid creating a solution that requires capturing the essence of how the individual mind grasps the meanings of the linguistic expressions and acquires linguistic competence. It is by no means obvious how a mental entity such as an idea can be determined.
Additionally, the inventors of the present invention understand that the gap between the generation of information and the knowledge gained/extracted is further widening. Most data is unclassified and unstructured, making it unsuitable for current data management technology solutions such as traditional data mining and OLAP. While conventional search and categorization technologies can process unstructured data and, given keywords or phrases, identify lists of documents referencing those phrases or build basic taxonomies based on statistical clustering of keywords or phrases, they are ineffective in complex information extraction and analysis. The conventional technologies are based on the assumption that there are naturally occurring groups of words or phrases that can be identified by users and that are more commonly referred to in documents containing related themes than in other unrelated material. Due to the nature of language itself, this is generally not a valid assumption and thus the relationships identified by search systems are often misguided. Further, attempts to categorize search results through automated clustering tends to produce taxonomies with mathematical relations that often have no natural interpretation and limited practical use. More importantly, however, many if not all of these conventional systems put the burden of analysis on the user to wade through possibly irrelevant results. Thus, with conventional technology, the only effective way for users to achieve a task, such as preparing a legal brief from diverse information sources and multiple formats, is manually. The manual process is slow, expensive, error prone, and does not scale.
Other approaches to the problem, that do not rely on pre-defined ontologies include latent semantic analysis (LSA). This process does involve semantics, but focuses primarily on counting frequencies of keywords. Drawbacks to such an analysis is that without any semantic knowledge is impossible to perform any type of analysis beyond simple relationship establishment based on word frequency counting. As evident by the industry's standardization on inverted list keyword indexing systems (such as Google, Yahoo!, MSN, Amazon, etc.), in most practical applications the advantages of LSA have proven to be marginal. Additionally, LSA is inherently more complex than keyword indexing systems. LSA is currently used in determining the similarity of meaning of words and passages by analysis of large bodies of text rather than higher level cognitive analysis.
As with many systems based on simplification of idealized systems, controlled experimental results of LSA do not accurately reflect the results of real-world applications. Specifically LSA's strength, its lack of need for knowledge from perceptual information about the physical world, from instinct, or from experimental intercourse, feelings, or intentions, also limits its success in practical applications. Because LSA induces it's representation of the meaning of words and passages solely from analysis of text alone, it is incapable of identifying related concepts in a generalized case. This results from the fact that grammatical representations of similar expressions of thoughts can vary dramatically, making it impossible to identify a single set of words which will be use consistently to express the same thought in any given context.
As can be seen from the above, techniques for improving searching of information are highly desirable.