The veritable explosion of the Internet has created a problem of altogether too much information. The user is overwhelmed by the simplest of searches. Every website owner strives to have their site on the top of the search results. Few web users look at any sites beyond the first few pages or 50 sites from result sets ranging in the multi-million. The problem stems, in part, from the use of ambiguous words to drive the search queries. Additionally, the sheer number of websites continues to increase the difficulty of finding the right information.
One alternative approach has been to build directories. The difficulty of the directories is still the issue of ambiguity. These directories are by no means an attempt to search the Internet but rather a way to organize a small selection of the billions of web pages currently available. These handpicked sites are very limited in absolute terms or numbers. More importantly, the Internet is growing at such a rapid rate that static directories are, by their very nature, outdated. There needs to be a way that even the brand new pages can be organized.
There are many drawbacks with current Internet search methods such as Google and Yahoo. Many relatively robust search engines exist today. All that Google does is search, and yet, they have results that are full of ambiguity and have not yet integrated a method of drill down to reach search results. These companies all continue to refine the use of algorithms dependent upon interpretations of the user's keystrokes or weighting the records based on complex calculations of proximity, frequency, and position.
Google and the pack of search engines have engaged in a race to the finish line trying to solve the frustrating problem of relevance. There is no way that the computer can consistently and reliably determine the intent of the user. In other words, the keystrokes of the user have been analyzed in conjunction to other queries to attempt to understand, or anticipate, the users' intention. But the user may have an active mind and able to shift between many diverse subjects. Therefore, the computer is constantly baffled by this problem. These companies have invested millions to develop Artificial Intelligence to solve this problem and to make the text box interface effective, but without apparent avail.
This is especially difficult when so many words are ambiguous. In particular, the more common words tend to have multiple meanings. It is for this reason, that those more educated users have a clear advantage when using the standard text box combined with a modicum of skill in Boolean logic. The educated user has a broader vocabulary and can thereby express their objective in a more precise manner.
Language-based searches have various unsolved problems: children are exposed to inappropriate material; words have more than one meaning; keystroke errors result in totally wrong information; keystroke requires skill sets that are not universal; the need to remember words and names; the need to read to understand results; the need for extensive vocabulary to assess results; and international use of the Internet is comprised of many languages.
Oftentimes, when using words for search parameters, the user is faced with sorting through the disparate results. Currently, search results present websites that contain the selected word but the subject matter at the same time be completely unrelated to the searcher's objective. For instance, if a user searches for flamingo they see the following results:
1) Flamingo Hotel and Resort, Las Vegas;
2) Flamingo, Scientific Classification;
3) Harper Collins Publishers, UK;
4) Flamingo Gardens, Florida;
5) Flamingo Land Theme Park and Zoo, United Kingdom;
6) Flamingo Table tennis, located in Gouda, Netherlands;
7) Flamingo World, for free online coupons; and finally,
8) xxx.com, in which the word flamingo appears but which features erotic stories of bondage.
If a teacher asks a young student to research Flamingo birds on the Internet, the unfortunate student has to read through the mass of unrelated sites to find one site that offered some appropriate information. Even so, the best and most useful sites are not found in the first 20 results, they tend to show up after 50 sites, or more. In particular, a somewhat illiterate student is stymied by words being the exclusive method to understand the multitude of website hits.
Similarly, a search using “Hilton Paris” results in stories about Paris Hilton (including her personal tapes) and Paris, France (however, the latter is presented in a lower priority due to lower interest, or current popularity). Young people are very fond of Paris Hilton.
The World Wide Web is cluttered with everything imaginable. Now, web surfers are deluged with links to sites that have nothing to do with their target subject matter. Ironically, the very abundance of results is the main limitation of text-based searches. It is unfortunate that such a marvelous opportunity is dramatically diminished by the inability to exclude unrelated information. And the searching experience is, all too often, contaminated with unwanted material.
Parents, understandably, have serious concerns about their children's Internet surfing experience. Few solutions are available that effectively restrict access to inappropriate websites. There have been many heated debates about freedom of speech and inappropriate websites, which are easily accessible to children. Governments have great difficulty enforcing any constraints on website materials or how these sites restrict or prevent access by children.
U.S. Pat. No. 6,868,525 to Szabo, issued Mar. 15, 2005 discusses much of the same background to this searching problem as follows. The Internet presents a vast relatively unstructured repository for information, leading to a need for Internet search engines and access portals based on Internet navigation. The Internet's very popularity is based on its “universal” access, low access and information distribution costs, and suitability for conducting commercial transactions. However, this popularity, in conjunction with the non-standardized methods of presenting data and fantastic growth rate, have made locating desired information and navigation through the vast space difficult. Thus, improvements in human consumer interfaces for relatively unstructured data sets are desirable, wherein subjective improvements and wholesale adoption of new paradigms may both be valuable, including improved methods for searching and navigating the Internet.
Generally speaking, search engines for the World Wide Web (WWW, or simply “Web”) aid users in locating resources among the estimated present one billion addressable sites on the Web. Search engines for the web generally employ a type of computer software called a “spider” to scan a proprietary database that is a subset of the resources available on the Web. All the search engines and metasearch engines, which are servers, operate with the aid of a browser, which are clients, and deliver to the client a dynamically generated web page which includes a list of hyperlinked universal resource locators (URLs) for directly accessing the referenced documents themselves by the web browser.
A Uniform Resource Identifier (URI) is the name for the standard generic object in the World Wide Web. Internet space is inhabited by many points of content. A URI is the way you identify any of those points of content, whether it be a page of text, a video or sound clip, a still or animated image, or a program. The most common form of URI is the Web page address, which is a particular form or subset of URI called a URL. A URI typically describes: the mechanism used to access the resource; the specific computer that the resource is housed in; and the specific name of the resource (a file name) on the computer.
The structure of the World Wide Web includes multiple servers at distinct nodes of the Internet, each of which hosts a web server which transmits a web page in hypertext markup language (HTML) or extensible markup language (XML) (or a similar scheme) using the hypertext transport protocol (http). Each web page may include embedded hypertext linkages, which direct the client browser to other web pages, which may be hosted within any server on the network. A domain name server translates a top-level domain (TLD) name into an Internet protocol (IP) address, which identifies the appropriate server. Thus, Internet web resources, which are typically the aforementioned web pages, are thus typically referenced with a URL, which provides the TLD or IP address of the server, as well a hierarchal address for defining a resource of the server, e.g., a directory path on a server system.
A hypermedia collection may be represented by a directed graph having nodes that represent resources and arcs that represent embedded links between resources. Typically, a user interface, such as a browser, is utilized to access hyperlinked information resources. The user interface displays information “pages” or segments and provides a mechanism by which that user may follow the embedded hyperlinks. Many user interfaces allow selection of hyperlinked information via a pointing device, such as a mouse. Once selected, the system retrieves the information resource corresponding to the embedded hyperlink.
One approach to assisting users in locating information of interest within a collection is to add structure to the collection. For example, information is often sorted and classified so that a large portion of the collection need not be searched. However, this type of structure often requires some familiarity with the classification system, to avoid elimination of relevant resources by improperly limiting the search to a particular classification or group of classifications. Another approach used to locate information of interest to a user, is to couple resources through cross-referencing. Conventional cross-referencing of publications using citations provides the user enough information to retrieve a related publication, such as the author, tide of publication, date of publication, and the like. However, the retrieval process is often time-consuming and cumbersome. A more convenient, automated method of cross-referencing related documents utilizes hypertext or hyperlinks. Hyperlink systems allow authors or editors to embed links within their resources to other portions of those resources or to related resources in one or more collections that may be locally accessed, or remotely accessed via a network. Users of hypermedia systems can then browse through the resources by following the various links embedded by the authors or editors. These systems greatly simplify the task of locating and retrieving the documents when compared to a traditional citation, since the hyperlink is usually transparent to the user. Once selected, the system utilizes the embedded hyperlink to retrieve the associated resource and present it to the user, typically in a matter of seconds. The retrieved resource may contain additional hyperlinks to other related information that can be retrieved in a similar manner.
A well-recognized problem with existing search engines is the tendency to return hits for a query that are so incredibly numerous, sometimes in the hundreds, thousands, or even millions, that it is impractical for users to wade through them and find relevant results. Many users, probably the majority, would say that the existing technology returns far too much “garbage” in relation to pertinent results. This has lead to the desire among many users for an improved search engine, and in particular an improved Internet search engine.
In response the garbage problem, search engines have sought to develop unique proprietary approaches to gauging the relevance of results in relation to a user's query. Such technologies employ algorithms for either limiting the records returned in the selection process (the search) and/or by sorting selected results from the database according to a rank or weighting, which may be predetermined or computed on the fly. The known techniques include counting the frequency or proximity of keywords, measuring the frequency of user visits to a site or the persistence of users on that site, using human librarians to estimate the value of a site and to quantify or rank it, measuring the extent to which the site is linked to other sites through ties called “hyperlinks” (see, Google.com and Clever.com), measuring how much economic investment is going into a site (Thunderstone.com), taking polls of users, or even ranking relevance in certain cases according to advertiser's willingness to bid the highest price for good position within ranked lists. As a result of relevance testing procedures, many search engines return hits in presumed rank order or relevance, and some place a percentage next to each hit which is said to represent the probability that the hit is relevant to the query, with the hits arranged in descending percentage order.
However, despite the apparent sophistication of many of the relevance testing techniques employed, the results typically fall short of the promise. Thus, there remains a need for a search engine for uncontrolled databases that provides to the user results, which accurately correspond the desired information sought.
Therefore, the art requires improved searching strategies and tools to provide increased efficiency in locating a user's desired content, while preventing dilution of the best records with those that are redundant, off-topic or irrelevant, or directed to a different audience.
As the amount of information available to a computer user increases, the problem of coherently presenting the range of available information to the computer user in a manner which allows the user to comprehend the overall scope of the available information becomes more significant. Furthermore, coherent presentation of the relationship between a chosen data unit of the available information to the rest of the available information also becomes more significant with the increase of information available to the user. Most of the existing methods utilize lists (e.g., fundamentally formatted character-based output), not graphic models, to indicate the structure of the available information. The main problem associated with the use of lists is the difficulty of indicating the size and complexity of the database containing the available information. In addition, because the lists are presented in a two-dimensional format, the manner of indicating the relationship between various data units of the available information is restricted to the two-dimensional space. Furthermore, because presentation of the lists normally requires a significant part of the screen, the user is forced to reduce the amount of screen occupied by the list when textual and visual information contained in the database is sought to be viewed. When this occurs, the user's current “position” relative to other data units of the available information is lost. Subsequently, when the user desires to reposition to some other data unit (topic), the screen space occupied by the lists must be enlarged. The repeated sequence of adjusting the screen space occupied by the lists tends to distract the user, thereby reducing productivity.
A users' knowledge of the subject represented in the hypermedia is a particularly important user feature for adaptive hypermedia systems. Many adaptive presentation techniques rely on a model of the users' knowledge of the subject area as basis for adaptation. This means that an adaptive hypermedia system that relies on an estimate of the users' knowledge should update the user model when the user has presumably learned new things. Further, a preferred user model according to the present invention preferably also models decay of memory.
There are two common ways of representing users' knowledge in an adaptive hypermedia system. The most often used model is the overlay model that divides the hypermedia universe into different subject domains. For each subject domain in the hypermedia universe, the user's knowledge is specified in some way. The user's knowledge of a particular subject domain can be given the value known or unknown, or for instance a fuzzy semantic variable such as good, average or poor. On the other hand, a numeric or continuous metric may be provided. The user's knowledge may also be represented as a value of the probability that the user knows the subject. An overlay model of the user's knowledge can then be represented as a set of concept-value pairs, one pair for each subject.
The other approach, apart from the overlay model, is the stereotype user model, in which every user is classified as one of a number of stereotypes concerning a particular subject or area. There can be several subareas or subjects, so one user can be classified as a different stereotype for different subjects. For instance, a novice stereotype, an intermediate stereotype and an expert stereotype can be defined for one subject in a system, and every user is therefore classified as one of an expert, novice or intermediate on that particular subject. This scheme is much simpler to implement but caries the disadvantage of not being able to tailor the appearance of the system to every individual user.
In some adaptive hypermedia systems, the user's background is considered relevant. The user's background means all information related to the user's previous experience, generally excluding the subject of the hypermedia system, although this exclusion is not necessary in all cases. This background includes the user's profession, experience of work in related areas and also the user's point of view and perspective.
The user's experience in the given hypermedia system means how familiar the user is with the appearance and structure of the hyperspace, and how easy the user can navigate in it. The user may have used the system before, but does not have deep knowledge of the subject. On the other hand, the user can know a lot about the subject, but have little experience of the hypermedia system. Therefore it is wise to distinguish between the user's knowledge and the user's experience, since optimal adaptations for each factor may differ.
The user's preferences are used in adaptive information retrieval systems mostly where they are the only stored data in the user model. Users' preferences are considered special among user modeling components, since they cannot be deduced by the system itself. The user has to inform the system directly, or by giving simple feedback to the system's actions. This suggests that users' preferences are more useful in adaptable systems than in adaptive systems. However, users' preferences can be used by adaptive hypermedia systems as well. Some researchers have found that adaptive hypermedia systems can generalize the user's preferences and apply them on new contexts. Preferences are often stored as numeric values in the user profile, contrary to the case for other data, which is often represented symbolically. This makes it possible to combine several users' preferences, in order to formulate group user models. Group models are useful when creating a starting model for a new user, where this user can define his or her preferences, and then a user model is created based on the user models of other users who are in the same “preference group”.
Machine learning and use of intelligent agents is a useful technique, with respect to adapting the user interface to different users' needs. The reason for this is that the same user can have different needs at different times and therefore the system must respond to the user, and examine the user's actions, in order to understand what the user needs. In other systems that use user modeling, for instance, in film recommending systems, the system already knows what the user wants and the interaction with the user is not as important.