This invention relates generally to computer network data operations and, more particularly, to an apparatus for generating and updating databases for the retrieval of information.
The Internet is a vast collection of documents that is accessible to the greatest number of users in the world. The Internet is constantly in flux, as new documents are added, and older documents are removed. The documents are typically written in hypertext mark-up language (HTML) and can include a mixture of text, graphic, audio and video elements. These documents comprise what is referred to as the xe2x80x9cWorld Wide Webxe2x80x9d and are also called web pages. Internet users can utilize a wide variety of Internet search engines that can be accessed with web browsers to locate and retrieve web pages that provide useful information. A user provides a search query, usually a string of words on a topic of interest, to a search engine, which then applies the search query to a database of web pages. Links to matching pages are returned to the user, typically ranked accordingly to a similarity score. Some of the currently popular search engines include xe2x80x9cAlta Vista(trademark)xe2x80x9d, xe2x80x9cLycos(trademark)xe2x80x9d, xe2x80x9cYahoo(trademark)xe2x80x9d, xe2x80x9cGoogle(trademark)xe2x80x9d and xe2x80x9cInfoseek(trademark)xe2x80x9d.
The database searched by each search engine is usually a proprietary database, created by the search engine operator. Often, the search engine database comprises a reverse-lookup table of individual words with links to the web documents in which they are found. A web page that contains multiple instances of the words in a search query has a higher similarity score than a web page that contains fewer words from the search query. Likewise, a web page that contains all the words from a search query will have a higher similarity score than a page that does not contain all the words from the search query. Although this type of matching will generally lead to valid results, such search techniques can locate a fair amount of duplicate and irrelevant documents.
Most search engines rely on programs called xe2x80x9ccrawlersxe2x80x9d or xe2x80x9cspidersxe2x80x9d that search the Internet for new documents that are made accessible to Internet users by storage at a web server computer. The contents of such documents are read for their word content, and links to these documents (their Internet addresses) are automatically added to the reverse look-up database of the search engine. Alternatively, humans can review the documents and make a determination of categories into which the documents should be indexed. The search engine database is then modified to include the reviewed documents, so that links are inserted into the database according to the categories decided upon. In this way, the respective search engines include virtually all of the documents that may be found on the web.
Users can then access the search engine and provide a query. The search engine applies the query against the database and returns matches to the user. Unfortunately, the search results can easily become over-inclusive and return irrelevant links. For example, a search for information on North American wildlife may return links to discussions of stock market xe2x80x9cbullsxe2x80x9d and xe2x80x9cbearsxe2x80x9d. A search for Java(trademark) programming developments may return links to coffee houses. This type of over-inclusion requires reviewing the search results and discarding the links that are identified as irrelevant, which can be a very inefficient use of time. As the number of links to the web increases, an over-inclusive search can result in inadvertent obfuscation rather than elucidation of the sought after relevant information.
One way to increase the relevancy of Internet documents located by a search engine is to limit the breadth of the search that is conducted. For example, a search may be limited to web pages found at a particular web site or Internet domain name. This technique works well if one is searching only for a web page at a particular site. The technique is not particularly useful if a more generalized subject matter search is desired, as the search will then be under-inclusive and many relevant documents will be missed.
Aside from being an ever growing repository for information, the Internet environment, and the World Wide Web, in particular, has become a nexus for commercial activity. A key factor for commercial success in the Internet environment is the ability of a web site to attract the web surfer. Recent trends and activity have seen development of a business strategy based on Vertical Portals. A Vertical Portal or xe2x80x9cvortalxe2x80x9d is a web site that is focused to a specific topic or several topics. The commercial advantage of such a site is that it provides the web advertiser with a narrow and well defined audience to which it can present its products and/or services. The commercial success of vortals, such as, ZDNet(trademark) and eTrade(trademark), have demonstrated the viability of this strategy. One of features that attract the defined audience to continually return to a vortal is often the accessibility of a database that focused on a specific area of interest. Vortals are increasingly receiving more traffic and repeat traffic, demonstrating that users are indeed in search of better, more relevant information. Further indication of the success of vortals is their ability to attract and charge higher. advertising rates, due to their well-defined audience. New vertical portals are projected to launch in vast number in the future.
From the discussion above, it should be apparent that there is a need for a database search technique that will provide relevant search results without unduly limiting the scope of the search. In addition, with the increasing number of vortals and commercial enterprises on the web there is a continuing need for an efficient method of generating and managing online databases. The present invention fulfills these needs and others.
An automated method of creating or updating a database of resumes and related documents, the method comprising,
a) entering at least one example document that is relevant to a subject taxonomy in a retrieval priority list, if there is a plurality of example documents stored in the retrieval priority list, ranking the example documents according to the relevancy of the example documents to the subject taxonomy;
b) retrieving a document from a network of documents, where the document is the most relevant document to the subject taxonomy stored in the retrieval priority list;
c) harvesting information from specified fields of the document;
d) classifying the information into one or more classes according to specified categories of the subject taxonomy;
e) storing the information into a database;
f) determining whether the information are links to other documents;
g) ranking the link""s according to relevancy to the subject taxonomy, and storing the links in the retrieval priority list according to the relevancy;
h) terminating the method, provided the method""s stop criteria have been met; and
i) repeating steps b) through h), provided the method""s stop criteria has not been met.