1. Field of the Invention
The invention relates to researching and organizing information from a plurality of sources. More particularly, the invention relates to computer assisted mining and organization of information from electronic sources.
2. Brief Description of the Prior Art
Never in the history of humanity has there been so much information available to so many people. The advent of the World Wide Web in the early 1990s created the ability to access information stored in computer databases all over the world from any computer connected to the PSTN (public switched telephone network). According to the Online Computer Library Center, Inc. (http://wcp.oclc.org/), there were approximately 2,851,000 web sites in 1998 and approximately 8,712,000 in 2002. Although growth has slowed, the number of websites is still increasing every year.
Many websites contain little or no useful information. However, there are also many websites which contain a wealth of valuable information. The difficulty is in locating and organizing the available information. Many so-called search engines attempt to organize the content of the World Wide Web. The most well known are, perhaps, Yahoo and Google. While these search engines are helpful for the casual user, they are incomplete and often inaccurate. Moreover, information retrievable from the Internet is not formatted in a standard uniform structure. For example, data may be in HTML format, PDF format, Microsoft Word (.doc) format, tab-delimited format, XML format, etc. Even information found in the same document format are often presented in various sources. For example, data may be tabled in some sources, and described in free text in others. Additionally, different lexicons are often used to describe the same features. Thus, in order to mine information for use in a queriable database, the information must be restructured to a uniform view.
Businesses have always recognized that accurate, precise, coherent data is a powerful tool for making sound business decisions. Many businesses have realized that extremely valuable information can be mined from the World Wide Web as well as from other Internet resources such as “news groups” and “ftp sites” and from their own electronic data. However, successfully retrieving and organizing this information is costly and time consuming. The state of the art approach is to employ skilled data and domain experts to manually extract, classify, structure and categorize data. This process can take up to an hour for a single data entry. In addition to information mined from the Internet, it would be desirable to integrate that information with existing “legacy data” in a company's own electronic file system. Much of this data is only semi-structured, e.g. tabular data in a text document, or completely unstructured, e.g. free flowing text.