Information, particularly technical information, is becoming ever more available. For example, full text U.S. patents are available on the World Wide Web and patent text data or images are available commercially. Some scientific journals that had been available only in technical libraries are now available on the World Wide Web. Trade journals, formerly sent in printed form only to area specialists, but are now available on the World Wide Web. Trade organizations have websites that are information rich. Search engines make retrieving raw information from the World Wide Web fast and easy. Many newspapers and magazines that were previously delivered only in print format, are now available in electronic formats. Further, commercial sources of similar data are also available. For example, marketing, technical and legal information can be purchased in paper and electronic form from organizations such as Derwent, or Lexus-Nexus among others. The problem librarians, researchers, and managers now face is organizing information in such a way that it can be easily assimilated, interpreted, and acted upon particularly for business decisions where speed and accuracy are critical.
Document Information Sources and Uses
Patents are becoming ever more central to the success of technology related businesses. Patenting activity in almost all areas of technology has exploded in recent years. Some have suggested that 90–95% of all the world's inventions can be found in patent documents [1]. Patents are considered to be a key in translating inventions to commercial use. [2]. Innovation is considered to be the driving force behind competition, economic growth and the creation of jobs [3]. Patents play a central role in recording innovation and in business strategy development. FIG. 1 illustrates the unique position of patents in the information flow pathway from basic science through patents to market launch [4]. The existence of a strong linkage between patenting activity and basic science has been postulated [5].
Because patents record innovation and are keys to protecting products, they contain a wide variety of information that is useful in making business investment and technology development decisions. The description of methods, procedures, synthetic routes, compounds, devices, and other information that is included in a patent provides a picture of the invention itself. A patent typically also provides the author's vision of the context in which the invention is meaningful. Hence, a patent typically contains information on other “similar” inventions and the status of the basic science that supports the invention. A patent can also contain a description of the market conditions that make the patent useful. Because a patent is highly structured, relevant information is located in predictable locations within the patent, a fact that can facilitate analysis with automated or semi-automated methods.
Scientific journal articles chronicle the development of an idea, theory, concept or product. Kuhn describes one theory of scientific development in his book, The Structure of Scientific Revolutions [6]. Because scientific journal articles are typically peer reviewed, some consider a journal articles to be accurate representations of the current understanding of the laws of science.
Scientific journal articles are typically highly structured documents. Titles convey the topic discussed in the article, often in significant detail. An abstract provides a summary of the scientific methods used and of the experimental results. Often, an abstract contains a one or two sentence conclusion that describes the implications of the results. In most journals, the experimental and data analysis methods are described in sufficient detail for one skilled in the art to reproduce the experimental protocol. The results are summarized and the conclusions state the implications of the results. Finally, a journal article contains references to previous work by others, a rich source that describes the author's vision of the area's historically important documents. Because the scientific journal article is highly structured, relevant information is located in approximately predictable locations, facilitating analysis with automated or semi-automated methods.
Trade journals articles are typically organized similarly to scientific journal articles. Trade journals chronicle the commercialization of ideas, methods and products, and can contain a more detailed picture of the commercial or market situation than might be expected in a scientific publication. Titles of a trade journal article convey the main topic discussed in the article. Abstracts provide summary information, and the body of the article provides details about methods and instruments used. The body of the article typically provides details and the results. The article often provides a summary of the results that is similar to an abstract. Often trade journals cite some relevant literature to provide the reader a window on the literature that author considers to be relevant. The structure of trade journal article is typically similar to a scientific journal article, and in trade journals, relevant information is located in approximately predictable locations, facilitating analysis with automated or semi-automated methods.
Newspapers, magazine articles, newsletters, advertisements and other popular publications can provide a picture of the popular perception of an area of business or technology. Newspaper and magazines articles and other publications are usually structured. Usually the articles have a title or a headline that describes the content of the article, and a body of the article that describes the area. Typically, newspapers and magazine articles do not provide summary abstracts or references. Because newspapers, magazine articles, newsletters, advertisements and other popular publications are structured, relevant information is located in approximately predictable locations, facilitating analysis with automated or semi-automated methods. An automated or semi-automated analytical method that could help link popular perception about an area with technical or patent information would be valuable assessment tool.
Marketing research information or reports document public perception or participation in a market area. Certain technical documents, especially patents and trade journals, can be expected to use the same or similar vocabulary that is found in market research reports. An automated or semi-automated analytical method that could link market research and technical information would be a valuable business asset.
Web sites posted on the World Wide Web by various organizations can be a useful source of information. The structure of websites and the quantity and accuracy of information contained in them is highly variable. Yet, a basic structure that permits analysis still exists. Most websites have a title that contains some information describing the site contents. Some sites provide summaries or indexes to their sites. Methods that could identify common vocabulary and help link web sites with common information would be beneficial tools for business and technology investment decisions.
A Web search engine can also provide information that can be analyzed to better understand the focuses of an area that has been searched. For example a search returned by Alta Vista or Google, among other search engines, is organized into titles and phrases that can be analyzed to show relationships among various sites. The vocabulary retrieved by the web search engine can help link abstracts or hits with common content, allowing rapid indexing and organization of data retrieved from a common web search.
Document Analysis Methods.
The document sources cited contain substantial information on technology, competitors, and market status indications. Collections of documents can be studied to provide a picture of business, technology or market area development. A variety of methods have been employed to attempt to extract business, technology or market information from document collections including expert analysis, patent or document counting statistics, patent class analysis, bibliographic methods such as co-occurrence analysis, including co-word or co-citation analysis, technology roadmapping, and methods from artificial intelligence such as genetic algorithms, Bayesian learning methods, Markov models and the like. The historical development and technical aspects of bibliographic methods have been summarized [7, 8, 9, 10].
Co-word analysis is a bibliographic method that allows an exploration of the vocabulary used in a document set in order to identify major themes within the document collection. In co-word analysis the frequency of appearance of selected keywords or phrases is measured. A major premise of co-word analysis is that the co-word pairs that are used frequently indicate major topics that run throughout a set of documents.
Co-word analysis can be implemented with a simple database [11, 12], and linkage maps can reveal overall structure of a research area [13]. However, published maps do not appear to reveal the details of the area that are needed for business opportunity assessments.
Co-citation analysis is a bibliographic method that measures the frequency with which two references appear together in the references of a scientific journal article. Co-citation analysis has been successfully applied to studying the structure of science through references in scientific journal articles [14]. Co-citation analysis with U.S. patent references has been used to assist in corporate licensing decisions [15], but has not been successfully applied to scientific references in patents because the formats of the references are highly heterogeneous [14].
Artificial intelligence, methods from data mining and knowledge discovery can also be used to reveal relationships among words. Artificial intelligence, data mining and knowledge discovery employ methods such as genetic algorithms, Bayesian learning, neural networks, Markov models, hidden Markov models, partial least squares, and principal component analysis, among others.
Many artificial intelligence methods have elements in common. Most methods start by building mathematical models that describe the document collection including document vocabulary and sometimes document structure. The data from which a model is constructed is often called a training set, and often must include examples of all the situations that the model is expected to be able to find or detect. After the mathematical model has been constructed, the model can be used to identify new structures, phrases, images and the like that are similar the structures, phrases, images used in developing the mathematical model. Careful choice of the training set is required. Otherwise, the model can respond with unwelcome surprises or provide answers that are actually incorrect.
Document Analysis Method Examples.
In U.S. Pat. No. 5,991,751, Aurigin discloses a system of multiple databases that correlates patent information with bill of materials information, personnel information, R&D spending information among many other information types. Citation analysis is the primary bibliometric method disclosed.
Aurigin in collaboration with Sandia Laboratories and the Institute for Scientific Information have produced topographical maps of technology for a given point in time. While these maps provide a good visual representation of linkages among scientific and technological endeavors, they do not provide the resolution or time perspective needed for most business decisions.
Co-word analysis of documents has been used by the Sec. of the Navy (U.S. Pat. No. 5,440,481) describes a process by which repeated themes can be identified if the words composing the themes are adjacent to one another.
Xerox (U.S. Pat. No. 6,038,574) describes a co-citation analysis method in which hyperlinks in web pages are viewed as references, and the relationships found have been used to help create a web page index. None of the methods provides adequate detail for the analysis of business or technical information.
Artificial intelligence methods have been used to identify images or military targets. Artificial intelligence methods, particularly Markov models, have been used in voice recognition, handwriting recognition and identification of genetic sequences. The application of Markov models to the analysis of sequences, including word sequences and gene sequences, is well known in the art. Several patents that describe voice recognition and text to speech methods apparently employ Markov models. U.S. Pat. Nos. 6,003,005 and 5,230,037 provide examples of speech recognition applications of Markov models. With appropriate training, a reasonable, but not perfect, voice to text conversion can be achieved with popular software such as Dragon Naturally Speaking™, and others. The National Technical Information Service offers software (order number AD-M000 099) that claims to be able to separate multiple simultaneous conversations after appropriate training of a Markov Model.
Artificial intelligence methods can be used to study words and vocabulary contained in documents. With artificial intelligence methods, words or groups of words can be treated as vectors in multidimensional space. Mathematical manipulation of the vector space can reveal relationships among the words or groups of words in a document or a set of document. in common with most artificial intelligence methods, a mathematical model must be constructed from example data, a training set. A key limitation of artificial intelligence methods is the need to provide an all-encompassing training set, a requirement that can substantially limit the ability of the methods to discover new and unexpected relationships within the data.