The invention is related to the field of information science. It is particularly pertinent to the creation of historical maps of scientific and scholarly publications and, through such maps, to illustrate histories of the development of the theories, ideas, hypotheses and discoveries that advance science and scholarly pursuits in the interest of mankind.
To those who practice the art and science of mapping the historical development of scientific and scholarly ideas, principles and discoveries, a most useful tool is the historical map of the evolutionary history of a singular thread of insight expressed in a publication. By mapping the citation history of an article forward and back, it is possible to discover the primordial (that is, the original) expression of the theory or concept and trace its development over time in succeeding papers on that subject and others closely related to it. These maps are called historiographs (or historiograms). Historiographs aid the study of the contemporary history of scientific topics. History and bibliography are intimately linked.
There have been many different types of xe2x80x9cmappingxe2x80x9d exercises performed on a small scale particularly with respect to clustered files of bibliographic information. The clustering required main frame computers. These ideas were later extended to creating small cluster maps in the SciMap system developed by Henry Small at the Institute for Scientific Information (ISI). In that mainframe system a starting paper is used to seed the creation of a cluster map.
In spite of the many mapping and visualization techniques available, none of them were applicable to the creation of historiographs. Indeed, no one considered the relationship between historical display and its use in the evaluation of large data sets retrieved from publicly accessible databases like Science Citation Index (SCI), Medline, PubSci, to name a few. Government and privately maintained databases can also serve as sources for historiographic analysis (Examples: U.S. Patent and Trademark Office patents database, the American Chemical Society publication database).
Even in the early stages of developing the idea of programmed algorithmic historiography, it was considered only in terms of seeding the process by selecting one or more primordial papers. Then the Science Citation Index (SCI) would be used to trace forward in time all the papers that had cited the starting reference. This is the fundamental notion involved in doing a traditional cited reference search.
Indeed, since the idea of creating an historiograph is to display the chronological development of science from the primordial paper forward, it was assumed that searching would be done one year at a time. This was also influenced by the fact that published indexes appear annually both in print and on CD-ROM. In contrast, literature searches are traditionally focused on retrieving the most current material and then working backward. Using the annual CD-ROM version of the SCI, the inventors"" initial experiments involved a cited reference search on a single starting paper; all papers that cited it in that one year file were retrieved. Then a further search was done on those citing papers. Then the search process was iterated for as many years of the literature necessary.
However, it became apparent that one would and could feed in a group of papers by an author and then by extension larger groups of papers by institution or by key word. Thus the output of any conventional search can be input to the system so as to produce its map and identify the core papers.
The production of the various tables or lists from these procedures is of course separate from the problem of visualizing these data in the form of maps or graphs. These artifacts aid in the visual perception of the interrelationships between citing and cited papers. Creating maps of related documents present problems in display due to the limitations of space and restrictions of an 8xc3x9711 piece of paper. Visualization is aided by using larger sheets, such as tabloid size of 12xc3x9715. However, the advent of computer display screens means that one can create a display page of infinite size. Segments of a map can be shown in a movable display. Using mouse clicks and pop-up windows one can first show a condensed version of a large map in which the main nodes are visible but intermediate nodes are not.
Thus a map of several hundred nodal papers would first be seen in a condensed version in which only 25 to 50 nodes are seen, perhaps the most cited papers in the collection. The full map could be observed in chronological sections from top to bottom or from left to right. Essentially one goes from a standard two-dimensional display to a moving interactive multi-dimensional display. The combination of computer with human selection permits the algorithmic real time visualization of the historical connections between literatures on a micro or macro level.
The invention is a process, adaptable to implementation in software and operation in a computer system having a visual display device, which analyzes a large collection of related publications and organizes the documents into an historiograph of the subject matter. The invention process starts with a randomly organized file of input documents or their descriptions. Then an output database is created which permits one to identify a series of most significant nodes and links in tables and graphs. Thus the user can quickly perceive the historical connections between the documents. The system may include classification tags or research front identifiers that would permit one to recognize the larger cluster or category of which each paper is a part.
While the described embodiment of the invention processes source documents containing citation index tags, entire texts of documents could also be included. In that way it would be possible to observe the contextual significance of each citation.
In a first embodiment, the invention produces five basic indexes based on author, institution, journal, year, and citation frequency. Other tables may be added, such as tables based on title words or key words so as to identify the most often used terminology in the subject.
The system also produces frequency-ranked indexes of cited papers that fall outside the basic core collection. The user can examine these candidate papers and decide whether to include them in the core analysis. For example, a highly cited book or patent might occur which is not part of the original source database and therefore a source record would have to be created. Some of these items may in fact have been published prior to the starting reference.
It is well known that authors cite references with many variant spellings or make errors in one or more parts of the reference such as volume or page. Such errors can cause an important paper to be missed in the process of accumulating related papers. These xe2x80x9cmissingxe2x80x9d references are identified in a separate table in the process of the invention. As part of the procedures invoked, the process seeks out the closest matched document in the collection and suggests candidates that the reader is asked to examine. This can be done manually or by an expert system which, for example, adds a missing volume number to a citation that is otherwise identical for author, journal, year and page. In a large number of references the page cited will not be the first page, as in a typical chemical paper where a chemical compound is mentioned. If the citation frequency is high enough the user may wish to treat such a citation separately or include it as a subset of the fully paginated reference.