1. Field of the Invention
The invention is addressed to apparatus and methods for interactively studying very large bodies of data in general and more specifically to apparatus and methods for studying similarities of values in such very large bodies of data.
2. Description of the Prior Art
Computers have made it relatively easy to collect, store, and access large collections of data; however the view of the data which has been provided by the computer has typically been the 50 or 100 lines which can be shown in a display terminal. The most modern graphical interface technology has made possible the display of information for up to about 50,000 lines of code in a display, as set forth in U.S. Pat. Application 07/802,912, S. G. Eick, Information Display Apparatus and Methods, filed Dec. 6, 1991 and assigned to the assignee of the present patent application. While the techniques of Eick are a great improvement over the display of 50 or 100 lines, they still cannot deal with bodies of data which consist of millions of entities such as records or lines. Further, the techniques of Eick do not address a frequent problem in dealing with such large bodies of data: namely, being able to visualize the relationship between similar or duplicate information and the body of data as a whole.
This kind of visualization is important in areas as diverse as the study of DNA, the automated manipulation of bodies of text, the detection of copyright infringement, and the maintenance of large bodies of code for computer systems. A technique which has been used in DNA research for such visualization is the dot plot, as explained in Maizel, J. and Lenk, R., "Enhanced graphic matrix analysis of nucleic acid and protein sequences," Proc. Natl. Acad. Sci. USA, 78:12, 7665-7669. As employed in DNA research, the dot plot is used to compare a sequence of n nucleotides with itself. A representation of an n.times.n matrix is created in a computer system. Each element (i,j) of the matrix represents a comparison between nucleotide (i) of the sequence and nucleotide (j) of the sequence; if they are the same, a mark is placed in the matrix element. The dot plot is then output to a printer or plotter. The patterns of marks in the dot plot provide an overview of the similarities in the sequence of DNA. In order to make significant patterns more visible, the dot plot is filtered; various compression techniques further make it possible to display a dot plot for a sequence of up to 10,000 nucleotides on a single piece of paper.
Although dot plots have been used in DNA research for over 10 years to display the results of comparisons between sequences of DNA, their use has not spread to other areas where similarities need to be studied. Problems with applying such techniques in these other areas include the much greater complexity of the data being compared, the need to deal with much larger sets of data, the need to make the techniques speedy enough so that they can be used interactively, and the need to develop a user interface which permits the user of the dot plot to interact easily with the data which underlies the dot plot and also permits the user to easily control the filtering used in the dot plot. It is an object of the present invention to overcome the above problems with dot plots and thereby to make them available for use in any area in which similarities in large bodies of data are of interest.