1. The Field of the Invention
The present invention relates to systems and methods for analyzing data. More particularly, the present invention relates to systems and methods for viewing, searching and navigating large data sets such as textual data, files, or databases and more specifically to producing a focused data set from the original data set.
2. Background and Relevant Art
Computers and computer related technologies such as software are becoming increasingly sophisticated. Computers that used to run at a few Megahertz are now capable of operating at Gigahertz speeds. Computers that offered a few hundred kilobytes of memory now offer hundreds of megabytes of memory. Software development, of course, has adapted to the ever improving technology. Whereas computer programs were often delivered to consumers on a couple of floppy disks that held relatively little data, most computer programs are now delivered on CDROMs that store hundreds of megabytes of data. It is easy to see that the development of software has blossomed from thousands of lines of code to millions of lines of code. One of the side effects of larger programs is that it is potentially more difficult to debug because the programmer is looking at significantly more text.
A similar problem occurs in applications or programs that generate a large amount of output. Data sets such as log files are examples of files that may contain a large amount of text that represent actions that have occurred, for example, in a computer, a network, or a web site. Operating systems generate log files, Internet servers generate log files, and debugging programs generate log files. Other applications may store large amounts of data in other formats, but the same problems apply to these formats as well.
The data sets that are generated in these and other situations can often provide valuable information that can be used in various ways. The problem with these types of data sets is that their size (measured in number of entries, size of a single entry, etc.) makes it difficult to find and view the specific data that is of interest to a user. For example, log files can be used to determine the events that occurred just before a problem crashed a system or terminated an application. Finding and examining the entries corresponding to these events in the log files can then be used to prevent this type of problem for re-occurring. However, the sheer size of the log file makes it very difficult to examine the log file and find the entries or text that is associated with the system crash or with the terminated application. When the appropriate entry (or group of entries) is found in the log file, it may provide some idea as to why the system crashed or why the application terminated improperly. With this information, a user may be able to fix the problem so that this problem does not cause similar actions in the future.
In these types of situations, it is difficult to extract useful information from a data set that has a significant amount of extraneous data because the data of interest is often interspersed among the extraneous data. There are many standard text editors that provide a basic find functionality, but this capability is inadequate when it is necessary to compare two lines of text that are widely separated in the log file or in the data set. Other editors approach this problem by allowing a user to mark certain lines within the data set. While this can be beneficial, it is often not enough to help find the appropriate lines of text. One of the reasons is that these more sophisticated editors are not able to provide context with respect to certain lines of text.
Another potential solution to this problem is to use a global regular expression and print (GREP), which is a function or utility that searches for a certain string of text and outputs any line that contains the specified string. The problem with a GREP is that the output of a GREP cannot be temporally reconciled with the output of other GREPs. The output of one GREP cannot be easily combined with the output of a second GREP because the temporal relationship between the two respective outputs is unknown. In addition, the output of a GREP does not provide the desired context for lines of interest.