The human brain discovers new information and relationships by sorting through vast quantities of information, discarding information that is of no interest and then analyzing and organizing the remaining important information into new constructs. This requires the use of short-term memory that allows a person to focus narrowly on the newly organized information and newly created constructs. Long-term memory allows a human being to recall information that was organized previously and enables comparison of the recalled information to new information. These long-term and short-term memory operations allow human beings to discover and report new relationships among recalled and new information.
Three actions are assumed critical to the processes of discovering and reporting new information: 1) collating and processing important information while discarding unimportant information, 2) rapidly reviewing and comparing large quantities of new and old information, and 3) organizing new insights into new information constructs.
A recent article in Discover magazine states, “The human brain has evolved different modes for concentrating on a single thing versus jumping from one thing to another . . . . And the cost is that it takes several minutes to shift back . . . that's the way we're wired.” (Does e-mail make you dumber?, Anne Casselman, Discover, August 2005, p. 8) While this article discusses the effect of the loss of focus in the context of approaching tigers, phone calls and new e-mails, the same basic considerations apply to data analysis and scientific discovery. Operations that require a researcher to change his or her focus from analysis and report writing to the mechanics of loading, locating, processing, changing views and transferring information from raw data to finished reports can be at least slightly detrimental and possibly severely debilitating to the discovery and reporting processes.
Consequently, a combination of automated or semi-automated data processing, configurable result viewers and result transfer functionalities may be advantageous to the researcher. Such tools may allow the researcher to maintain focus when dealing with large amounts of data having many diverse forms and originating from many different source files. By using such tools a researcher's brain is allowed to maintain a sharp focus on the selected items of interest. One technology that exemplifies a system that addresses precisely this need for the maintenance of focus and processed data organization is the heads-up display.
In the realm of scientific research and sample measurements, scientists and engineers must develop protocols for storing raw data and then later retrieving that raw data. They typically must develop analysis tools to process the stored data. Analysis of the stored data, including organizing measured results, enables scientific discovery. Scientists and engineers must also develop protocols for transferring the results of the analyses into reports for documentation and dissemination.
Often, the data of interest that must be processed and analyzed is unique to a particular measurement system, hardware configuration, and measurement sample. While not all of the raw data may be necessary in later analyses, often the researcher is not certain which data will be important for future post-measurement analysis. As a consequence, it is the safest practice to save all information about each measurement in a single data file and later extract only data entries of interest for each analysis. This, however, is rarely done as it is far easier to save only the simplest data that is of the most interest at the time. This practice is followed because it is often too difficult to extract the information of interest using standard packages when the data is buried in data files with data of mixed types.
The full set of data may be quite complex, for example, including some or all of the following: the date and time of the measurement, an identification code of the sample under test, a system operator identification code, comments on the conditions of the experiment, measurement protocols such as data sampling parameters, identification of multiple and varied hardware components within the system, the settings of multiple and varied hardware components within the system, the actual measurement data itself, the filenames of other files that contain relevant reference, calibration, or experimental data, and the reference, calibration or experimental data contained in those other files. Each of the aforementioned measurement and hardware data items may be text or numeric. In addition, data in each entry field may be of various dimensions, including: scalar (e.g., single entry), vector (e.g., one-dimensional [1-D] row-wise or column-wise array of elements), or matrix (e.g., a two-dimensional [2-D] row and column-wise array of elements).
Adding to this complexity is the fact that the form of the data file structures may vary considerably. For example, each dimension in a vector or matrix entry may be of arbitrary length, the order and layout of the different sections of data in multiple files may vary, and new types of hardware may be introduced into the system during the development process, requiring the ability to add or modify the data layout in subsequent data files. Furthermore, a scientist may want to process data from many (perhaps hundreds) of diverse data files, one file at a time.
A scientist may also want to collate, process, view and compare data and results from various fields across a set of many data files which may also number in the hundreds. These data files may be contained in single or multiple folders or directories (i.e., locations). These folders may also contain other types of data files as well as non-data files. The data may need to be analyzed multiple times over the course of days, weeks, months or years using new analysis protocols that are under development.
Consequently, it is highly desirable to be able to open any of the raw data files and inspect the contents using, for example, a simple text editor during both the development of the measurement systems and when processing the data. The inspection process is simplified when the data in the files are stored in an organized and commented text file format with a mixture of identifying text and data that is laid out in an aesthetically pleasing, easy-to-read format.
Another issue important to a researcher is that, during the initial phase of the R&D process, most aspects of the measurement systems and the data are evolving. Hardware systems that are easily and flexibly modified allow the rapid integration of new hardware and/or the creation of new experimental variations for the purposes of discovery. These modifications translate directly to the data files in that each data file set may need to contain varying output data structures, lengths, and layouts. Because a state of flux typically exists in an experimental environment, it is important that a researcher have the ability to rapidly develop creative and new experiments, to generate new data types, and to develop and modify data analysis algorithms on-the-fly. In addition, the development of new analysis algorithms and searching for new data relationships may necessitate re-processing of recently acquired and old data files using such evolving analysis algorithms.
Once an analysis software package has been chosen, a user must adapt his or her way of saving data, processing data, developing algorithms, and reporting results using the tools and protocols supplied by that chosen analysis software package. Many of the currently available data analysis packages provide advanced computational functions that address scientific analysis needs. However, none of the currently available packages provides an integrated set of tools that enables a researcher to rapidly develop a flexibly configured results region and then automatically or semi-automatically import and analyze multiple complex text data files while monitoring a results focus region.
Currently, state-of-the-art data analysis software includes of a number of well-developed packages such as Microsoft Excel®, Mathematica®, MatLab®, and Mathcad®. All of these software packages provide excellent sets of computational functions that enable development of complex and multi-step data analysis algorithms. Such packages typically also provide graphical templates that users may insert into the analysis programming environment. Users may then enter their specific data or analysis results into graphs by pasting, using wizards, or using programming-style commands.
The aforementioned data analysis packages provide simple fixed field and fixed format text data file importation functions. Fixed field means that all entries (fields) in the file must have the same column width. Fixed format requires that all data entries must have the same delimiter separating each field (e.g., comma separated variable or csv files) in the data file. During the importation of data from the files, the currently available fixed importation commands allow the start and ending rows of data in the file to be specified.
The analysis of data files using such rigid importation methods may be problematic when processing more than one data file. This is because such fixed methods will load multiple data fields from each data file into formulas in the analysis program based only upon their precise locations in the data file. When using fixed format importation, the data in multiple text files must have precisely the same location in every file and the entries must be of a predetermined number and dimensionality. Such fixed format data importation methods lead to an indirect association of the data with the variables and/or computational functions. This approach often gives rise to computational errors when multiple, complex data files having different layouts are processed. Computational errors may occur when a data field exists in one file but not in the other, when the order of the data fields changes, the number of entries in a data field changes, or the dimensionality of a data field changes between files.
If the data file layout varies between files, as is frequently the case in R&D environments, the only known solution to the problem has heretofore been to manually alter each and every one of the affected data processing algorithms within the analysis program to accommodate the specific layout of each data file. This approach requires that each data analysis program must be stored and used only with a particular associated set of data files, which greatly increases the complexity of analyzing and processing multiple data files from evolving measurements and measurement systems. The actions that must be taken by the user to monitor and compensate for indirect association errors and to maintain numerous analysis programs may cause a severe loss of focus on the part of the researcher when analyzing multiple files.
Considerable effort has recently been expended to developing platform integration whereby software packages may pass data between application interfaces and may call one another's routines for computation. Recently, efforts have also been directed toward enabling integration of the analysis software directly into user configurable report templates. The integration of analysis software with the reporting document and the use of report templates is a great aid in the semi-automatic generation of reports. Such an approach enables consistency of reporting style, useful particularly from a quality control perspective. In addition, the integration of analysis programs and documents enables the generation of interactive educational tools. However, during the R&D phase, it makes little sense to spend time developing these interfaces or methods because the results of interest, the organization of those results, and the display of those results are in a continual state of flux. In other words, interfaces developed today may be of no use tomorrow and data exchange protocols between software packages may have to be continually re-programmed.
Because of the constantly changing environment of the initial R&D phase, a less structured reporting format is often acceptable during reporting, it generally being understood that the presenter has not “fully polished” the analysis methods and/or the results in his or her reports. Because this practice is widely accepted in the R&D community, the development of inter-package data analysis algorithms and sophisticated user interfaces is viewed as more of a distraction than an aid during the initial R&D phase. Therefore, most of the recent efforts by software companies to provide inter-operability of analysis packages and fixed format styles of report generation are of little use and little interest to scientists and engineers involved in the first stages of R&D or exploratory data analysis.
An additional problem with developing protocols that transfer data between packages is the loss of focus that may occur when changing display windows. Loss of focus may also occur when the person who is developing the data analysis routines must switch between the use of completely different languages and algorithm development interfaces.
Another problem with data exchange between various software packages is that it typically requires the purchase of multiple software packages for installation on each computer where the software is used. Data analysis software packages are frequently quite expensive.
Other efforts have been directed toward collating data from networked computer environments, where there is a need to gather information from many computers and to collate that data for later or dynamic processing. Software and systems that meet these needs include many web-based applications and data collating search engines. Such an approach is described in U.S. Pat. No. 6,917,972 for PARSING NAVIGATION INFORMATION TO IDENTIFY OCCURRENCES CORRESPONDING TO DEFINED CATEGORIES, issued Jul. 12, 2005 to Basko et al. Such approaches, however, do not typically address the unique needs of scientists and engineers who are developing measurement systems in an exploratory research mode.
In addition to the aforementioned deficiencies in data importation, deficiencies exist in data analysis packages of the prior art. For example, spreadsheet programs, such as Excel, have a number of additional shortcomings with regard to data processing and analysis of single and multiple files. These additional shortcomings include the fact that the most prominent items in the viewing space are the data in individual data cells. In general, a research scientist, engineer, or other such user does not care to view a specific value in a specific cell, but is more interested in the relationships and computations between vectors (columns or rows) or matrices of data. Also, the computational formula for a group of data is contained in a small bar, typically hidden until the user manually selects the cells that receive the results of the computation. Computations proceed by indirectly referencing data by cell addresses and not by directly associating the data with easily identifiable variable names. Computational flow is completely unstructured in that computations may be made on indirect references to data variables whose cell locations can lie anywhere in one or more spreadsheets.
The locations of those specific cells within the workspace are typically not obvious or not easily determined. Graphs must be placed on separate windows or on top of the work sheet cells. This then requires the user to switch his or her focus between the graphical display window, the computational formulas and the data areas. This frequent shifting of focus may interrupt the user's train of thought. Changes to items in the graph must be performed by opening a wizard which allows the manual selection of a group of cells for plotting on each axis. In addition, common spreadsheet programs do not provide a native data file importation mechanism that can be activated by a simple index change or key press. Finally, there is no native ability to easily batch file process or organize imported data from multiple data files.
Consequently, spreadsheet programs require many detailed manual operations to load data files and to make changes to the data processing algorithms as well as to display results from multiple files. All such manual operations required for processing single and multiple data files may also cause a loss of focus for the user. As a result, spreadsheet programs are not typically appropriate for analyses containing many computation steps or the importation of complex data from sets of multiple data files.
Programming-style data analysis packages such as Matlab or Mathematica provide a higher level of data analysis sophistication than do spreadsheets. Such programs provide the ability to directly associate variable names with the various data types for use in later computations. However, the data importation facilities typically provided by these analysis packages are still problematic. Data is usually read from fixed format files and the association is made by the data's location in the file. Consequently, the problems caused by data importation methods that use indirect association described for spreadsheet programs are still present with these analysis packages.
Moreover, some programming-style languages, such as Matlab, do not use standard mathematical symbols for computations. This is a detriment to engineers and scientists who are skilled in efficient mathematical language. In Matlab, numerous windows are used to contain information on disparate items. One window is needed for interactive program statements. This window is not static; it scrolls as each new command is entered and the output is displayed. Other windows are needed for programmatic data analysis algorithms, data variables, filename directories and graphical data.
In addition, the windows must be resized, opened, closed or stacked to change or enlarge views. The need for manual window re-sizing and the inflexible methods of computation and result organization hinder a user's important ability to rapidly and easily organize new and arbitrary information into computational and visual constructs. Human beings think and develop analyses in an ordered and sequential computational format that proceeds from the top down and from left to right or right to left, in some languages. Neither spreadsheets nor the multi-windowed Matlab interface provide this capability and, consequently, both may lead to confusion and loss of focus during their use.
Mathematica does allow imbedding graphs within a single workspace and sequential display of analysis algorithms. However, the programming style user interface may seem cryptic in that extremely complicated and non-intuitive programming statements are required to perform data analysis, graphing, and organizing single or multiple graphs. These complicated analysis statements and non-intuitive graphing methods still require a considerable amount of focus to develop and review. Mathematica does provide functions that allow the user to develop very general text file parsing capabilities. However, Mathematica does not provide guidelines for formatting and then importing complex data files; the user must develop his own text parsing formats and routines. Without guidelines, the data files are often organized by the researcher to appear more like spreadsheet formats, which becomes problematic when attempting to extract information in an automated fashion. Even with Mathematica's ability to parse text data files, it is not intuitively obvious to a user how to structure data files and then present commands that can import data from multiple and complex data files with varying layouts in the data entries.
It is also non-obvious how data from the same field in multiple files may be collated and processed in Mathematica. Mathematica does not solve all of the problems regarding complex data file handling, simple configuration of user interfaces and flexible results organization that are needed by the research scientist.
Another method that has been used to handle text data files involves importing data files into databases. Databases can provide an intermediate step for extracting and converting data types from text files, organizing that information across single and multiple data files and storing it for later retrieval in analysis software. However, using databases for organizing information introduces new problems to the researcher. The database requires additional steps as it must be programmed to load, collate and organize the data files. Thereafter, methods must be developed to retrieve that data from the database for loading into the data analysis software. This may be acceptable if the files always have the same layout. However, the method is typically unacceptable if the data files evolve during the development phase of an R&D project. If new fields of information are added to the text data files, then the databases must be re-programmed to handle that new data. In addition, databases require the database software to be installed or available on each computer where the data must be viewed and processed.
Databases do allow efficient storage of data as compared to text files, but this is generally of no importance to the user unless the data files or data sets are extremely large. As computer processing and storage capacity increases, the need for efficient data storage diminishes for most users.
The most intuitive program currently available for flexibility in computational algorithm development and maintenance of focus is embodied in a program called Mathcad. Mathcad provides a virtual “whiteboard” on a computer display, allowing the user to develop complex analyses in an intuitive manner that mimics how that person might perform computations at a whiteboard or chalkboard as described in U.S. Pat. No. 5,189,633 for APPARATUS AND METHOD FOR INTERACTIVELY MANIPULATING MATHEMATICAL EQUATIONS issued Feb. 23, 1993 to Allan R. Bonadio. Mathcad allows the user to enter computations in a structured, top-down, left-right sequential format in a static and editable display window. The user can place true mathematical notation and re-sizable graphs anywhere in the work area, which greatly aids in developing efficient computations and in developing and reviewing an analysis. Graphing is accomplished by pasting a graph into the document and then filling variable names into placeholders. This is similar to the manner in which a user might draw a graph on a whiteboard. Within Mathcad, the graphs may be re-sized and moved using simple mouse operations.
Unfortunately, Mathcad has major shortcomings when used as a scientific data file processing engine. First, Mathcad does not provide general text data file viewing or parsing capability for even a single, complex, text data file. The Mathcad file importation functions require a fixed format, which creates the same problems described hereinabove with regard to spreadsheets. Complex data files may be imported into matrices for further processing, however, in practice each measurement produces data of many different types (numeric and text; scalar, vector and matrix) which ideally are stored in a single complex data file, not just as a large matrix. Therefore, the file importation tools provided by Mathcad have the aforementioned deficiencies of complexity in use and indirect data association of the imported data with variable names in the workspace.
One mechanism for importing data files is to use file read/write components. However, using read/write components requires the user to change focus from an area of interest, move to the top of a worksheet area, manually select and load a single data file, and then return to the area of focus to view the updated results.
Another mechanism for data file importation in Mathcad is through its read functions. However, the Mathcad read functions do not parse the data files, but instead operate by use of a fixed format data importation method. The Data Analysis Extension pack provided by Mathsoft does provide commands that enable the importation of text data files into matrices, however, this does not allow direct parsing of the data files.
Neither the file read/write component nor the read functions can search out and load a selected data field from a file having an aesthetically pleasing mixture of text, data and white space. Because Mathcad cannot parse or a single, complex data file, it certainly cannot be used to process multiple text data files with a complex and varied mixture of data entries that may evolve over time. Because of these limitations, Mathcad by itself cannot be used to enable the automatic or semi-automatic parsing of single or multiple files or batch file parsing.
Some vendors offer data file parsing programs that may be used to convert raw text data files into new data file constructs for subsequent importation into data analysis packages. Other vendors offer data file parsing programs that provide commands that allow the user to develop programming statements that can be called to import data from files directly into named variables. Software Techniques, Inc., 773069 RR#2, Proton Station, Ontario, NOC 1L0 provides Parsing Tools and Guy Software, 1752 Duchess Avenue, West Vancouver, British Columbia V7V 1P9 provides ParseRat™. Parsing Tool provides a user the ability to import data from single or multiple files into named variables in a programming style analysis environment. Subsequently, the imported data may be analyzed using such packages such as MatLab, C, C++ or Visual Basic. While Parsing Tool is a useful tool, each data parsing and import operation requires writing specific “for-loops” and/or using parsing profiles that are written and stored in a database. As a result, the user must develop the sequence of programming statements that imports the data into each of the named variables; the user must maintain a database of parsing profiles. The user must also work in a multi-windowed, programming-centric environment where the region of focus for imported data, analysis algorithms and results cannot be easily or flexibly configured into a single workspace and a single focus region.
ParseRat does not provide the ability to integrate data imported from files directly into the analysis package. ParseRat is a tool for converting one file format into another and generating new files that may then be loaded into the analysis package using the protocols provided by that software analysis package. This additional step increases the complexity of data analysis and requires that at least two sets of files must be managed: the original file and the converted file that will be imported into the analysis package.
When processing large numbers of complex text data files, it is generally not sufficient to provide some aspects of automated data processing while ignoring other aspects. A computer program that does not use true mathematical notation or does not adhere to top-down analysis flow, but provides text file parsing and graphs alone will not provide the general purpose, convenient interface that is required for enabling a scientist or engineer to maintain optimal focus. Similarly, a computer program that provides mathematical notation, but which ignores the need for automated parsing and importing of text files will require that the user engage in numerous manual operations to import data from multiple files into the analysis package which may also cause a loss of focus. A program which does not allow automatic updating of computations in a static display, or scrolls the display of interactive commands in the viewing area, or forces the use of multiple windows for viewing different types of information, also creates an undesirable loss of focus to the user. Loss of focus also occurs if the user must develop and maintain protocols or programming statements that transfer data and results between formats and/or between different software packages.
What is needed is a single computer software system that provides a graphical user interface (GUI) that is integrated with a simple and fully automated data file parsing capability and a flexibly configured results region. Many scientists and engineers who must process multiple and complex data files will utilize such a tool. However, they will probably not switch to a software package that does not provide a complete multi-file processing solution or is cumbersome to program. Such a limitation would merely present a new set of deficiencies to the user. The ideal software package would, therefore, singularly handle data analysis of multiple files ranging from complex raw data through to the display of results using a single whiteboard-style interface and would allow a simple means for transferring those results into reports.