A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the xerographic reproduction by anyone of the patent document or the patent disclosure in exactly the form it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates to a general-purpose data parsing and analysis system and, in particular, a common system and method for analyzing any data composed of interrelated data structures similar to the protocols found within network frames.
Data search processors perform a number of functions such as data matching, filtering, statistics gathering, converting and bracket matching. Data search processors or tools are typically associated with a specific data editor and are limited to recognizing embedded control characters that are associated with the particular data editor. Data search tools that function independently, meaning that they are not associated with a specific data editor, are not currently able to recognize embedded control characters for any data editor. A discussion of a data editor independent search tool may be found in B. Kernighan, et al., xe2x80x9cRegular Expressions,xe2x80x9d Dr. Dobb""s Journal, April, 1999, p. 19-22, the contents of which are incorporated herein by reference.
Data parsing refers to the ability to categorize data into components based on the characteristics of the data values. A practical example of data parsing would consist of the following procedures: (1) efficiently searching for text words in a document that consisted of both text and graphics; (2) identifying the components of the document that are graphical; and (3) skipping over the graphical components rather than searching through them character by character as though they were text or control characters.
Filtering of data files is typically implemented using a value for comparison, and, in some cases, xe2x80x9cwildcardxe2x80x9d characters within the value. Filtering of data files typically comprises doing a search on the data file, and then taking an action based upon the search results. For example, a filter might search for all instances of a particular data expression, and then provide a count of the total number of instances found.
For multiple value filters, the result from each filter is logically combined together to obtain an overall result. Therefore, each additional result adds to the processing required to filter on that value. Conventional filtering does not typically include a provision to identify embedded graphic images so that the images may be either intentionally examined or skipped over in a data search.
A practical example of data filtering would be to search all components of a document for company proprietary information, and filter out the proprietary information, in order to prevent its unwanted disclosure. Such information might be embedded, or hidden, in the control characters of the data editor format.
Existing data search, filtering and statistical tools are either specific to a particular version of a particular data editor, for example the xe2x80x9cFindxe2x80x9d or xe2x80x9cWord Countxe2x80x9d functions typically found in popular word processors, or must parse through files character-by-character without being able to differentiate among data, document format control characters or graphic characters. Thus, the existing tools are either limited by their inability to work across various editors, or, for those tools that are not editor-dependent, their inability to efficiently parse files containing data, document control characters and graphic characters.
Although CPU""s available today can execute hundreds of millions or even billions of instructions per second, to achieve the necessary processing rates for most filtering, vendors often must provide dedicated hardware assistance and/or front-end processors with hand-coded assembly language routines. This solution typically requires hardware and/or software modifications whenever changes are made to the number of supported features or editors.
In a conventional data search engine, a string of characters is specified, and the engine searches for the string of characters in the data editor file(s). For an ASCII character set file, the file may contain:
a. alphanumeric characters, such as a-z, A-Z, 0-9;
b. delimiters, such as punctuation characters and spaces;
c. graphics, such as bit maps;
d. control character sequences, such as a sequence of characters that will cause the data editor to show words underlined or in bold print, or change the size of the font; and
e. xe2x80x9cjunk stringsxe2x80x9d of characters, such as control character sequences appearing consecutively with different values for the same control or duplicate control character strings, that may be generated by automatic conversions performed on a file to change it from one data editor format to another, for instance a conversion from Word document format to Rich Text Format.
Conventional data search engines cannot be configured to: a) recognize or identify values as i) elements for the control syntax of a data, spreadsheet or other kind of editor, or ii) part of a graphic image; or b) to modify the use of a value by specifying the characteristics associated with the value.
Thus, it is desirable to have a configurable search, filter, statistics, and conversion capability, with common control logic that: a) is applicable to many different data editors or character sets, b) provides field based operations, and c) can be implemented in either hardware or software. By using common control logic, the system can be reconfigured to support the variety of existing data editors, document formats and character sets and to support future data editors, document formats, and character sets without the need for hardware or software modifications. Moreover, the added ability to provide filtering and to collect statistics in hardware may significantly improve performance.
The present invention is directed to improved systems and methods for parsing, searching, filtering, gathering statistics, and converting data files generated by any data editor, using character sets and editor controls definitions that can be programmably defined. A single logic control module, implemented in either hardware or software, is used to perform a number of data manipulation functions, such as parsing, filtering, statistics gathering, and data conversion. The module is based on one or more programmably configurable protocol descriptions that may be stored in and retrieved from an associated memory.
By using common control logic, meaning a single logic control module, and programmably configurable character-set characteristics and data editor control protocol descriptions, changes can be made to existing data editor control protocol descriptions and support for new data editor control protocol descriptions can be added to a system entirely through user reconfiguration, without the need for hardware or software system modifications. Thus, those skilled in the art will appreciate that a data file manipulation system in accordance with the present invention may be configured and reconfigured in a highly efficient and cost effective manner to implement numerous data manipulation functions, such as parsing, and to accommodate substantial data editor modifications, such as the use of different editors, editor versions, or editor formats, without requiring substantial system changes.
In a preferred embodiment, the system employs a CPU or other hardware-implemented method as a processing unit for analyzing files in response to selectively programmed parsing, filtering, statistics gathering, and display requests. The embodiment may be incorporated in a device, including a CPU and a plurality of input devices, storage devices, and output devices wherein files are received from the input devices, stored in the storage devices, processed by the CPU based upon one or more programmably configurable protocol descriptions also stored in the storage devices, and displayed on the output devices. The protocol descriptions may take the form of one or more protocol descriptions for each supported data editor control defined therein.
A preferred embodiment of the logic control module includes logic for:
a) extracting field values from a particular file and making parsing decisions based upon field values and information in protocol descriptions;
b) filtering a subset of files or data from the input or storage devices that satisfies a filter criteria based upon information stored in a protocol description;
c) filtering a subset of files or data from the input or storage devices that satisfies a filter criteria based upon information stored in a Data-Filter-Object criteria;
d) filtering files or data that satisfy a filter criteria which includes several filter criteria joined together by Boolean operators, wherein the system creates an intermediate filter result for each criteria representing a filter/don""t filter decision for each field;
e) analyzing a filter request by breaking the request into its component criteria to determine whether the result from evaluating particular filter request criteria, when combined with results from earlier criteria, can be used to filter a particular file or data;
f) collecting statistics based upon extracted field values that satisfy a statistics criteria based upon information stored in a protocol description;
g) determining the next protocol description structure required to continue analyzing a file;
h) determining a file length, individual protocol header lengths, and embedded lengths from extracted field values in a file;
i) determining display formats based on information contained in protocol descriptions;
j) evaluating individual field values and making parsing decisions based on the values; and
k) converting files by altering field contents based on information contained in protocol descriptions.
The system gains a distinct advantage in size and maintainability over conventional data search/analysis/filter devices by implementing analysis capabilities for each data editor, data editor character, and data editor embedded control set, using common control logic. Furthermore, the system gains a distinct advantage in speed and efficiency over conventional data analysis devices when the control logic is implemented in hardware or a front-end processor, without requiring additional hardware and/or software development when data editors or data editor versions change.
Accordingly, it is the object of the present invention to provide improved systems, methods and machine implemented processes for data file analysis;
a) wherein the elements of the character set and the elements of the data editor controls that exist in the file are determined, also referred to herein as parsing, using a common control logic combined with configurable protocol descriptions and configurable character sets;
b) wherein the control logic may be implemented in hardware as well as software;
c) wherein each supported analysis capability is configurable even when the control logic is implemented in hardware;
d) that determine if a particular data file includes a field that satisfies a particular filter criterion;
e) that determine if a particular data file includes a field that satisfies a particular statistics gathering criterion; and
f) that convert data files from the format and characteristics of one data editor""s selected protocol descriptions to those of another data editor.