1. Field
The present invention pertains to extracting information included within a file. More specifically, the present invention relates to a method and apparatus for extracting and cataloging text and the text characteristics of that text from within a file.
2. Background Information
A network may be defined as a group of connected computers which allows users to share information and equipment. The group of connected computers typically includes one or more client computers connected to a server. The server stores information which may be accessed by the connected clients. The information stored on the server may include thousands of documents and files pertaining to a wide variety of subjects. Unfortunately, because many of the documents and files stored on the server are usually created using different application programs (such as Microsoft® Word, WordPerfect®, Excel®, and Powerpoint®), there are few effective methods which enable a network user to search these stored documents and identify all the documents which pertain to a particular subject. Similarly, there are few effective methods to search documents stored on one or more storage devices within a single computer when the documents are created using different application programs.
Currently, each application program employs different file formats. Thus, a searching application which enables a user to search for a particular subject within files or documents created using different application programs must be able to recognize the file format for each of the different application programs. Some application programs include methods for searching for specific text strings within documents created using that particular application program. For example, Microsoft® Word includes a document find function. Using the find function, a user may either highlight text within a particular document or enter text into a “find file” dialog box and then search for similar text within other documents created using Microsoft® Word.
There are several drawbacks associated with functions similar to Microsoft® Word's find function. First, these functions work only with documents created using the application program with which the function is associated. Second, the application program must be up and running to implement the function. Third, a significant amount of user interaction is required because the user must type in the specific text string which will be the subject of the search. Fourth, these functions are typically not capable of cataloging the entire document. Cataloging, as used herein, refers to storing the contents of a document in a manner which allows the document contents to be subsequently searched by a user. Cataloging the entire document expedites future searches of that document.
Another drawback of functions similar to Microsoft® Word's find function is that these functions are unable to rank search results according to hierarchy information. In this context, hierarchy information includes print characteristics of a string of text (e.g., font size, bold, underline, and position) which help determine whether the string of text is found within a title, within a paragraph sub-heading, or within a paragraph of a document. Hierarchy information helps the user determine the extent to which a particular string of text is addressed within a located document because text strings located in titles are likely to receive more extensive coverage within a document than text strings located within a paragraph sentence. When using a find function without a facility for ranking search results according to hierarchy information, a target word or text string within a sentence of a document appears to be just as important as the same target word or text string within a title of a document.
Accordingly, there is a need for a method and apparatus which enables a user to extract the contents of files created using different application programs and catalog the extracted contents in a manner which permits the user to subsequently search the cataloged contents to identify files which include information related to a particular subject. The search result should rank identified files according to the extent to which the files are likely to address the subject.