The present disclosure involves methods for developing full text searches for searching multiple file types which are distributed on a CD CACHE ROM.
In present day commercial situations, many digital development software and computer companies work to deliver documentation to their customers in a number of different formats. These formats may show up in a number of different varieties, that is to say the document format may be on paper, for example, or Adobe Acrobat Portable Document Format (PDF) files, or Windows Help files, or Hypertext Markup Language (HTML) and also HTML help files.
The documentation provided to receivers, such as customers, is distributed and made available on, for example, paper documents, on CD ROMs, and on Web Servers.
Of course, it is desirable for a recipient or user to make a full text search of the received documents. However, users cannot perform full-text searches on paper documents, except through long, laborious reading and surveys of the documents. There is, however, software designated as xe2x80x9csearch enginesxe2x80x9d that exist in digital technology in order to search files that are distributed on CD ROMs.
However, these search engines are limited in a number of ways in providing search capability when the document or CD ROM involves multiple file types. Most of the existing search engines are designed only to search files of one particular format.
In this type of situation, then it would be necessary to convert all files in the documentation or CD ROM into a common format. This common format would be the format which was compatible with the particular search engine available.
However, when files are converted into a format different from that in which they were originally created, much of the functionality for searching the original file is lost, and this includes navigating through the file and finding certain content in the file.
There are other types of search engines which are capable in a certain limited way of including search operations for multiple file types in the documentation or CD ROM. However, these are unable to open all the file types at locations where the search terms appear and then be capable of moving from one such location to the next location within the document.
Thus, these other types of search engines require that the user first search with one particularly favorite engine and then refine the search using another search engine designed for the file type.
One example of a standard (not a full-text) search is what one can do in a product such as Word. The operator tells Word to find a text string. Then Word starts reading the text in the document by reading each word one at a time beginning at a specified location and comparing the text against the string that was entered. Now, when Word finds a xe2x80x9chitxe2x80x9d (match), then Word highlights the text and stops searching. If the operator chooses xe2x80x9cFind Nextxe2x80x9d option, then the Word program repeats the process and continues the search beginning just past the current hit. However, this is considered pretty much of a brute force and slow process of operation.
A xe2x80x9cfull textxe2x80x9d search, however, works to search a collection of files at one time. It accomplishes this by using an auxiliary collection of files that was created ahead of time and then distributed with the files that are to be searched. If, for example, the operator wished to search 450 files for the word xe2x80x9cserver,xe2x80x9d the software would then read the auxiliary files which will already know all occurrences and locations of the word xe2x80x9cserver.xe2x80x9d Here the software would present the operator with a xe2x80x9chit listxe2x80x9d of all files that contained the word that is built from the information in the auxiliary files. If the operator elects to open up any of these files, the software will then open the file, move to the first location in the file (which it already knows from the auxiliary file), and then highlight the word. It may be noted that none of the files are directly searched or scanned. By using such a file, the operator or user can utilize advanced features such as wild cards (xe2x80x9cinstall*xe2x80x9d) and Boolean operators (xe2x80x9cinstallation and not printersxe2x80x9d).
There are a number of ways to create these auxiliary files. Such a process may take several hours for most of releases to be made on CD-ROM. The success of a xe2x80x9csearch enginexe2x80x9d can be measured by how efficiently the desired files are generated and accessed.
The present invention provides for the use of an existing search engine that is designed to support the searching of one particular file format (PDF, or Adobe(copyright) Acrobat(copyright) files). This can then be extended to allow the searching of virtually any other type of file format such as HTML, HTML Help, or Windows Help. The method and system accomplishes this by creating a PDF file xe2x80x9cduplicatexe2x80x9d consisting of the text from the file that the operator wants to search in order to allow the search engine to find the text in the duplicate that was created. Here then there is provided a link from each page in the PDF duplicate into the corresponding location in the file of the other format so that the user-operator has now essentially performed a full-text search in that file.
The present method and system involves a technique that is used to search the Portable Document Format (PDF) files that contain the text extracted from files residing in other formats such as Windows Help, Hypertext Markup Language (HTML) Help, and HTML.
On each page of the PDF file there are hyperlinks that the user can select to open the original file at the corresponding location.
The method enables the user to search the collection of PDF files, including both files that were created as PDF files as well as the PDF files created from the text extracted from the files of other formats. The method uses the search engine from Verity that is distributed by Adobe(copyright) in order to search the Adobe(copyright) Acrobat(copyright) portable document format files on a CD ROM. If the search targets include files of formats other than PDF, then the user is presented with pages within the PDF copy of the file in which the target text appears.
The user can navigate within the PDF copy using the xe2x80x9cnext hitxe2x80x9d and xe2x80x9cprevious hitxe2x80x9d program options. The text is visible to the user and is sufficient to help the user determine whether it is necessary or helpful to access the original file.
Each page of the PDF file carries a xe2x80x9cbuttonxe2x80x9d then, when selected, opens the document in the original format at the location corresponding to the location displayed in the PDF copy. Both the PDF copy and the original file are accessible at the same time so it is possible to identify the location of the hits within the file and to find additional hits in the complete collection of files.
The indicated method includes software which is used to extract the text from Windows Help, HTML, and HTML Help files, and then create from that text the new files that can be converted by the standard Adobe software into PDF files with corresponding explanatory messages and buttons on every page in order to support the linking into the corresponding locations within the original files.
This method then provides the ability to link from the hits displayed in Adobe Acrobat into the corresponding locations within the original files.