In the prior art, it has been well known that computer systems can be used to store records of databases. Many techniques are known for parsing information of databases. However, parsing large databases storing information in various encoding modalities presents special problems.
In recent years, a unique distributed database has emerged in the form of World-Wide-Web (Web). The database records of the Web are in the form of pages accessible via the Internet. Here, tens of millions of pages are accessible by anyone having a communications link to the Internet
The pages are dispersed over millions of different computer systems all over the world. Users of the Internet constantly desire to locate specific pages containing information of interest. The pages can be expressed in any number of different character sets such as English, French, German, Spanish, Cyrillic, Kanakata, and Mandarin. In addition, the pages can include specialized components, such as embedded "forms," executable programs, JAVA applets, and hypertext.
Moreover, the pages can be constructed using various formatting conventions, for example, ASCII text, Postscript files, html files, and Acrobat files. The pages can include links to multimedia information content other than text, such as audio, graphics, and moving pictures. As a complexity, the Web can be characterized as an unpredictable random update, insert, and delete database with a constantly changing morphology.
Prior art parsing technologies are not always suited to handle the formidable task of parsing pages of the Web. Most parsers are designed to be context dependent In addition, most parsers handle information having distinct attributes, such as titles, authors, and locations, separately from the information attributed. This can be a severe limitation, particularly in an environment which permits the creation of pages in other linguistic and grammatical constructs.
It is desired to provide a parsing technique which can parse a large database having content and content attributes stored in a number of different forms. The parsing should provide indexable items which can be stored in a single unified index structure.