1. Technical Field
This invention generally relates to the usability of Internet/intranet sites by users of web-portals and/or search-engines. The inter-relations among users, sites, search engines and our inventions are summarised in FIG. 1 below
2. Background Art
Software products such as Lotus Notes and Lotus Domino (both produced by Lotus Development Inc., a wholly-owned subsidiary of IBM Inc) store information in structured or unstructured records, also called xe2x80x9cnotesxe2x80x9d within files. With regards to Lotus products in particular, the files commonly have the extension xe2x80x9c.nsfxe2x80x9d which stands for Notes Storage Facility. Whether or not Internet search engines are able to locate information housed within such xe2x80x9c.nsfxe2x80x9d files depends upon how a particular Lotus system is configured. For instance, the default configuration of the Lotus Domino product makes xe2x80x9c.nsfxe2x80x9d files completely 100% invisible to web search engine crawlers. As a result, users of search engines will not be able to obtain any information they might be searching for within xe2x80x9c.nsfxe2x80x9d files housed on a Lotus Domino system with a default configuration.
In an attempt to make xe2x80x9c.nsfxe2x80x9d files visible to Internet search engines, many Lotus site owners activate xe2x80x9cfriendly URLs formats configuration optionsxe2x80x9d found in the current Lotus products. However, the activation of these options typically causes the publishing of infinite loops or redundant and duplicate Uniform Resource Locators (URLs) for xe2x80x9c.nsfxe2x80x9d files. Mixed configurations of the Lotus Domino product normally produces a wide in between degradee of similar dissatisfactory results including the involuntary publishing by site owners of copyrighted documentation and material included in the standard installation of the Lotus Domino software package. Internet sites (based on Lotus Domino) operators or owners do not normally have legal rights to re-publish this, nor they are xe2x80x9cauthoritative sitesxe2x80x9d for that documentation.
Thus, Internet users who are searching for files over the Internet via search engines can become frustrated by not being able to find most relevant documents stored in .NSF containers. This is due to the mismatch between Search Engine Crawlers and files created by Lotus Domino (.NSF files). And when a document created by Lotus Domino (a .NSF file) does turn up amongst search engine results, Internet users are further frustrated when the direct link to that document is xe2x80x9cunframedxe2x80x9d (meaning that it lacks useful contextual information required by navigational tools, such as container and/or site context).
Lotus Domino is estimated at the time of filing of this application to be in use by circa 50 million users worldwide, who use it to create documentsxe2x80x94contentxe2x80x94a significant piece of which is intended to be made available over the internet via the usage of the HTTP protocolxe2x80x94popularly known as the WWW. Thus, there is a need to provide better management of xe2x80x9c.nsfxe2x80x9d files for use on the Internet to facilitate location of relevant information, as well as to properly manage secured and copyrighted information.
The present invention discloses an apparatus and method to support authoritative registration, location, de-duplication, security-clearance, sanitization, submission, indexation, and robotic/human reference-contextualized-retrieval of Uniform Resource Locators (URL""s) and/or contents of database servers. The method includes the provision of a WorldWide publicly available xe2x80x9cLibraryxe2x80x9d (technical term to define a listed set of databases) that lists (upon requests or automatically) which is/are the authoritative site/s for each registered ReplicaId. The method further includes a native crawler that finds the 100% complete lists of URLs by asking NATIVELY (with a Lotus Domino/Notes xe2x80x9cNRPCxe2x80x9d API call) to the system the exhaustive list of databases and records within each database.
The native crawler creates a xe2x80x9cminimalxe2x80x9d URL with just these 2 pieces of information. For each potential UTRL, using native Lotus Domino/Notes, a verification is made to determine the likelihood that such document is xe2x80x9cpublicly available.xe2x80x9d Then the actual document accessibility is confirmed and contents of that URL are retrieved by checking each of them with a NON-native notes standard based http access (via a browser system call to port 80). Further, the methods the present invention generate lists of unique URLs that are marked each of them as static, thus the engines do not need to follow ANY non-static link. Plus, the list that follows is deduplicated, optimized and sanitized. The methods of the present invention use standard internet domain naming resolution (DNS) to virtualizes each ReplicaID as a subsite of the general virtual site, which in turn we build with a combination of A,CNAME and NS records that include xe2x80x9cwildcarded (*)xe2x80x9d elements.
The foregoing and other features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings.