1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to a method and system for searching computer files distributed across a network, such as hypertext markup language (HTML) pages on the World Wide Web of the Internet.
2. Description of Related Art
A generalized client-server computing network 2 is shown in FIG. 1. Network 2 has several nodes or servers 4, 6, 8 and 10 which are interconnected, either directly to each other or indirectly through one of the other servers. Each server is essentially a stand-alone computer system (having one or more processors, memory devices, and communications devices), but has been adapted (programmed) for one primary purpose, that of providing information to individual users at another set of nodes, or workstation clients 12. A client is a member of a class or group of computers or computer systems that uses the services of another class or group to which it is not related. Clients 12 can also be stand-alone computer systems (like personal computers, or PCs), or xe2x80x9cdumberxe2x80x9d systems adapted for limited use with network 2 (like network computers, or NCs). A single, physical computer can act as both a server and a client, although this implementation occurs infrequently.
The information provided by a server can be in the form of programs which run locally on a given client 12, or in the form of data such as files that are used by other programs. Users can also communicate with each other in real-time as well as by delayed file delivery, i.e., users connected to the same server can all communicate with each other without the need for the network 2, and users at different servers, such as servers 4 and 6 , can communicate with each other via network 2. The network can be local in nature, or can be further connected to other systems (not shown) as indicated with servers 8 and 10.
The construction of network 2 is also generally applicable to the Internet. In the context of a computer network such as the Internet, a client is a process (i.e., a program or task) that requests a service which is provided by another program. The client process uses the requested service without having to xe2x80x9cknowxe2x80x9d any working details about the other program or the service itself. Based upon requests by the user, a server presents filtered electronic information to the user as server responses to the client process.
Conventional protocols and services have been established for the Internet which allow the transfer of various types of information, including electronic mail, simple file transfers via FTP (file transfer protocol), remote computing via Telnet, xe2x80x9cgopherxe2x80x9d searching, Usenet newsgroups, and hypertext file delivery and multimedia streaming via the World Wide Web (WWW). A given server can be dedicated to performing one of these operations, or running multiple services. Internet services are typically accessed by specifying a unique address, or universal resource locator (URL). The URL has two basic components, the protocol to be used, and the object pathname. For example, the URL xe2x80x9chttp://www.uspto.govxe2x80x9d (home page for the United States Patent and Trademark Office) specifies a hypertext transfer protocol (xe2x80x9chttpxe2x80x9d) and a pathname of the server (xe2x80x9cwww.uspto.govxe2x80x9d). The server name is associated with a unique numeric value (a TCP/IP address, or xe2x80x9cdomainxe2x80x9d).
The present invention relates to searching of computer files that are distributed on a network like the Internet, but is particularly applicable to the WWW, which provides files that are conveniently linked for user access. For example, as illustrated in FIG. 2, a group 14 of files or pages 16a-16h are interrelated by providing hypertext links in each of the files (group 14 may thus be considered a typical xe2x80x9cweb sitexe2x80x9d). A hypertext link is an image that is viewable on the workstation""s display 18, which can be selected by the user (e.g., using a pointing device or xe2x80x9cmousexe2x80x9d) and which then automatically instructs client workstation 12 to request another page associated with that particular hypertext link (i.e., issue another URL). A hypertext link may appear as a picture, or as a word or sentence, possibly underlined or otherwise accentuated to indicate that it is a link and not just normal, informative text.
A WWW page may have text, graphic (still) images, and even multimedia objects such as sound recordings or moving video clips. A hypertext page, if more than just text, is usually constructed by loading several separate files, e.g., the hypertext file xe2x80x9cmain.htmlxe2x80x9d might include a reference to a graphic image file xe2x80x9cpicture.gifxe2x80x9d or to a sound file xe2x80x9cbeep.wavxe2x80x9d. When a client workstation 12 sends a request to a server for a page, such as page 16a, the server first transmits (at least partially) the main hypertext file associated with the page, and then loads, either sequentially or simultaneously, the other files associated with the page. A given file may be transmitted as several separate pieces via TCP/IP protocol. The constructed page is then displayed on the workstation monitor 18 as shown in FIG. 2. A page may be xe2x80x9clargerxe2x80x9d than the physical size of the monitor screen (i.e., larger than the software-programmed xe2x80x9cwindowxe2x80x9d provided for viewing the page), and techniques such as scroll bars are used by the viewing software (the web browser) to view different portions of the page.
The increasing number of pages available on the Web, and their sometimes ephemeral nature, can make navigation of the WWW more difficult, requiring a user to follow many links before the desired information is finally found. For these and other reasons, it is often difficult to find a particular web page. Some Internet services such as Yahoo have organized links under topics of information, but many web pages are found using various types of search engines. Some search engines are local, that is, software set up and running on a client workstation, while other search engines are remote, running on a server. For the Internet, many search engines utilize wide area information services (WAIS) servers, which provide databases (similar to concordances) having information regarding the contents of web pages. Some network searching utilities are limited to queries relying on boolean operators (AND, OR, NOT), while others are more sophisticated and use intelligent programming to present a more natural interface to the user, and to narrow down or prioritize the search results.
One problem that frequently occurs when searching for files on large networks such as the Internet, is that far too many files (i.e., URLs) are found when searching for textual content within a page. This problem is exacerbated by the nature of web pages, that is, their use of field-based languages such as the hypertext markup language (HTML). This language provides a protocol for transmitting formatted information and control codes used to construct the xe2x80x9ccompletexe2x80x9d page that is ultimately displayed by the browser. Different fields within the main HTML file are defined to store the formatted information and control code parameters, using tags. Tags not only mark elements, such as text and graphics, but can also be used to construct graphical user interfaces within the web page (such as buttons that are xe2x80x9cdepressedxe2x80x9d by selecting them using the graphical pointer). In HTML, a tag is a pair of angle brackets ( less than  greater than ) that contain one or more letters and numbers between the angle brackets. One pair of angle brackets is often placed before an element, and another pair placed after, to indicate where the element begins and ends. For example, the language xe2x80x9c less than B greater than TODAY ONLY less than B greater than xe2x80x9d uses the xe2x80x9cBxe2x80x9d tag to provide a boldface formatting code for the words xe2x80x9cTODAY ONLY.xe2x80x9d
HTML fields and tags complicate searching strategies because they introduce additional words into the document which may not actually be relevant to the substantive content of the document. A prime example is embedded links in a document. If a user wants to locate pages which pertain to a particular topic, a search engine may indicate that many pages relate to the topic when, in actuality, they have little relation to the topic but they happen to include a link to a page dealing with the topic. In this situation, the user is less likely to be interested in viewing the dozens of pages which contain such links, and is more likely to want to see the pages that actually constitute the links. These latter pages may, however, be lost in the search engine""s output among hundreds, or even thousands of search xe2x80x9chits.xe2x80x9d So-called intelligent search engines might even place a higher confidence level on less relevant pages, leaving the most relevant pages at the end of the search results, and so taking much longer before the user notices their existence.
Also, the information desired during a search is often only in one portion of a page. For example, a user may remember that a specific string of text can be found within a specific point within an HTML file (say, the title), but there is no method of limiting a search to a particular field. Another related, and sometimes aggravating, phenomenon is the inclusion of a particular page in search results due to text which does not even appear in the browser""s display, but rather is hidden using tags. Search engines will identify such pages even though the main body of the page completely omits any reference to the keywords of the search. This limitation in the search engine allows unscrupulous commercial interests to peddle their web pages to anyone using search engines, by including long, hidden lists of keywords in the main HTML file, wherein the keywords may have absolutely nothing to do with the page, but are included only to provide matches for search engines.
In light of the foregoing, it would be desirable to devise a method of limiting the results of network file searches to more relevant files. It would be further advantageous if a user could, for example, retrieve only HTML file that contain the text within a given portion (field) of the file.
It is therefore one object of the present invention to provide an improved method of searching for files stored in a computer system.
It is another object of the present invention to provide such a method that is adapted to identify files that are distributed within a computer network, particularly a large network such as the Internet.
It is yet another object of the present invention to provide such a method which searches the contents of files that use tag structures to provide formatted information or control codes.
The foregoing objects are achieved in a method of searching for files located in a computer system, generally comprising the steps of constructing a plurality of files on the computer system wherein each file has at least one of a plurality of fields, creating a search query, selecting a subset of the fields for searching wherein the subset is selected independent of the search query, and processing the files by examining the content of only those fields included in the subset for matching against the query. The fields can be selected by using a default setting. Certain fields are restricted from being selected, i.e., they are always ignored during the processing step. An illustrative embodiment is adapted for use with hypertext markup language (HTML) files that are transmitted along the Internet""s World Wide Web, and have a plurality of tags embedded in the files to define the fields. The user interface may include a pop-up window that displays a list of the tags which may be embedded in the files, and allows a user to individually select one or more tags so displayed for searching.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.