The present invention relates to the field of data processing, and particularly to a software system and associated method for use with a search engine, to search data maintained in systems that are linked together over an associated network such as the Internet. More specifically, this invention pertains to a computer software product for generating profile matches between a structured document and web documents.
The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. Users navigate these pages by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW sites, many web pages have a redundancy of information or share a strong likeness in either function or title. The vastness of the WWW causes users to rely primarily on Internet search engines to retrieve information or to locate businesses. These search engines use various means to determine the relevance of a user-defined search to the information retrieved.
A typical search engine has an interface with a search window where the user enters an alphanumeric search expression or keywords. The search engine sifts through its index of web pages to locate the pages that match the user""s search terms. The search engine then returns the search results in the form of HTML pages. Each set of search results includes a list of individual entries that have been identified by the search engine as satisfying the user""s search expression. Each entry or xe2x80x9chitxe2x80x9d includes a hyperlink that points to a Uniform Resource Locator (URL) location or web page.
A significant portion of the WWW documents today are authored in HTML, which is a mark-up language that describes how to display page information through a web-browser and to link documents up to each other. HTML is an instance of SGML (Standardized Markup Language) and is defined by a single document schema or Document Type Definition (DTD). The document schema puts forth a set of grammatical rules that define the allowed syntactical structure of an HTML document. The schema, or structure of HTML pages, is typically consistent from page to page.
Currently, Extensible Markup Language (XML) is gaining popularity. XML, which is a subset of SGML, provides a framework for WWW authors to define schemas for customized mark-up languages to suit their specific needs. For example, a shoe manufacturer might create a xe2x80x9cshoexe2x80x9d schema to define an XML language to be used to describe shoes. The schema might define mark-up tags that include xe2x80x9ccolorxe2x80x9d, xe2x80x9csizexe2x80x9d, xe2x80x9cpricexe2x80x9d, xe2x80x9cmaterialxe2x80x9d, etc. Hence, XML documents written in this shoe language will embed semantic, as well as structural, information in the document. For example, a shoe XML document uses the mark-up tag xe2x80x9ccolorxe2x80x9d to indicate that the shoe is xe2x80x9cbluexe2x80x9d.
One advantage of XML is that it allows the efficient interchange of data from one business to another (or within the business itself). A business may send XML data that conforms to a predefined schema to another business. If the second business is aware of the first business""s schema, it may use a computer program to efficiently process the data. To enable this efficient data interchange and processing, XML requires that standard and high-quality schemas be developed and conformed to, by XML documents.
As noted, the XML framework allows for the definition of document schemas, which give the grammars of particular sets of XML documents (e.g. shoe schema for shoe-type XML documents, resume schema for resume-type XML documents, etc.). The XML framework also puts forth a set of structural rules that all XML documents must follow (e.g. open and close tags, etc.). Moreover, it is possible for an XML document to have no associated schema. If a document has an associated schema, the schema must be specified within the document itself or linked to by the document.
Information about the quality of an XML document may be inferred by its conformance with the rules put forth by this XML framework. An XML document is said to be xe2x80x9cvalidxe2x80x9d if it has an associated schema and conforms to the rules of the schema. An XML document is said to be xe2x80x9cwell-formedxe2x80x9d if it follows the general structural rules for all XML documents. Ultimately, a high quality document has a higher probability of being both xe2x80x9cvalidxe2x80x9d and xe2x80x9cwell-formedxe2x80x9d than a low-quality document.
In addition, like HTML documents, XML documents form a hyperlinked environment in which each XML document that has an associated schema provides a link to the schema (if the schema is not defined within the document itself). Moreover, each XML document, using various mark-up structures, such as XLink or XPointer, may link up to other XML structures and XML documents. Unlike the HTML environment, however, the schemas of each hyperlinked document may vary from document to document. A document that satisfies one particular schema can point to a document that satisfies a different schema. Further, two documents with different schemas can point to a document with a third schema. The quality of each schema may vary significantly.
A search of web pages using keywords, in most cases, returns an over-abundance of search-results. For example, a search for xe2x80x9cHarvardxe2x80x9d might result in an excessive number of web pages. Search engines face the challenge of matching these results to a profile provided by the user. Text-based matching alone will often miss some pages that are relevant to the search.
Harvest, is a program that tries to solve the robotic copying problem by indexing each site rather than copying its entire contents. Using Harvest, a web site can automatically produce a concise representation of the information on its site. This informational snapshot is then provided to interested crawlers, avoiding congestion on the server and slowing down the Internet. One Harvest concept of an automatically generated information snapshot index is known as metadata and written in a language known as Summary Object Interchange Format (SOIF). SOIF extracts such details as title, author""s name, data type, and if one is available, the abstract from a web site. In the case of text files, all the entire text is included.
Webcasting, or Internet push, automatically delivers information to the users based on user profiles. Information frequently updated and of regular interest to the users becomes a prime target for webcasting delivery such as headline news and stock quotes.
One of the main problems facing webcasting is the lack of sufficient support for personalization in that a subscribed channel often contains a significant amount of information irrelevant to the users"" interest. For example, users cannot customize their subscription to receive only information about their favorite teams when subscribing to a sports channel. Moreover, the bandwidth wasted by delivering irrelevant content exacerbates the burden on network infrastructure, preventing widespread deployment.
Therefore there still remains a reed for a solution that enables users to filter subscribed channels according to their needs in an individualized profile, and more importantly matching profiles against available content on the server side. Thus, only information pertaining to the user""s personal interest needs to be displayed and delivered over the network, significantly enhancing usability while reducing network traffic.
The Grand Central Station (GCS) project is more than a search engine. GCS combines both information discovery and webcasting-based information dissemination into a single system. GCS builds a profile of the user and keeps him or her informed whenever something new and relevant appears on the digital horizon. The GCS system generally includes two main components. The first component constantly gathers and summarizes new information in the manner of a robotic crawler. The second component matches this information against the profiles of individual users and delivers it to a specified location, computer, or electronic device.
One aspect of GCS is that it is not limited to interacting with the user""s desktop computer. GCS technology also pushes the information to devices such as Personal Digital Assistants (PDAs). As an example, a PDA owner might check the latest sports scores, traffic conditions and weather on the way home from work. The concept of having information available as-needed xe2x80x9cjust-in-time informationxe2x80x9d, in analogy to the just-in-time (JIT) manufacturing concept. The search engines of GCS that look for information on sales figures, airport directions, patent citations and box scores are computer programs running on workstations termed gatherers and are derived from the University of Colorado""s Harvest archival computer indexing system. To handle the information growth, GCS splits up the task of searching among several gatherers.
The GCS Gatherer can gather information from most common sources such as HTTP, FTP, News, database, and CICS servers, and summarizes data in a variety of formats such as HTML, GIF, Power Point, PostScript, VRML, TAR, ZIP, JAR, Java Source, JavaBeans, and Java class files. Represented in the XML format, a GCS summary contains the metadata for each gathered item and its salient features that are useful for search purposes. This allows the users to search diverse information with uniform queries.
GCS broadens the scope of webcasting by making data from anywhere in any format available as channel content. It also provides fine-grain personalization capabilities for the users to specify filters in any subscribed channel. The heart of GCS webcasting is the profile engine, which maintains a large profile database and matches it against incoming data received from GCS collectors. Data satisfying certain profiles will be automatically delivered to the corresponding users. Users interact with the GCS client to subscribe to web channels, specify filters to personalize a subscribed channel, and display delivered information in various forms. The profile engine consults the channel database to automatically compile data into a hierarchy of channels. System administrators can define channels using the channel administration tool according to the specific needs from where the system is deployed.
The gatherers collect all the available information. Most of the search engines currently available on the Internet work in one of two ways. xe2x80x9ccrawlers,xe2x80x9d AltaVista(copyright) and HotBot(copyright), try to visit every site on the web, indexing all the information they find. The information provided by searches, on sites built by crawlers, suffers from an overload syndrome, typically producing too much irrelevant data.
On the other hand, a hierarchical engines may suffer from the opposite problem in that they may miss information that does not fit into their manicured schema. Hierarchical engines are akin to card catalogs. A staff of librarians constantly scans information collected about websites and places sites into an information hierarchy.
The GCS uses a crawler designed to retrieve obscure information that other search engines miss. The GCS crawler can communicate using most of the popular network protocols, which enables it to access information from a variety of data sources such as Web servers, FTP servers, database systems, news servers and even CICS transaction servers. CICS is an IBM application server that provides industrial-strength, online transaction management for mission-critical applications. The GCS crawler is designed to track file systems on machines in dozens of formats that are not commonly considered a part of the World Wide Web lexicon. This data can take the form of corporate presentations, database files, Java byte code, tape archives, etc.
The crawler passes the information that it discovers to the second stage of the gatherer. This stage is called the recognizer, and distinguishes the different kinds of information (i.e., database files, web documents, emails, graphics or sounds) the gatherer has unearthed. The recognizer filters the information to remove irrelevant material before transmitting it to the summarizer.
The summarizer is a collection of plug-in programs in which the appropriate program is xe2x80x9cplugged inxe2x80x9d to handle a particular data type, that takes each of the data types the recognizer can identify and produces a summary represented in a metadata format known as the extended Markup Language/Resource Discovery Format (XML/RDF), an emerging standard for metadata representation. The metadata for a web page, for example, might contain its title, date of creation and an abstract if one is available, or the first paragraph of text if it is not. As new programs are developed that are programmed to understand document types, they may be incorporated into the open architecture of GCS.
Regardless of the data type, typically, all XML summaries look similar, which facilitates their collection, classification, and search. A Web server associated with each gatherer makes the XMLs available to a central component called the collector. From the XMLs, the collector creates a database that is essentially a map of the digital universe. The collector co-ordinates the work of the gatherers so as not to repeat work. For example, when the gatherer looking for information in North America comes across a link to Japan, it informs the collector, which passes this information on to the Japan gatherer. Gatherers may be assigned by a GCS administrator to specific domains in the digital universe, but over time they may migrate dynamically to distribute the overall load of the system.
The gatherers and the collector make up the GCS search engine. The power of GCS lies in its ability to match information to the interests and needs of users. A program known as a profile engine exercises this task. Starting with the user""s queries, it constructs information profiles that it continuously matches against the incoming information. As relevant material is found, it distributes them to administration servers that deliver them to the client""s desktop computer or PDA.
Commercially available systems push channels of information to a user""s desktop using a browser available at http://www.entrypoint.com. However, those channels are predefined, broad and unfiltered. GCS users can create channels that are as narrow or as broad as they wish. As the user switches from channel to channel, the information scrolls by in xe2x80x9ctickers,xe2x80x9d similar to the stock marker ticker tapes.
The quality of the information delivered by GCS improves with use. This advance stems from a concept known as a relevance tracker. However, like all search engines, GCS inevitably delivers a lot of information that may be unrelated to the initial query. To address this problem, GCS includes a learning engine to analyze information that the user accepts and rejects, to refine queries and cut down on irrelevant provision of data.
Two forms of information transfer on the Internet are known as push and pull. A push is a one time definition or query that elicits a vast number of results, forcing the questioner to spend time sifting through piles of irrelevant information in quest of the required answer. The technical definition of push is any automatic mechanism for getting information off the web from the users perspective. A pull is a very specific query specification that may be too specific to pull in the precise information required.
Push means that new information is delivered or retrieved automatically from a remote computer to the user""s computer. Information does not need to be updated manually on a regular basis. Grand Central Station technology is designed ultimately to allow users to both pull and push information on the web. Its advantage lies in the ability to tailor its searches to the requirements of individual users.
Unified messaging is another example of push-pull technology, and represents the convergence of e-mail, fax, and voice mail technology. A message can start as a fax and be converted into an e-mail message for delivery to the in-box of a mail server or an e-mail message can be transmitted to a fax number. Some services convert e-mails to voice messages so the messages can be heard over the telephone as a voice mail. This illustrates the multimedia nature of a push-pull style for information delivery through e-mail text, fax, images or audio presentation.
Java(copyright) is designed as a universal software platform which is currently being used to build streamlined applications that can easily be distributed across networks, including the Internet and corporate intranets. Appropriately equipped users download Java(copyright) xe2x80x9cappletsxe2x80x9d and run them on their personal computers, workstations, or network computers.
GCS represents a good example of a Java(copyright)-developed tool, and an xe2x80x9cintelligent agentxe2x80x9d that crawls through all sections of the Internet searching for user-specified information. After automatically filtering, collecting and summarizing this information, GCS brings it to the attention of the user on a computer or a PDA.
Numerous indexing systems, such as freewais-sf, are available on the Internet. Freewais-sf has the ability to conduct field searching and documented relevance ranking. Harvest is another indexing system which is a modular system of applications consisting primarily of a xe2x80x9cgathererxe2x80x9d and a xe2x80x9cbroker.xe2x80x9d Given URLs or file system specifications, the gatherer collects documents and summarizes them into a format called SOIF (Summary Object Interchange Format). SOIF is a meta-data structure. The broker""s task is to actually index the SOIF data. In its present distribution, brokers can index the SOIF data using SWISH or WAIS techniques. Harvest""s strength lies in its ability to easily gather and summarize a wide variety of file formats. Harvest provides indexed access to well and consistently structured HTML documents.
Profile matching enables the creation of an xe2x80x9cidealxe2x80x9d personality profile against which job applicants are compared. Studies have shown that those job applicants who most closely match the xe2x80x9cidealxe2x80x9d profile are the most productive workers and experience lower stress when performing the job. Psychologists comment on specific factors relevant to the job which should be considered when making employment decisions and can provide a list of interview prompts based on the differences between the xe2x80x9cidealxe2x80x9d and candidate profiles. Profile matching is the most cost-effective method of checking a candidate""s suitability for a given role, and is ideally suited to screening large numbers of applicants.
The market of web-based recruiting is expected to grow significantly. Websites like monster.com, hotjobs com, and careercentral.com provide facilities to register, post resumes and jobs, and search for jobs among other things. These portals provide facilities to tailor personal resumes in looking for job matches. Notification concerning job matches are performed typically through email. A centralized server is used to store both personal and job posting data. Job and personnel matching are believed to be performed through keyword matching. Personal data resides on a central server and the user exercises little or no control over the matching process.
There is therefore a long felt and still unsatisfied need for an enhanced profile matching system and method that provide accurate matches.
The present profile matching system and method satisfy this need by matching the path expressions (i.e., profile matching) in a structured or semi-structured document, such as an XML document (e.g. a resume with headings), to an indexed resource (i.e., an index). The system, having assigned weighting values to the elements in the index, maps the document path expressions and attempts to match them to the index elements according to a predetermined schema. If needed, the system converts the document schema to that of the index, in order to harmonize the schemas, thus facilitating the mapping and matching process.
The foregoing and other features of the present invention are realized by a profile matching system comprised of an indexing module that maps the document and identifies its content attributes; and a matching module that matches the document content attributes to weighted elements of an index.
As an example, the system considers the schema of a job applicant""s resume (the document to be mapped) and the weighted index elements of the job posting. For every attribute in the resume schema, the system defines the attribute or set of attributes in the job schema that result in a match. The matching criteria are specified in a map specification file that specifies the specific qualification criteria for a job applicant seeking a particular job. This basically requires taking into account the important attributes of the job description and determining if the applicant possesses matching qualifications.
The indexing module uses the map specification information to produce efficient indices from the xe2x80x9cresumexe2x80x9d XML document. Another instance of the indexing component produces efficient indices from the xe2x80x9cjobxe2x80x9d XML posting or document. The matching module is a driver based upon the map specification file that navigates the resume index document and the job index document to define matches.
The matching module uses a match specification language (MSL) and a match operator. Each rule in the MSL is a pair of path expressions: one for the source document (i.e., resume) and one for target document or index (i.e., job). As an illustration, for a rule r in the MSL, a source path expression sr, a target path expression st, and a match operator m, a match occurs if m(sr, st)=true. In addition, each rule may have a weighting factor. The weighting factor specifies the weight of this rule against other rules in the specification file. The weighting factor can be a real number between 0 and 1. The matching process basically processes the rule specification against the index files of the source and the target documents, and cumulatively weights the rule matches, and identifies an overall match criteria for each target document.
In the example above, as a new job applicant submits his or her resume to a web site, the matching module matches the resume using the match specification file against all the available job postings. As new job postings are added, the matching module incrementally matches them to previously matched resumes. As new resumes are added, the matching module matches them against existing job postings. Every time the applicant logs to the web site, the system shows provides him or her with a dynamically generated personalized listing of the most current top job postings matching his or her qualifications.
Although the profile matching system and method are described in connection with resumes and job postings for illustration purpose only, it should be amply clear that the invention is not limited to this specific application and that the invention can be applied to, and adapted by various other applications, including but not limited to applications where pairs of entities (e.g. books and book readers) need to be matched.