The present invention relates to the field of data extraction, more specifically to a system for collecting specific information from several sources of unstructured data. In a practical application, the invention may be used to extract specific information, such as business-related information, from the multiple pages of the World Wide Web (WWW).
With over one and a half billion pages, the WWW is one of the largest sources of information on the planet. Whether searching for corporate, educational, historical, social, current affairs, geographical or general-knowledge information, among many other types, the WWW offers the richest, most up-to-date bank of information in existence.
Unfortunately, the WWW boasts an extremely vast and unstructured content, through which navigation may be difficult and even unsuccessful. In order to find and extract a few specific and relevant pieces of information, a Web user may have to personally search through many Web pages and immense quantities of disorganised information. This exhaustive searching of the WWW consumes an excessive amount of time and is oftentimes very frustrating for the Web user.
Present day technology provides to the Web user the capability to search the WWW for specific information, using a search engine to identify its probably location. However, once potential Web pages are found, the pages have to be thoroughly visited by the Web user in order to find and extract the relevant information, with no guarantee that the required information is even present in the potential Web pages. Further, where a structured compilation of the specific information is required, the Web user must personally create this compilation by identifying, extracting and formatting the relevant information from the WWW.
One system that is currently used for collecting specific information from the WWW involves the use of dedicated databases containing specific information, where the information contained in each dedicated database is associated with pages of the WWW, in a simplified example through cross-referencing. These dedicated databases are created and maintained by a human operator, for use by the system, and require constant maintenance and updating. Once a search of the WWW has identified possible relevant Web pages, the system accesses the appropriate database, determines the information contained therein that corresponds to the relevant Web pages and generates therefrom a structured compilation of the requested information. In a particular example, assume that the specific information being searched for is contact information for a particular company, a search of the WWW having identified several potentially relevant Web pages. In this case, the system accesses a dedicated database containing commercial information, including contact information, on various corporate entities and extracts therefrom the required contact information, on the basis of the Web pages revealed by the search.
Unfortunately, this system has many disadvantages. In particular, the specific information provided to the Web user in the structured compilation is only as up-to-date as the last time the dedicated database from which the specific information was taken was updated, and may lack information newly available on the WWW. Another, and greater, disadvantage is the need for human resources to create and continuously update the dedicated databases, as well as the potential for incorrect information stored in the dedicated databases due to human error. Finally, while certain specific information may be unpublished (unavailable) on the WWW but available elsewhere, such as in a private Intranet or in a set of data files on a workstation, the system is specifically designed to work only with the pages of the WWW.
The background information provided above clearly indicates that there exists a need in the industry to provide a novel system for extracting and structurally compiling specific information from unstructured digitized data, such as the Web pages of the WWW.
Under a broad aspect, the invention provides a system for collecting specific information from several sources of unstructured digitized data. The system has an input for receiving at least one instruction governing the collection of the specific information. In a specific, non-limiting example of implementation, the system receives an instruction conveying the location(s) where the collection is to take place. The system includes a processing unit that connects to a plurality of sources of unstructured digitized data from which the specific information is to be collected, at least in part on the basis of the instruction(s) received at the input. The processing unit is operative to analyse the contents of each source of unstructured digitized data to identify in each source the information elements relevant to the specific information. The processing unit extracts the identified information elements from each source of unstructured digitized data where information elements relevant to the specific information have been identified, and processes the extracted information elements for generating an output signal containing the specific information. The system further includes an output for releasing the output signal.
The advantages of this system are twofold. First of all, the sources of unstructured digitized data do not have to be personally searched in their entirety by a human operator in order to collect the specific information. Rather, the system analyzes the contents of each source of unstructured digitized data and automatically extracts therefrom the requested specific information. Secondly, the specific information collected by the system is the most up-to-date information available from the particular source(s) of unstructured digitized data where originated the specific information, since the specific information is taken directly from the particular source(s) of unstructured digitized data.
In this specification, the term xe2x80x9csourcexe2x80x9d in the expression xe2x80x9csource of unstructured digitized dataxe2x80x9d refers to a broad category of facilities containing, storing or providing digitized data, including databases, servers, memory modules, text files, digitized documents, among other possibilities. The sources of unstructured digitized data may be of different, even incompatible, data formats.
In this specification, the term xe2x80x9cunstructuredxe2x80x9d in the expression xe2x80x9csource of unstructured digitized dataxe2x80x9d is defined with respect to the information being searched for in the source of digitized data, from the point of view of the searcher. More specifically, the searcher is unaware of any particular layout or structure organizing the information contained in the digitized data. Further, several sources of unstructured digitized data are considered to be xe2x80x9cunstructuredxe2x80x9d since they share no common structure or layout for the information contained therein.
In a specific non-limiting example of implementation, the unstructured digitized data is the data contained in the many pages of the WWW and the specific information is business-related information, in particular sales lead information for prospective clients. Such sales lead information, also referred to herein as contact information, may include the business name, the postal address, the e-mail address, the telephone and fax numbers, the name and title of a contact person, the number of employees, etc. The system is software implemented and resides on a computing device, such as a server or a workstation. For the purposes of this specific example, the system resides on a workstation at which a system user can access and use the system. In particular, the processing unit includes an identification unit having an input for receiving at least one instruction that governs the collection of the contact information. In this specific example, the identification unit receives from the system user an instruction conveying the location of a remote WWW site, in the form of a machine-readable URL (Universal Resource Locator) address, where the collection of the contact information is to take place. The unstructured digitized data to be searched is the data contained in the various Web pages connected to the URL address.
The identification unit is operative to establish a data connection with the Web site located at the URL address, from which starting point the identification unit can connect to the various Web pages connected to the URL address and import all of the unstructured digitized data contained therein. The identification unit is then operative to examine the data contained in each Web page connected to the URL address and to identify therein any information elements relevant to contact information, such as a telephone number, an e-mail address, a postal code, a name of a city, etc.
In a variant, the identification unit is operative to determine the particular Web pages connected to the URL address that are most likely to contain contact information. The identification unit will then examine only those particular Web pages in order to identify therein any relevant information elements, ignoring the other Web pages connected to the URL address. In a specific example, assume the URL address corresponds to the home or welcome page for a Web site. The identification unit first examines the home or welcome page in order to detect therein the various hyperlinks linking it to other, related Web pages. Assuming these hyperlinks are entitled: xe2x80x9cProductsxe2x80x9d, xe2x80x9cHistoryxe2x80x9d, xe2x80x9cContactsxe2x80x9d, xe2x80x9cAddressxe2x80x9d and xe2x80x9cInnovationsxe2x80x9d, the identification unit may determine that the most likely pages to contain contact information are those linked to the xe2x80x9cContactsxe2x80x9d and xe2x80x9cAddressxe2x80x9d hyperlinks. The identification unit will then examine only the Web pages linked to the xe2x80x9cContactsxe2x80x9d and xe2x80x9cAddressxe2x80x9d hyperlinks for identifying relevant information elements, ignoring all of the other Web pages.
The processing unit also includes an extractor unit for extracting from the Web pages the information elements identified by the identification unit, as well as an aggregator unit for processing the extracted information elements for generating an output signal containing the contact information requested by the system user. In this specific example, the output signal includes a structured compilation, such as a list or a table, of all of the retrieved contact information, where this output signal is transmitted to the system user by display on the monitor of the workstation.
The identification unit relies on lexical analysis operations that are well known to persons skilled in the art, as well as on text interpretation rules, to identify and categorise the information elements relevant to the specific information, in this example sales lead information. The lexical analysis performed by the identification unit relies on one or many dictionaries. In a specific example, a first dictionary contains all the names of major cities of the world, a second dictionary contains all the names of major provinces and states of the world and a third dictionary contains all the names of major countries of the world. Possible categories for the identified information elements may include name of a city, name of a province or state, name of a country, telephone or fax number, e-mail address, street name, postal code, etc.
The text interpretation rules are based on xe2x80x9cregular expressionsxe2x80x9d, used to express and process different text patterns. The concept of xe2x80x9cregular expressionsxe2x80x9d is well known to those skilled in the art and, as such, will not be described in further detail. Different regular expression processing tools, such as OROmatcher (trade-mark), can be used by the identification unit for interpreting the data of the Web pages in order to identify therein and categorise information elements relevant to the requested specific information. Note that different types of text interpretation systems could also be used by the identification unit, without departing from the scope of the present invention.
The aggregator unit relies on pre-determined clustering rules to correlate and establish relationships between the information elements identified in each Web page. Thus, for a particular Web page, the aggregator unit processes the information elements identified therein and, on the basis of distance between the identified information elements on the page and the different categories of the identified information elements, relates the identified information elements for compiling complete or incomplete contact information. Once the contact information for each Web page has been compiled, the aggregator unit is operative to aggregate the contact information compiled from each Web page on a page by page basis, as well for the totality of the Web pages, in order to remove any similar or repetitive contact information. The aggregator unit is also capable to combine, if appropriate, incomplete contact information from a particular Web page with complementary incomplete contact information from a different Web page.
In a different example of implementation, the system includes a prospector unit that cooperates with at least one search engine and acts as an interface between the system and a user of the system. The prospector unit prompts the system user for at least one key word, based on which the prospector unit formulates to the search engine a search query in order to prospect for contact information of potential clients available over the WWW. For example, assume a software publisher provides to the prospector unit the key words xe2x80x9csoftware distributorsxe2x80x9d. On the basis if these key words, the prospector unit formulates a search query to the search engine, which searches the WWW for relevant Web sites/pages. The search results are returned by the search engine to the prospector unit, which is operative to feed the URL address of each relevant Web page returned by the search engine to the identification unit of the system. Next, the information elements relevant to contact information are identified in each Web page, extracted and compiled into contact information, as defined above.
In a variant, the prospector unit is capable to select, on the basis of the key word(s) input by the system user, one or more specific Web pages from the plurality of pages returned by the search engine, passing only the URL address(es) for the selected specific Web page(s) to the identification unit of the system. In a specific example, the system user inputs to the prospector unit the name of a company, based on which the prospector unit formulates a search query to the search engine. The search engine searches the WWW for pages containing or making reference to the name of the company, and returns to the prospector unit a plurality of potentially relevant Web pages/sites. The prospector unit is operative to select from the plurality of potentially relevant Web pages/sites returned by the search engine the particular Web page that constitutes the home page for the named company, if present. The prospector unit next discards all of the other Web pages/sites and feeds to the identification unit of the system only the URL address corresponding to the home page of the named company, where collection of the contact information will then take place, as described above. In another aspect, the invention provides a computer readable storage medium containing a program element for execution by a computing apparatus to implement a system for collecting specific information from several sources of unstructured digitized data.
In yet another aspect, the invention provides a data processing device for collecting specific information from several sources of unstructured digitized data, having an input for receiving at least one instruction governing the collection of the specific information. The data processing device includes an identification unit operative to connect to a plurality of sources of unstructured digitized data from which the specific information is to be collected, at least in part on the basis of the at least one instruction. The identification unit examines each source of unstructured digitized data in order to identify information elements relevant to the specific information. The data processing unit also includes an extractor unit for extracting the identified information elements from each source of unstructured digitized data in which data elements were identified, and an aggregator unit operative to process the extracted information elements for generating an output signal containing the specific information. The data processing device includes an output for releasing the output signal from the data processing device.
The invention further provides a method for collecting specific information from several sources of unstructured digitized data.