The present invention discloses an apparatus and method for the automatic discovery of control events from the publication of documents over the Internet or other network.
A document published electronically in the Internet may have an important effect on the control of goods and services delivered in a global economy. Whether proposed or final, a new or modified law, regulation, standard, study, guideline, finding, health alert, product recall, governmental or advisory announcement are all examples of publications that the producer or manufacturer of the good or service must be aware of as soon as possible-if they influence his business or interests. Such a publication represents a critical control event that may influence the producer's or manufacture's actions and business process with regard to the good or service. For instance, if a new regulation is published that restricts the usage of a manufacturer's product or raw materials anywhere in the world, published in any language, the manufacturer's regulatory experts must be made aware of that publication immediately. Similarly if a new toxicological study has found that a substance has an adverse environmental or health effect, the manufacturer has an important managerial duty, potential liability arising from, and economic investment in knowing of such a publication and in responding quickly. Indeed a product recall may be necessary to carry out as a result.
The need to be well informed of relevant control events applicable to producing goods and services in a global economy applies to all actors in a supply chain, whether manufacturer, distributor, customer, service provider or government. Accordingly it would be desirable to provide a system and way to discover automatically relevant information associated with control events that can be passed to all actors in a supply chain. Further it would be highly useful to pass the discovered information as signals to an enterprise system enabling early warning and transaction control.
Systems and methods for personalizing and aggregating news content are well known. An example of such a method is Google News™. Google News™ aggregates and classifies news stories authored by other news publishers in an automatic manner. In contrast, in the present invention is not a news aggregation service but an apparatus and method to discover critical control events relevant to a supply chain using a novel discovery and classification technique, which may be linked to an enterprise system to pass signals and early warning for transaction control of goods and services. Other known methods have defined how to enable the enhancement of reference data taken from a variety of sources. For example, financial markets may receive press releases from companies identified by stock symbol and a variety of other types of information which can be advantageously rendered into a value-added and centralized view through methods previously defined. The present invention does not merely enhance the value of input reference data for which the relevance is known, but rather provides an apparatus and method to discover through classification the relevance of an unknown original publication and then produces information essential to controlling goods or services in a supply chain.
Today the actors in a supply chain have a limited number of options available to them to become informed of publications that represent critical control events: They may: a) manually check relevant web-sites one-by-one; b) subscribe to specialized news publications that summarize the actions taken of importance; c) retain consultants to perform this as a service; d) join trade associations where committees seek to monitor such events; and e) perform searches in one of the major Internet search engines. These are inefficient and involve delay. It would be desirable to have a centralized source of information, rather than relying on checking web sites individually, a labor-intensive activity requiring expert knowledge (for example, to know which web sites are important in a global economy). However, a short-coming of centralized web-searching at present using the major web-search engines is that they are limited in their capability to search across technical and linguistic content. For instance, if the search involves a chemical substance, a search in any of the major engines, including Google™, will return only hits for the English term, “acetic acid,” but will not return in the same search any translation such as “acide acétique” (French) or “Essigsäure” (German). If a producer of a good or service dependent on acetic acid searched new regulations or studies, a search using today's search engine would fail to produce hits for the French or German terms without entering each one individually, but to achieve the result the user would have to know in advance the relevant translations for which to search. The same difficulty applies to a search for commodities, e.g., strawberry (English) vs. fragola (Italian) vs. Erdbeere (German). In addition, the same limitation also applies to toxins and diseases. Finally, such web-services not only cannot address multi-lingual searching for publications of relevance to a supply chain, but they also do not classify the relevance of the publications to control events for a supply chain. The present invention provides an apparatus and method to establish a centralized as well as customized information services based on the automatic discovery of publications that must be made known to the producer of a good or service in a global economy.
Other resources are also available to the user: Information services are in wide use by which the user who wishes to be informed of an event may establish personalized alerting of news of interest by storing a query that is run with a frequency defined by the user, often referred to as “clipping services”. Lexis-Nexis has an Alert feature that is an example of such a service. The Federal Register also offers an alert service for registered users. However, there are limitations to such services: a) The user must know a priori the relevant search terms and queries that will find a relevant document; b) in a global context the definition of such terms and queries would be complicated by the need to translate the terms of the query across many languages for a given issue; c) the user may need to be aware of and subscribe to many such services in order to obtain the necessary coverage of international subject matter; and d) the user's knowledge to enter a term or query would not be capable of being optimized, improved by, or of including terms used by other experts or by a system that could calculate a more robust selection set from an information source. In brief, the user can easily miss important information given the limitations of today's clipping services.
To achieve centralized information sources across the Internet, Google™ and others have provided for page-ranking algorithms to index and make available web-pages collectively deemed important through a method of voting based on cross-page references. However, if the publication on a given web-site is very new and the need is urgent for the user to know that the publication has been released, there is not enough time for a voting or ranking algorithm based on other links to reference it and thus to rank it highly among many thousands of references in responding to a query by the user of an Internet search engine. In short, publications of high importance and recent release can easily be missed. It would be desirable to have a ranking system based on a broader pool of knowledge which would act in the place of the human expert carrying out and improving on the results possible through current methods, and which would be implemented using artificial intelligence functionality.
With regard to existing artificial intelligence methods, significant research has occurred in the area of text classification by automatic means. Computer-based systems and methods for finding rules of law in text have been defined. Document classifying systems have been developed to build classification decision trees. Further, systems have been invented for automatically and rapidly classifying or declassifying military, intelligence, government, and industrial documents to protect sensitive or classified information. Some of the techniques used in document classification include artificial neural networks, Bayesian learning, decision-tree learning, genetic algorithms, cluster, vector and other statistical techniques, and rule-based analysis. Typically these are applied to documents available to the user or transmitted to the user. For example, a spam filter by automatic means eliminates unwanted emails, or an automatic classification is carried out for a known library of documents into categories. It would be desirable to have an automatic system to search for and discover new publications that can then be passed to such a classification method that produces not only an ontological or hierarchical collection of terms, but more specifically a means to relate those classification outcomes to a system of control events that can then be related to critical control points in a business process.
A control event for the producer of a good or service may be considered by analogy to a system used to achieve food safety developed by the U.S. FDA, called Hazard Analysis and Critical Control Point Principles (HACCP). According to FDA, “HACCP is a management system in which food safety is addressed through the analysis and control of biological, chemical, and physical hazards from raw material production, procurement and handling, to manufacturing, distribution and consumption of the finished product.” http://www.cfsan.fda.gov/˜comm/nacmcfp.html. The goal of a HACCP strategy is to define all of the critical control points in a process or supply chain that can be applied to prevent or eliminate a food safety hazard. A critical control point is a step at which control can be applied to prevent a food safety hazard. In one embodiment of the present invention, a control event is one relevant to a critical control point and is associated with the publication of one or more documents.
For example, if the manufacturer is engaged in the development and sale of food additives, then a critical control point would include the determination that compliance had been achieved with all relevant requirements of the jurisdictions in which the product was marketed and sold. A newly published regulation or toxicological study would, for example, represent a control event associated with such critical control points in the manufacturing process.
As used in this invention control events are considered generally related to the production of any good or service and are defined as key elements or categorizations of the production of a good or service in a supply chain. Once control events are defined and are mapped to the control points of a specific business or manufacturing process it becomes possible to automatically monitor and identify publications relevant to a particular control event. All subscribers to a centralized information service can then be quickly informed of events that affect the production, marketing, and distribution of the good or service in a defined informational structure. The automatic identification of control events offered by the system and method of the present invention then can be passed in a structured way as signals to alert the management of a supply chain in the same manner that control signals can be sent by an information system through a network to control a motor. In order to achieve this objective a protocol must be defined for what information and how it is to be sent to the motor and control events serve a similar purpose here with regard to alerting a supply chain.
The apparatus and method of the present invention involves a number of steps. One step is to define by manual or algorithmic methods the source web-sites that correspond to a plan for geographic and subject matter coverage of interest to the actors in a supply chain. These web-sites would be the sources of the publications to be monitored. Four example web-sites of immediate interest to the regulatory compliance of food processors are: The U.S. Federal Register, FDA's web-site of the Center for Food Safety and Applied Nutrition, the European Union, and European Food Safety Authority (EFSA). In an alternative embodiment an existing source of or index to documents in the Internet or in a network could be used as the source of the publications to be examined by the classification process described below.
Based on the plan of coverage, a diverse number of web-sites are monitored on a regular basis by a master control application running on a server computer to identify whether a new publication has been released and its type. For example, the Federal Register would be monitored on a daily basis to identify notices, proposed rules, and final rules as publication types. Taking another instance, the European Food Safety Authority (EFSA) might release a new scientific opinion of the Panel on Food Additives, Flavourings, Processing Aids and Materials in Contact with Food. A master control program routinely accesses each defined Internet or network source and seeks to identify the publications of interest using one or more methods according to the invention that result in a collection of publications or publication references.
The identification of candidate publications of possible interest within a web-site is carried out by one or more methods that include but are not limited to: applications running on the server computer that make a simple examination of each new publication, applications that examine the date of files stored on the target source; applications that include algorithmic methods for the selection of publications, and screening applications running to inspect the web-site based on selection criteria. The step concludes with a set of new candidate publications of possible interest for the time period defined.
Each of the candidate publications is submitted to a classifier application running on a server computer connected to a database containing document elements—for example, word terms, fragments, phrases, spatial or formatted document sections, and images corresponding to a set of predefined control categories. The task of the classifier is to analyze the occurrence of such elements to determine whether the document contains a control event. This classification step provides an intelligent function based on a collective analysis of a large corpus of relevant publications that can optimize the possibility of finding a publication of interest by automatic means in order to save the time of the user and to improve the accuracy of the search. Further, in one embodiment of the present invention the classification method is a learning function that can improve the results of classification over time based on feedback. The classification step though cannot be carried out without the result of the master control program to identify what is new and to process the type of the publication.
A number of different classification methods are known and may be used. These include, but are not limited to, Bayesian Analysis, Inverse Term Frequency Analysis and other statistical methods that analyze the word content or other attributes of the publication document in order to assign the document to a particular control category. For example, in one embodiment, the database contains a list of word terms associated with a control category and the classifier application extracts and counts the occurrence of such words in the publication. Based on the classification method selected, an equation is applied to the cumulative frequencies of terms found in the document in order to produce a weighted score of whether the publication belongs to a control category. For example, in an Inverse Term Frequency Analysis, the occurrence of words strongly associated with a single category but weakly associated with all categories is given higher weight. Thus words like “regulation” may appear in all control categories of a set of documents that are published regulations. In contrast, words and phrases like “pesticide residues” would receive higher scores because they are associated with a limited set of control categories dealing with the control of contaminants in foods. However, some web sources of publications may be pre-classified a priori and any publications from such a highly specific web-site can be assigned to a particular control category.
In a preferred embodiment the classifier application running on a server computer connected to a database contains a collection of relevant terms together with their weighted scores, based on the analysis of a trained set of multilingual publications. Depending on the method of the analysis used, the occurrence of the terms in the document is calculated using the weighted scores associated with the terms found in the database. To determine the accuracy of the weighted terms for a given control category, one method is to randomly select a subset from among publications that have been previously assigned by manual or expert means. In such an example, the randomly selected test set of publications would be withheld from the training run of known documents. Then the classification of the random subset can be compared against the manual classification in order to determine whether the overall result is accurate.
The output of the classifier application is a determination that a publication is relevant to one or more pre-defined control categories and the automatic preparation of a summary report that is then loaded to the database of a server. An example of a control category would be: “Product Recalls”. The information resulting from the classification process will, for example, include the determined control category and sub-categories, the citation, a hyperlink to the source document, and any summary statement.
In a preferred embodiment of the present invention, another classification step occurs in which a multi-lingual analysis of the document is made by an application program running on a server computer in order to identify any specific reference to a product, substance, food or foodstuff, biological agent, disease, and commodity. The application program uses an indexed database of multilingual terms that are sorted by a number of different methods that include but are not limited to an inverted order with the longest terms occurring first. A list of all found references relevant to goods and services is then added to the summary information about the publication determined from the previous classification step. In this preferred embodiment, any reference to “acetic acid” or “strawberry” would, for example, be found. As a result, in the first classification step, information about the control category is determined, e.g., that FDA has published a proposed change in the regulation of sanitizing solutions in food processing equipment, which is assigned to a control category, “food contact”, and that the proposed regulation applies to solutions containing “acetic acid”. Thus, the present invention provides for a system and method to automatically discover and assign a control event or categorization to the newly published document and to analyze which product, substance, food or foodstuff, biological agent, disease, and commodity it might apply to in a multi-lingual manner without geographic limitation.
On the basis of the output reports stored within the web-based content management system, a centralized information system can be provided to subscribers within the supply chain or other interested parties.
According to a further feature of the invention, the output reports are structured in such a manner, using one or more protocols to format the information, so that the information may be loaded as data into an enterprise resource planning system (ERP), such as SAP, Oracle, or similar system. This would also apply to other specialized product life cycle management systems (PLM) or report generation systems used for product control. For example, a set of control events or categories can be associated with standard phrases that can be loaded into the enterprise system which can then be associated with reports to be generated from the enterprise system or transactions to be controlled by the enterprise system. The control event or information signal may be output from the centralized information service of the present invention and transmitted to the enterprise system in a number of different ways. The enterprise system can receive and consume the control event that can be passed using a structured information protocol, such as XML or as a web-service, or by a number of different methods. Such methods also permit customization of the information service and the mapping of the control events to the needs of the user of the enterprise system.
The output of the preferred embodiment includes the structured identification of one or more products, substances, foods or foodstuffs, biological agents, diseases, and commodities to which a document may apply. Further, the output of the preferred embodiment includes a control event associated with the document. These can be loaded as alerts to facilitate transaction control by the enterprise system. Thus the present invention provides a system and method to reduce the amount of time between the publication of a new or proposed law, regulation, standard, study, guideline, finding, health alert, product recall, governmental, commercial, or advisory announcement and the capability for an expert subscribing to the information service offered by the invention to act and, in addition, the capability for a subscribing enterprise system to load early warning signals for transaction control in a business-to-business exchange of information using any one of a number of data exchange protocols.
Other objects, advantages and novel features of the present invention will become apparent from the following detailed description when considered in conjunction with the accompanying drawings.