With computer-implemented word processing and mass data storage, the amount of information generated by mankind has risen dramatically and with an ever-quickening pace. As a result, there is a continuing and growing need to collect and store, identify, track, classify and catalogue, and link for retrieval and distribution this growing sea of information.
Much of the world's information or data is in the form of text, the majority of which is unstructured (without metadata or in that the substance of the content is not asymmetrical and unpredictable, i.e., prose, rather than formatted in predictable data tables). Much of this textual data is available in digital form [either originally created in this form or somehow converted to digital—by means of OCR (optical character recognition), for example] and is stored and available via the Internet or other networks. Unstructured text is difficult to effectively handle in large volumes even when using state of the art processing capabilities. Content is outstripping the processing power needed to effectively manage and assimilate information from a variety of sources for refinement and delivery to users. Although advances have made it possible to investigate, retrieve, extract and categorize information contained in vast repositories of documents, files, or other text “containers,” systems are needed to more efficiently manage and classify the ever-growing volume of data generated daily and to more effectively deliver such information to consumers.
This proliferation of text-based information in electronic form has resulted in a growing need for tools that facilitate organization of the information and allow users to query systems for desired information. One such tool is information extraction software that, typically, analyzes electronic documents written in a natural language and populates a database with information extracted from such documents. Applied against a given textual document, the process of information extraction (IE) is used to identify entities of predefined types appearing within the text and then to list them (e.g., people, companies, geographical locations, currencies, units of time, etc.). IE may also be applied to extract other words or terms or strings of words or phrases.
Knowledge workers, such as scientists, lawyers, traders or accountants, have to deal with a greater than ever amount of data with an increased level of variety. Their information needs are often focused on entities and their relations, rather than on documents. To satisfy these needs, information providers must pull information from wherever it happens to be stored and bring it together in a summary result. As a concrete example, suppose a user is interested in companies with the highest operating profit in 2015 currently involved in Intellectual Property (IP) lawsuits. In order to answer this query, one needs to extract company entities from free text documents, such as financial reports and court documents, and then integrate the information extracted from different documents about the same company together.
Content and enhanced experience providers, such as Thomson Reuters Corporation, identify, collect, analyze and process key data for use in generating content, such as news articles and reports, financial reports, scientific reports and studies, law related reports, articles, etc., for consumption by professionals and others. The delivery of such content and services may be tailored to meet the particular interests of certain professions or industries, e.g., wealth managers and advisors, fund managers, financial planners, investors, scientists, lawyers, etc. Professional services companies, like Thomson Reuters, continually develop products and services for use by subscribers, clients and other customers and with such developments distinguish their products and services over those offered by their competition.
Companies, such as Thomson Reuters—with many businesses involved in delivery of content and research tools to aid a wide variety of research and professional service providers—generate, collect and store a vast spectrum of documents, including news, from all over the world. These companies provide users with electronic access to a system of databases and research tools. Professional services providers also provide enhanced services through various techniques to augment content of documents and to streamline searching and more efficiently deliver content of interest to users. For example, Thomson Reuters structures documents by tagging them with metadata for use in internal processes and for delivery to users.
“Term” refers to single words or strings of highly-related or linked words or noun phrases. “Term extraction” (also term recognition or term mining) is a type of IE process used to identify or find and extract relevant terms from a given document, and therefore have some relevance, to the content of the document. Such activities are often referred to as “Named Entity Extraction” and “Named Entity Recognition” and “Named Entity Mining” and in connection with additional processes, e.g., Calais “Named Entity Tagging” (or more generally special noun phrase tagger) and the like. There are differences in how these activities are performed. For example, term recognition might only require setting a flag when a certain expression is identified in a text span, while term extraction would be identifying it and its boundaries and writing it out for storage in, for example, a database, noting exactly where in the text it came from. Techniques employed in term extraction may include linguistic or grammar-based techniques, natural language or pattern recognition, tagging or structuring, data visualizing and predictive formulae. For example, all names of companies mentioned in the text of a document can be identified, extracted and listed. Similarly, events (e.g., Exxon-Valdez oil spill or BP Horizon explosion), sub-events related to events (e.g., cleanup effort associated with Exxon Valdez oil spill or BP Horizon explosion), names of people, products, countries, organizations, geographic locations, etc., are additional examples of “event” or “entity” type terms that are identified and may be included in a list or in database records. This IE process may be referred to as “event or entity extraction” or “event or entity recognition.” As implemented, known IE systems may operate in terms of “entity” recognition and extraction wherein “events” are considered a type of entity and are treated as an entity along with individuals, companies, industries, governmental entities, etc.
There are a variety of methods available for automatic event or entity extraction, including linguistic or semantic processors to identify, based on known terms or applied syntax, likely noun phrases. Filtering may be applied to discern true events or entities from unlikely events or entities. The output of the IE process is a list of events or entities of each type and may include pointers to all occurrences or locations of each event and/or entity in the text from which the terms were extracted. The IE process may or may not rank the events/entities, process to determine which events/entities are more “central” or “relevant” to the text or document, compare terms against a collection of documents or “corpus” to further determine relevancy of the term to the document.
Systems and methods for identifying risks, entities, relationships, supply chains, and for generating visualizations related to risks, entities, relationships, and supply chains are described in at least: SYSTEMS, METHODS, AND SOFTWARE FOR ENTITY EXTRACTION AND RESOLUTION COUPLED WITH EVENT AND RELATIONSHIP EX FRACTION, U.S. patent application Ser. No. 12/341,926, filed Dec. 22, 2008, Light et al.; SYSTEMS, METHODS, SOFTWARE AND INTERFACES FOR ENTITY EXTRACTION AND RESOLUTION AND TAGGING, U.S. patent application Ser. No. 12/806,116, filed Aug. 5, 2010, issued as U.S. Pat. No. 9,501,467, on Nov. 11, 2016, Light et al.; FINANCIAL EVENT AND RELATIONSHIP EXTRACTION, U.S. patent application Ser. No. 12/363,524, filed Jan. 30, 2009, Schilder et al.; SYSTEMS, METHODS, AND SOFTWARE FOR ENTITY RELATIONSHIP RESOLUTION, U.S. patent application Ser. No. 12/341,913, filed Dec. 22, 2008, issued as U.S. Pat. No. 9,600,509, on Mar. 1, 2017, Conrad et al.; METHODS AND SYSTEMS FOR MANAGING SUPPLY CHAIN PROCESSES AND INTELLIGENCE, U.S. patent application Ser. No. 13/594,864, filed Aug. 26, 2012, Siig et al.; METHODS AND SYSTEMS FOR GENERATING SUPPLY CHAIN REPRESENTATIONS, U.S. patent application Ser. No. 13/795,022, filed Mar. 12, 2013, Leidner et al.; and RISK IDENTIFICATION AND RISK REGISTER GENERATION SYSTEM AND ENGINE, U.S. patent application Ser. No. 15/181,194, filed Jun. 13, 2016, Leidner et al.; each and all of which are incorporated herein by reference in their entirety.
Thomson Reuters' Text Metadata Services group (“TMS”) formerly known as ClearForest prior to acquisition in 2007, is one exemplary IE-based solution provider offering text analytics software used to “tag,” or categorize, unstructured information and to extract facts about people, organizations, places or other details from news articles, Web pages and other documents. TMS's Calais is a web service that includes the ability to extract entities such as company, person or industry terms along with some basic facts and events. OpenCalais is an available community tool to foster development around the Calais web service. APIs (Application Programming Interfaces) are provided around an open rule development platform to foster development of extraction modules. Other providers include Autonomy Corp., Nstein and Inxight. Examples of Information Extraction software in addition to OpenCalais include: AlchemyAPI; CRF++; LingPipe; TermExtractor; TermFinder; and TextRunner. IE may be a separate process or a component or part of a larger process or application, such as business intelligence software.
Currently, the dominant technology for providing nontechnical users with access to Linked Data is keyword-based search. This is problematic because keywords are often inadequate as a means for expressing user intent. In addition, while a structured query language can provide convenient access to the information needed by advanced analytics, unstructured keyword-based search cannot meet this extremely common need. This makes it harder than necessary for non-technical users to generate analytics.
What is needed is a natural language-based system that utilizes the benefits of structured query language capabilities to allow non-technical users to create well-formed questions.
Today, investment decisions in the financial markets require careful analysis of information available from multiple sources. To meet this challenge, financial institutions typical maintain very large datasets that provide a foundation for this analysis. For example, forecasting stock market, currency exchange rate, bank bankruptcies, understanding and managing financial risk, trading futures, credit rating, loan management, bank customer profiling, and money laundering analyses all require large datasets of information for analysis. The datasets of information can be structured datasets as well as unstructured data sets.
Typically, the datasets of information are used to model one or more different entities, each of which may have a relationship with other entities. For example, a company entity may be impacted by, and thereby have a relationship with, any of the following entities: a commodity or natural resource (e.g., aluminum, corn, crude oil, sugar, etc.), a source of the commodity or natural resource, a currency (e.g., euro, sterling, yen, etc.), and one or more competitor, supplier or customer. Any change in one entity can have an impact on another entity. For example, rising crude oil prices can impact a transportation company's revenues, which can affect the company's valuation. In another example, an acquisition of a supplier by a competitor puts an entity's supply chain at risk, as would political upheaval or natural disaster (e.g., tsunami, earthquake) affecting availability or operations of a supplier.
Given the quantity and nature of these datasets, each modeled entity tends to have multiple relationships with a large number of other entities. As such, it is difficult to identify which entities are more significant than others for a given entity.
Accordingly, there is a need for systems and techniques to automatically analyze all available supply chain related data to identify relationships and assign significance scores to entity relationships.
Event detection and relation extraction is an active field of academic research. State of the art systems employ statistical machine learning models to identify and classify relations between entities mentioned in natural language texts. Recently, deep learning-based systems have been shown to achieve similar quality, requiring less feature engineering. Knowledge base building systems make use of known machine learning models to create or augment knowledge graphs, depicting relations between entities.
What is needed is, a system configured to be applied to the identification of supply chain relationship between companies. Supply chain identification is still based on manual work and on extracting relations from structured data (financial reports, piers records etc.).
Supplier—Customer relations are very valuable to investors, among other interested classes of users, but are oftentimes hard to detect. Some information is available in structured data, but many more indications are available only in unstructured data, such as news stories, company SEC filings, blogs and company and other web sites. A lot of highly informative data is publicly available, but is too voluminous and unfeasible for manual processing to systematically identify supply chain relations.
Accordingly, what is needed is an automated system capable of processing the large volumes of available data to detect indications for supply chain relationship between companies and aggregate these indications across data sources to generate a single confidence score for the relation between such companies.