CPC G06F 16/285 (2019.01) [G06F 7/14 (2013.01); G06F 16/245 (2019.01); G06N 5/01 (2023.01); G06N 20/20 (2019.01); G06Q 10/063112 (2013.01); G06Q 30/01 (2013.01); G06Q 30/015 (2023.01)] | 1 Claim |
1. A system for disambiguating company profiles, the system comprising:
an entity disambiguation computer comprising a memory, a processor, and a plurality of programming instructions, the plurality of programming instructions when executed by the processor cause the processor to:
receive a plurality of candidate company profiles with timed attributes from a database;
monitor the plurality of received candidate company profiles, and extract and ingest at dynamic intervals, a plurality of timed metadata from the plurality of candidate company profiles, the timed metadata comprising individual company data, each timed metadata, of the plurality of timed company profiles, linked to a company profile, of the plurality of the candidate company profiles, associated to a time frame, wherein the individual company data for a company profile comprises, at least location, geocodes, company name, employee attributes, average company headcount reporting data over a given time frame, company website and URL information, and company employment records;
disambiguate, from at least a portion of timed metadata, a plurality of locations and geocodes, for the time frame, using regular expressions to match components of unstructured text and cross-reference one or more identified location components against one or more geocode databases;
classify and disambiguate, from at least a portion of timed metadata, a company name component, for the time frame, from a company name associated with at least one company profile, of the plurality of candidate company profiles, to identify at least a base name, a connector, a function and/or industry, and a legal identifier associated with the company name, wherein a conditional random field (CRF) is used to disambiguate the company name component;
disambiguate, from at least a portion of timed metadata, employee attributes, for the time frame, by mapping skills to skill topics implemented with a Latent Dirichlet Allocation (LDA) topic model algorithm;
train a tree model for pre-selection of a plurality of candidate companies for the first company profile, wherein one or more manually annotated training examples are used to train the tree model, so as to facilitate identification of a pre-selection of the plurality of candidate companies, wherein one or more manually annotated training examples are used to train the tree model, so as to facilitate identification of a pre-selection of the plurality of candidate companies, and wherein one or more algorithms used to train the tree model include Random Forrest algorithm, Gradient Boosting algorithm, Decision Tree algorithm, or a combination thereof;
train a similarity model for comparison of the plurality of candidate companies based at least on the pre-selection of the plurality of candidate companies, wherein the similarity model is trained to determine whether two given candidate companies are the same, wherein the similarity model is trained using one of a Regression Algorithm, a Neural Network, or a Vector Similarity Algorithm paired with a learned threshold model; and
in response to a determination that at least two given candidate companies, of the plurality of candidate companies are matched, merge the timed metadata associated with the matched candidate companies.
|