CPC G06F 16/382 (2019.01) [G06F 16/335 (2019.01)] | 18 Claims |
1. A method for identifying citations from documents and constructing enriched citation databases, the method comprising:
obtaining, by a processing device, a document comprising texts of a natural language;
constructing pre-processing filters comprising a first set of regular expressions matching non-citation text patterns;
applying the pre-processing filters to the document to generate a pre-processed document by removing the non-citation text patterns from the document;
constructing citation filters comprising a second set of regular expressions, wherein each of the second set of regular expressions matches at least one of a corresponding citation or a context associated with the citation, and a regular expression in the first set of regular expressions or in the second set of the second set of regular expressions is one of an atomic regular expression or a compound regular expression defined by one or more atomic regular expressions or other compound regular expressions;
applying the citation filters to the pre-processed document to identify one or more citations and corresponding contexts that match at least one of the second set of regular expressions; and
storing the one or more citations and corresponding contexts in an enriched citation database.
|