The present invention relates to managing data sources in a corpus, and more specifically, to identifying new data sources for ingestion into the corpus or determining if a current data source stored in the corpus is stale.
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human languages. To interact with humans, natural-language computing systems may use a data store (i.e., a corpus) that is parsed and annotated. For example, the computing system may use the corpus to identify an answer to a question posed by a human user by correlating the question to the annotations in the data store.
Before the NLP computing system is able to interact with a user, the corpus is populated with different text documents. In addition, annotators may parse the text in the corpus to generate metadata about the text. Using the metadata and the stored text, the NLP computing system can interact with the user to, for example, answer a posed question, diagnosis an illness based on provided symptoms, evaluate financial investments, and the like. In a sense, the corpus acts like the “brain” of the natural-language computing system.