The present invention relates to extract, transform, and load (ETL) systems, and more specifically, to transforming arbitrary document formats into standard formats using natural language processing (NLP).
Generally, ETL systems extract data from multiple disparate sources, transform the data to fit given operational needs, and then load the data into an end target (e.g., a data store, data warehouse, etc.). Once loaded, other systems can access the data for specified purposes. For example, an analytics system can process the data and derive various metrics that may be of use for an organization. More generally, ETL systems play an important role in many fields.
For example, the medical field has a wealth of information spread across many sources. For example, research institutions publish medical papers and articles distributed to the medical community. Medical libraries archive textbooks and encyclopedias providing information about diseases, treatments, and the like. Some organizations desire to access this information to develop healthcare solutions, learning mechanisms, treatment decisions, and other beneficial techniques. One approach is to input the information (i.e., medical papers, articles, texts, etc.) into a system that uses natural language processing techniques to parse each of the structured and unstructured texts input to the system.
However, one issue with ingesting documents from multiple sources is that documents from different sources may be formatted or structured differently. That is, data exists in many unstructured, semi-structured, and structured forms. For example, some organizations may organize text files in a structured XML format, while others may organize texts using some other markup language. Although standards bodies have recommended that data be presented in a certain publishing format (e.g., RSS/Atom), many organizations have not adopted such formats. As a result, an ETL administrator must manually examine the different formats of files and determine how the texts should be formatted for the end target system. Once determined, the ETL server can re-format the texts accordingly. However, an organization may have texts in many formats. Therefore, to standardize the formats, the ETL administrator must identify each individual format provided and discern relevant fields to extract from the texts. Such an approach quickly becomes be time-consuming.