The present embodiments relate to natural language processing of raw text data. More specifically, the embodiments relate to optimal sentence boundary placement.
Natural language processing (NLP) systems are used to extract information from documents intended to be read by a human audience in order to enable computers to understand content of the document. NLP systems extract information from the documents to provide a complete and accurate representation of the original content. The extracted information can be provided to other computer systems in a plain text output (e.g. raw text data). The plain text output can be used by a classifier to determine the meaning of the text to support other computer systems and trigger programmatic function corresponding to the meaning.
Documents such as reports, newspapers, and magazines use stylistic devices, such as paragraph headers, address formatting, lists, and tables in order to provide content expression that facilitates organization and understanding of the content. However, such stylistic devices can be difficult to translate to a plain text format output for use by a computing system, leading to extraneous information in the translation. The stylistic devices can lead to plain text outputs containing distorted text, which may effectively limit performance of downstream NLP.