The present invention relates generally to the field of schema generation, and more particularly to using natural language processing to generate a schema for a corpus of data.
Natural language processing (NLP) integrates computer science, artificial intelligence, and linguistics to enable computer systems to accept (receive) inputs in a conventional human (natural) language. NLP has the ability to enable software to derive meaning from human or natural language input, which is referred to as natural language understanding.
An entity-relationship model is used to show and describe existing relationships within a set of data. Entities are linked with other entities by relationships that express the dependencies and requirements between the entities, for example, one building may be divided into zero or more apartments, but one apartment can only be located in one building. Entities may be characterized by various properties or attributes. Diagrams created to represent these entities, attributes, and relationships graphically are called entity-relationship diagrams. Entities can be thought of as nouns for example: a computer, an employee, a song, a mathematical theorem. Entities and relationships can both have one or more attributes, for example: an employee entity might have a Social Security Number (SSN) attribute.
A relationship captures how entities are related to one another. Relationships can be thought of as verbs, linking two or more nouns. Examples: an owns relationship between a company and a computer (e.g., company owns computer), a supervises relationship between an employee and a department (e.g., employee supervises department), a performs relationship between an artist and a song (e.g., artist performs song), a proved relationship between a mathematician and a theorem (e.g., mathematician proved theorem).
An entity is a set of one or more details that are held about a real-world object such as a person, location, or bank account. An entity is a type of item. A link is an association between two entities, such as an ownership relationship between a person and a vehicle. A property is a single piece of information about an entity.
A schema is a way to define the structure, content, and to some extent, semantics of data. With a defined schema, a user can create instances of data structures that conform to that schema and exchange the data structures, such that others will be able to use that same schema to correctly interpret each instance.
A corpus is a set of structured and/or unstructured text. Corpora are used, for example, for statistical analysis and hypothesis testing, checking occurrences, and validating linguistic rules within a specific language territory. Annotating corpora can increase their functionality.