The present invention provides a system for analysing online text repositories and in particular a system that can analyse online text and automatically extract text sequences that may be of interest to a party, in accordance with predetermined criteria.
Social networking sites and other web 2.0 services allow users to create comments or content regarding the goods and services that they have used and for other users to then add further comments. For example, groups on Facebook or Twitter trends can quickly be established which may make unfavourable comments regarding a business's activities or products, leading to subsequent brand damage for the business concerned. Given the large number of social networking sites, blogs, message boards, etc. it is not a practical proposition for a business or enterprise to monitor all of the comments, messages, user generated content, etc., that is generated in order to be able to respond to untrue or unfavourable information.
A known system for monitoring social networks is provided by Radian6 (see http://www.radian6.com). The system extracts relevant posts from social networks, and other sources of user comments, which can be categorised, prioritised and then assigned to an agent for subsequent processing. The classification of a post as being relevant is based solely on a set of keywords.
Another system is provided by Dapper (see http://www.dapper.net), which has the capability to extract the contents of web pages. Dapper requires the user to select a number of sample web pages all residing on the same level of the hierarchy of a hierarchically organised web site. In the case of each page the user identifies those parts of the page contents that are of interest to him. Dapper is then able to extract those contents from other pages residing at the same level of the hierarchy as the selected sample web pages. Dapper is not able to traverse the hierarchy of a web site and to then extract content from multiple levels of the hierarchy.
According to a first aspect of the present invention there is provided a method of automatically extracting content from a data resource, the data resource comprising a plurality of hierarchical levels, each of the plurality of hierarchical levels comprising content defined using mark-up language and the method comprising a training phase and a content extraction phase, wherein the training phase precedes the content extraction phase; the training phase comprising the steps of: i) defining one or more hierarchical levels of interest; ii) defining an entity which is comprised within the or each hierarchical levels of interest defined in step i) and one or more properties associated with that entity; and for each of the entities and the associated entity properties defined in step ii), executing a plurality of training instances, wherein each of the training instances comprises the steps of: a) defining a value for each of the one or more properties associated with said entity; b) for each of the property values, determining the containing element which provides a match to the property value and storing the containing element in an instance schema, the instance schema being associated with said entity; and iii) comparing each of the plurality of instance schemas associated with said entity to generate a final schema for said entity; the content extraction phase comprising the steps of: I) comparing a data resource from which content is to be extracted with the composite schema; II) identifying those entities and their associated properties within the data resource which match the containing elements specified in the composite schema; and III) extracting those entities and their associated properties identified in step II) from the data resource. The training phase may comprise the further step of: iv) storing the one or more final schemas in a composite schema which is associated with the data resource.
In step iii) the first instance schema to be generated for an entity may be retained and have assigned an occurrence count value of 1. Alternatively, if there is no adequate match between a first instance schema and a second instance schema then the second instance schema will be retained and is assigned an occurrence count value of 1. In a yet further alternative, if a first instance schema is identical to a second instance schema then the occurrence count of the first instance schema will be incremented and the second instance schema will be discarded.
A derived instance schema may be created by merging a first instance schema with a second instance schema. The derived instance schema may be created by merging a first instance schema with a second instance schema if there is an adequate degree of similarity between the first and second schemas. Such a derived instance schema may be created by merging a first instance schema with a second instance schema if the first and second instance schema comprise: a) a common start-tag; b) identical sub-element hierarchies; and c) an equal number of property elements comprised within the sub-element hierarchies. The first derived instance schema to be generated for an entity may be retained and have assigned an occurrence count value of 1. If a first derived instance schema is identical to a second derived instance schema then the occurrence count of the first derived instance schema will be incremented and the second derived instance schema will be discarded.
Step iii) may comprise the step of determining which of the plurality of instance schemas and derived instance schemas has an occurrence frequency which exceeds a predetermined threshold value. The predetermined threshold value may be 60%. Three or more training instances may be executed for each of the entities.
According to a second aspect of the present invention there is provided a data carrier for use in a computing device, the data carrier comprising computer executable code which, in use, performs a method as described above.
According to a third aspect of the present invention there is provided a apparatus comprising one or more central processing units, one or more data storage means and a network interface, the apparatus, in use, being configured to perform a method as described above.