The present invention relates to a system for analysing textual data and in particular to a system for analysing online content to identify textual data of interest.
Social networking sites and other Web 2.0 services allow users to create comments or content regarding the goods and services that they have used and for other users to then add further comments. For example, groups on Facebook or Twitter trends can quickly be established which may make unfavourable comments regarding a business's activities or products, leading to subsequent brand damage for the business concerned. Given the large number of social networking sites, blogs, message boards, etc. it is not a practical proposition for a business or enterprise to monitor all of the comments, messages, user generated content, etc., that is generated in order to be able to respond to untrue or unfavourable information.
A known system for monitoring social networks is provided by Radian6 (see http://www.radian6.com). The system extracts relevant posts from social networks, and other sources of user comments, which can be categorised, prioritised and then assigned to an agent for subsequent processing. The classification of a post as being relevant is based solely on a set of keywords.
According to a first aspect of the present invention, there is provided a method of automatically generating a mark-up language schema, the method comprising the steps of: a) receiving a plurality of training samples, the or each training sample identifying one or more mark-up language elements stored within an online data resource; b) for each of the plurality of received training samples, automatically generating a candidate mark-up language schema; c) for each of the plurality of candidate mark-up language schema, comparing that candidate schema with the remainder of the candidate schemas to determine how many of the schema match and selecting a candidate mark-up language schema if the proportion of matching candidate schema exceeds a predetermined threshold; d) if none of the plurality of candidate mark-up language schema matches a sufficient number of the other schema, generating a further mark-up language schema and executing a further instance of step c); and e) reiterating step d) until one of the candidate schemas matches with a sufficient number of the other schema.
According to a second aspect of the present invention, there is provided a method of analysing mark-up language text, the method comprising the steps of: i) applying a mark-up language schema to an online data resource, the mark-up language schema comprising a plurality of mark-up language elements; ii) identifying one or more data elements comprised within the online data resource, the or each data elements being associated with a particular mark-up language element; and iii) extracting those data elements identified in step ii), wherein the mark-up language schema is generated using a method as described above.
Further aspects of the present invention provide a digital data carrier comprising computer executable code for performing a method according to one of the methods described above and an apparatus, comprising one or more processors and one or more memory means, the apparatus, in use, being configured to perform a method according to one of the methods described above