E-commerce websites attempt to use relevant product descriptions to organize results and to improve user experience. However, product descriptions, particularly those supplied by sellers, tend to contain irrelevant information (e.g., information about the end-user, shipping information, etc.). A technique such as supervised text segmentation can be employed to identify relevant segments of descriptions.
Text segmentation is the process of dividing text (e.g., a paragraph, page, description, etc.) into meaningful units, such as breaking down a body of text into meaningful clauses or phrases. For a machine to perform text segmentation, units of texts (e.g., words) are tagged to guide the machine. Segmenting text using tags previously defined by a user is referred to as supervised text segmentation.
One example of a challenge that may exist for supervised text segmentation techniques is the expense of manually tagging data. Human annotators mark the relevant and irrelevant portions of some training cases. A system then attempts to learn a set of rules from the training cases to segment subsequent new cases. Such supervised techniques may require a large number of tagged training cases. Another example of a challenge that may exist is that such training cases cannot be generalized across different types of products. As a result, supervised techniques may also face a challenge in terms of scalability.