With the rapid development of the network technology and the various information technologies, the contents that the user may access gradually get out of the constraint of time and space. As a result, the possibility of the user being exposed to contents that are unhealthy and threatening to the client, such as pornography, violence and virus, are significantly increased, which leads to an ever stronger requirement for screening of the communication contents. The existing screening technologies include list screening, keyword screening, template screening and categorization screening etc, among which the categorization-based content screening is a much interested research topic due to its flexibility and wide application. On the other hand, with 50 years of development of the automatic digest technology, the related fundamental technologies (such as automatic word segmentation) also went through a long-term development and formed some application systems. Automatic digest of western language is particularly well developed. Also the research on video digest technology has rendered much result and is getting more well-established.
Please refer to FIG. 1, the current Categorization-Based Content Screening (CBCS) architecture is principally divided into two parts, that is, a content screening component and a content categorization component. The content categorization component is adapted to provide an interface CBCS-1, whereby a content categorization requester (including the content screening component internal to the architecture and other external requesters) may obtain content category of the contents to be categorized. The parameters that can be input by the content categorization requester include the content itself or content references, such as Uniform Resource Identifiers (URIs) and/or other content-related information (such as content provider).
Please refer to FIG. 2, which shows a process for the content categorization requester to obtain the content categories of the content to be categorized according to the prior art. The process includes the following steps:
Step 1: The content categorization requester determines to request the content categories by means of the content itself. For example, in case one, the content categorization requester is the content screening component, and a content screening request received by the content screening component only contains the content itself without other content reference or pre-categorized information. In case two, the content categorization requester is the content screening component, and the content of a content screening request received by the content screening component is the pre-categorized content, but the pre-categorization information is unreliable and there is no other content reference. In case three, the content categorization requester is the content screening component, and a content screening request received by the content screening component contains the content itself and its content reference, but the content categorization provider (content categorization component) can not provide category information corresponding to the content reference. In case four the content provider as the content categorization requester only intends to request the category information in order to generate pre-categorized content. Thus the content provided by the content provider contains its category which can be used directly. In case five, the content categorization requester is the content screening component, and a content screening request received by the content screening component contains the content itself and its content reference. However, the content itself is directly used to request the content category since the content screening component is configured into a mode that doesn't support the obtaining of category based on the content reference.
Step 2: The content categorization requester formulates a content categorization request message which carries the content itself, and transmits the message to the content categorization provider.
Step 3: The content categorization provider extracts the content itself from the content categorization request message and applies an appropriate algorithm to the content itself to perform categorization.
Step 4: The content categorization provider formulates a response message and returns the content category to the content categorization requester.
Among the input parameters, only the content itself and the URI corresponding to the content can reflect the content directly. The URI can not always be available. Meanwhile, the content categorization provider may not always be able to provide the content category corresponding to the URI (because, for example, the corresponding content categories may not be stored within the content categorization component or can not be obtained by the content categorization component externally). In this case, the content categorization requester can provide only the content itself to the content categorization provider. While the content is probably very large and needs to be carried in a plurality of data packets partitioned from a content categorization request message. In this case, not only the content categorization provider is required to parse the content to be categorized from the request message, but also lots of buffering and content rearranging is required. Finally, the categorization may be performed according to the categorization algorithm.
During the research, the inventor has the following finding: in the prior art, in the information provided by the content categorization requester to the content categorization provider when requesting the content category, only two input parameters, the content itself and the URI are shown directly. There is a lack of an efficient processing method while providing the content itself to the content categorization provider. This not only makes the processing burden of the content categorization provider heavier, but also increases the network transmission traffic, especially when an external entity requests the content category through the interface CBCS.