Electronic data (documents containing text, and textual captions/tags parts of audio/video/images etc.) usually contains ‘meta-data’, i.e. data describing data, generated to help readers understand what is described in the document. This meta-data, is generated using the title of the document, the keywords that are used in the document, or using some of the sub-titles/headings of the document. This meta-data can then be embedded in the document as its property (for example, Microsoft Word documents have a property which can store document related information). However, the problem of this approach is that the keywords may not give the entire idea about the contents of the document.
Keywords/sub-titles may also mislead a human reviewer. For example, the human reviewer may infer that a document talks about “Shakespeare's Hamlet”, based on the keywords—“Shakespeare”, “Hamlet”. However, the document may be just having a single sentence about “Shakespeare” and “Hamlet”, and may contain other text that is not related to either “Shakespeare” or “Hamlet”. Another problem of this approach is that a human author needs to identify the keywords associated with the document; the sub-titles associated with the document and add that as a property of the document. This needs to be done manually, which may cause concerns such as human time consumption, labor cost, possibility of manual errors, etc.
In certain documents for the web (i.e. web pages), search engines derive all the words used in the web documents (i.e. web pages), and index the document based on the words. In this way the words of the document become the meta-data for the document. This meta-data then works as an index for a user, who wants to understand the document without going over the details of the document. In this case, the web search engine may index the document based on certain keywords that do not have much relevance in terms of the context of the document. For example, a page may be dedicated to Shakespeare in general and has not much relevance in terms of the Shakespeare's drama Hamlet. The onus to find the correct web page hence rests on the human reader who must not only provide the correct keywords while searching, but also go through (read and understand) the web pages that are shown by the web engine, in order to find the web page that has the required information.
Search engines also display the results of a search within snippets, which are sentences that contain the searched keyword, and the sentences adjacent to these sentences. This approach, though helpful in identifying the exact sentence and the block of text around the keyword, is not helpful in identifying the overall context of the document. Thus a web search results in temporary ‘Denial of Information’ where the user may end up browsing a page which may consist of a single sentence that contains the keyword but has no relation to overall context of the document/web page. For example, the search on ‘Beethoven’ may lead to a blog where some blogger has watched the movie by the same name, though the search was intended for ‘Beethoven the Composer’.
Some search engines also search the structure of the web pages and provide the starting sentences of each paragraph of the web page. This approach though helpful does not capture the essence of the document, as it concentrates on the titles/heading and not on the semantics of the entire content of the document.
Certain systems exist for calculating summary of document based on semantics of the document. However, the summary calculation is dependent on a small context of input documents, and does not take into account the massive corpus of Internet, and hence does not consider the large-scale summarization that is involved at this scale.
In the case of networks such as Internet, the transfer of large number of semantically irrelevant documents for consumption by humans results in waste of network bandwidth.
Thus these systems do not prevent ‘Denial of Information’ where the human reader is flooded with information in form of hundreds of documents or web pages that may not be relevant, thus resulting in wastage of user, network bandwidth and client/server computing time.
All these systems lack the ability to provide more detailed document search by taking into account a large corpus of documents and providing a fast, concise, complete and understandable document content summary that enable the human reader to quickly analyze the document semantically, and move over to the next document.
Accordingly, a need exists for a method and system which provides semantically generated meta-data i.e. summary or semantic excerpts for a document, using a large corpus that can be used effectively by human readers in quickly understanding the context of the document, thus preventing a ‘Denial of Information’ and loss of computing and network resources.