When documents are created, many decisions must be made as to style, content, layout, and the like. The text, images, and graphics must be organized and laid out in a two-dimensional format with the intention of providing a presentation to the viewer that will capture and preferably maintain their attention for a time sufficient to get the intended message across. Different style options are available for the various content elements and choices must be made. The best choices for style and layout depend upon content, intent, viewer interests, etc. In order to tell if a set of choices made as to the look and feel of the final version of the document were good or bad, one might request feedback from a set of viewers after viewing the document and compile the feedback into something meaningful from which the document's creators or developers can make alterations, changes, or other improvements. This cycle repeats until the document's owners are satisfied that the final version achieves the intended result. Alternatively, as will be discussed in more detail below, existing sets of documents may be analyzed to determine those that have a favorable style and/or layout so as to result in more frequent access or citation to the document.
Factors that contribute to the quality and effectiveness of layout and style decisions for a document are the handling of groups of content elements as style and layout choices affect groups of content. A group is a collection of content elements. Group membership is a property of the logical structure of the document. The neighborhood of groups can be considered a layout property. While layout structure often matches the logical structure, there is no requirement that it do so.
Preferably, one would like to have a quantitative measure of various value properties of the document (measures of the document “goodness”) based on properties inherent in the document itself. In this manner the document itself provides a level of quantitative feedback. For instance, one property that developer's would like to be able to measure would be how easy it is to use a document. A measure for the ease of use of a document can be used in evaluating or making document design decisions.
One aspect of the ease of use of a document is one's ability to tell which elements belong to a group and which do not. The style and layout decisions that are made in the presentation of a document can affect the degree of group identity that it conveys. In evaluating a document's design for its ease of use, it is useful to have a measure of the degree of group identity. Considerations for ease-of-use with respect to groups include spatial coherence, spatial separation, alignment separation, heading separation, background separation, and/or style separation. Measures for various characteristics of content, feature, and the like could be weighted by intent, relevance, and other parameters and these could then be combined to obtain one or more overall measures for the document itself. If one had a method for evaluating properties inherent in the document itself then such a measure could be used during the document development process to help determine optimal presentation.
An aspect of the ease of use of a document is its searchability. Searchability can be defined as the degree to which the document structurally supports the finding of a desired content element. A document with high searchability provides aids that help in finding desired content. In general, a document with high searchability measure is easier to use because it is easy to locate the portion of the document containing the information of interest.
Another aspect of a document's ease of use is the document's degree of distinguishability. The distinguishability of content can be defined as the ability to identify one particular content element from another content element within the document. Distinguishability is important in establishing the context for the information disclosed by the element. It can reduce confusion about what that element is and to what group or setting it belongs. It can also aid in locating a desired element. The distinguishability of the document elements is therefore a contributing factor to the ease of use of the document.
Another property that would be desirable to be able to quantitatively measure is the ability of the document to hold the viewer's attention and interest. While much of the document's ease of use depends upon the actual content and its relevance to the viewer, there can also be a contribution from the style with which that content is presented. If a measure of the effect of style decisions on ease of use could be defined it could be used in determining a measure of optimal presentation.
Documents can present content in ways that make it easier to locate individual items. This can be referred to as ‘locateability’. A way to distinguish one content object from another object is to evaluate the target object's locatability, i.e., how easy it is to find an object within the document. This is a little different from distinguishability, which tells how well an item can be differentiated from its neighbors. Structural aids such as layout of tables or bullet lists help the document viewer to locate objects. Presenting content in a table allows its location to be identified by row or column. The presence of headings for the rows and columns can further increase the ease of locating items. Presenting content items in a list introduces an ordering that aids in locating them, and the use of list bullets or item numbers aids further. Separability and distinguishability contribute to the locatability of an object.
Measures for various aspects of content, features, and the like could be weighted by intent, relevance, and other parameters and these could then be combined to obtain one or more overall measures for the document itself. If one had a method for evaluating such properties inherent in the document itself then such a measure could be used during the document development process to help determine optimal presentation.
Therefore, it is desirable to provide a methodology to measure the quality of a document in a quantifiable way. Moreover, it is desirable to provide a quantifiable measurement of quality that is useable in evaluating the document and improving its quality so as to add value to the information being conveyed through the document.
For at least a century, there has been substantial consideration and investigation, within the document-using community, of the extent to which various document formatting and stylistic elements contribute to or detract from document effectiveness. Such stylistic elements include, for example, choice of typeface, type style (such as serif versus sans-serif, fixed pitch versus proportionally spaced), type size, number of text columns, right-justified versus ragged-right text, etc. Since these stylistic elements are mostly second-order contributors to document effectiveness (particularly obviously after document content), a large amount of data is required to support reliable conclusions in this area.
However, most previous considerations and investigations into document effectiveness have relied on, at most, manual data collection. Hence, prior investigations have been limited to fairly small amounts of data relative to that which would be required for reliable conclusions, and have yielded mostly speculative and unconvincing results. What remains as a need in the art is a method for a document user to determine and know, with confidence, what document style characteristics lead to improved or greater document effectiveness.
Heretofore, a number of patents and publications have disclosed methods for identifying related documents and citations.
U.S. Pat. No. 6,182,091 to J. Pitkow et al., issued Jan. 30, 2001, and hereby incorporated by reference in its entirety for its teachings, discloses a method and apparatus for identifying related documents in a collection of linked documents. In the method the link structure of documents to other documents are analyzed. By analyzing only the link structure, a process intensive content analysis of the documents is avoided.
U.S. Pat. No. 6,038,574 to J. Pitkow et al., issued Mar. 14, 2000, and hereby incorporated by reference in its entirety for its teachings, teaches a method comprising: generating a document collection; for each document, determine the frequency of linkage, i.e. the number of times it is linked to by another document in the collection, threshold the documents based on some minimum frequency of linkage, create a list of pairs of documents that are linked to by the same document so that each of the pairs of documents has a count of the number of times (the co-citation frequency) that they were both linked to by another document, and cluster pairs using a suitable co-citation clustering technique.
In “Online or invisible?” by S. Lawrence, published in Nature, Volume 411, Number 6837, p. 521, 2001, the author discusses research relating to the investigation of the impact of free online availability of publications by analyzing citation rates. Associated with this publication is an exemplary CiteSeer listing of the publication of “RCS A System for Version Control” by Walter F. Tichy (1991), where the abstract of the publication is listed along with CiteSeer details relating to the document's citations by other publications.
The present invention is directed to a method for determining a document's overall effectiveness or quality using an automated investigation and computation of document citation rate versus presentation elements such as style and layout. A document's citation rate is the number of citations of or references to that document from other documents. This is taken as an indicator of a document's overall effectiveness. This invention employs automated means to obtain, for a sample of documents, both presentation data and citation rate data. Presentation data is obtained, for each document in the sample, by automated inspection of the document, for stylistic elements. The citation rate for each document is based on the number of citations (e.g., hyperlinks) to that document from another set of documents, the larger the set the better. The present invention then computes the statistical correlation of document citation rate versus presentation elements used, in a straightforward manner to identify correlation between the citation rate and presentation element(s).
In accordance with the present invention, there is provided a method for characterizing at least one target document's overall effectiveness amongst a set of documents, comprising: a) obtaining at least one indicator for said target document relating to its citation rate within the set of documents; b) characterizing at least one common presentation element for each of said documents in the set of documents; c) computing a statistical correlation between the indicator and the at least one common stylistic element for said target document and the set of documents; and d) employing said correlation as an indicator of said target document's overall effectiveness.
In accordance with another aspect of the present invention, there is provided A method for quantifying a measure of quality of a document, comprising: (a) measuring a predetermined set of characteristics of the document; (b) quantizing the measured predetermined set of characteristics of the document; and (c) generating a quantized interest value for the document based on a combining function that includes a citation-correlation aspect, the predetermined combining function combining the quantized measured predetermined set of characteristics, the quantized interest value being a measure of quality of the document.
One aspect of the invention is based on the discovery that the citation rate of a document is indicative of the quality or effectiveness of the document at communicating the information therein. This discovery avoids problems that arise in attempting to characterize document quality and is believed to lead to an objective measurement and weighting of document presentation elements that can be used to characterize documents.
The techniques described herein are advantageous because they may be completed by automated systems and provide the capability of obtaining objective document quality feedback from existing and publicly available document databases. Some of the techniques can be used to identify presentations elements that have significant impact on document effectiveness.
The present invention will be described in connection with a preferred embodiment(s); however, it will be understood that there is no intent to limit the present invention to the embodiments described herein. On the contrary, the intent is to cover all alternatives, modifications, and equivalents as may be included within the spirit and scope of the present invention, as defined by the appended claims.