Introductory material is presented in this section, relating (A) specific principles guiding language-based authorship attribution within the forensic setting; (B) general principles of authorship attribution as a pattern-recognition problem; (C) background information in authorship attribution, including variables, methods and results of others, and (D) principles of syntax, markedness and part-of-speech tagging which underlay embodiments of the present invention.
A. Language-Based Authorship Attribution in the Forensic Setting.
During the course of criminal investigations, documents come to light whose authorship is uncertain but yet can be legally significant. Authorship determination is important in situations such as: a ransom note in a kidnaping; a threatening letter; anonymous letters; suicide notes; interrogation and/or interview statements; locating missing persons; employment disputes; examination fraud; plagiarism; will contests; peer review of reports in various other situations; and other contested issues of authorship. In view of the current focus on terrorism and the search for persons involved in terrorist acts, making terroristic threats, or kidnaping of citizens, the determination of authorship also plays a significant role.
While in the past these documents were generally hand-written, increasingly they are being produced with the aid of computers and printers, over electronic networks, or on printers or copiers, thus precluding the use of “standard” document analysis, which has typically focused on handwriting analysis, or analysis of the imprints of typewriter keys. In situations involving printed, electronically-produced or facsimile transmitted, rather than hand-written documents, the linguistic features of the document become important factors for determining the authorship of the document.
In contrast to handwriting examination or typewriter analysis, language-based authorship attribution relies on linguistic characteristics as variable sets for differentiating and identifying authors. In the literature on authorship attribution, there are four linguistic-variable classes which have been used by others and are sometimes combined with each other. These linguistic-variable classes are: (1) lexical, (2) stylometric, (3) graphemic, and (4) syntactic.
Lexical variables include vocabulary richness and function word frequencies; (function words in English are a closed set of words which specify grammatical functions, such as prepositions, determiners and pronouns).
Stylometric variables include word length, sentence length, paragraph length, counts of short words, and such.
Graphemic variables include the counts of letters and punctuation marks in a text.
Syntactic variables include the counts of syntactic part-of-speech tags such as noun, verb, etc., and adjacent part-of-speech tags.
As will be shown in the specification, and defined by the claims, new linguistic-variable sets are defined within these classes, and which variable sets are specifically applicable to authorship attribution in the forensic and non-forensic settings.
Authorship attribution in the forensic setting must meet certain criteria in order to be admitted as scientific evidence or entertained seriously as investigative support. In Daubert v. Merrill-Dow Pharmaceuticals, Inc., 509 U.S. 579, 27 USPQ2d 1200 (1993), the Supreme Court set out guidelines which substantially changed the admissibility of scientific evidence within the federal court system, and which have become applicable in a number of state court jurisdictions as well. The criteria described herein are not those described in Daubert, but those that this inventor believes should guide the development of an authorship identification method, and which will later insure the admissibility of such evidence. Accordingly, these criteria are linguistic defensibility, forensic feasibility, statistical testability, and reliability.
First, the method must be linguistically defensible. Basic assumptions about language structure, language use, and psycholinguistic processing should undergird the method. The linguistic variables which are ultimately selected should be related in a straightforward way to linguistic theory and psycholinguistics; the linguistic variables should be justifiable. For example, function words have been used in many lexical approaches to authorship attribution, perhaps most famously by Mosteller and Wallace (1984). Function words can be justified as a potential discriminator for two reasons: first, function words are a lexical closed class, and second, function words are often indicators of syntactic structure. Psycholinguistically, function words are known as a distinct class for semantic processing and the syntactic structures which function words shadow are known to be real. A method based on function words is linguistically defensible because there is a fairly obvious way for a linguist to relate this class of discriminators to what we already know about language structure and psycholinguistic processing.1 1 However, function words may not be the most direct way to access the linguistic knowledge and behavior which function words apparently reflect.
Second, the method must be forensically feasible. Specifically, a forensically feasible method must be sensitive to the actual limitations of real data and the basis of expert opinion. Foremost, the method must be designed to work within the typical forensic situation of brevity and scarcity of texts. The importance of this criterion can not be ignored because forensic feasibility will impact both the selection of linguistic variables as well as the selection of statistical procedures. Many of the lexical approaches which have been developed within literary studies have rightfully exploited the lexical richness and high word counts of such literary data, but these same approaches are not forensically feasible because the typical forensic data is too short or too lexically restricted. Further, statistical procedures which require hundreds of cases to fit a large number of variables are not always forensically feasible because in the typical forensic situation there are not hundreds of texts to be analyzed. Due to the scarcity of texts, either the texts can be separated into smaller units to provide additional cases or the linguistic variables can be collapsed. But in either text-decomposition or variable-reduction, again linguistic defensibility must be maintained. For example, it was once suggested that split-half reliability testing be performed at the word level: every other word of a document was extracted and that extracted portion was tested against the remainder of the original document (Miron 1983). While this kind of text-decomposition is understandable as a way of dealing with the scarcity of texts, this particular technique is linguistically indefensible because, by relying on a basic assumption that language is just a “bag of words” rather than a structured system, the approach totally ignores the fact that there is a linearized and syntactic structure in text which is psychologically real to the author of the document.
Another impact of the forensic feasibility criterion concerns the basis of expert opinion. In the forensic setting, the expert witness stakes his or her reputation on the accuracy of the data analysis. Therefore, any “black box” methods which are automatized to the extent that the analyst cannot supervise, error-correct or otherwise intervene in the basic data analysis may not be acceptable to forensic practitioners or linguists who do not wish to serve as mere technician-servants of the machine. On the other hand, automatization of many types of linguistic analysis provides a welcome way to avoid examiner bias and fatigue. The best approach, therefore, appears to be an interactive, user-assisted automatic computerized analysis, since the machine can provide objective, rule-based analysis and the human can correct any analytical errors the machine might make.
Third, the method must be statistically testable. Specifically, this criterion requires that the linguistic variables—even if they are categorical—can be operationally defined and reproduced by other linguists. This criterion does not reject categorical linguistic variables which may have their basis in qualitative analysis, but it does reject subjective reactions to style such as “sounds like a Clint Eastwood movie” or “not what a blue-collar worker would write”. These quotations are not facetious, but actual comments from experts whose reports this inventor has read.
Fourth, the method must be reliable, based on statistical testing. The level of reliability can be obtained through empirical testing. Naturally, the most accurate method is most welcome in the forensic setting, but even a method with an empirically-based, statistically-derived overall accuracy rate of only 85% or 90% is better than any method whose reliability is unproven, untested, anecdotal or simply hypothesized and then stated as accomplished fact.
If an authorship attribution method meets these scientific criteria, it will surely meet success within the legal arena under the Daubert-Joiner-Kumho criteria. Linguistic defensibility speaks to general acceptance among peers; linguists are certainly far more likely to accept any method which is based on standard techniques of linguistic theory as well as conceptions of language congruent with linguistic theory and psycholinguistic experimentation than one based on prescriptive grammar or literary sensibility. Forensic feasibility speaks to the appropriate application of the method to typical forensic data and the credibility of the testimony. Finally, both statistical testing and reliability speak to the error rate, and again, the credibility and weight of the testimony.
Given these criteria for developing a forensic method of determining authorship, many current proposals or methods are eliminated. For instance, vocabulary-richness methods requiring texts of more than 1000 words cannot be met within the typical forensic situation; there is simply not enough data in forensically-relevant texts. Error analysis looks for errors in punctuation, spelling and word usage, based on the assumption that errors are idiosyncratic, and that the configuration of errors possessed by one person is a characteristic of that individual. However, errors are often so rare that they do not occur with enough frequency to be statistically testable (Koppel and Schler 2003; Chaski 2001). Syntax-based proposals are more promising, because every text contains phrases which contain syntactic structures, but some types of syntactic structures require more data than is forensically feasible.
The linguistic variables and method set forth in this application are defensible in terms of linguistics as a science, are forensically feasible because they can work on short texts, have been statistically tested and have been found to be reliable.
B. Authorship Attribution as a Pattern-Recognition Problem.
Authorship attribution is a pattern recognition problem. In any pattern recognition problem, the basic task is to determine the optimal fit between feature sets and algorithms. The interaction between features and classification procedures is an intricate dance that can only be completely understood through empirical testing. As in any pattern recognition problem, these two sides to the solution have to work together. The first side is the variables which quantify the textual data, and the second side is the algorithm which classifies the variables. The optimal solution consists of a variable set matched with a classification algorithm to achieve a correct attribution, as shown in FIG. 1, illustrating the Variable Sets being processed by Classification Algorithms which produce accuracy results through standard statistical methods.
The classification algorithms used with these variables sets are standard procedures, including discriminant function analysis, logistic regression, decision trees, and support vector machines. Any of these methods create a model based on training data and then test the model by predicting the correct author of a new document.
When small amounts of data are available, which is typically the situation in forensic authorship attribution, those skilled in the art in pattern-recognition problems utilize a cross-validation technique. Cross-validation is a way of testing how good the model is, based on all the data that is available. For example, in “leave-one-out” (“LOO”) cross-validation, one data row is left out during the model-building and its membership is predicted; it is then put back into the model-building, while the next data row is left out and its membership is predicted. Other cross-validation schemes are available, such as four-fold or ten-fold (where one-fourth or one-tenth of that data, respectively is left out for model-building and so forth).
Accuracy results include how many times the left-out documents are classified to the correct author, as well as how many times a new document is classified to the correct author when the model is tested. Since each classification algorithm has different assumptions and requirements for the data, any of these algorithms can be used with the variable sets described within the present invention if enough textual data which meets the requirements of the algorithm(s) is available.
Aspects of the pattern-recognition approach to authorship attribution are known. Generally, rather than focusing on the handwriting of the document, this language-based, pattern-recognition approach to determining the authorship of a document, or other textual work, such as a book, manuscript, or the like, involves the steps of tagging the documents for linguistic characteristics, counting the tags, and statistically testing the counts through a classification procedure. Within this paradigm, the methods differ in terms of the linguistic-variable sets employed, the classification algorithms and their overall accuracy results.
For example, recent studies in this paradigm such as those of Stamatatos et al. (2001), Baayen et al. (2002), Chaski (2004) and Tambouratzis et al. (2004) have examined lexical, syntactic and punctuation variables with discriminant function analysis, one of several statistical procedures for classifying and predicting group membership. As shown by these studies, combining different types of features (e.g. lexical with punctuation, or lexical, punctuation and syntactic) improved performance for the discriminant analysis. These studies provide some support for and are consistent with earlier findings that syntax and punctuation, in general, can reliably distinguish authors (Chaski 2001).
Discriminant function analysis consistently performs well as a classification procedure for authorship attribution. Baayen et al. (2002) demonstrated that discriminant analysis performed much better in their authorship attribution experiment than principle components analysis. In earlier work, Stamatatos et al. (2000) showed that discriminant analysis performed better than multiple regression at classifying documents by author and genre.
This application addresses the task of achieving feature-algorithm optimality within the forensic setting. Thus, the present application is directed towards the variable sets used to quantify the textual data, and a method and system for obtaining cross-validation in the classification algorithms.
This inventor has developed variable sets which can be used with several available classification algorithms and different amounts of textual data, using discriminant function analysis, logistic regression, decision trees, and support vector machines. Consequently, the best accuracy results are being obtained using the variable sets which are described herein with discriminant function analysis, decision trees and logistic regression (results are reported in Section C). Embodiments of the present invention employ, in contrast to known methods, both sentence-level and document-level data for use with the classification algorithms.
Embodiments of the present invention utilize cross-validated classification algorithms with sets of variables comprising syntactic and graphemic features, illustrating that in contrast to previous methods, the method described herein has an overall accuracy rate of 95%.
C. Prior Art Methods in Authorship Attribution in Contrast to Specification.
This section reviews work by Stamatatos et al. (2001), Tambouratzis et al. (2004) and Baayen et al. (2002). These studies illustrate the use of linguistic-variables in the pattern-recognition paradigm, and they are in general similar to the specification, but they each differ fundamentally from the invention/specification in two ways. First, each of these studies uses standard, well-known linguistic variables which are different from the linguistic-variables sets specified in the invention/specification. Second, each of these studies uses standard, well-known cross-validation procedures for document-level data, which are different from the cross-validation procedure in the invention/specification. This section concludes with a brief summary of experimental results using the invented variables and method demonstrating that the invention has achieved higher accuracy rates than previously obtained in prior art.
Stamatatos et al. (2001) demonstrated that a totally automated analysis using syntactic and lexical variables obtains an accuracy rate ranging from 74% to 87%. The corpus consisted of 30 texts for each of 10 authors, newspaper columnists writing on a range of topics including biology, history, culture, international affairs and philosophy. The texts ranged in word length from less than 500 words to more than 1,500 words. In total, the corpus contained 333,744 words. Twenty texts of each author were used to train a linear discriminant function analysis; the remaining ten texts of each author were then classified according to the closest Mahalonobis distance from each of the groups' centroids.
The linguistic variables used in the linear discriminant function analysis included 50 lexical features and 22 syntactic features. The lexical features were the frequencies of the 50 most frequent words in the training texts normalized for text-length. Using these 50 lexical features, the average (or overall) accuracy (or correct classification) was 74%. The syntactic features included sentences/words (average sentence length), punctuation marks/words, detected versus potential sentence boundaries, length of phrasal chunks for noun, verb, adverb, preposition and conjunction, and information about parsing such as the number of words untagged for part-of-speech after a number of passes. None of the linguistic variables used by Stamatatos et al. (2001) are the same as the syntactic or graphemic variables described in embodiments of the present invention.
Using the 22 syntactic features, the average accuracy was 81%. When the lexical and syntactic features were combined into a 72-feature set, the highest accuracy rate of 87% was obtained. Most of the lexical variables are frequencies of Modern Greek function words (determiners, prepositions, pronouns, complementizers and so forth). Given that function words often shadow syntactic structure, the accuracy result may actually be due to the underlying syntactic structure signaled lexically. But what is especially interesting in Stamatatos et al.'s study is that direct syntactic measures improve on the accuracy rate based on the lexical measures.
In stated contrast to Stamatatos et al.'s work, Tambouratzes et al. (2004) focused on determining authorship within one register (as defined by general topic). Transcripts of speeches delivered in Greek Parliament by five parliament members over the period 1996-2000 were extracted from a record prepared by the Greek Parliament Secretariat. The speeches ranged in length from less than 300 words to more than 5,000 words. With over 1,000 texts, the total corpus consisted of 1,292,321 words. The corpus for each speaker ranged in size from 463,680 words for Speaker A to 177,853 words for Speaker B. Further, the number of speeches given by each speaker ranged from 418 for Speaker A to 85 for Speaker B. Speakers C, D, and E's total number of speeches and total word count of speeches fell between the maximum of Speaker A and minimum of Speaker B.
Several variable sets of 46, 85, and 25 features were used for linear discriminant analysis. These sets included both lexical and syntactic variables. Lexical variables consisted of specific words. Syntactic variables included part-of-speech (POS) tags and morphological inflections, where POS includes noun, verb, adjective, adverb, and so forth. Other variables included word and sentence length as well as punctuation marks, and information about parsing, such as the number of tokens unidentified by the tagger. A forward stepwise discriminant analysis with 85 variables indicated that only 25 variables were actually used to generate the classification. The 25 variables included lexical, stylometric, syntactic and punctuation information. Lexical variables included frequencies of specific words (one, me, mister). Stylometric variables included average number of letters per word (word length). Syntactic variables included frequencies of adverbials, conjunctions, verbs, articles and other parts-of-speech. Punctuation variables included the frequencies of dashes and question marks. The combination of features used by Tambouratzes et al. (2004) is not the same as the syntactic or graphemic variables described in embodiments of the present invention.
Leave-one-out as well as ten-fold cross-validation was selected with discriminant function analysis. The average cross-validated accuracy rate for the five speakers, using the 85-variable set, and texts of any length was 85%. Speaker A's speeches were the most correctly classified at 92.3%, while Speaker F's speeches were the most difficult to classify with an accuracy rate of 78.3%. When the cases were restricted to speeches at least 500 words long, the accuracy rate improved to 89%. But now Speaker D's speeches obtained the highest accuracy rate at 93.7%, (Speaker A's rate having fallen to 90%), and the speeches of Speaker F were still the most difficult to classify at an accuracy rate of 83.3%.
Baayen et al. (2002) demonstrated that lexical and punctuation variables, using nine texts per author on two versions of discriminant function analysis (“DFA”) and two versions of cross-validation, obtain cross-validated accuracy rates from 49% to 88%. Baayen et al's (2002) experiments used eight “naive writers,” i.e. first- and fourth-year college students who wrote three texts in three genres (fiction, argument and description). The students were specifically asked to write texts of around 1000 words. The 72 texts' average length was 908 words, so the total corpus can be estimated at approximately 65,000 words.
When only the lexical, function word variables were included and the cross-validation procedure included texts of all three genres, the standard pairwise discriminant function analyses resulted in an overall accuracy rate of 49%. When the cross-validation procedure was modified so that the genre of the holdout (or left-out) text was matched by the validation texts, the overall accuracy rate improved to 79%. Under the same modification to cross-validation, when the standard discriminant function analysis was enhanced by weighting the vectors by the entropy of the words (so that novel words across texts weigh more than redundant words), the overall accuracy increased to 82%.
The frequencies of eight punctuation marks constituted the punctuation mark variables. When these punctuation mark variables were added to the lexical function word variables, with the modified cross-validation procedure and the entropy-enhanced discriminant function analysis, the overall accuracy for the 28 author-pairs increased to 88%. None of the punctuation features used by Baayen et al. (2002) are the same as the syntactic or graphemic variables employed in embodiments of the present invention.
Table 1 summarizes the three studies described above.
TABLE 1Summary of Recent Authorship AttributionCross-Validation ResultsStudyStamatatosTambouratzisBaayenLanguageModern GreekModern GreekDutchAuthors1058Number30100072of TextsTotal333,7441,292,321~65,000WordcountStatisticalLinear DFAaLinear DFALinear DFA andProcedureentropy-enhancedDFA (“EDFA”)FeaturesLexical,Lexical,Lexical &Syntactic &Syntactic &PunctuationPunctuationPunctuationBest87%89%LDFA: 57%OverallEDFA: 88%AccuracyRatesAuthors'unreported94%-83%unreportedRange ofAccuracyRatesaDFA = discriminant function analysis.
Some of the differences between the prior art methods described above (Table 1) and embodiments of the present invention include:                1. no lexical variables, i.e., no specific words, function words or word frequencies, are used in the present invention.        2. syntactic variables based on the combinatoric markedness of the parts-of-speech are used in the present invention, not counts of specific parts-of-speech.        3. graphemic variables based on the syntactic edges to which punctuation marks attach are used in the present invention, not counts of specific punctuation marks.        4. embodiments of the present invention allow for sentence-level data, as well as document-level data, to be used for model-building, not only document-level data.        5. embodiments of the present invention allow for sentence-level data for model-building and document-level cross-validation (called LODO cross-validation, detailed in the specification).        6. an embodiment of the present invention specifies a method of reducing the number of linguistic variables based on the markedness contrast, so that fewer documents can be used. (Markedness will be defined and discussed in Section D).        7. an embodiment of the present invention specifies a method of reducing the number of linguistic variables based on the nominal/predicative contrast, so that fewer documents can be used. (The nominal/predicative contrast will be defined and discussed in Section D).        8. an embodiment of the present invention specifies a set of part-of-speech tags which are the building blocks for the combinatoric markedness of phrases.        
Both markedness and the nominal/predicative contrast will be defined and discussed in Section D: Background Information on Syntax, Markedness and Part-Of-Speech Tagging.
The basis of the present invention is that individuals focus more attention to meaning than to form. The formal combination of words into syntactic structures is both so habitual and so variable that it cannot be easily imitated or adopted by others, and therefore these highly individualized and unconsciously created patterns enable different authors to be reliably distinguished from each other.
Thus, by focusing on those features of language that are highly unconscious and individualizable one can seek to identify the author of a work such as a document, using these features with an appropriate classification procedure.
Experimental results (described below and in the Detailed Description of the Invention section) demonstrate that an embodiment of the present invention achieves higher accuracy results than those obtained in prior art methods.
In this example, ten authors were drawn from Chaski's Writing Sample Database, a collection of writings on particular topics designed to elicit several genres such as narrative, business letter, love letter and personal essay (Chaski 1997, 2001). The ten authors are five women and five men, all white adults who have completed high school up to three years of college at open-admission colleges. The authors range in age from 18 to 48. The authors all have extensive or lifetime experience in the Delmarva (Delaware, Maryland, Virginia) dialect of the mid-Atlantic region of the United States. The authors are “naive writers” (in terms of Baayen et al. 2002) with similar background and training. The authors volunteered to write, wrote at their leisure, and were compensated for their writings through grant funding from the National Institute of Justice, US Department of Justice.
The authors all wrote on similar topics, listed in Table 2.
TABLE 2Topics in the Writing Sample DatabaseTask IDTopic1.Describe a traumatic or terrifying event in your life andhow you overcame it.2.Describe someone or some people who have influenced you.3.What are your career goals and why?4.What makes you really angry?5.A letter of apology to your best friend6.A letter to your sweetheart expressing your feelings7.A letter to your insurance company8.A letter of complaint about a product or service9.A threatening letter to someone you know who has hurt you10.A threatening letter to a public official (president,governor, senator, councilman or celebrity)
In order to have enough data for the statistical procedure to work, but in order to make this experiment as forensically feasible as possible, the number of documents for each author was determined by however many were needed to hit targets of approximately 100 sentences and/or about 2,000 words. One author needed only 4 documents to hit both targets, while two authors needed ten documents. Three authors needed 6 documents to hit the sentences target but only one of these three exceeded the words target. The exact details are shown in Table 3: Authors and Texts.
TABLE 3Authors and TextsAuthorAverageRace,Topics byIDNumberNumber ofNumberText SizeGenderTask IDNumberof TextsSentencesof Words(Min, Max)aWFb1-4, 7, 81661072,706430 (344, 557)WF1-52351342,175435 (367, 500)WF1-1080101181,959195 (90, 323)WF1-1096101081,928192 (99, 258)WF1-3, 109841032,176543 (450, 608)WF3557010,944TotalWMc1-89081061,690211 (168, 331)WM1-69161081,798299 (196, 331)WM1-79761141,487248 (219, 341)WM1-79971052,079297 (151, 433)WM1-716871081,958278 (248, 320)WM345419,012TotalGrand691,11119,956Totala(Min, Max) = Minimum, MaximumbWF White, Female.cWM White, Male
Authors are compared to each other in pairs. Comparing two authors at a time gets better results than comparing multiple authors. That is, higher accuracy rates for distinguishing the documents of different authors and assigning documents to the correct author are obtained with pairwise author testing.
Table 4 shows the performance of some of the proposed linguistic-variable sets with available classification algorithms using commercially available software, SPSS (Statistical Package for the Social Sciences, SPSS Inc., Chicago. Ill.). These proposed linguistic-variable sets enable these classification algorithms to achieve higher accuracy rates than have been previously reported in the literature (as summarized above, Table 1).
TABLE 4Accuracy Rates Using Syntactic Analysis and Variable SetsTextualUnitsVariable(for countsClassificationOverallExpaSetsof variables)AlgorithmAccuracy1.1Sentences,Linear DFAbOpposite SexMean ofusing LODOcPairs: 98%Documentcross-validationAll Pairs: 91%2.2DocumentLinear DFA95%3.3DocumentLinear DFA91%4.7DocumentLogistic Regression96%aExp = Example ExperimentbDFA = discriminant function analysis.cLODO = leave one document out
Similar results have also been obtained using decision trees (in the commercially available software DTREG, available from its author through www.dtreg.com) and support vector machines (in the open source software LNKNET, from the Lincoln Laboratory, MIT, Cambridge, Mass.)
D. Background Information on Syntax, Markedness and Part-of-Speech Tagging.
Some basic ideas about syntax and markedness are presented here to assist one's understanding of this application. Part-of-speech tagging schemes are described so that the part-of-speech tagging scheme of the present invention can be distinguished and identified. The combination of these ideas has never been applied to authorship attribution to the best of the inventor's knowledge. Embodiments of the present invention utilize new variables for linguistically characterizing a text for authorship testing.
Syntax is the study of the possible combinations of word units into grammatical phrases. Grammatical combinations are also known as constituent structures since they are structures which are constituted of smaller structures and units. Discourse is the study of how sentential units are combined and how communicative effect is conveyed (e.g. how we recognize irony, agreement and other rhetorical effects).
In elementary school and foreign language instruction, one learns that there are different types of words which differ because they function in different ways. For instance, nouns label objects, persons, places and ideas and, in English, nouns follow other types of words such as determiners and adjectives. In grammar that is used for teaching purposes, words are thus classified into “parts-of-speech” (“POS”) categories.
In the concept of generative grammar, words are classified into two main categories, major and minor. Word types in the major categories can combine with other words to create phrases which function like single word units. For instance, noun is a major category because combined phrasal unit “the beautiful tables” can function just like the single word unit “tables.” To illustrate this, compare the two sentences: “he bought tables at auction” and “he bought the beautiful tables at auction.”
Major categories are known as “heads” because they “head up” phrases when they combine with other words. Word types in minor categories combine with other words, but when minor word types combine their word type does not dominate the other words in the phrase or “head up” the phrase.
Within the major and minor categories, the POS categories are defined much as in school grammars, and described below.                MAJOR:        Noun: names person, place or thing (abstract or concrete). Pronouns replace nouns; they are like proper nouns.        Verb: names action or state-of-being.        Adjective: describes nouns or state-of-being.        Preposition: names relationship between noun-noun or verb-noun, usually spatial or temporal relation (on, over, above, beyond).        MINOR:        Determiners: specify nouns (the, a, this, those, that). Possessive pronouns are like determiners because they are very specific.                    Complementizers: introduce embedded clause (that, for, whether, if).                        Adverbs: specify the action/states named by verbs or act as intensifiers for the degree of an adjective (hardly, very); also known as Modifiers.        Particles: look like prepositions and are similar to adverbs, they specify verbs, but unlike adverbs (which are always modifiers) particles are required in certain verbs (look up, pick up, throw up, look down on, throw over).        Conjunction: conjoin phrases and sentences (and, but).        
In generative grammar, the concept “head of the phrase” (also known as headedness) is structurally very important. The head of a phrase is the word which gives its function to the entire phrase. A phrase is a single word or combination of words which conveys a unit of information. For example, in the phrase “the alleged conspirator”, “the” (a determiner) specifies a particular person, “alleged” (an adjective) describes a state-of-being, and “conspirator” (a noun) labels a particular person. Since the entire group of words, “the alleged conspirator” also labels a particular person, the head of this phrase is “conspirator” a noun, and the phrase is designated as a noun phrase (“NP”).
The following series of phrases explains this nomenclature, wherein below each sentence the words therein are identified by their parts-of-speech.                (1) Tables                    NOUN                        (2) The tables                    DETERMINER NOUN                        (3) The antique tables                    DETERMINER ADJECTIVE NOUN                        (4) The antique tables which you found                    DETERMINER ADJECTIVE NOUN RELATIVIZER PRONOUN VERB                        (5) The antique tables for your sister                    DETERMINER ADJECTIVE NOUN PREPOSITION PRONOUN NOUN                        (6) The antique tables to give to Charlie                    DETERMINER ADJECTIVE NOUN VERB PREPOSITION NOUN                        
All of these phrases are headed by a noun “tables” because the phrases (2) through (6) could stand in the same place as the phrase (1). For example, one could put the phrase “are beautiful” after any one of the phrases (1) through (6) above.
In generative grammar, headedness also relates to the ordering of words in a phrase sequentially. The head of a phrase is typically restricted to being the first or last word in the phrase. For example, in the English noun phrase “the alleged conspirator” the head is the last word, or as known to those skilled in the art, head-final. In the English verb phrase “conspired with the general,” the verb “conspired” is the head, in first or head-initial, position in the phrase. But word order is not totally fixed, even in English, because a head noun can also occur in a medial position, as in the phrase “the alleged conspirator of the attorney general” and sometimes the head noun can even occur in head-initial position, as in “the attorneys general.”
Such variations or options in language reveal markedness (described below) by demonstrating how some English syntactic patterns such as head-final noun phrases are much more usual and easy to understand than other syntactic patterns such as head-initial noun phrases. This particular binary contrast between head position (final/not-final) is simply one example of how binary contrasts organize language asymmetrically.
Markedness is the basic asymmetry in language which pervades the binary substructure of linguistic signs. Language is structured for binary contrasts such as voiced/unvoiced at the phonetic level, nominative/non-nominative at the morphological level, mass/count at the semantic level, recursive/nonrecursive at the syntactic level.
Yet even though language is structured for binary contrasts, the contrastive items are not equally interchangeable. For example, the binary contrast of the concept [age] is lexicalized in English as [young]/(old). But the binary distinction between [young]/[old] is not symmetrical, not equal, as shown by the fact that these two terms are not interchangeable. When we are inquiring about age in English, we ask [how old are you?] for the unmarked use, while we can, in the marked use, as [just how young are you?]. Similarly, the head-final noun phrase is unmarked, while the head-initial noun phrase is marked.
Another binary contrast in language is the distinction between the nominal and the predicative. Nominal or noun-like parts-of-speech can substitute for each other, but never for predicative or verb-like parts-of-speech. Predicative parts-of-speech relate nominals to other nominals and even require nominals, as in logic functions require arguments, but nominal parts-of-speech do not. Although syntactic categories and part-of-speech labeling schemes can be extremely detailed and complex, this basic distinction between nouns and verbs is respected in all syntactic theories and substantiated in all languages.
The complexity and detail of part-of-speech tagging schemes is directly related to the purpose of the syntactic analysis. For example, fewer tags are needed for diagramming sentences than for generating sentences. There are six tagging schemes described in Manning and Schuetze (1999), ranging in size from 45 to 197, far more POS tags than occur in the traditional, school grammar list of nouns, verbs, adjectives, adverbs, prepositions, determiners and conjunctions.
As will be shown in the present application, embodiments of the present invention employ a number of variable sets, which, while including two stylometric variables, focus on syntactic structure in ways not found in prior art methods. Further, embodiments of the present invention reduce the number of variables needed for authorship attribution, enabling the method to be used with smaller sized text samples (such as about 500 words) than had been used previously. These variable groups are briefly described below, and in more detail within the Detailed Description of the Invention section.
Briefly, an embodiment of the present invention is a method for authorship attribution with the linguistic-variable component implementing a specific POS tagging scheme; syntactic variables based on markedness and the nominal/predicative contrast; punctuation variables based on syntactic attachments, and stylometric variables; and the classification-algorithm component enabling both sentence-level and document-level data for model-building with cross-validation and classification at the document level.
In the method, which can be implemented in a computing environment, each word in each text is labeled according to the syntactic functions in this specific POS tagging scheme. In other words, each grammatical category (noun, verb, preposition, conjunction, modifier, adjective, determiner, subordinator, an interjection) is labeled in the document. The phrases which these grammatical categories create through headedness are classified into marked (“m”) or unmarked (“u”) types. These marked and unmarked syntactic phrases constitute seventeen syntactic variables. The method enables these seventeen variables to be collapsed into two variables based on markedness values or four variables based on the nominal/predicative contrast. Each punctuation mark in each text is classified by what type of syntactic edge it is marking, i.e., what type of syntactic edge the mark attaches to, as well as discursive function, for a total of four punctuation variables. In addition to the syntactic and punctuation (also known as graphemic) variables, the method also includes two stylometric variables (word and paragraph length). A range of variable sets are available from the procedures for creating the variables; the variable sets contain as many as twenty-two or as few as six variables. When the largest variable sets are used, the method details how sentence-level data is used for model-building, while the classification cross-validates on, and predicts the authorship of, document-level data. When the small variable sets are used, the method employs the standard document-level data and cross-validation procedures in prior art.
This inventor believes that the particular variables based on the specific POS tagging scheme, principles of syntactic markedness and syntactic edges are causing the good accuracy results, as any classification algorithm can only work as well as the input variables allow, as discussed in greater detail within the Detailed Description of the Invention section.