The Problem
Open network systems like the Internet and closed network systems such as those operated by cable television and telephone companies deliver trillions of words and millions of hours of digitized audio and video to billions of computer and television screens. Systems exist which survey traffic on these systems to determine the behavior of consumers. Some systems exist which will identify consumer behavior on the basis of the selection of a particular web page or a particular television program. No system exists however, to analyze and/or survey statistics revealing the underlying interests (psychographic or psycholinguistic behavior) of those persons selecting particular content and portions of that content, to recommend related products, services and content that can be consumed or purchased by the consumer. It would be highly beneficial to create markets on a near real-time basis for those products and services of interest to persons that are already recognized to be interested in a particular related subject.
Origins of the Solution
During and immediately following World War II, large scale computing was first applied to the task of managing the explosion of information. Vannevar Bush, FDR's technology czar, laid out the problem in an article in the Atlantic Monthly called ‘As We May Think’(see and imagined a solution —called the MEMEX—which was the precursor to the massively indexed databases and search engines in wide proliferation today. At roughly the same time, Claude Shannon of MIT and Bell Labs (Bush and Shannon knew each other and worked together in the design and deployment of the first computers) laid out ‘Information Theory’(see and the conceptual framework for digital noise reduction, based on the fundamental precepts of Boolean logic.
Though cloaked in secrecy for decades, the National Security Agency (NSA) has made extensive use of massive scale computing to perform traffic analysis on electronic/digital communications (telephony, telegraphy, RTTY, fax, email, etc.). The standard methodologies employ two different but complementary approaches, forecast by Bush and Shannon: filtering based on Boolean search techniques, and word frequency analysis. The first methodology takes impossibly large arrays of data and produces manageable subsets relevant to the search criteria (‘associative trails’ as imagined with Bush's MEMEX: “Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them”), the second methodology identifies pervasive themes and/or subject matter within these manageable subsets (in effect, road maps). The resulting analysis can then be ‘fed back’ (feedback is a key concept in Information Theory) into the search process in order to refine and more precisely target the searches.
Massive computing and associated databasing began to impact the internal operations of big business and the military in the 1950's, somewhat lagging behind the intelligence agencies. In the 1960's, massive computing enabled large scale electronic transaction processing and billing, with consumers benefiting through the arrival of credit cards. For business, the resulting transaction databases enabled datamining for customer behavior profiles, and led to consumer targeting through direct mail and telemarketing. Using set-top boxes and diaries, Nielsen and other firms sought to sample consumer behaviors, and used computer-driven statistical analysis and inference to characterize consumer behavioral trends.
In the early 1980's massive computing became sufficiently inexpensive for academics to employ. Then, the first word frequency analysis projects were undertaken on very large samples of published English language prose, and by the late 1980's the results were commonly available in public literature.
In the early 1990's, the Office of Naval Research (ONR) embellished word frequency analysis techniques in order to automate the review of international science and technology literature, to create comprehensive conceptual roadmaps through the material.
The idea was to use machine analysis to figure out what the Russians, and other adversaries and allies were doing in science and technology by using computational linguistics on a closed system of published literature. The result is a technology called Database Tomography (DT), which automates                the retrieval of relevant documents        the identification of technical infrastructure (who is citing who, etc.)        the identification of technical themes and relationships        the discovery of non-obvious underlying themes in the literature        
In the mid-nineties a further embellishment of word frequency analysis evolved in the academic/technology community, called latent semantic indexing (LSI). LSI seeks to identify the underlying concepts in documents, and then draw conclusions with regards the similarity/relevance to other documents by comparison of the documents thematic matrices.
In the late 1990's, largely in response to the demands for improved search and ad targeting over the Internet, a number of search enhancement and content analysis techniques were in development.
Some of these systems required manual intervention. In one instance, Yahoo employed a large numbers of ontologists to develop a knowledge classification system with upwards of 30,000 nodes, in order to assist the search for related material. In another, a firm called Gotuit developed systems for adding additional data (metadata) to streaming audio and video that allowed the material to be ‘sliced and diced’, thus enabling search for specific segments.
Some of these systems were automatic. In one instance, Rulespace sought to duplicate Yahoo's ontological approach in an automated fashion. Autonomy, and other like firms, sought to automatically classify content according to extent advertising categories. Predictive Networks, and other like firms, sought to classify consumer behavior patterns by tracking consumer's use of clicks and keystrokes while using the Internet.
The system of this invention, (called the Etronica system) directly tracks what consumers are interested in, by sensing their search behavior.
Component Methodologies of the Etronica System
                Word frequency analysis on large corpi of English language prose to identify a base keyword set.        Word frequency analysis on smaller ‘special’ corpi of English language prose (eg. An Electronic Program Guide used in a cable television system, or a law citation database) in order to identify statistically frequent, and hence special ‘terms of art’ for inclusion as extensions to the base keyword set.        Automated assignment (metatagging) of keywords, drawn from a master keyword set, to individual documents, or records within a database.        Exploitation of the ‘tagged’ keywords to form effective Boolean ANDed searches.        Exploitation of the ‘tagged’ keywords as indicators of consumer's territories of interest.        Signaling consumer interests over a network for centralized accumulation in a datamining system for traffic analysis.        Exploitation of statistically significant consumer patterns of interest for optimization of ad and merchandise sales and delivery of relevant content.        
For example, while searching the Internet for an article on basketball, various basketball-related television programs or video-on-demand (pay-per-view) movies could be recommended, as well as various products that could be suggested on the screen for purchase, such as sports supplies, sports clothing and books and magazines on the subject of basketball. If it could be determined that the searcher was particularly interested in professional basketball, the products suggested could be narrowed to be more relevant to that interest. Alternatively., while watching a broadcast television program like WEST WING, various related politically-oriented television broadcasts, in dramatic, news and documentary genres (for example a documentary on the Secret Service), could be recommended, as well as related Pay Per View motion pictures (for example, a film such as In The Line Of Fire, through Video on Demand services), as well as an array of related products and services, and related websites, might be recommended. Further, psychographically related products and services, related by coincident behavior rather than common themes of interest, might be incorporated into the recommendations.
Superiorities of the Invention (the Etronica System)
                1) It is founded on a broad model of human interests and activities, as empirically indicated by the keyword set derived from word frequency analysis of massive, non-specialized corpi of English language prose. The document-specific analysis of LSI and DT limits the reach of the analysis to the system of documents reviewed, and suffers from increasing complexity as documents are added to the system. The advertising specific approach of Autonomy and others limits the analysis to a crude breakdown of advertising categories.        2) Unlike the numbers-based LSI (and other Neural Net systems), the Etronica system uses a set of tokens based on keywords whose meaning is clear, and easily understood and interpreted by humans.        3) Unlike computationally intensive systems like LSI (and other Neural Net systems), the Etronica system is fast, and computationally highly efficient. The creation of the keyword set is already done, and the keyword matching to content is principally based on table lookup techniques. The computational requirements grow in a flat, symmetric fashion with the number and length of the documents or records, rather than exponentially, as with LSI, and other matrix-analysis based systems.        4) Because virtually all digitally searchable bodies of content can be manipulated using Boolean search operators (AND, OR, NOT), the exploitation of the metatagged keywords in the Etronica system to form Boolean ANDed queries is naturally compatible with the de facto international API (application program interface) for search.        5) Because the Etronica system is founded on an empirically valid keyword set (see 1), tuning the keyword set to a new specialized corpus simply requires the identification of an extension to the base set of keywords, rather than the complete reformation of the set (as is required by most metatagging systems). This is quickly and easily accomplished by a word frequency analysis on the specialized corpus, and comparison of the results to the existing Etronica keyword set to determine the significant differences. This process is, in essence, a feedback loop for signal correction.        6) Because the Etronica system tracks consumer interest, rather than their transactions (as in the case of Amazon's metatagging system, and many advertising-driven systems), no invasion of individual privacy as a result of the association of individual information with sensing data is either necessary or inevitable in the datamining/traffic analysis process.        7) Because the Etronica system exploits only most commonly used words in the keyword set as second operands in Boolean ANDed queries and analysis, the synonymy problem suffered by most computation linguistics problems (including DT and LSI) is attenuated.        8) Because the Etronica system exploits keywords with non-ambiguous meanings (‘movie’, as opposed to ‘film’), the polysemy problem suffered by most computational linguistics systems (including DT and LSI) is attenuated.        9) Because the Etronica system is based on constant Traffic Analysis, rather than sampling and statistical inference (as practiced by Nielsen, Mediametrix and other consumer sensing systems), and senses human interests, rather than mouseclicks and keystrokes, the resulting profiling of behavior is far more accurate.        10) Because the Etronica system exploits a ‘flat’ set of keywords (where no words holds a parent-child hierarchical relationship to another, nor is any specific value-based weighting give one keyword over another), rather than the hierarchical systems employed by Yahoo and Rulespace, and derived by DT and LSI, the statistical occurrence of Etronica keywords can be viewed in a combinatorial fashion. In effect, two or more keywords co-occurring in a statistically significant fashion will describe a territory of consumer interest in a more precise fashion, because they have been Boolean ANDed together.        11) Because the distribution of the Etronica keywords is consistent, and the set of keywords is limited, the storage and transmission of consumer behavior data equipped with a payload of Etronica keywords requires a very small amount of data to be transferred, unlike most other consumer remote-sensing techniques.        