A. Field of Invention
This application claims priority to U.S. Ser. No. 61/152,085, filed Feb. 12, 2009, which is incorporated herein by reference. This invention pertains to the art of methods and apparatuses regarding analyzing data sources and more specifically to apparatuses and methods regarding organization of data into themes.
B. Description of the Related Art
Government intelligence agencies use a variety of techniques to obtain information, ranging from secret agents (HUMINT—Human Intelligence) to electronic intercepts (COMINT—Communications Intelligence, IMINT—Imagery Intelligence, SIGINT—Signals Intelligence, and ELINT—Electronics Intelligence) to specialized technical methods (MASINT—Measurement and Signature Intelligence).
The process of taking known information about situations and entities of strategic, operational, or tactical importance, characterizing the known, and, with appropriate statements of probability, the future actions in those situations and by those entities is called intelligence analysis. The descriptions are drawn from what may only be available in the form of deliberately deceptive information; the analyst must correlate the similarities among deceptions and extract a common truth. Although its practice is found in its purest form inside intelligence agencies, its methods are also applicable in fields such as business intelligence or competitive intelligence.
Intelligence analysis is a way of reducing the ambiguity of highly ambiguous situations, with the ambiguity often very deliberately created by highly intelligent people with mindsets very different from the analyst's. Many analysts frequently reject high or low probability explanations, due to the difficulty in obtaining evidence to support those explanations. Analysts may use their own standard of proportionality as to the risk acceptance of the opponent, rejecting that the opponent may take an extreme risk to achieve what the analyst regards as a minor gain. Above all, the analyst must avoid the special cognitive traps for intelligence analysis projecting what she or he wants the opponent to think, and using available information to justify that conclusion.
Since the end of the Cold War, the intelligence community has contended with the emergence of new threats to national security from a number of quarters, including increasingly powerful non-state actors such as transnational terrorist groups. Many of these actors have capitalized on the still evolving effects of globalization to threaten U.S. security in nontraditional ways. At the same time, global trends such as the population explosion, uneven economic growth, urbanization, the AIDS pandemic, developments in biotechnology, and ecological trends such as the increasing scarcity of fresh water in several already volatile areas are generating new drivers of international instability. These trends make it extremely challenging to develop a clear set of priorities for collection and analysis.
Intelligence analysts are tasked with making sense of these developments, identifying potential threats to U.S. national security, and crafting appropriate intelligence products for policy and decision makers. They also will continue to perform traditional missions such as uncovering secrets that potential adversaries desire to withhold and assessing foreign military capabilities. This means that, besides using traditional sources of classified information, often from sensitive sources, they must also extract potentially critical knowledge from vast quantities of available open source information.
For example, the process of globalization, empowered by the Information Revolution, will require a change of scale in the intelligence community's (IC) analytical focus. In the past, the IC focused on a small number of discrete issues that possessed the potential to cause severe destruction of known forms. The future will involve security threats of much smaller scale. These will be less isolated, less the actions of military forces, and more diverse in type and more widely dispersed throughout global society than in the past. Their aggregate effects might produce extremely destabilizing and destructive results, but these outcomes will not be obvious based on each event alone. Therefore, analysts increasingly must look to discern the emergent behavioral aspects of a series of events.
Second, phenomena of global scope will increase as a result of aggregate human activities. Accordingly, analysts will need to understand global dynamics as never before. Information is going to be critical, as well as analytical understanding of the new information, in order to understand these new dynamics. The business of organizing and collecting information is going to have to be much more distributed than in the past, both among various US agencies as well as international communities. Information and knowledge sharing will be essential to successful analysis.
Third, future analysts will need to focus on anticipation and prevention of security threats and less on reaction after they have arisen. For example, one feature of the medical community is that it is highly reactive. However, anyone who deals with infectious diseases knows that prevention is the more important reality. Preventing infectious diseases must become the primary focus if pandemics are to be prevented. Future analysts will need to incorporate this same emphasis on prevention to the analytic enterprise. It appears evident that in this emerging security environment the traditional methods of the intelligence community will be increasingly inadequate and increasingly in conflict with those methods that do offer meaningful protection. Remote observation, electromagnetic intercept and illegal penetration were sufficient to establish the order of battle for traditional forms of warfare and to assure a reasonable standard that any attempt to undertake a massive surprise attack would be detected. There is no serious prospect that the problems of civil conflict and embedded terrorism, of global ecology and of biotechnology can be adequately addressed by the same methods. To be effective in the future, the IC needs to remain a hierarchical structure in order to perform many necessary functions, but it must be able to generate collaborative networks for various lengths of time to provide intelligence on issues demanding interdisciplinary analysis.
The increased use of electronic communication, such as cell phones and e-mail, by terrorist organizations has led to increased, long-distance communication between terrorists, but also allows the IC to intercept transmissions. A system needs to be implemented that will allow automated analysis of the increasingly large amount of electronic data being retrieved by the IC.
Query languages are computer languages used to make queries into databases and information systems. A programming language is a machine-readable artificial language designed to express computations that can be performed by a machine, particularly a computer. Programming languages can be used to create programs that specify the behavior of a machine, to express algorithms precisely, or as a mode of human communication.
Broadly, query languages can be classified according to whether they are database query languages or information retrieval query languages. Examples include: .QL is a proprietary object-oriented query language for querying relational databases; Common Query Language (CQL) a formal language for representing queries to information retrieval systems such as as web indexes or bibliographic catalogues; CODASYL; CxQL is the Query Language used for writing and customizing queries on CxAudit by Checkmarx; D is a query language for truly relational database management systems (TRDBMS); DMX is a query language for Data Mining models; Datalog is a query language for deductive databases; ERROL is a query language over the Entity-relationship model (ERM) which mimics major Natural language constructs (of the English language and possibly other languages). It is especially tailored for relational databases; Gellish English is a language that can be used for queries in Gellish English Databases, for dialogues (requests and responses) as well as for information modeling and knowledge modeling; ISBL is a query language for PRTV, one of the earliest relational database management systems; LDAP is an application protocol for querying and modifying directory services running over TCP/IP; MQL is a cheminformatics query language for a substructure search allowing beside nominal properties also numerical properties; MDX is a query language for OLAP databases; OQL is Object Query Language; OCL (Object Constraint Language). Despite its name, OCL is also an object query language and a OMG standard; OPath, intended for use in querying WinFS Stores; Poliqarp Query Language is a special query language designed to analyze annotated text. Used in the Poliqarp search engine; QUEL is a relational database access language, similar in most ways to SQL; SMARTS is the cheminformatics standard for a substructure search; SPARQL is a query language for RDF graphs; SQL is a well known query language for relational databases; SuprTool is a proprietary query language for SuprTool, a database access program used for accessing data in Image/SQL (TurboIMAGE) and Oracle databases; TMQL Topic Map Query Language is a query language for Topic Maps; XQuery is a query language for XML data sources; XPath is a language for navigating XML documents; XSQL combines the power of XML and SQL to provide a language and database independent means to store and retrieve SQL queries and their results.
The most common operation in SQL databases is the query, which is performed with the declarative SELECT keyword. SELECT retrieves data from a specified table, or multiple related tables, in a database. While often grouped with Data Manipulation Language (DML) statements, the standard SELECT query is considered separate from SQL DML, as it has no persistent effects on the data stored in a database. Note that there are some platform-specific variations of SELECT that can persist their effects in a database, such as the SELECT INTO syntax that exists in some databases.
SQL queries allow the user to specify a description of the desired result set, but it is left to the devices of the database management system (DBMS) to plan, optimize, and perform the physical operations necessary to produce that result set in as efficient a manner as possible. An SQL query includes a list of columns to be included in the final result immediately following the SELECT keyword. An asterisk (“*”) can also be used as a “wildcard” indicator to specify that all available columns of a table (or multiple tables) are to be returned. SELECT is the most complex statement in SQL, with several optional keywords and clauses, including: The FROM clause which indicates the source table or tables from which the data is to be retrieved. The FROM clause can include optional JOIN clauses to join related tables to one another based on user-specified criteria; the WHERE clause includes a comparison predicate, which is used to restrict the number of rows returned by the query. The WHERE clause is applied before the GROUP BY clause. The WHERE clause eliminates all rows from the result set where the comparison predicate does not evaluate to True; the GROUP BY clause is used to combine, or group, rows with related values into elements of a smaller set of rows. GROUP BY is often used in conjunction with SQL aggregate functions or to eliminate duplicate rows from a result set; the HAVING clause includes a comparison predicate used to eliminate rows after the GROUP BY clause is applied to the result set. Because it acts on the results of the GROUP BY clause, aggregate functions can be used in the HAVING clause predicate; and the ORDER BY clause is used to identify which columns are used to sort the resulting data, and in which order they should be sorted (options are ascending or descending). The order of rows returned by an SQL query is never guaranteed unless an ORDER BY clause is specified.