The data available to individuals and institutions that monitor the global financial markets is wide-ranging. Investment professionals responsible for monitoring a particular company or industry sector may receive thousands of individual information items each day. Some of these information items may be presented in well-formatted and categorized formats from reliable and well-known sources such as financial statements filed with a stock exchange or the Securities and Exchange Commission, whereas other information items may be in the form of informal correspondence such as email or instant message, phone conversations, or face to face meetings. Furthermore, the application of numerous internet communications technologies to the research and information publishing process over the last decade has increased the volume of information available for analysis and the speed at which it is delivered. Often, opportunities to take advantage investment opportunities based on such information may exist for only a short time. Furthermore, the opportunity to act on information may not be concurrent with the arrival of the information itself. It is critical that investment professionals be able to monitor the numerous sources of information, discern pertinent information from irrelevant information, analyze it as quickly as possible and base decisions on the information as it arrives. Investment professionals must therefore be able analyze, in short periods of opportunity, historic information that is often difficult and time-consuming to recall or retrieve manually.
One challenge facing investment professionals is the accurate identification and classification of the information they receive. Although information categorization is a relatively mature field in systems research and many methods exist for the analysis and categorization of text, they do not provide the accuracy and speed that is crucial in fields such as investment management.
Typically, information categorization depends on the features (e.g., recognizable contents) of source text used by categorization algorithms and the definition of the categories into which the text is to be grouped. Feature selection is a key aspect in establishing effective interpretation of the source material, and a well-chosen feature set can be used to sub-divide or cluster a sample set of information, such as the way Internet news websites group the day's headlines by topics such as business, sports, law, national news, international news, etc. However, clustering with the best feature set does not provide adequate categorization. True categorization requires putting source material into meaningful destinations based on more than a general designation.
Category oriented thinking and collaborating is common across many professions—finance, medicine, business consulting, pharmaceuticals and so on. For many of these professions systems of categorization (also referred to as “ontologies”) have evolved to a high level of effectiveness. Experts in such fields communicate, analyze, and make decisions implicitly using their established systems of categorization. For example, investment professionals understand “IBM” as a meaningful entity—as a corporation, as a topic of discussion, as an equity traded on the New York Stock Exchange, as the subject of a financial model, and/or as a competitor to Microsoft. Because of the well-established meaning, “IBM” instantly conveys the context. For these reasons, “IBM” is an extremely useful way to characterize the content of an email, a news story, a phone call, a financial statement. Conventional ontologies used by finance professionals organize research items by company, industry, and sector. Other categorization schemas use “trades” or “deals” as the principle organizing unit.
Information extraction is another technique used as part of the document classification process. Essentially, information extraction refers to the conversion of unstructured or loosely structured data into structured formats such that it can be queried and processed (see, for example, ACM Queue November 2005: http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=350). Unlike conventional information search methodologies that build new representations of the underlying data (i.e., the search index), information extraction attempts to build a structure for the source data based on the contents of the data itself.
Neither of these approaches, however, address the fundamental challenges that face investment professionals. Specifically, what is needed is a technique and supporting system that effectively and accurately categorizes information based on user-defined categories while considering the type and source of information, user-specific rulesets, and can effectively determine the relevancy of a document to a particular individual based thereon.