1. Field of the Invention
Embodiments of the present invention relates generally to collecting data, and more particularly to systems, methods, and computer-program products for secure data collection.
2. Related Art
Many companies today have expanded (e.g., through acquisition of other companies, outsourcing, or diversification, etc.) such that parts of companies may be located in different locations around the world. In addition, company infrastructure has become much more complicated. As such, companies face an ever increasing set of problems when they are required to collect information from within the company, for example, in the case of an audit, discovery in litigation, etc. Discovery may refer to the collection of information stored in electronic form, for example, electronic discovery (eDiscovery) in the context of litigation. Discovery does not necessarily include the search and collection of email. Accordingly, there are numerous problems for conventional data collection systems. Some problems are listed below.
First, a company may face a serious set of problems if the company is obliged to discover and collect relevant information across its whole product range, for all versions and for all the years it has been producing the products. The problems here include those related to the complexity of getting all products (past and present), sheer volume over the years (and with the possibility of a large number of hardware platforms involved), and actually locating what is still available from the early years.
There may have been many products (or more accurately products and product groups) released over the years. Types of products may include, such as, e.g., but not limited to, current products, products that are in a “maintenance” mode (which may still have significant customer base), and “historical” or “dead” products, etc. Current products may include products that are currently being developed. Maintenance mode products may include products that are no longer being developed but are still being sold. Historical or dead products may include products which are no longer sold or no longer available.
Current products and maintenance mode products may also contain numerous third party products, and in some cases a complex web of “internal” products which product groups often share. To add to this complexity of data identification and collection, major products may also have spin-off products.
Second, most companies run their daily business based on a “Living” environment, an environment where data may be constantly changing. Therefore, for many companies it is impractical to shut down systems while all the data within the systems is collected. This may be a significant issue if data has to be collected with an imposed deadline and may affect a company's FTO, becoming a significant burden on the company.
Third, companies, especially those involved in the Computer Software industry, often use Source Control Systems (also known as Revision Control Systems and Version Control Systems) to develop their products. Source Control Systems permit changes in data to be tracked over time. Changes are usually identified by a number or letter code, termed the “revision number,” “revision level,” or simply “revision.” For example, an initial set of files is “revision 1.” When the first change is made, the resulting set is “revision 2,” and so on. Each revision is associated with a timestamp and the person making the change. Revisions can be compared, restored, and with some types of files, merged.
Over time these Source Control Systems are often replaced by new development products for other Source Control Systems. In addition, acquired companies may bring diverse development technology, often resulting in a mix of Source Control Systems in use by different groups within the same company. Additionally, “older” products that are still in the market but no longer being developed may be left in their original Source Control Systems. Identifying and extracting information in a complex environment in a standardized form can often be a heavy and costly business when dealing with multiple Source Control Systems.
Fourth, more and more companies have expanded and diversified the locations where company data is stored (e.g., through mergers or acquisitions, tax incentives, development costs, etc.). It is no longer the case that only large global companies have multiple development sites spread over different country locations. This geographical diversity may provide a considerable obstacle in obtaining the data within a limited timeframe.
Fifth, within most companies there are multiple organizations, such as, Product Development/Manufacture, Quality Assurance, Marketing & Sales, Financial & Controlling, Professional Services, Technical Alliances (Third Party), etc. The presence of multiple organizations introduces a whole set of problems in a data collection process. Information from some groups (e.g., Financial) may be sensitive and highly confidential, and some groups (e.g., Marketing) tend not to maintain older information. Additionally, different groups may have different methods and structures for storing their own data.
Sixth, occasionally there is the need to identify and collect “Historical” information. This could be information about products that are “dead” or in a “maintenance” mode, but also information on older product versions of current products that go back many years. Some documentation may be in a format no longer used within the company and the tools that support the format may no longer be available. Also, there may possibly only be hardcopy versions of documentation available. Another problem is that in some cases the person responsible and knowledgeable for documentation may no longer be with the company, and over time and with possible office changes, a hardcopy may not be retained or lost.
Seventh, in a company there is always reorganization, staff turnover, relocations, etc. These changes may provide a significant stumbling block in locating relevant information in any collection process, especially if the “completeness” of the data collection is an issue.
Eighth, there are some issues and problems that may be specific to companies that employ the use of Mainframes in their data storage. Particularly, these types of machines usually use Extended Binary Coded Decimal Interchange Code (EBCDIC) encoding rather than American Standard Code for Information Interchange (ASCII). In addition, product development in the past on such machines revolved around specialized environments rather than conventional Source Control Systems, making data extraction difficult.
Ninth, almost all companies have some form of back-up/recovery system for their data. The problem is back-up/recovery systems can often be extremely large, the vast majority of the data being duplicates and as such irrelevant to most discovery activities. Additionally, there is a significant problem with cost-effectively searching such back-ups.
Tenth, in any company there is a significant amount of sensitive and confidential information (e.g. product source, financial details, strategy and future plans). If it is necessary to discover and collect this information then there is the problem of how to identify, separate, and secure this data during the lifecycle the collection. Another problem is identifying who will have what kind of access to the data. A further problem is how the data will be secured long term or securely destroyed once it is no longer needed.
Specialized data collection solutions in the market concentrate on searching all available data with sophisticated indexing and search algorithms.
One such solution is provided by Symantec Corporation and is described in Discovery Accelerator 8—Effective Searching, published in May 2010 and authored by Logan Sutterfield. The document offers an approach which collects information from specific repositories into a homogenized archive solution called an Enterprise Vault. This solution tailors an archiving method for common sources inside an organization (e-mail servers, collaborative editing servers, file archives, etc.), avoiding issues such as replication. All information is then stored and indexed in the Enterprise Vault. A tool called Discovery Accelerator then proceeds to search these indexes and retrieves the required documents based on the search parameters.
Once defined, the process runs automatically, collecting all data and storing it securely. However, since the amounts of collected data are huge in the typical enterprise scenario, a lot of non-relevant information is collected, and is sorted out by precise parameter searching. The time of relevant data selection is at the time of the discovery. Because the sorting is highly dependent on the accuracy of the search criteria, there is a high risk of either failing to find all relevant information or of finding a mix of relevant and irrelevant information or both. In any event, it may be difficult to be assured that the returned data is in any way complete. Such an approach may be challenged either in the aspect of returning all relevant data (ex., its completeness) or that it has been used to hide data (e.g., by choice of indexing). Also, this solution does not include an approach to manage data stored in EBCDIC format. In addition, the system is expensive in the set-up and maintenance complex (e.g., setting up and maintaining the indices).
Autonomy Corporation describes another solution in the document “Next Generation Archiving—White Paper,” using customized search mechanisms based on their IDOL (Intelligent Data Operating Layer) platform, which is claimed to understand over 1000 different file formats. This platform creates conceptual and contextual relationships between the stored data. Using their Enterprise Archive Solution (EAS), files are stored in a centralized repository which handles e-Mail and shared network folders separately. All stored files and e-Mails are indexed according to IDOL's methods and are searchable afterwards. However, this approach also suffers from the same issues and problems as the Symantec solution.
These types of solutions described usually come with service level agreements, indicating their complexity. The solutions also have to be maintained over the years, again at very high cost. As a result, the cost-effectiveness of this approach and these mechanisms is one of the key reasons for avoiding them. In addition, these solutions, when implemented as offered, could very well disclose information which, while not relevant to the discovery itself, might be extremely sensitive.
Accordingly, there is a need for a system and method that can overcome the disadvantages of the previous systems and methods.