The present invention relates generally to extracting and summarizing data, and in particular to allowing different extraction, summarization, or other such methods to interact while operating on the same data source.
Presently, many problems are met when extracting and/or summarizing data. For example, multiple extraction methods currently exist, but each extraction method has its own limitations. For instance, one extraction method may perform well with a small volume of data but may perform poorly with a large volume of data. Alternatively, a second extraction method may perform well with a large volume of data but may perform poorly with a small volume of data. Examples of such extraction solutions are discussed below.
In one previous solution called logging, when data in a table is updated or inserted, entries to a log are added. Such log entries may then be extracted and reported on as desired. However, such a process requires the joining of a potentially large data source with a log table that may also grow large. Therefore, the logging solution does not perform well for high volume extractions.
In a solution similar to the logging solution, a solution called event-based logging may create an entry which has additional information beyond which row has been updated or inserted. This entry may contain functional information that allows the entry to be identified and individually processed. However, for similar reasons as the logging solution, the event-based solution does not perform well for high volume extractions.
In another solution called flagging, a flag is used to mark areas of a data source, the flag identifies the data that has not been previously extracted. In some cases, the flag may take on a value of ‘N’ if the data has not been extracted and is updated to “null” when the data is extracted. However, because of the limitations of a flag, the flag needed to be updated to a value of null immediately or else the incremental context was lost. Furthermore, when only a small portion of the data is functionally required for extraction, using information other than the flag may be required since the flag could be set to ‘N’ for a much larger data set than required.
In yet another solution, rather than actually extracting the data, a view is defined that takes the place of a summarization program. However, such a view may quickly become intractable in a summarization solution. Further, this solution does not perform well with larger amounts of data since it forces all of the data to be summarized on every extraction. Moreover, using a view precludes other beneficial aspects of a summarization program, such as the recoverability of extraction work that has already been completed.
In a solution similar to the viewing solution, a solution without incremental logic makes it possible to extract all the data every time the summarization program is run. However, in addition to the problems that the first viewing solution has, this second viewing solution has the added overhead of clearing the data from wherever the summarization program finally leaves the summarized data.
Previously, different solutions such as the ones described above could not operate on the same source of data reliably. Attempting to utilize multiple solutions on the same source of data risked the corruption of the data. As such, developers were forced to design custom solutions targeted to specific scenarios with little flexibility or scalability. Developers' resources were consumed as they developed these custom solutions while customers' costs increased. Customers were forced to use only one solution, and such a single solution was not always the most efficient solution for all of their possible scenarios.
In another previous solution employing hybrid flow with persistent incremental tables, there is typically a single summarization flow that diverges in locations that require different types of tuning to optimize for bulk and incremental summarization methods. This approach tends to be implemented after a summarization flow is designed and during advanced coding stages and even implementation, when various portions of the flow need to be tuned differently for bulk and incremental data volumes. Without a meaningful separation of the bulk and incremental methods and an architectural way of fixing the data to ensure better performance, this approach can cause the data in the persistent tables to become fragmented over time.
In another solution using index-organized tables, index-organized tables automatically maintain the physical placement of data in the table to minimize fragmentation. However, persisting data in these tables with frequent high volume updates results in undesirable end-user overhead due to the automatic maintenance of the data in the tables.
In yet another solution, it is necessary to periodically execute data maintenance manually in order to improve the summarization methods' performance. This typically requires more support from a development team than if a deferring and scheduling method were to exist.
In still another solution, it is possible to use a single code-path for all methods of summarization. However, this architecture does not have the flexibility to be optimized for both bulk and incremental methods when the tuning techniques differ between the two data volumes, and therefore the solution does not scale well.
Previously, different solutions such as the ones described above could not operate on the same source of data reliably. Attempting to utilize multiple solutions on the same source of data risked corruption of the data. As such, users are forced to use only one solution, and this single solution was not always the most efficient solution for certain scenarios. Further, developers and customers are increasingly seeking solutions that are more cost-effective, customizable, maintainable, and robust while data sources and customers' needs continue to become more complicated. Therefore, an improved data extraction approach and an improved data summarization approach are desirable.