1. Technical Field
The embodiments herein generally relate to a process of data integration and particularly relate to a method of synchronizing data in a process of data integration. The embodiments herein more particularly relate to a method and system for synchronizing data from a source to a destination in real time.
2. Description of the Related Art
A data integration process involves combining the data residing in different sources and providing users with a unified view of the data. In this process, the data is frequently fetched (polled) from a source system to achieve the changes in a destination system. The data integration process becomes significant in a variety of situations such as commercial and scientific fields. In commercial field, the data integration plays very important role in merging the databases of two similar companies. In a similar way, there is a need to provide an integration among the data in scientific field while combining the research results from different repositories.
One important aspect of data integration is a synchronization of data among multiple systems in a real time. At present there are many systems and methods for providing a synchronization of the data among the systems. One important question in the synchronization process is how to write the data and how frequently the data is fetched or polled from the source system. A polling of the data from the source system means reading the data from the source system and the data will be written in the destination system after the reading process. A polling frequency indicates how frequently the data is read from the source system. A difference between the two polling operations is referred to as a polling interval. A polling data can be a data of same entity or different entities or new entity.
Sometimes it needs to qualify different business processes for writing the data to the end systems (destination systems). There may be a failure in writing the data to the end systems on few occasions due to an unavailability of the end system or unqualified or unauthenticated data. During the synchronization process, solutions are provided for a failure in the synchronization process when the application is resumed. The writing of authenticated data to the end system is called a processing of the data. A synchronization process is always expected to synchronize the data as soon as it is generated at the source system. A polling operation can be done frequently to bring the changes quickly. But this will not be enough to bring the source system and the destination system in a synchronized stated until it is processed. Therefore a processing of the data need to be quicker besides a frequent polling process. This problem can be solved by performing a parallel processing operation. A processing of the data in parallel is possible only when the data are independent in nature. But the parallel processing will result in either inconsistency or failure, when the data to be processed are dependent on one another. The existing synchronization solutions are mostly derived based on an event based trigger process or on a scheduler based trigger process. In the event based trigger process, a trigger is enabled at the source system. Whenever a change occurs in the source system, it will be triggered as an event. The event will be further processed during a synchronization process. The event based triggers result in parallel propagation resulting in a missed dependency order. However the scheduler based synchronization solution always searches for the changes in the source system.
During a data integration process, all the changes in the data in the source system are to be carried out to the corresponding data in the destination system thereby satisfying an interdependency of the data. It is to be noted that a solution in the integration process may not always carry out the sequence of the changes in the data in the destination system but it must maintain the interdependency of the data. The data at the source system is generated by a plurality of users with different operations/responsibilities. A new data is generated by creating an entry or updating the existing data or deleting the existing data. When the order of changes in a data in the source system are read randomly during an integration process, the inter dependency of the data between the two changes may be lost. This inter dependency of data can be in the form of a parent-child relationship of a data. Therefore it is important to preserve the order of changes made to the data in the source system.
The main problem in the integration process is to define a way or process to make the changes in the data in an order so that the interdependency of the data in the source system and the destination system are satisfied or maintained. One obvious choice is to carry out the changes in the data in an incremental order, i.e. by arranging the data with respect to the time of generation of the data. As a result, the changes in the data are polled and synchronized in the same order in which the changes in the data are generated. When the changes in the data are made in an incremental order, then an event is generated for every change in the source data and notified during a synchronization process. Moreover the changes in the data are processed during a data integration process. This process will be carried out as long as a synchronization application is running and is made available. When the synchronization application is unavailable because of network or system failure or some other cause, then there is a chance to miss the changes that are generated during the abovementioned down time. This will result in a generation of an inconsistent state between the source system and the destination system. The currently available solutions in the synchronization process generally polls and sorts the changes in the data with respect to a time. This method will not allow for a recovery of the changes lost during a polling process, when the changes are made for a plurality of entities in a same place or when the changes are made for the same entity at a plurality of locations in the same time.
A synchronizing process performs a synchronization of changes in the entities among the incompatible distributed systems. A reliable synchronization process must have a tolerance to faults. It needs to take care of all the failures occurred during a synchronization process. The failures may be an expected/anticipated failure or unexpected/new failure. The examples of the expected/anticipated failures are an unsuccessful transaction on a target side, a non availability of the end system, inappropriate write privileges in the target system. With respect to the expected/anticipated failures during a synchronization process, the system for synchronizing the data knows that synchronization has failed and will take necessary action on the failure after resuming its process. The examples of the unexpected failures are a machine shutdown condition and a server crash condition. During the generation of unexpected/sudden failures during a synchronization process, an integration process may be in progress or a synchronization process might have reached to a certain level and the integration process is interrupted without any prior knowledge. The data may or may not be written on a target system during a failure in a synchronization process. In such circumstances, it is not known when the synchronization application will be active next time after a failure and after a starting up of the system. More clearly one may not be aware of the processing condition after a failure. It is better to poll the changes incrementally, store a time for a source system for restarting a polling process. All the changes done to the entity in the source system will be polled after this stored/waiting time. This stored time/waiting time will be kept updated after processing one change. When a change is written to the destination system, this stored/waiting time is updated with a time stamp of the process event. This time is referred to as last polled time. The next synchronization process will fetch/collect only those changes which are done after the last polled time. But, what will happen, when multiple entities are updated (bulk changes) at same time. The existing processes support the bulk changes, but there is no way for a recovery of the bulk changes in a synchronization process. When the bulk changes are synchronized, the existing solutions will not ensure that the synchronization of the bulk changes are carried out in a correct or right order. Further the existing synchronizing processes will not ensure the appropriate synchronization of the changes in case of any failures like a system crash or an application crash.
In the synchronization process, all the required changes done for an entity in the source system should be transferred and carried out to the entity in the destination system in a sequential order thereby satisfying the dependency among the changes to an entity. A change should not be transferred twice to the destination system and no unwanted change is transferred to the destination system. Till now, it is generally assumed that all the changes in the source system will be polled from a single place. Alternately the changes may be made to the entity in the source system from different locations. Also there is no standard to stipulate that all the changes for an entity need to be stored at the same place. If changes are made to the entity in the source system from different places, then there no other option but to fetch a data from each location individually and process it in order during a data integration process. An integration module accesses a location at one time to poll the changes after some initial waiting time. Then the module accesses another location to poll the changes after an initial waiting time. The polled changes are merged after sorting the changes with respect to the time of change. This process is repeated for all the locations. But there exists a critical situation, in which some changes occur at both the locations and the changes are made at both the locations just after a finishing of polling at a first location but before the starting of polling at a second location. By this time synchronization module will poll the changes from first place after some initial time and will start polling from next place. As a result, the integration module will poll the changes from a first place but will miss the last change at the first place. The handling of the changes from the different locations of a same source generates a problem, when a change history is fragmented. It will be the case, when the created data is stored at one place, while the updates are stored at different places. A standard synchronization module polls the data from the fragmented places and prepares the list by merging the changes done at different places with respect to a time but doesn't maintain the dependency between the changes during a merging process.
The synchronization module interacts with the systems using an authenticated login called as integration user. The integration module reads the changes from a source system and writes the data in a destination system. Consider a scenario of a bi directional integration between two systems S1 and S2. The systems S1, S2 are the source and the destination systems with respect to each other. The users U1 and U2 are the integration users for the systems S1 and S2 respectively. A change C1 is created by an integration user in the system S1. A synchronization module polls the change C1 and writes to system S2. This change in S2 will be made by the user U2 and a change C2 is generated in S2. Now C2 will be polled by user U2 and written in the system S1 using the user U1. It is almost the same set of changes to be written in S1 by the integration module. If the synchronization module allows a writing of change in the system S1, then again a change will be generated in the system S1 and it will be polled and so on. Hence a loop will be created and synchronization module oscillates between these two systems. There are cases where systems or synchronization module does not allow a writing of the changes when the changes have the same value. In this case, a different value can be expected while writing back the changes to the source. For example, this different value can be a traceability link, or a setting key of a destination in a source. Again it will generate a change in the source but this change will be written because of a lack of a destination field. Hence, there is a failure in the synchronization process again. These will result in an indefinite loop. The existing synchronization module generally filters these data to avoid an indefinite loop. But a filtering process is not accurately done using these existing solutions.
None of the currently available systems and methods uses an event based trigger and a scheduler based trigger to synchronize a data from a source to a destination.
Hence, there is a need for a method for synchronizing a data from a source to a destination in real time. There is also a need for a method to address the problems of synchronization with incremental changes, bulk changes and the changes from multiple locations of a source. Further there is a need for a solution to a problem of indefinite loop occurred during a synchronization of the data among multiple systems. Yet there is a need for a method and a system to use both an event based trigger and a scheduler based trigger to synchronize a data from a source to a destination.
The abovementioned shortcomings, disadvantages and problems are addressed herein and which will be understood by reading and studying the following specification.