Information can be acquired and integrated from many data sources. It is often useful to acquire and integrate information from data sources not in the direct control of the integrator. Where direct access to a data source is not possible, or where access to one or more data sources provides only a partial view of information associated with a given subject area or domain, conventional information integration approaches may be impractical.
Many websites and other non-database sources publish information that is generated from a database that lies behind the presentation or publication of the information on the site. Information published on a webpage or in some other unstructured format may be less useful than data in a database, as it may not be readily queried by end-users, and must be accepted in the published format or found through unstructured searching techniques. Furthermore, information on various subjects or domains may be distributed piecemeal across multiple different websites and other information sources, so that uniform access to related information for a given domain may be difficult or impractical to obtain.
Various software tools and techniques exist to acquire or scrape information from websites and similar sources of unstructured information, using methods analogous to screen-scraping techniques historically used for enterprise system integration. Such techniques may require considerable manual labor, which may include custom programming for individual sources to be scraped. Such tools may be incapable of integrating scraped information into a comprehensive, consistently structured database without considerable effort.
Applications that make use of information acquired from data sources may perform repeated acquisition runs over time as information is periodically or continually updated on each source. Managing data acquired over time, especially when the data is acquired from multiple sources, can be a difficult challenge. For example, when data is edited after acquisition and subsequently reacquired, the subsequently reacquired data may be inconsistent with the edited data. When data referring to the same entity is acquired from multiple sources, inconsistencies between the sources may lead to inconsistencies in the database.