1. Technical Field
This invention relates to data ingesters and management of data collected by the data ingesters. More specifically, the invention pertains to employing a taxonomy for classifying ingested content and applying rules to filter ingested content based upon the taxonomy.
2. Description of Related Art
A data ingester is a program or script which collects content from data repositories. It browses locations within a distributed computer network in a methodical and automated manner. The goal of an ingester is to collect data from data sources. The processing of collected data is known in two forms, the first is a continuous collection process, and the second is responsive to a specific query or request. With respect to the continuous process, the ingester continues to collect new data and refresh old data. In general, a broad set of rules are applied to the ingester to control which locations it is allowed to visit. These rules include a set of allow rules and a set of forbid rules, and the rules are employed to attempt to ensure that the ingester visits only intended location, directories, pages, and sites. In a distributed computer network, such as the Internet, there is continual growth of locations and sites, which in itself leads to an increasing challenge to the ingester.
When data sources change the type of content being served, changes to the ingester are required to maintain appropriate data collection. Other solutions for addressing changes to content require the ingester to react to changes in ingestion requirements, discovery of new data source, and changes in the nature of the data sources over time. For example, new content servers may come on-line and old content servers are either removed or modified. An ingester that consumes data from the content server ensures that their data ingestion policies take into account the new content servers. More specifically, the ingester ensures the desired content continues to be ingested, no undesired content proceeds through the ingestion process, and any new content-type is handled seamlessly and as intended.
One problem encountered in the related art is determining which data sources should be ingested and which data sources should be discarded. A known solution is to create a list of sources to ingest, and not to ingest any data absent from the list. In a similar manner, another known solution is to create a list of sources to discard, and to ingest everything else. Both of these complementary processes are manually employed. As the list grows, a scaling challenge is encountered. More specifically, it is impossible to manually specify rules for each and every data source as the quantity of data sources increases.
An approach to the solution described above, is to specify a broad coverage of categories to ingest, with the understanding that the ingested content will include undesired content. A complementary approach to this solution is to specify a narrow coverage by explicitly defining sites, at a risk of filtering out potentially useful content. However, both of the described solutions are not desirable since they either introduce more content than desired, with the likelihood that the content will be polluted, or do not allow access to all of the desired content.
Accordingly, there is a need for granular management of a data ingester that functions in an autonomous manner. The modified ingester addresses content to be ingested and/or discarded, while minimizing manual intervention.