The Internet may provide a source of training data for ML (Machine Learning) applications. However, it has proven quite challenging to filter out garbage/rubbish data in an efficient and effective way while collecting desirable training data. Conventional approaches use a “black list” to deal with unwanted training data by filtering out the URLs which are deemed as unwanted content. However it is inefficient and ineffective to utilize these black lists because a user must manually add URL entries in the black list in order to filter out newly found pages. As such, these conventional approaches may be laborious and passive.