This disclosure relates generally to ranking genealogical records. Specifically, this disclosure relates to increasing the convergence speed of a machine learning model that can rank genealogical records.
A large-scale genealogical index can include billions of data records. Owing to the age of some of those records, the data records in a genealogical index are often obtained by digitalizing various paper documents via optical character recognition (OCR) and indexing the scanned data into a database. Another source of data may come from users' manual input of family history and data. The data in a genealogical index are often noisy due to mistakes on the original documents, especially dated documents, transcription errors, OCR errors, misreported and mistyped information, etc.
A genealogical index allows users of a genealogical system to build their family trees, research their family history, and make meaningful discoveries about the lives of their ancestors. When users search a large collection of records for their ancestors, it is important for the genealogical system to return the most relevant records. However, a genealogical search is different from an ordinary web search in several aspects. First, typical genealogical queries often are short and include only names, birth year, and birth place. Second, an imbalance between the number of relevant samples and irrelevant samples fails a lot of state of the art ranking models. Third, the large amount of typos and inaccurate values in records also often deteriorates the search results. As a result of those reasons, a query often results in a long list of potentially relevant records. This makes the ranking of the searched results particularly important, yet challenging.