The invention relates to computer systems and methods, and in particular to citation record extraction system and method, and program product.
This section is intended to introduce the reader to various aspects of the art, which may be related to various aspects of the present invention, which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, it should be understood that these statements are to be read given said understanding, and not as admissions of prior art.
Citation records are essential to research communities. Researchers usually create their own publication lists on the Web for various reasons, such as describing their researches and contributions, or announcing their new papers before they are formally published on journals or conferences. Here, a Web page containing publication information is referred to as a publication list page. The challenges of extracting citation records from publication list pages arise from two aspects. First, many publication list pages are crafted manually by researchers themselves such that the layouts could be quite different. In addition, in most cases, the citation records usually accompany with some unrelated data, such as descriptive text related to the corresponding citation record. These noises become a great obstacle for extracting citation records. Second, there are many rules and formats for the representation of citation record, thus it is difficult to derive parsing rules for extracting citation record directly.
Accordingly, an effective processing method for citation record extraction is needed.