The present invention relates generally to the field of information processing, and, more particularly, to using artificial intelligence to compile information.
To facilitate searching, sorting, combining, and various other functions, information may be stored electronically in a database. A database is generally structured as a set of records with each record containing one or more fields. Unlike a data structure, such as an array, in which all the array elements represent the same type of information, each field in a record typically represents a different type of information. A record may be accessed as a collection of fields or, alternatively, the various fields in a record may be accessed individually by name.
Although databases are generally characterized by their highly organized structure of records and fields, the information to be stored in a database may not be as highly organized. For example, consider a database for storing rxc3xa9sumxc3xa9s for job candidates. Most rxc3xa9sumxc3xa9s contain the following types of information: demographic information (e.g., name, address, telephone number, electronic mail address, etc.), education information, and job experience information. Nevertheless, while these various types of information are generally present in most rxc3xa9sumxc3xa9s, they may not be arranged in a standardized format. As a result, it may be difficult to store candidate rxc3xa9sumxc3xa9s in a database in a consistent manner such that a user may search, sort, or otherwise process the rxc3xa9sumxc3xa9s according to some criterion.
Consequently, there exists a need for improvements in compiling and organizing information such that the information may be more readily accessed and processed when saved in, for example, a database.
Embodiments of the present invention may include methods, systems, and computer program products for compiling information into information categories using an expert system. For example, multiple information categories may be defined and, for each information category, a fact table may be provided that contains facts and rules associated with the respective information category. The information to be compiled may be encoded as multiple data strings and received as a digital data stream. An inference engine is then used to process the facts, the rules, and the data strings for at least one of the fact tables to associate one or more of the data strings with at least one of the information categories. The data strings that are associated with the information categories may then be arranged in a file based on their information category associations.
By using the inference engine and fact tables to associate data strings with information categories, non-standardized information may be organized by category and then arranged in a file based on these categories. The resulting file may be more readily processed by other applications because the information contained therein may be arranged in a consistent, predetermined manner.
The fact tables may be viewed as a knowledge base and the inference engine and fact tables together may be viewed as an expert system for associating information with information categories. Because rules may be developed for the expert system to account for various organizations of data strings in the received data stream, a programmatic approach to categorizing the data strings need not be followed. For example, when processing information from a rxc3xa9sumxc3xa9, the expert system need not rely on the candidate""s name being at the beginning of the rxc3xa9sumxc3xa9 or the use of specific subtitles, such as xe2x80x9cEXPERIENCExe2x80x9d or xe2x80x9cEDUCATIONxe2x80x9d in the body of the rxc3xa9sumxc3xa9.
In particular embodiments of the present invention, a determination may be made whether data strings are encoded using the American Standard Code for Information Interchange (ASCII) coding scheme. If the data strings are encoded using a non-ASCII coding scheme, then the data strings may be translated into ASCII to facilitate further processing.
In embodiments of the present invention, the facts may include, but are not limited to, names, words, phrases, acronyms, terms of art, number strings (e.g., zip codes, area codes), geographic names, etc. The rules may comprise fact match rules, pattern match rules, and proximity search rules.
In further embodiments of the present invention, the inference engine may process the facts, the fact match rules, and the data strings for one or more of the fact tables to associate data strings with the information categories. The inference engine may also process the pattern match rules and the data strings for one or more of the fact tables to associate data strings with the information categories. The pattern match rules may include rules related to sequences of data strings. Finally, the inference engine may process the proximity search rules and the data strings for one or more of the fact tables to associate data strings with the information categories. The proximity search rules may include rules related to the relative location of data strings in the data stream. For example, when processing information from a rxc3xa9sumxc3xa9, if the term xe2x80x9cGPAxe2x80x9d is located near the term xe2x80x9cEDUCATION,xe2x80x9d then it may be interpreted as xe2x80x9cGrade Point Averagexe2x80x9d and may be associated with an education category. Alternatively, if the term xe2x80x9cGPAxe2x80x9d is located closer to the term xe2x80x9cEXPERIENCE,xe2x80x9d then it may be interpreted as an acronym for a skill, job responsibility, etc. and may be associated with an employment category.
In particular embodiments of the present invention, the information categories may be tailored for compiling information from a rxc3xa9sumxc3xa9. Accordingly, the information categories may include a demographic category, a skill set category, an education and employment category, and a career progression category. The number of occurrences for each data string that is associated with the skill set category may be determined and the number of occurrences for each data string that is associated with the career progression category and corresponds to job position title information may be determined. These xe2x80x9chit countsxe2x80x9d may be indicative of the relative importance of a particular candidate""s skills and job titles.
In further embodiments of the present invention, a qualitative rank may be determined for each data string that is associated with the career progression category and corresponds to job position title information or job responsibility information. These qualitative rankings may be based on weights assigned to job position titles and job responsibilities in the fact tables. The weights assigned to the job position titles and job responsibilities in the fact tables may be dynamically set by a user based the type of qualifications sought in a job candidate.
In still further embodiments of the present invention, in addition to the data strings that are associated with the information categories, the number of occurrences for each data string that is associated with the skill set category, the number of occurrences for each data string that is associated with the career progression category and corresponds to job position title information, and the qualitative rank for each data string that is associated with the career progression category and corresponds to job position title information or job responsibility information may also be arranged in a file based on the associations between the data strings and the information categories.
The file containing the data strings associated with the information categories may be an extensible markup language (XML) file. Advantageously, XML may allow the file to be described in terms of logical parts or elements. For example, the information categories and the various types of information that belong to each category may be represented in the XML file as specific elements.
In further embodiments of the present invention, the data strings may be added to the XML file in their received arrangement. For example, if the data strings comprise information from a rxc3xa9sumxc3xa9, then the entire rxc3xa9sumxc3xa9, without any processing or formatting performed thereon, may be added to the XML file. To facilitate processing by other applications, the XML file may be saved in a structured query language (SQL) database. In addition, the XML file may be sent to the originator of the digital data stream (e.g., the source of a rxc3xa9sumxc3xa9 file or other information stream).
In other embodiments of the present invention, unknown data strings may be identified by removing those data strings that are either known to be uncorrelated with any of the information categories (e.g., xe2x80x9cnoisexe2x80x9d terms) or are represented by a corresponding fact in the fact tables. Any data string that remains may be considered to be an xe2x80x9cunknownxe2x80x9d data string and may be added to a pending fact table. Moreover, the pending fact table may include multiple pending fact tables corresponding to the fact tables associated with the information categories.
In still other embodiments of the present invention, the number of occurrences for each data string in each one of the pending fact tables may be determined. These number of occurrences or xe2x80x9chit countsxe2x80x9d may then be compared with thresholds that are defined for each of the pending fact tables. If the number of occurrences of a data string exceeds the threshold defined for a particular pending fact table, then that data string may be added to the fact table associated with the pending fact table. Thus, new facts may be xe2x80x9clearnedxe2x80x9d when their frequency rises to a level that suggests that they may be used in connection with a particular information category.
In yet other embodiments of the present invention, the number of occurrences for each data string in each one of the pending fact tables may be determined and then the data strings for each of the pending fact tables may be ranked based on these number of occurrences or xe2x80x9chit counts.xe2x80x9d The ranked data strings may be displayed on, for example, a display monitor to allow a user to select which of the data strings in each of the pending fact tables to add to the respectively associated fact tables. New facts may be xe2x80x9clearnedxe2x80x9d by adding those data strings in the pending fact tables that are selected by the user to the appropriate corresponding fact tables. In addition, a user may identify those data strings in the pending fact tables that are uncorrelated with any of the information categories and, thus, may be treated as xe2x80x9cnoisexe2x80x9d terms.
The present invention may be used to compile information that may be received as multiple data strings arranged in a variety of different formats into a structured arrangement or format by using an expert system to associate the data strings with predetermined information categories. For example, the present invention may be used to compile information from candidate rxc3xa9sumxc3xa9s, which may be written in many different types of formats or styles, into a structured arrangement in which the information is organized based on a set of information categories that are typically associated with a rxc3xa9sumxc3xa9. Once the information has been arranged in a structured format, other applications may more readily access and process the information because of the uniformity in which the information is arranged and stored.
While the present invention has been described above primarily with respect to method aspects of the invention, it will be understood that the present invention may be embodied as methods, systems, and/or computer program products.