Business classification is used in many different ways for fiscal, financial, sales, marketing and other purposes and activities. It helps businesses judge which companies should be targeted to become customers or vendors of a particular product or service. One popular use of business classifications is to build companies sales pipelines by focusing on likely prospective customers.
Business classification systems evolve with time depending on business trends. For example, the development of the computers led to a significant expansion of the Standard Industry Code (“SIC”) classification used in the United States to covering multiple areas related to computing. Typically, governments require businesses to self-assign classification codes, a process that is prone to error and omission. This especially can be the case if a company has multiple lines of business or if the primary focus of the business changes over time.
The Internet constitutes a new source of information to determine and assign business classification codes for a company. However, some sources are better suited than others to serve this purpose. For example, when a company applies for a place in business directory it quite often provides a description of the company's line of business. A company web site is probably the richest and most detailed source of information for automatic classification code assignment.
Using data mined from the Internet for the task of determining and assigning business classification codes has been known and used for a number of years. Such information is especially important for companies that provide business information to other companies. For example, InfoGroup has been doing manual and semi-automatic SIC and North American Industry Classification System (“NAICS”) code assignments using on-line company descriptions for a number of years. More recently other types of businesses have started doing this, such as insurance companies that assess risk for business insurance based on a company's business classification.
For example, US Patent Publication No. US 20120290330 A1, entitled “System and method for web-based industrial classification”, describes methods for determining risk-related business classification using business information obtained from the Internet. That publication describes a method that combines manual classification code assignment with classic natural language processing techniques and machine learning based clusterization.
A key drawback of methods such as described in the foregoing publication, however, is the complex nature of web pages present on the Internet. In particular, to be useful for the code assignment task, proper attribution must be made of the information contained on the web page(s) to a particular business entity, as well as a process for resolving contradictory information contained on different web pages. For example, the presence on web pages of extraneous elements, such as advertisements, provides a high level of noise. Without resolving such noise, the resulting clusterization and corresponding code assignments may be highly inaccurate.
Additional difficulties arise when a company has a multiple different lines of business, which is typical for large corporations, especially multi-national corporations. The inability to account for the interference between descriptions of different lines of business creates an additional high level of noise that may result in unreliable business classification code assignments, especially where a machine learning technique is used. Accordingly, in order to distinguish one line of business from another, or one corporate division from another, a more in-depth analysis and classification of web pages is needed than is currently available using previously-known methods and systems. None of these drawbacks are addressed by previously-known computer-assisted business classification code assignment systems.
In view of the many drawbacks of previously-known systems and methods, it would be desirable to provide apparatus and methods that overcome such drawbacks. In particular, it would be desirable to provide a computer-assisted business classification code assignment system and methods that can mine data presented on an Internet web page and correctly attribute information relevant to the company of interest while rejecting extraneous information, such as advertising contained on the web page.
It further would be desirable to provide a computer-assisted business classification code assignment system and methods that can mine data presented on an Internet web page and differentiate and properly attribute information relevant to the business division of the company of interest from information relating to other divisions of the same company.