The Internet is a public, self-sustaining, worldwide system of computer networks. The most widely used part of the Internet is the World Wide Web, often referred to as just “the web”. The web is an Internet service that organizes information through the use of hypermedia, utilizing markup languages such as Hyper Text Markup Language (HTML) and Extensible Markup Language (XML).
In this context, an HTML file is a file that contains the source code for a particular web page. A web page is the image or collection of images that is displayed to a user when a particular HTML file is rendered by a browser application program. Unless specifically stated, an electronic or web document may refer to either the source code for a particular web page or the web page itself. Each page can contain embedded references to images, audio, video or other web documents. The most common type of reference used to identify and locate resources on the Internet is the Uniform Resource Locator, or URL. In the context of the web, a user, using a web browser, browses for information by following references that are embedded in each of the documents. The Hyper-Text Transfer Protocol (“HTTP”) is the protocol used to access a web document and the references that are based on HTTP are referred to as hyperlinks (formerly, “hypertext links”).
Many manufacturers (also referred to as brands) and retailers (also referred to as stores) of products post product information on web pages. Product information may be coded manually into web pages or populated automatically from a back-end data store through the use of templates in a Content Management System (CMS).
Search Engines.
It is estimated that the publicly indexable web provides access to over 11.5 billion pages of information. However, a significant drawback with using the web is that because there is so little organization to the web, it can be extremely difficult for users to locate the particular pages that contain the information that is of interest to them. To address this problem, a mechanism known as a “search engine” has been developed to index a large number of web pages and to provide an interface that can be used to search the indexed information by entering certain words or phrases to be queried. These search terms are often referred to as “keywords”.
Search engines, such as Google and Bing, generally employ a “crawler” (also referred to as “web crawler”, “spider”, “robot”) to “crawl” across the Internet in a methodical and automated manner to locate web documents. Upon locating a document, the crawler stores the document's URL, and follows any hyperlinks associated with the document to locate other web documents. The search engines generally extract and index certain information about the documents that were located by the crawler. In general, index information is generated based on the contents of the HTML file associated with the document. The search stores the index information in large data stores that is made available for users to query through a user interface. For example, the search engine interface allows users to specify their search criteria pertaining to certain product information (e.g., keywords) and, after performing a search, the search engine provides interface for displaying the search results.
Since search engines are optimized for general search queries, it can be difficult for users to extract product information. Search engines do not provide structured data, search by category or specification attributes is not supported, results tend to be skewed to popular items, and the search engine generally returns URLs to web pages wherein the product information is not displayed uniformly.
Shopping search engines, such as Nextag and Froogle, are search services that attempt to address some of the deficiencies of search engines in locating product information. While these services do have some capability to search by category or specification attributes, these services may not include all relevant websites in their index and may be restricted in the degrees to which the specification attributes may be refined, resulting in incomplete results.
Social Networking Services.
A social networking service is an online service, platform, or site that focuses on building and reflecting of social networks or social relations among people, who, for example, share interests and/or activities. Social network services, such as Facebook and Twitter, essentially consist of a representation of each user (often a profile), the user's social links, and a variety of additional services. Most social network services are web based and provide means for users to interact over the Internet, such as e-mail and instant messaging, in site messages displayed on user's home pages, location based messages, and multimedia sharing such as photos and videos. Social networking sites allow users to share ideas, activities, events, and interests within their individual networks.
Many users of social networking services express their opinions about products through the services, including complaints about products, positive experiences with products, and problems encountered with products. Depending on the status of an expressing user, their particular opinion may carry more or less significance to other users of the social networking service.
Many social networking services are not part of the publicly indexable web. As a result, users seeking information contained within the social networking service need to utilize the search capabilities of the social networking service. For example, social networking services generally provide an interface which allows users to specify their search criteria pertaining to certain product information (e.g., keywords) and, after performing a search, the social networking service provides an interface for displaying the search results which may contain other user's opinions and experiences related to the product information.
Since social networking services are optimized for general search queries, it can be difficult for users to extract organized information pertaining to product information. The social networking service does not organize the information, rate opinions, evaluate sentiment, discern experts, or relate any of the information to structured data.
As a result, in order to obtain structured product information enhanced with social networking information from users and experts in order to make an informed purchasing decision, currently a user would have to perform multiple searches on existing search engines, and then perform multiple searches on social networking services, and then somehow combine the results of those multiple searches in some meaningful way. What is needed is an effective way to extract and combine structured data from websites with relevant data from social networking services along with an interface so that a user can perform a single query to obtain highly relevant information. Some conventional search engines work as voting machines that gather links and calculate the relative popularity of the links and return answers to user queries based on the popularity of the links. The user queries are answered with pages of links which the user can spend a lot of time to sort through manually. Some other conventional search engines extract information and build aggregated data stores that are not complete and contain many errors.
An advantage of the present invention is the improved quality of search results. The structured and social data aggregator returns pre-organized and relevant information that is organized and sorted by specification attributes that contains quantitative data and qualitative data. The conventional search technologies, in contrast, typically return a list of web addresses that may or may not contain relevant search information. The results are often inaccurate, incomplete, or biased by paid inclusion.
Another advantage of the present invention is the automatic creation of the data store encompassing a plurality of web sites and social networking services. The present invention provides a novel method of providing aggregated data by extracting structured data from web pages by crawling, finding, extracting, normalizing and classifying content from web pages, rating social networking information from social networking services by crawling, finding, extracting, rating and classifying content from social networking services, and merging both sets of data in a data store. The disclosed structured and social data aggregator provides a more efficient extraction and rating process, and provides a more comprehensive and accurate aggregated data store.
Another advantage of the present invention is that user generated templates are automatically converted to extraction templates which can be used to extract data records from product pages.
Another advantage of the present invention is the automatic identification of popular products, deals, and social sentiment about products. The system crawls a social site or uses a data feed to find messages containing products, brands, and stores on a social site. The system can then identify links in those messages, follow the links, and identify product pages that information can be extracted from using the templates described above.
Crawling other social networks can be done in several ways. The system can perform a conventional crawl and start at the root of the site. The crawler can use a list of popular users to seed a crawl that extracts the list of followers and then repeats the process of finding the next set of followers. Then each user's social messages are downloaded and analyzed to find content which can be classified. The analyzers identify social messages that contain names of brands, products, stores, model numbers of products, and other brand and store identification information.
Social messages may be re-tweeted on Twitter, re-pined on Pinterest, and shared on Facebook. Many users may have the same social message about the same brand, product, or store on their newsfeed, wall, and/or board. Messages can be cross-posted to other social networks. Identifying the group of people who have the same social message about the same brand, store or product, reveals a common interest, opinion, or thought about the brand, product, or store that we will call a single social message interest cluster. When two or more users appear in more than one interest cluster then the users share the same or similar common interests, opinions, or thoughts about the brand(s), product(s), or store(s) that we will call a multiple social message interest cluster. The walls or newsfeeds belonging to the users in the cluster wall contain the same social message that is identified as a positive or negative opinion, interest, or thought about some social, consumer or rich attribute with respect to the brand, product, or store. And not just a general positive or negative comment as most social message analysis sites have about social messages. The third cluster type is the union of two or more social message interest clusters that share the same people and either the same brand, store, or product but which also have different people. The fourth type of cluster is the union of two or more social message interest clusters that share the same opinions about and have the same brand, store, or product but which also have different people. The fifth type of cluster is the union of two or more social message interest clusters that share the same opinions about and have the different brand, store, or product but which also have different people. The sixth type of cluster can be formed using product categories, where social messages about brands, products, or stores that belong to the same product category, can be clustered. Different category clusters are joined by user interests to form related clusters. Therefore user opinions, interest, and thoughts are used to join clusters. Users, which do not appear in all joined clusters, can be inferred to share similar interests with the users that are in all clusters. If user A is in clusters 1 and 2 and user B is in cluster 1 and user C is in cluster 2 then it can be inferred that users A and C have similar interests even though they do not appear in the same cluster.
Inference of relationships and similar interests between users with the same fine grained social opinions, thoughts, and interests can be weighted by the distance between the users and the number of shared social opinions, thoughts, and interests. Two or more users can express the same social opinion, thought, and/or interest using synonyms. Social opinions, thoughts, and/or interests about brands, products, and stores can be interpreted at a general level (i.e. overall positive or negative) or can be interpreted at a fine grained level with respect to some particular aspect about the brand, store, and/or product.
The social discovery of brands, stores, and products identifies the ones that are the most popular, useful, best, most interesting, for consumers. One embodiment of the present invention finds the brands, products, and stores that consumers like the most and then extracts the product information from the pages that the products are found on.
Newer social sites such as Polyvore, Wanelo, and Pinterest are image driven. The social messages on these sites may not contain any meta-information about the brand, product, store, and/or related rich attributes. The social message also may not contain the link back to the original source. If there is a product link, one embodiment of the present invention can extract the product information. If there is an image, then one embodiment of the present invention can attempt to match the image to an image associated with a product in the data store using well-known image matching techniques. Brand, store, advertiser, publisher, and social sites can modify images from their original form. Some of the image modifications include cropping, scaling, conversion from color to greyscale, conversion from one image format to another image format (e.g. jpg to png conversion), and adding watermarks for copyright protection and other reasons. This is not a comprehensive list of the modifications that can be made to images. Images without meta-information are less valuable to advertisers, brand managers, and other product related professionals and services. Images without meta-information but which contain social comments about brands, products, or stores are more valuable when the images are matched to a brand, product, and/or store data record in the data store. The social messages in the product record can be used to rate the product. Messages may contain opinions, thoughts and/or interest levels. The messages can be used to compare the brand, product, and/or store in the image to social messages about other brands, products, and/or stores. The image can be used to normalize the information about the brand, product, and/or store with other brand, product, and/or store information. Identifying the data record in a third party data store that matches the brand, product, and/or store increases the value of the social information associated with the image.
Images can be identified as brand, product, or store images by following the link from the image to the original source. If the image was sourced from a third party, such as Google, then the original source can be found by following a second link back to the original source of the image.
Advertisers can use the meta-information associated with an image to target ads for the user. If there is no meta-information associated with the image on the social site then the addition of the meta-information, through the methods described above, enables advertisers to match ads that are relevant to the images with no meta-information on the social site. The social sentiment analysis of the user comments enables the advertiser to further refine the ad that is served to the user when viewing the social page. If the user comments are positive about the brand, product or store, then an ad that is related to the brand, product, or store can be shown. Otherwise if the comments are negative about the brand, product, or store then an ad from the same category about a brand, product or store that has positive opinions, interests, or thoughts about it can be shown. The selection of brand, product, or store in the ad can be based on a broad set of opinions from a general set of users or can be based on the opinions of users on the social network who are found to have the same opinions, interests, or thoughts via the clustering mechanism described above or some other social opinion, interest, or thought matching algorithm to find the content of the ad most suitable for the user. Further, refinements to the ad selection algorithm can be made using the location of the users in the cluster. If the users in a cluster are found to be in the same locality, state, country, or have the same sex, language, or other characteristics then this information in combination with the fine grained social opinions can be used to serve the ads. In order to match ads with cluster the ads themselves need meta-information about the brand, product, and/or store as well as information about the type of message that the ad is aimed at conveying to the user.
Automatic identification of products on image based social networking sites using product images is another advantage of an embodiment of the present invention. Users of social bookmarking sites like Pinterest add images, the URL for the image page, and the title of the page which the image is located on to their collections. The rich meta-information contained in the URL page that the social image points to often includes the product record (i.e. brand name, store name, price, product name, category, specifications, store and brand logos, product image, URL of the product page (known from a source such as a data feed or crawl or user extraction via a widget). The product image which is extracted from the brand site has a unique numerical signature which can be computed using a well-known hashing algorithm. Product records are extracted from product pages and stored in a data store via a web crawl and automatic extraction process as described in a previous patent, a data feed from a publisher (brand or merchant or other data aggregator source (e.g. a product search engine such as Price Grabber), a user based extraction method based on a widget as described in this and previous patents, or other data collection methods. The images from the product record or the social bookmarking service can be stored in a file system using the hash of the name to construct a directory path and file name where the image is stored. A map can be constructed using the hash of the name as the key and the corresponding data record as the value.
Each product page of interest at a brand or store contains a product record. The same product image can be found on the Internet at more than one store or brand product page. Each data record contains to a different URL where the data record was found. Data records for pages from different URLs (i.e. the store sites and/or brand site) in the data store that have the same product record can be created using the image hash that uniquely identifies the product record. Product records with the same image hash are clustered together. The product records in each product cluster are added to the cluster map. The cluster map key is the image hash and the value is the list of product records that contain the image hash. Clusters with different image hashes but the some of the same meta-information from the page titles are compared to see if the clusters should be joined.
Meta-information in the title of and body of pages found at social bookmarking sites can be used to compare the information in 2 or more pages that may not contain the same exact product images. The images may be from the same original image but differ due to cropping, the adding of watermarks, transformations, and other image alteration techniques. Detection of the object in the image from the same original source can be done using a convolution filter or some other outline detection mechanism in conjunction with a pixel value range comparison after the images are aligned. If the images are from different sources advanced image processing comparison techniques may be used to compare the images because of different camera angles, lighting conditions, and camera properties.
The information extracted from social bookmarking site pages and in product records found in a data store at the local site is used to cluster with different images of the same product. The textual information is used to find potentially similar product records. The images in the similar product records are then analyzed by the image processing service to join existing clusters and/or add products to clusters and/or create new clusters. Comparison of image signatures can thus be used in conjunction with limited, semi, and/or complete product record information to identify products in visual social bookmarking or catalog sites.
Matching images in a visual social catalog to a product record facilitates the serving of ads on the social catalog site, brand analytics on the social catalog site, conversion of links on the social catalog site to affiliate marketing links for commission based programs so that when the user clicks on the link to the page at the original site which contains the image, a cookie is set on the user's computer. If the user buys something at the site, the store pays a commission to the referring site. Additional advantages include adding meta-information about the product to the visible text on the page to give the viewer additional information about the product. Another advantage of the system is setting keywords in meta tags and descriptions for search engines to index. Other SEO and SEM advantages that adding keywords to pages have are not described here but are well understood in the Internet community.
Furthermore, the merging of structured data and social networking information greatly increases the accuracy of search results where qualitative results are desired. The probability of finding useful information in response to search keywords is significantly greater. Moreover, because the data store contains more complete information, such as numeric attribute information which describe the data store elements (e.g., the size of an object) and qualitative information (e.g., an expert's opinion of the durability of an object), searches can be conducted using general descriptions of the objects (e.g., search for a digital SLR which is within a certain dimension range and longevity) or searches can be conducted using the category, brand, store, and social rating of the former. Conventional search engines, by contrast, return results that require the user to manually validate, sort, and filter the search results. In the case of conventional search engines that return links based on popularity, the user must search through the list of links to find relevant web pages and manually search social networking services to find corresponding qualitative data.
Other goals and advantages of the invention will be further appreciated and understood when considered in conjunction with the following description and accompanying drawings. While the following descriptions may contain specific details describing particular embodiments of the invention, this should not be construed as limitations to the scope of the invention but rather as an exemplification of preferable embodiments. For each aspect of the invention, many variations are possible as suggested herein that are known to those of ordinary skill in the art. A variety of changes and modifications can be made within the scope of the invention without departing from the spirit thereof.