Current socially curated networks contain information which is unstructured and often does not contain the meta-data associated with the images that users have uploaded or captured from another website using a widget or extension. These types of socially curated websites contain unstructured data which makes it difficult to index, search, and compare different items on the social network. Search results on product search engines typically include duplicate products from different retailers. Product search engine results do not typically include manufacturer records, which normally contain the most complete set of product attributes, including specifications. Thus, it is difficult to compare different products even if they can be found on the aggregated web site, since the detailed product information is missing. Shopping engines typically contain relatively little information about the products in their search results. A formal definition of information retrieval is finding documents, which are typically unstructured text, that match a query, from a large body of documents that are indexes. The current search process for products at shopping engines, retailers, manufacturers, and socially curated product sites is not as efficient as it can be.
Users save data from product web pages using widgets, buttons or browser extensions from socially curated sites such as Wanelo, Pinterest and Clipix. Socially curated sites allow users to save a title and select a picture to save, and select a price to save on a page to their list. However, socially curated sites do not create a template for the data record, nor extract the data record, nor transmit, nor store the entire data record from the remote web page. Because the socially curated web site does not receive the entire data record, no cleaning, classification or normalization actions are performed. Currently, socially curated sites do not do semantic analysis of the text that is extracted from the remote web site to create data records that are displayed on the user's collection. The one data value that they may extract automatically is the price nearest the product image. They do not extract complete information from web pages and associate semantically analyzed text with data field names and store the information in data records. An example of text which has semantic meaning is a token(s) of alphabetic characters that represent a manufacturer name. Consequently, there is a need for semantic analysis after the text that is associated with a data field name is extracted from the page.
Unstructured data is contained on socially curated networks that was captured on remote sites and saved to user collections. “Unstructured data” in the case of product records means that the data is not organized into name/value pairs such as “price” and “$10”. Sites such as Pinterest, Wanelo, and Shopcade extract the title of the page, search for an image near the top of the page or let the user select the image, and search for a price near the selected image. They send the extracted record to their popup, the user selects a collection to add the data to, and the record is then added to the collection. These socially curated sites do not have a pre-defined template nor do they make a template for the product sites. As a consequence a robot or user cannot revisit the site and extract the full product record from the sites using a previously created template and create a product database on their respective sites. Structured data is typically stored in relational databases or some other form of table structure that may be hierarchical and have relationships between tables. Structured data in web pages has a structure that is structurally repetitive in nature from document to document. The structure can be represented with a template. Structured databases are used to generate product pages at manufacturer and retailer websites. The product pages contain most or all of the same information as the product record in the database. The product web page is generated with a template. The product record is embedded in a markup structure (HTML) in each web page. The structure which holds the product record may vary slightly from page to page due to differences such as the presence of a sale price on one page and no sale price on another or variable numbers of specifications from page to page or advertisements. Capturing the product record on any web page at the same site is a matter of knowing the layout of the structure that contains the product record. A template which contains XPATHs and semantic information (the data field name) has been used in solutions to capture and save web based information for the purposes of analyzing the information, using the information in reports, and other purposes.
Other social networks that utilize buttons on remote web sites to capture information from the web page normally send links or small amounts of data from the remote page via Facebook like or Twitter Tweet buttons (shortened urls) from sites to their respective destinations, Facebook or Twitter. It would be beneficial to send complete data records from sites using 3rd party predefined set of data field names and the corresponding data field values from pages at sites for the purpose of creating user curated data. There is also a need for systems that transmits the data records, cleans the data records, classifies the data records, normalizes the data records, stores the data records in a database and displays the data records on a socially curated site.
Users on Twitter and Facebook tweet and post messages about brands and products. The messages can be classified by different types such as customer service, product durability. Two or more product records can be compared by the user in the social shopping network. The comparison can be saved to the user's list of product comparisons. The comparison process may require that the user normalize the data field names or the specification attribute names.
The Document Object Model (DOM) is a cross-platform and language-independent convention for representing and interacting with objects in HTML, XHTML and XML documents. Objects in the DOM tree may be addressed and manipulated by using methods on the objects. The public interface of a DOM is specified in its application programming interface (API). The HTML DOM defines a standard way for accessing and manipulating HTML documents. The HTML structure is represented as a tree.
When a page is loaded into a browser, the browser domain object model (DOM) is constructed. The DOM is a tree-like representation of the HTML hierarchy, attributes, visible text, and other information in the HTML page. FIG. 1 shows an example HTML tree. On top is the HTML tree document 101, under is the root element 102, the head element 103, the title element 104, the text associated with the title 105, the body element 106, and the href attribute 107. The <a> element 108 contains text associated with the link 110. Element <h1> 109 contains text associated with header.
The web site templates used for generating the web pages that contain product records are created by one person and are typically not downloaded from a central source. Content management systems which are sold or downloaded contain templates that are customized by the web designer responsible for the creation of the website. Different sites may use the same content management system. However, the resulting HTML on two sites using the same content management system and templates are not necessarily the same. Moreover, it is not really possible to know that two web sites have used the same content management system and templates. Online shopping site generators offer stores different templates to use to generate their stores. Again, it is not possible to know what template was used to generate the store front, and the store front can be customized, which leads to differences between two different store fronts that were generated from the same template. However, it would be beneficial to have a system which uses crowd sourced web page data record template creation to build a database of web page templates which could then be used by others to extract the information from the web pages at the site where the template(s) were created, and to save the information to a social network. Moreover, there is a need for a crowd based web page data record template creation and storage system that could be used to create templates for batch extraction of information from remote web sites. Furthermore, there is a need for a system that uses the data record information extracted from the web page to find the same or similar products at other web sites in a central product record data base that is created with the previously mentioned batch extraction system.
Search engines index words and phrases. Attempts to extract structured data in web pages have been made by search engines using special markup in the web pages such as micro formats. The web designer inserts the micro formats to identify the data records in the web pages. The search engine crawls the site and examines the pages for the presence of micro formats. The micro formats identify the data field values using a set of data field names. The micro formatted data is extracted into a data structure which is then inserted into a database or data table. The database or data table can then be further indexed to provide better search results for end users. Identifying product pages with fine grained searches that contain detailed information is then possible. However, web masters have not embraced micro formats and only a small percentage of the web sites are currently using micro formats or any of the other industry standard structured data formats designed to assist conventional search engines in extracting structured data. The structured data formats are not being inserted into the pages.
The information may be combined by inserting it into a spreadsheet and manually normalizing the data to produce a report. XPath, the XML Path Language, is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C). Tag pairs in an HTML product page contains text. The text can be product record data field names and values. The XPATH and data field name and value is created from a template and a data record.
Kapow has web data extraction capabilities for a single web site using wrapper technology. They also have data normalization and data transformation capabilities including text and code strings, numbers, date and time, HTML/XML.
Fetch.com compares pairs of pages using algorithmic “experts” (e.g. computer algorithms) to find similarities between the pages, forms clusters out of matching pairs, extracts the data from the clusters and stores the data in the data base. (Publication number EP1910918 A2).