Some embodiments described herein relate to a method for compiling a unique sample code for specific web content. Some embodiments described herein also relate to a method for providing specific web content with such a unique sample code. Some embodiments described herein further relate to a method for gaining access to specific web content provided with such a unique sample code. Some embodiments described herein moreover relate to a method for indexing web content in a search engine. Some embodiments described herein additionally relate to a method of processing an Internet search query using a search engine having indexed web content according to the above method. Some embodiments described herein further relate to an index repository for use in the above method. Some embodiments described herein also relate to a method for gaining access to specific web content provided with a unique sample code by using a search engine having index web content according to the above method. Some embodiments described herein also relate to a computer-readable medium with computer-executable instructions which, when loaded onto a computer system, provide the computer system with the functionality of any of the aforementioned methods. Some embodiments described herein additionally relate to a sample code as compiled by the above method. Some embodiments described herein further relate to a system for compiling a unique sample code using the above method. Some embodiments described herein also relate to a system for handling a user's request for gaining access to specific web content provided with a sample code according to some embodiments.
Internet users wishing to retrieve information from the world wide web (WWW) will often submit a query containing search words to an Internet search engine. Such a search engine will provide a user with a result list of websites, or items contained in websites, in response to a query from the user. A result list will contain references to websites, or parts of websites, which the search engine considers match the search terms. The match can be an exact match, or provision can be made for the search engine to provide near matches, near matches being determined by truncations, letter transpositions or letter replacements within the search terms. The result list is sorted based on how well web pages match the query and respective ranks associated with matching pages.
In order to obtain the information needed to be able to provide a user with a result list in response to a query, most search engines use computer programs called web crawlers or spiders to search the Internet, downloading web pages from servers. It is not possible, due to constraints in communication bandwidth and computing resources, for a web crawler to download every web page on the world wide web. Necessarily, search engines only search a subset of web pages. A number of different search prioritisation methods, such as breadth first searching, may be used to ensure that the most valuable pages are downloaded as efficiently as possible. Typically the downloaded pages are stored temporarily, in a memory device such as for example a server's read only memory, to be processed by the search engine for indexing.
In order to produce an index for use by a search engine in responding to a user's query, the information from downloaded web pages is compressed, sorted and stored. Typically the downloaded pages are stored temporarily, in a memory device to be processed by the search engine for indexing. The downloaded pages are then parsed and processed. Processing the information includes extracting words contained within the pages as well as the number of occurrences of the words, their location in the pages, font size and the like. Processing the information also includes extracting hypertext links included in the web pages. The processed information from a web page is stored such that it can be addressed according to the words contained within the page. The stored information is also used to rank the page, that is, to quantify how useful the page will be to a user based on the search terms of a query.
In order to rank the matches to a query, ranking algorithms are used which are usually based on simple link analysis techniques. These algorithms include HitList and Google's PageRank. The aim of these algorithms is to rank a page based on a measure of the page's authority using a mechanism based on the number of links to the page from other pages. The underlying assumption with such a ranking is that many Internet users will choose to incorporate in their web pages links to relevant or authoritative web pages.
A problem associated with link-based ranking algorithms such as PageRank™ is that it is possible for a website designer to employ techniques which capitalise on their knowledge of search engine link analysis algorithms in order to improve the rank of their website artificially. Such techniques are often referred to as “spamming” and the web pages which are the target of spamming techniques are known as “spam” web pages. For example, spamming techniques include creating numerous web pages for the sole purpose of linking to a target (spam) web page and thereby raising the ranking of that web page. This spam technique is commonly referred to as link farming. Another problem with the known search engines is that no distinction can be made nor is made between authentic web content published by the originating party and counterfeit web content published by malicious parties which easily leads to deception of the public.
An embodiment includes a method by means of which at least one of the problems above is solved.
To this end, embodiments provide a method for compiling a unique sample code for specific web content, comprising: A) defining at least one sample code template comprising multiple sample code segments to be used for building a sample code for specific web content, said sample code segments at least comprising: a sample owner identifying code segment, and a sample identifying code segment; B) specifying the content of the sample code segments to be used for building said sample code, wherein the sample owner identifying code segment is specified by an Internet address, in particular an IP address and/or a domain name, of an owner of the specific web content, C) stringing the specified sample code segments to form the sample code, D) defining a digital path to a digital location via which access can be gained to the specific web content, and E) creating a cross-reference between the sample code generated during step C) and the digital path defined during step D) in case the sample code and the digital path are mutually distinctive.
In some embodiments, the digital path may represent a Uniform Resource Locator, or may refer to a digital location, in particular, a web location, where the specific web content is stored. In other embodiments, at least a part of the digital part and the sample code are identical or may be substantially identical. In some embodiments, in Step A at least one punctuation mark may be defined for separating adjacent code sections during step C). In other embodiments, an order of defined code segments to be stringed may be defined in Step A. Other embodiments may provide that Step A be processed repeatedly to generate multiple sample code templates, wherein the method further comprises step I) comprising choosing a code template to be applied prior to executing step B.
By labelling each world-wide unique specific web content with a world-wide unique product sample code acting as world-wide unique identifier, comparable with a DNA profile or fingerprint of the sample, one specific web content can be traced and distinguished easily and unambiguously from another specific item of web content, and thus each specific web content can be identified throughout the world regardless of its context. This world-wide unique identification can be facilitated by the recognizable (identifiable) incorporation of the IP address and/or the domain name of a (present or prior) owner of the specific web content. Moreover, since the specific web content code is associated with a digital path to a digital location where the specific web content, and eventual further information (metadata) relating to said specific web content, is stored and can be traced/found, it can be verified relatively easily whether the specific web content has been manipulated or is authentic. This may facilitate assessment of the authenticity of the specific web content by determining the identity of the publisher of the specific web content. If the specific web content is published by the owner of the specific web content, the web content is deemed to be authentic. However, if the specific web content is not published by the owner, the specific web content is not considered to be authentic. The specific web content will typically not be moved once stored at the digital location. If the specific web content is moved to another digital or physical location, the cross-reference between the sample code and the digital path may be correspondingly updated, so the sample code is up to date and gives permanent access to the specific web content. Hence, dead links due to changes of the digital paths to digital locations where specific web contents are stored can be eliminated in this manner.
Specific web content, also considered as a single individual digital entity, are defined to have a unique identity and to be distinguishable (individualizable) and hence trackable and traceable from other specific web content in the scope of its specification criteria. The term “specific web content” is understood as a web item or piece of web content represented by textual, visual or aural content that is encountered as part of the user experience on websites, which, by way of non-limiting example, may include text, images, sounds, videos and animations.
The term “owner” may include the originator, publisher, distributor, author, and creator, provided that an actual or previous ownership of the specific web content can be deduced from the IP address and/or the domain name of the owner as used and visualized in the sample code itself. The term “digital location” refers to a web location which can be a location on a computer of the owner as the code issuing party which is connected to the Internet, though it can also be a remote location in a private or public cloud computing infrastructure employing Internet-based computing, whereby shared resources, software and information are provided to computers and other devices on-demand, similar to a public utility. The sample codes may be stored in a computing cloud, while the specific web contents are stored in a location separate from the computing cloud, which would reduce the traffic load within the cloud and may also be beneficial for security reasons.
Each unique piece of web content is marked with a world-wide unique sample code. This sample code not only facilitates differentiation between authentic web content and non-authentic web content, but also incorporates metadata relating to the content of the specific web content. This metadata may be stored as a web index file in an index repository for use by a search engine in responding to a user's query. Since the size of the metadata incorporated in a sample code is commonly considerably smaller than the size of usual (meta)data of web content to be indexed, indexing of web content can take place more concisely and more efficiently and can accelerate the speed of handling a search query and hence the Internet traffic load. This can lead to energy savings, which is favourable both from an economic and environmental point of view.
The sample code segments are selectively ordered to build an identifying path referring either directly or indirectly to a digital location, in particular a web location, where the specific web content can be found. The digital path may commonly represent a Uniform Resource Locator (URL) which may (automatically) be provided with a prefix, such as http, https, ftp, ftps, mailto, file, by a web browser. In an embodiment, at least a part of the digital path is identical to the sample code, meaning that the sample code is incorporated in the digital path. In case the sample code and the digital path are substantially identical, creating a cross-reference in accordance with step E) may be omitted. In this respect, the term “substantially identical” is being used to show that there may be minor differences between the sample code and the digital path which do not have any effect in practice. For example, although the digital path will commonly have a prefix, such as “http://”, such a prefix may not be present in the visualized sample code itself. However, since most web browsers will normally add a prefix in front of a web address not already having such a prefix, the sample code as such may easily be used as web address (digital path) leading to a web location (digital location) where the requested specific web content is stored.
In an embodiment, the method includes step F) comprising storing the sample code, the digital path, and the cross-reference between the sample code and the digital path in a database. Storing the cross-reference as a link between the sample code and the digital path can facilitate translating the sample code into a digital path where the specific web content can be found. Moreover, storage of this data may facilitate updating the cross-references in case of a change of the digital path in order to prevent unlinking (dead linking) of the sample code with respect to the actual location where the specific web content is stored and can be traced and found.
The method optionally comprises step G) comprising converting the sample code generated during step C) into a machine-readable format. If the sample code is printed or displayed on a screen, the sample code may be read, for example, by using an optical scanner. By applying optical character recognition, the scanned sample code can be converted into a set of characters identical to the sample string of the sample code, which can subsequently be entered either automatically or manually into a web browser. The machine-readable sample code may also be represented in a digital or physical encrypted iconographic format, such as a 2D/3D barcode and/or a RFID tag. It should be noted that while these iconographic representations look similar to conventional iconographic representations, the content, meaning, and use of the iconographic representation of the sample code is completely different from the conventional iconographic representation of known sample series and/or categories codes.
Alternatively, the method comprises step H) comprising translating at least the sample identifying code segment of the sample code into another language and matching characters. Since the sample identifying code segment may comprise metadata relating to the specific web content associated with the sample code, the metadata providing relevant recognizable information about the specific web content, it may be user-friendly to offer and display these metadata in the language of the location/country where the specific web content code is issued. An example of possible metadata incorporated and named in the at least one sample identifying code segment is information relating to the author, title, subject, keywords, size, version, date of creation, remarks, and/or status of the specific web content. The IP address and/or the domain name of an owner as incorporated in the owner identifying code segment is commonly not translated and commonly remains unchanged during step H).
It is further imaginable that the sample code string comprises at least one intermediary identifying code segment relating to the identity of an intermediary e.g., used to manufacture, supply, support, distribute, sell, and/or promote the product sample. The intermediary identifying code segment, optionally based on the domain name or IP address of the intermediary, may comprise the identity of the intermediary but may also comprise other metadata relating to the intermediary, such as a platform or service offered to the public via which specific web content can be accessed. One example is related to the distribution of news via a news publishing service, such as Google News (news.google.com), via which news items originating from different sources are displayed. A sample code associated with a specific web content may be represented as follows: “cnn.com/google.com/2010—05—18/federal_government_extends_area_of_fishing_ban_in_Gulf_of_Mexico_due_to_enormous_oil_spill_in_the_coastal_waters”, wherein “cnn.com” represents the owner identifying code segment, “google.com” represents the intermediary identifying segment, “2010—05—18”represents the publication date (May 18, 2010), and “federal_government_extends_area_of_fishing_ban_in_Gulf_of_Mexico_due_to_enormous_oil_spill_in_the_coastal_waters” represents metadata relating to content of the specific web content, in which metadata serves as (a set of) keywords taken into account during processing of a user's search query. The sample code forms a web link which either directly or indirectly leads to the web content associated with said sample code. Hence the above sample code itself may refer to the digital location where the web content is published (direct routing), though it may also be conceivable that the above sample code may be automatically translated (by means of a script) into a cross-referenced digital path to the digital location where the web content is stored (indirect routing). An example of such a cross-referenced digital path is the URL: “http://edition.cnn.com/2010/US/05/18/gulf.oil.spill.main/index.html?hpt=T2”.
It may be beneficial during step A) to define at least one punctuation mark for separating adjacent code segments during step C). A variety of punctuation marks can be used, though since the sample code often functions as URL, a slash (‘/’) sign may be used to separate adjacent code segments. In a correct URL syntax commonly a slash sign is also positioned behind the last code segment. In addition to these separation characters, other typographic signs, such as a tilde (‘{tilde over ( )}’), a dot (‘.’), an underscore (‘_’), and a minus sign (‘−’), may also be used within the code segments themselves and/or between the code segments. Such a punctuation mark may be recognized by the search index, as a result of which the sample code can be decomposed by the search engine into multiple code segments forming the indexed metadata (to be) stored in the index repository.
In an embodiment, the sample code string comprises at least one checking code segment representing the result of a predetermined mathematical processing of at least one other sample code segment. The algorithm used to calculate the value of the checking code segment may be defined when defining the sample code structure during compilation of the sample code. This algorithm may for example use or have similarities with the ISBN (International Standard Book Number) category coding system. The algorithm for generating an ISBN check character works as follows. To generate an ISBN check character, each ISBN digit is multiplied by a predetermined associated weighting factor and the resulting products are added together. The weighting factors for the first nine digits begin with 10 and form the descending series 10, 9, 8 . . . 2. Thus for the nine digits 0 9 4 0 0 1 6 3 3, the products summed are 0+81+32+0+0+5+24+9+6=157. This sum is divided by the number 11. (157/11=14 with 3 remainder). The remainder, if any, is subtracted from 11 to get the check digit. (11−3=8). If the check digit is 10, it is represented by the Roman numeral X. The final ISBN in this example is accordingly 0-940016-33-8. By generating the check digit and comparing it with the received check digit, the validity of the ISBN may be verified. As mentioned above, a similar or comparable check may be incorporated in the sample code.
In another embodiment the sample code segments defined during step A) further comprises a sample code security identifying code segment. Application of this code segment may counteract abuse of the sample code by parties with malicious intent, since this security identifying code segment may be used as a check to determine the authenticity of the sample code. For example, after entering the sample code into a web browser, a validity check of the sample code security identifying code segment may be performed. This security related code segment may be time-dependent (“dynamic”), meaning that the code segment may only be valid for a limited period of time. In case the security check shows that the sample code is no longer valid or in force, access to the specific web content will not be granted. The security identifying code segment hence acts as an interactive key to gain access to the specific web content file.
During step A) not only the number and kind of the code segments used to build a code may be defined, but also the order of defined code segments to be stringed may also be defined. This allows for creation of a complete sample code template (code format), wherein code segments are ordered in a predetermined order. Determining the order of code segments during step A) can enhance the handling of sample codes and co-related storage locations of the specific web contents.
In an embodiment, step A) may be repeatedly performed to generate multiple sample code templates, wherein the method further comprises step I) comprising choosing a code template to be applied prior to executing step B). Generating multiple templates may allow for additional differentiation in sample codes provided to users. For example, a party may offer specific web content directly to customers and also indirectly to customers by making use of an intermediary. In doing so, different sample code templates may be used, where the direct customers may receive a code such as “www.owner.com/sample_id—1234” which does not use an intermediary, while indirect customers may receive a code such as “www.owner.com/intermediary.com/sample_id—5678” which utilizes an intermediary.
The aforementioned method may be performed using a software module having a user interface to allow the user to generate a world-wide unique sample code.
An embodiment also relates to a method for providing specific web content with a unique sample code, comprising: J) creating specific web content, K) compiling a unique sample code for the specific web content according to the method described above, L) marking the specific web content with at least one compiled sample code, M) storing the specific web content at a digital location, N) storing the sample code, and O) creating a cross-reference between a digital path referring to said digital location and the sample code in case the sample code and the digital path are mutually distinctive. Marking the sample with the specific web content code according to step L) may facilitate indexing of the specific web content by a search engine. Moreover, the manner of labelling the specific web content by using the sample code can allow for assessment of the authenticity and legitimacy of the specific web content. Specific web content may optionally be labelled with multiple unique sample codes. The multiple unique sample codes may be embedded as metadata in the specific web content or may also be incorporated in a body text of the specific web content. For example, embedding multiple sample codes into one specific web content could be advantageous if the specific web content is distributed via multiple intermediaries, with each intermediary using its own unique sample code.
In an embodiment, the method may include step P) comprising providing the sample code to a user, for example the creator of the specific web content. This may be performed by sending the user an e-mail which includes the sample code. The sample code may be displayed as plain text in the body of the email which contains a hyperlink. Alternatively, the sample code may be attached as a separate attachment to the email. As the sample code is commonly represented by a string of a limited number of alphanumeric signs and punctuation marks, the sample code is commonly no larger than 1 kilobyte. Since only the sample code and not the specific web content is distributed, Internet traffic and the storage load may be significantly reduced. By storing sample codes instead of the sample files in a computing cloud, users can be offered a secure exchange of information in a cloud computing environment.
As already indicated the sample code may be embedded as metadata into the specific web content forming a tag, mark, or label of the specific web content. In an alternative embodiment, the sample code is incorporated in the content, in particular the body text, of the web content.
Some embodiments further relate to a method for indexing specific web content provided with a sample code according to the above method, comprising: i) allowing a search engine to crawl specific web content and acquiring at least one sample code coupled to said specific web content, ii) verifying the authenticity of the sample code by comparing the Internet address incorporated in the owner identifying code segment with a detected Internet address of the web content; iii) labelling sample codes found authentic during step ii); and iv) storing the samples codes acquired in an index repository. Since each sample code comprises an owner identifying code segment including an Internet address of the owner, the source, and hence the authenticity and legitimacy of the published web content can be assessed by comparing the Internet address incorporated in the sample code and the detected Internet address of the publisher of the web content. In case there is a match between both Internet addresses, it is assumed that the web content originates from the owner and is hence authentic. Labelling the sample codes according to step iii) after verification of the sample codes according to step ii) allows a distinction to be made between authentic web content and non-authentic web content, which can be used for ranking search results in response to a search query wherein authentic web content will commonly be listed higher than non-authentic web content. Hence, labelling the sample codes after verification may allow for the prioritization of search results. For this purpose, it is conceivable that either the authentic web content or the non-authentic web content is labelled, though it would also be possible that both the authentic web content and the non-authentic web content is labelled provided that a clear distinction can be made between the authentic web content and the non-authentic web content. In one example, only the authentic web content is labelled, wherein during step iii) the sample codes found authentic are labelled by the detected IP address of the web content. Since the actual IP address of an owner does commonly not change in course of time, using the IP address as a marking label may help facilitate future verification of co-related web content.
In an embodiment, the method comprises step v) comprising detecting the IP address of the web address prior to step (ii), wherein during step ii) a domain name is derived from the owner identifying part of each sample code, wherein an IP address related to said domain name is looked up by using a domain name server, wherein the looked up IP address is compared with the detected IP address according to step v). An IP address comparison may be one method that can be used to verify the origin and hence the authenticity of the specific web content. Thus, during step iii) the sample codes found authentic may be labelled by a detected IP address of the web content.
In an embodiment, during step iv) the sample codes stored in the index repository are provided with a time stamp. Providing each indexed sample code with a time stamp facilitates chronological ranking of search results.
In another embodiment each sample code is decomposed in separate code segments, wherein during step iv) the sample codes are stored in decomposed format. Decomposition of the sample code into separated code segments may improve the efficiency during processing of a search query, since the code segments can be searched selectively, wherein other code segments can be disregarded during processing of a search query, leading to savings of time and energy. For example, if the sample code comprises a code security identifying code segment and/or a checking code segment as defined above, these code segments can be disregarded by the search engine.
Some embodiments described herein moreover relate to a method of processing an Internet search query using a search engine having indexed web content according to the above method, comprising: vi) receiving a search query comprising at least one keyword, vii) searching the sample codes stored in the index repository for the at least one keyword, and viii) in case the at least one keyword matches at least a part of at least one sample code stored in the index repository, providing the at least one matching sample codes as search results. In case the sample codes are stored in decomposed format in the index repository, it is imaginable that the code segments are searched selectively during step vii). The search results are based upon the extent of matching of the keywords (search criteria) entered by a user and the metadata incorporated in the sample codes as stored in the index repository. During step viii) the search results are provided in a ranked order if multiple matching sample codes were found during step vii). This ranking can be based on multiple criteria. For example, during step viii) the labelled authentic sample codes may be ranked higher than non-labelled non-authentic sample codes. Herein, co-related sample codes may be displayed together, for example as a cluster. The sample codes may also be ranked chronologically. During step viii) ranking of the sample codes may be based upon the extent of overlapping of at least one keyword entered and the sample codes stored in the index repository. Additionally, during step vi) a search query may be received comprising multiple keywords, wherein during step viii) ranking of the sample codes is based upon the extent of overlapping of the order of keywords entered and the sample codes stored in the index repository. The order of the ranking criteria set out above can be customized. However, ranking the search results based upon authenticity may be the primary ranking criterion.
Some embodiments further relate to an index repository for use in a method, said index repository comprises at least one sample code compiled by using the above method.
Some embodiments additionally relate to a method for gaining access to specific web content provided with a unique sample code by using a search engine having indexed web content according to the above method, comprising: ix) allowing the search engine providing search results comprising at least one sample code by using the above method, x) selecting at least one sample code listed in the search results by a user, and xi) redirecting the user to the digital location where the web content related to the selected sample code is stored. The sample code itself may directly refer to the digital location where the specific web content associated with the sample code is stored. It is also imaginable that during step xi) the sample code selected during step x) is translated into a cross-referenced digital path, in particular a URL, relating to the digital location where the specific web content is stored. Selecting the sample code by the user according to step x) can be performed manually by copying the sample code into an address bar of a web browser. However, the sample code can also be displayed as a hyperlink to the user, wherein during step x) the user can select the sample code by simply clicking said hyperlink after which the user will be redirected to the digital location where the specific web content is stored.
During the presentation of the search results in accordance with step ix) to recognizable visual distinction may be created between authentic sample codes and non-authentic sample codes so that a user will be able see which results relate to authentic sample codes and which results relate to non-authentic sample codes.
An embodiment moreover relates to a computer-readable medium with computer-executable instructions which, when loaded onto a computer system, provide the computer system with the functionality of the method for compiling a sample code, and/or the method of providing a sample code to specific web content as described above. Examples of computer-readable media are USB-sticks, internal and external hard drives, diskettes, CD-ROM's, DVD-ROM's, and others.
An embodiment additionally relates to a sample code as compiled by the above method. Advantages of the use of a world-wide unique sample code acting as a “fingerprint” have already been described herein.
An embodiment also relates to a database comprising at least one cross-reference between a sample code according to an embodiment and a digital path to a digital location where specific web content associated with said sample code is stored. The use of such a cross-reference table allows the sample code to be converted into a digital path to a digital location where the specific web content can be found.
An embodiment further relates to a system for compiling a world-wide unique sample code using the above method, comprising: at least one sample code template generator for defining at least one sample code template comprising multiple sample code segments to be used for building a sample code for specific web content, said sample code segments at least comprising a sample owner identifying code segment, and a sample identifying code segment, at least one sample code segment specification module connected to said template generator for specifying the content of the sample code segments defined by means of the code template generator, wherein the sample owner identifying code segment is specified by a an address of an owner of the specific web content, at least one code generator connected to said template generator and said specification module for stringing the specified sample code segments to form the world-wide unique sample code, and at least one database for storing at least one cross-reference between a generated sample code and a digital path to a digital location via which access can be gained to the specific web content in case the sample code and the digital path are mutually distinctive. For example, some embodiments of the sample code have already been described herein.
In some embodiments, the system may be a (cloud) computer-implemented system which may be fully automated after proper set-up and initialisation. An embodiment of the system may further include at least one service module for administering the system for issuing a sample code. A digital user/administrator interface for controlling and maintaining the template generator, the specification module, and the code generator are included in the system according to an embodiment. The system may additionally include a sample storage device for storage of specific web content at a digital location of which the digital path is stored in the database. An example of a suitable sample storage device is a web server, optionally in the cloud.
In an embodiment, the system further includes a distribution/communication module for distributing/communicating the generated sample code to one or more users.
Embodiments additionally may relate to a system for handling a request for gaining access to specific web content provided with a sample code according to the above method, comprising: a web client including a search engine for allowing a user to enter a search query, a processing module connected to said web client for processing the search query resulting in search results comprising at least one sample code, and a handling module connected to said processing module for redirecting the user to a digital location of the web content based upon a sample code selected by the user. The functioning of this system has already been described above in a comprehensive manner.