This invention relates to network based classified information systems, to methods of automatically building searchable databases of classified information derived from web pages posted on a network, and, to web pages for use in such systems and methods.
The information systems and databases of most relevance to this invention are those which include classified product and service catalogues similar to the Yellow Pages telephone books, contact indexes similar to the White Pages telephone books, and/or subject indexes similar to Library catalogues. Such information systems and databases typically include sets of associated classification, contact and/or geographic items of information. For convenience, classification, contact and/or geographic information will be hereinafter called CCG-data.
The networks with which this invention is concerned are the worldwide public computer/communications network commonly known as the Internet and private networksxe2x80x94sometimes called intranetsxe2x80x94which allow common access to markup documents on computers connected to the network. Markup documents are text files prepared using various markup languages such as HyperText Markup Language (HTML) and Extensible Markup Language (XML) which are implementations (or dialects) of the Standard Generalised Markup Language (SGML). The system of accessible files on the Internet is called the World Wide Web (WWW) and the markup documents themselves are commonly called xe2x80x98web pagesxe2x80x99. A web page is said to be xe2x80x98postedxe2x80x99 on a network when it is stored on computer-readable media of a host network computer as a file which is generally accessible to network users. A web page is transported from the host computer to a requesting computer through intermediate network computers as a computer-readable signal embodied in a carrier wave. Though this invention is not limited to Internet based information systems, these terms are used for convenience.
It has been estimated that there are about 100 million web pages on the Internet and that the number is doubling every two years. Many of these pages include information concerning commercially offered goods and services and often include contact details. But the difficulty of locating such information is increasing faster than the growth in the number of web pages.
To assist network users locate web pages of interest, certain network service providers create indexes (or databases) of the contents of web pages posted (stored on computer readable media so as to be generally accessible) on the network and provide xe2x80x98search enginesxe2x80x99 to use the indexes. These indexes are often created automatically by the use of xe2x80x98web crawlersxe2x80x99 which (i) interrogate computer after computer on the network to locate successive web pages and (ii) index the words in each web page encountered against the network address (eg Internet Protocol Address or IPA) and filing system path or universal resource locator (URL) at which the web page is accessible. Hereinafter the terms URL and URI (Uniform Resource Identifier) are taken to be identical in meaning and to signify network addresses and filing system paths. Usually, the indexes consist of a list of unique words with each word having an associated list of URLs of the web pages wherein the word was found to occur during interrogation. The URL serves as a xe2x80x98hyperlinkxe2x80x99 which, if selected by a user/searcher, results in the associated web page being automatically transmitted from the computer where it is posted on the network to the user/searcher""s computer where it may be displayed or otherwise processed. The sending and receiving of files in this way is greatly assisted by user interface programs called xe2x80x98web browsersxe2x80x99 (or more simply, xe2x80x98browsersxe2x80x99) such as Netscape and Microsoft Internet Explorer.
The search for web pages of interest using search engines leaves much to be desired:
simple searches (those using a few keywords in simple combinations) often yield far too many web page references (URLs) to permit them to be interrogated one-by-one,
complex searches (those using many keywords and/or complex Boolean expressions) require considerable expertise to undertake,
even using optimum search criteria, many irrelevant web pages are referenced because of inconsistent use of terminology by those who author the original web pages,
even using optimum search criteria, many relevant pages are missed, again because of inconsistent use of terminology by web page authors, and
because items of information included in the body of web pages cannot be xe2x80x98understoodxe2x80x99 or associated in useful ways by web crawlers; that is recognised as, say, a surname, a street name, a geographic locality, or type of goods or services and, say, a surname strongly associated with a street name, a geographic locality, or a type of goods or service.
The result is that information provided by search engines from databases which are automatically compiled using web crawlers is a very poor equivalent of the common Yellow Pages and White Pages directories which serve the telephone industry (though these directories are not, of course, automatically compiled from web pages).
In an attempt to improve the usefulness of automatically compiled network databases, some search engine providers make use of information contained in URLs, such as the country code and top level domain name codes such as xe2x80x98comxe2x80x99, xe2x80x98eduxe2x80x99, xe2x80x98netxe2x80x99 and xe2x80x98orgxe2x80x99 which is sometimes used to signify the subject matter of web pages. It has been proposed to add more content classifying codes to URLs (eg, xe2x80x9cchemxe2x80x9d to signify chemical subject matter) to allow specialised databasesxe2x80x94national, commercial, chemical, etcxe2x80x94to be generated. However, this proposal has serious drawbacks:
URLs are Internet addresses and it is in principle undesirable to confuse the address function of a URL with that of representing a list of web page classifications or contact details.
A URL is an inappropriate container of multiple web page classification codes and contact details because the length of the URL would cause it to become unwieldy as an Internet address.
Including in a URL classification codes drawn from a list of thousands of codes would compromise the mnemonic quality of Internet addresses such as xe2x80x9cwww.yellowpages.comxe2x80x9d.
There is substantial overlap in the subject matter contained in web pages having the various top level domain name codes.
There is no consensus on, or standard for, content classification codes in URLs.
Another proposal to add content classification data to web pages has arisen from the wish to identify pages containing material that may be offensive to some viewers, or should not be accessed by minors. The Platform for Internet Content Selection (PICS) (see www.w3.org/pub/WWW/PICS and other documents at www.w3.org) is a web page ratings standard similar in principle to the ratings systems for motion pictures. This system allows page authors to xe2x80x9cinternallyxe2x80x9d self classify their pages through use of the xe2x80x9c less than meta . . .  greater than xe2x80x9d HTML element. Alternatively, xe2x80x9cexternalxe2x80x9d PICS ratings of web pages may be obtained from ratings service providers accessed each time a URL is selected. In practice, the ratings service providers have adopted very limited range of web page classifications. For example, Ararat Software""s Commercial Rating System (see www.ararat.com.ratings/ararat10.html.) provides just 5 categories of web page content; commercial content, technical/customer support, ordering information, downloading information and contact information. In other examples, CyberPatrol (www.microsys.com/pics/pics_msi.htm) provides 16 categories, the Recreational Software Advisory Council (www.rsac.org/faq.html) provides 4 categories, SafeSurf (www.safesurf.com/ssplan.htm) provides 11 categories and Vancouver Webpages Rating Service (vancouver-webpages.com/VWP1.0/ provides 11 categories. None of the categories provide classification of web pages by industry, service, product or subject with sufficient specificity to be useful when searching for web pages. Rather, the categories are intended to prevent web browsers from displaying web pages unsuitable for particular types of web browser users. Such rating systems are not intended to be used for the automated creation of Yellow or White pages like databases from web pages and are unsuitable for that purpose because they can not represent contact details. Further, the ratings data may only be encoded in the  less than meta . . .  greater than  element in the  less than head greater than  of an HTML document drastically limiting the type and usefulness of the data that can be encoded.
Another proposal for classifying the content of web pages, the xe2x80x9cMeta Content Frameworkxe2x80x9d (MCFxe2x80x94see mcf.research.apple.com/mcf.htmlxe2x80x9d), requires the content of web pages to be classified and the classification data to be held in a separate non-HTML data file with a MIME type of text/mcf. Storing data in non-HTML encoded documents which describes the content of HTML encoded documents is a technical and economic barrier to the adoption by search engine providers of the proposal. The MCF proposal is thus entirely unsuited to the automated creation of Yellow or White pages like databases from HTML encoded web pages (MIME type text/html) because data stored according to the MCF proposal is not stored in HTML encoded web pages.
The xe2x80x9cElectronic Business Cardxe2x80x9d, vCard, (see xe2x80x9cvCard The Electronic Business Cardxe2x80x9d Version 2.1, versit Consortium Specification, Sep. 18, 1996 or ftp://ds.internic.net/internet-drafts/draft-ietf-asid-mine-vcard-01.txt) uses non-HTML data file (MIME Content Types of xe2x80x9ctext/plainxe2x80x9d or the non-standard xe2x80x9ctext/X-vCardxe2x80x9d) containing contact information equivalent to an extended White Pages entry which can be exchanged on a network using Simple Mail Transfer Protocol (SMTP) or using HTTP. It can be associated with a web page by use of a URL in the web page which refers to the vCard information (eg  less than a href=xe2x80x9cwww.thing.com/vCard.vcfxe2x80x9d greater than My vCard less than /a greater than ). Version 2.1 vCard standard data file format (published Sep. 18, 1996) provides for the inclusion of many items of contact information. The vCard specification recommends that, where possible, there should be consistent mapping of vCard property names to HTML xe2x80x9c less than input greater than xe2x80x9d element attribute names (eg vCard property name xe2x80x9cTITLExe2x80x9d maps to HTML xe2x80x9c less than input name=xe2x80x98titlexe2x80x99 greater than xe2x80x9d). The intention is to facilitate the transfer of vCard data into web page input forms by pasting from a clipboard or by dragging from other computer applications. The vCard proposal is unsuited to the automated creation of Yellow or White pages like databases from HTML encoded web pages because data stored according to the vCard proposal is not stored in HTML encoded web pages.
The inclusion of classified information in separate documents (such as Meta Content files or vCards) has the disadvantage that there is necessarily much duplication of data and coordination of modifications between the separate documents and the web pages. This must be done to allow a person who has accessed a web page using an HTML compliant browser to determine whether it is worth calling up the associated file or vice versa. Also, to allow portions of web pages to be classified, web page contextual information would have to be duplicated in the separate document. vCards in particular do not provide this functionality. Another disadvantage is that non-HTML documents such as vCards contain no details as to how the data they contain is to be displayed. In the display of HTML documents the position, font, size, colour of the text and other elements of the document are of great importance. The restriction of address data in a vCard to untagged ordinally organised fields is inflexible. For example, multiple instances of extended parts of the address are not possible. Also components of names, addresses and telephone numbers and so forth are insufficiently identified.
The Online Computer Library Center Inc (OCLC, Dublin, Ohio, USA) proposal, known as the xe2x80x9cDublin Corexe2x80x9d, proposes to classifying scholarly web pages by subject (topic of the work, or keywords that describe the content of the work), title, author, publisher, other agent, date, object type (genre of the object such as home page, novel, poem etc), form, identifier, source, language, relationship and coverage (spatial and temporal) (see www.oclc.org:5046/xcx9cweibel/html-meta.html and other documents at www.oclc.org). This proposal does not include industry, service, product or subject classifications. It also does not include contact details. Names such as that of the author are not specified in sufficient detail to avoid ambiguities such as which is the author""s first and last names. The proposal specifies that the details are encoded using the  less than meta . . .  greater than  element in the  less than head greater than  of web pages. The proposal is unsuited to the automated creation of Yellow or White pages like databases from web pages because the proposal does not provide for classification of web pages and does not provide adequate contact details. Further, the use of keywords for describing the content of the work adds very little to the effectiveness of indexing of web pages since the web pages are usually indexed on every word of their content and most often the key words would simply be a duplication of words already contained in the document.
It has also been proposed to use the Dewey Decimal System (see orc.rsch.oclc.org:6109/eval_dc.html and orc.rsch.oclc.org:6109/bintro.html) to rank electronic documents against a Dewey Decimal subject classification. The proposal suggests automatically assigning Dewey Decimal subject classification codes to documents during automated indexing and cataloguing but does not specify the exact nature of the assignment although it is implied that the codes are stored separately from the documents. The proposal admits that such automated classification is less satisfactory than human classification. The proposal is unsuited to the automated creation of Yellow or White pages like databases from web pages because the accuracy of classification is inadequate, does not provide for inclusion of industry, service or product classifications and does not provide for inclusion of contact details. Deriving a subject classification code from an analysis of every word and phrase in a web page is computationally expensive.
The HTML 3.0 standard (see page 23 of the www.w3.org document xe2x80x9cdraft-ietf-html-specv3-00.txtxe2x80x9d) provides xe2x80x9cclassxe2x80x9d as an attribute of almost all HTML xe2x80x9c less than body greater than xe2x80x9d elements. The xe2x80x9cclassxe2x80x9d attribute is intended to be used with style sheets. Style sheets provide a means by which the display of HTML documents may be altered to suit the needs of different classes of browser users. For example,  less than div class=xe2x80x9cappendixxe2x80x9d greater than  could be used to define a division that acts as an appendix,  less than h2 class=xe2x80x9csectionxe2x80x9d greater than  could be used to define a level 2 header that acts as a section header, although, of course, any string of characters could be defined for those purposes. The xe2x80x9cclassxe2x80x9d attribute, although never having been suggested for holding goods and services classifications, is not suited for such a use as it is, in any case, undesirable to confuse the style sheet function of the xe2x80x9cclassxe2x80x9d attribute.
The HTML 3.0 and earlier standards provided the HTML elements xe2x80x9c less than person greater than xe2x80x9d and xe2x80x9c less than address greater than xe2x80x9d but do not specify the form of the content or method of validating the content of those elements. A person""s name may be written as first name followed by last name or last name followed by first name. Similarly, different conventions exist for writing addresses. Similar ambiguities arise in the ill defined format of the HTML elements xe2x80x9c less than person greater than xe2x80x9d and xe2x80x9c less than address greater than xe2x80x9d. As such they are of little use in the automatic compilation of searchable databases.
The XML language (see: textuality.com/sgml-erb/WD-xml.html) was developed to extend HTML so that software vendors can add new elements and new element attributes to HTML which are not specifically defined in any HTML standard. The intention is to ensure that all new elements and attributes could be parsed by all XML parsers even if the new elements held no significance for any particular XML parser. However, like HTML, XML does not provide a standard for the representation of industry, service, product or subject classification, contact or geographic location details within an web page.
Of course, many useful databases of the Yellow Pages or White Pages type are made available by service providers on networks, but they are not compiled automatically by using web crawlers to scan HTML web pages posted on a network. For example, www.yellowpages.com.au and www.mcp.com provide classified advertisements of the Yellow Pages type with links to the web pages of paying advertisers or subscribers. There are also directories of email addresses which approximate the White Pages directories, listing the names of individuals and organisations and contact details, (eg www.bigbook.com and query1.whowhere.com). However, these email directories require listers to manually add their directory entries and enquirers to be aware of and to find the directory enquiry web page. They cannot be automatically generated by scanning web pages using web crawlers since there is no adequate mechanism to relate email addresses to the names of people and organisations and their other contact details which may also exist in the same web page.
The general object of the invention is to provide improved methods for automatically building searchable databases of classification, contact, and/or geographical information by using web crawlers to interrogate web pages posted on a network. [For convenience, this information is collectively referred to as CCG-data].
Other non-essential objectives are to provide methods for including and/or displaying CCG-data within web pages accessed by browsers, for automatically extracting CCG-data from web pages posted on a network and for using the same, and/or to provide methods for searching automatically compiled databases using such data.
Another subsidiary objective of the invention is to provide a new form of web page which is better suited to the automatic compilation (using web crawlers) of databases constructed by the automatic scanning of many such pages posted on a network.
The invention is based upon the realisation that highly useful databases can be automatically built by successively interrogating web pages posted on a network if one or more HTML encoded CCG phrases are included in the web pages. A CCG phrase is one containing CCG-data in a form which is directly accessible and identifiable. CCG phrases may also include one or more items which provide the web page author with control over how the CCG-data is applied to the database.
Data duplication can be reduced if some of the CCG-data in the coded CCG phrases can be displayed by browsers as well as being used to update databases. Errors due to inexactly duplicated data are also eliminated. Accordingly, it is envisaged that CCG phrases may include one or more items which provide the web page author with control over how the CCG-data is displayed by a browser.
HTML (including version 2 and version 3) and XML are evolving applications (sub-sets or dialects) of ISO Standard 8879 1986 known as Standard Generalised Markup Language (SGML). HTML, in large part, is a language used to describe how text (unstructured data) and graphics is to be formatted for display. The HTML language consists of a finite number of xe2x80x9celementsxe2x80x9d (for example; xe2x80x9c less than BR greater than xe2x80x9d where xe2x80x9cBRxe2x80x9d is the element name, also called the tag name) which may contain xe2x80x9cattributesxe2x80x9d (for example; xe2x80x9c less than DL COMPACT greater than xe2x80x9d where xe2x80x9cCOMPACTxe2x80x9d is an attribute named xe2x80x9cCOMPACTxe2x80x9d) and may contain values associated with attributes (for example; xe2x80x9c less than FONT SIZE=+1 greater than xe2x80x9d where +1 is the attribute value of the attribute named xe2x80x9cSIZExe2x80x9d). XML is a language used to describe structured data. The XML language is similarly composed of elements, attributes and values with a similar syntax to HTML but unlike HTML the element names which may be used are not restricted and the meaning of the XML data may be interpreted in any convenient manner. While the XML language is mute about how data described by XML is to be formatted for display, the data may be used by computer programs for any purpose including description of how XML coded data is displayed. However, due to its historic importance in connection with web pages, the term xe2x80x9cHTMLxe2x80x9d is herein used to refer to all markup languages which are subsets or complete sets of the SGML language. In particular, the term xe2x80x9cHTML encoded CCG phrasexe2x80x9d and the synonymous term xe2x80x9cCCG phrasexe2x80x9d are herein used to refer to CCG-data encoded in a subset or complete set of the SGML language. Herein, a xe2x80x9cweb pagexe2x80x9d is a document adapted to be or actually accessible through a network and encoded in a subset or complete set of the SGML language.
For convenience, CCG items in HTML encoded CCG phrases, whether they are syntactically represented as elements or as attributes, will be referred to hereinafter as CCG attributes.
A CCG phrase includes at least one of the following identifiable types of CCG-data attributes:
industry, product, service, and/or subject classifications,
contact categories, contact person(s) and/or organisation(s) names, titles or associations, contact details including physical and postal addresses, telephone and fax numbers, email and Internet or network addresses or locations, public keys, and
geographic location details.
A CCG phrase may also include any of the following identifiable types of CCG control attributes:
database control attributes to indicate which parts of the data are to be used to update databases, and
display control attributes to indicate how browsers are to display the data.
By virtue of occurring in the same CCG phrase, a plurality of CCG-data attributes are associated with each other.
By virtue of their occurrence in the same CCG phrase, CCG-data attributes are idententified as a set of associated attributes. However the degree of association between attributes can be controlled by the inclusion in the phrase of database control attributes.
The start and end of CCG phrases should be identifiable to clearly distinguish these phrases from other data. To identify the beginning and end of a CCG phrase, at least one HTML element should have a CCG specific HTML element name or CCG specific attribute name or CCG specific value. Each CCG attribute may consist, with or without other incidental characters, of a CCG attribute name and/or a CCG value or values. Preferably, each CCG phrase is contained in the xe2x80x9c less than body greater than xe2x80x9d of the web page.
Two examples of a CCG specific HTML element are: xe2x80x9c less than CCG . . .  greater than xe2x80x9d or xe2x80x9c less than CCG . . . / greater than xe2x80x9d or xe2x80x9c less than CCG greater than . . .  less than /CCG greater than xe2x80x9d. (Where a COG phrase is coded in XML, the elements xe2x80x9c less than XML greater than xe2x80x9d and xe2x80x9c less than /XML greater than xe2x80x9d may also be needed at the start and end of the CCG phrase.) A less satisfactory example is: xe2x80x9c less than ! - - CCG . . . - -  greater than  where the characters xe2x80x9cCCGxe2x80x9d after HTML comment element name xe2x80x9c! - - xe2x80x9d are used to signify that the comment contains CCG-data. An example of the use of a CCG specific attribute name is: xe2x80x9c less than START CCG greater than xe2x80x9d . . . xe2x80x9c less than END CCG greater than xe2x80x9d. An example of the use of a CCG specific value is: xe2x80x9c less than START TYPE=xe2x80x98CCGxe2x80x99 greater than xe2x80x9d . . . xe2x80x9c less than END TYPE=xe2x80x98CCGxe2x80x99 greater than xe2x80x9d. Obviously, other character strings could be substituted for the element name, element attribute name or element attribute value xe2x80x9cCCGxe2x80x9d string of the examples.
The codes xe2x80x9c less than CCG . . .  greater than xe2x80x9d and xe2x80x9c less than CCG . . . / greater than xe2x80x9d are compatible with most HTML specifications, but being non-standard HTML, most web browsers do not display any text or attributes (eg PQ=xe2x80x9cAQDxe2x80x9d) within the angle brackets xe2x80x9c less than xe2x80x9d and xe2x80x9c greater than xe2x80x9d. These codes are preferred where display of the CCG data is not required and compatibility with older browsers is required (eg CCG phrases containing only classification values).
From one aspect, therefore, the invention comprises a web page for posting on a network, the web page being characterised by the inclusion of at least one CCG phrase in the xe2x80x9c less than body greater than xe2x80x9d of the page, the CCG phrase being such that the CCG attributes contained therein are accessible and identifiable by (i) HTML compliant editors and/or (ii) HTML compliant web crawlers for the automatic construction of databases of classified information, and/or (iii) HTML compliant browsers for display on the computer screens of network users.
From another aspect, the invention comprises a method of constructing web pages of the above described type. The web pages may be constructed on digital computers using simple text editors such as Microsoft Windows Notepad, or preferably, purpose built human controlled editors or automated composing programs which embody knowledge of HTML and CCG syntax and grammar. Which ever process is used, CCG attributes are selected and inserted, modified, deleted and/or organised to form a valid CCG phrases in HTML encoded documents and the documents are posted on computer readable storage devices of computers connected to a computer network so that the documents are generally available to computers on the network.
From another aspect, the invention comprises a method of populating a database with CCG-data extracted from web pages. Web pages posted on a network are successively retrieved by a digital computer program (eg: a web crawler) and CCG phrases contained therein are identified and at least some of the CCG attributes found within the CCG phrases are extracted. The CCG attribute names are used to determine the type of data in the associated values. Generally the CCG attributes of interest are those relating to classification, contact and geographic data and database update controls while the attributes of little or no of interest in relation to database updating are those relating to display controls. Of course, the CCG-data extracted need only be that relevant to the particular database being updated. For example, one database may have been designed to index only web page classifications and URLs while another database may have been designed to index only contact details. Databases also differ in their internal representation of data and means of associating data. For example, some use xe2x80x9cflat filexe2x80x9d tables, others use pointers to data to create network associations while others use hashing and buckets.
The conventional nomenclature differs considerably between different types of database. Depending on the particular database nomenclature, data of the same type is said to be stored in table columns, fields, attributes and properties. The terms column and field are somewhat related to the physical representation of the data in files while attribute and property is more related to the logical representation of data. To avoid confusion, with the terms xe2x80x9cHTML attributexe2x80x9d, xe2x80x9cCCG attributexe2x80x9d or just xe2x80x9cattributexe2x80x9d, hereinafter a database property means both a type of data stored in the database and a place in the database where data of the same type is stored. Database properties are referred to by a name (xe2x80x9cproperty namexe2x80x9d) or similar reference and contain values. For example, a database property with the name xe2x80x9cCity namexe2x80x9d and which contains values which are all the names of cities may be defined as a xe2x80x9cCity namexe2x80x9d type database property.
Whichever style of database is used, it is preferred that the database update program relate the CCG attributes to corresponding database properties used by the database update process so that the database property values are updated with CCG values in a manner which preserves the distinctness, content and meaning of the CCG values and, preferably, preserves the CCG value associations expressed in the CCG phrase as sets of associated database property values of different types.
In some cases, it is desired to know the address of the web page from which the CCG values were extracted. For example, the purpose of building a database might be to allow searching of the database by web page classification to provide a list URLs of web pages or URLs of portions of web pages which contain matching CCG classifications. The URLs could then be inserted in an HTML document and transmitted to a web browser as a list of references to web pages matching a search expression. In that example, associating the URL of a web page or the URL of a portion of a web page with the CCG values extracted from the same web page or web page portion is important and the URL or means of reconstructing it must be available and supplied to the database update process. In one style of database, the values of the same type are held separate rows in a column (property) of a database table, and pointers held in another column (property) are associated with the values by sharing the same table row. The table row constitutes a set of associated property values. Each pointer points to a bucket (block of data) containing a list of URLs or pointers to URLs held in a separate bucket or table. In another style of database, values of different types are held in different tables together with a set number, pointer or similar code which is used to indicate which values are associated as members of the same set. In one variation, the values of set members are prefixed with a code indicating the type of value and all values are held in the same column of a table. If the purpose of the database is to hold contact data, recording the web page URL in the database might not be required although if the URL is not present in the database, updating changes in the CCG contact details contained within a web page is more difficult. Of course, one database may be used to record all types of CCG values contained in web pages and associate with each other any and all values extracted from the same web page or even from other web pages.
From another aspect, the invention comprises a method of searching the databases constructed as outlined above. These databases may be used for a variety of searching purposes. For example, to find web page URLs by using the association of web page URLs with industry, service, product or subject classification or a person""s or organisation""s name or address or geographic location values or any combination thereof. In another example, the databases may be used to find the contact details for people or organisations by name or location of industry, service, product or web page subject type and so forth by using the association between items of the contact details in the database without having to retrieve web pages associated with the contact details.
More particularly, the searching method involves finding URL references, or finding sets of associated database property values, from databases containing CCG-data. The method including steps of parsing a query phrase received from a computer network to extract query relational expressions and, from each expression, deriving a query field name, query relational operator and query value, determining the type of the query field by reference to its name, relating the query field to a corresponding database property according to type and locating CCG-data database property values in the database property which return a true value when tested against the query value using the query relational operator. Finally, the URL references or the sets of property values associated with the so located CCG-data database property values are extracted.
Database queries are usually expressed in a query language in the form of a phrase or sentence. In query by example style enquiry systems, the user types values into input fields on a form and a program extracts the input values and uses the values to automatically compose a query phrase or sentence. There are many existing examples of query languages used in connection with databases. Generally, they consist of relational expressions (eg Field=Value), logical expressions and grouping of relational and logical expressions by means such as parentheses. They may also contain sorting and output formatting expressions. Often abbreviated notation is used in the expressions such as leaving out field names or relational operators which are then inferred from the value in the expression or implied by default. In an enquiry the nature and format of the output may also be implied, such as a list of URLs of web pages or a list of contact details. Whatever is the mechanism of any particular database, the query expression needs to be parsed and fields in the query expression, explicit, default, implied or inferred, need be related to database properties of similar type. In some styles of database enquiry the query expression is evaluated against each row of a table or record of a file to find rows or records (ie a set of associated property values) which match the query expression. In other styles, sub-sets of the values of the properties are selected according to the interpretation of relational expressions in the query expression and the sub-sets are combined according to logical and grouping expressions in the query to find the sets of associated property values which match the query expression. Often, to make logical operations which combine the selected sub-sets more efficient, it is not the values which are selected but pointers to the values (eg Table name and table row) or unique keys (eg URLs or pointers to URLs) associated with the values. For example, the AND logical operator is often used to combine two lists so that only values or pointers or keys common to both lists are found in the combined list. Usually, the query produces a result list which is then provided to other processes. For example, a list of URLs of web pages is processed to produce an attractively formatted HTML encoded document containing the URLs and is sent to a web browser to allow an enquirer to retrieve interesting web pages. In another example, the contact details associated in the database with each value or pointer in the result list are retrieved from the database and presented as a report in the form of an HTML encoded document and is sent to a web browser for viewing.
From another aspect, the invention comprises a method of displaying CCG-data contained in CCG phrases within web pages which are displayed by a web browser executing on a digital computer. While a web page is loading or has loaded in a web browser, the web browser parses the web page and displays the text (or data) of the web page on a display device connected to the computer. When the web browser parser encounters CCG phrases, the web browser may display the CCG-data (element and/or attribute names (or translations of element and/or attribute names) and/or values) in a number of browser specific ways. For example, the web browser may by default not display any CCG-data, display all CCG-data, not display any CCG-data until a CCG display control attribute explicitly states that subsequent data should be displayed or display all CCG-data until a CCG display control attribute explicitly states that subsequent data should not be displayed. The web browser may also use CGA display controls specifying the size, font, position and so forth to alter the display of the CCG-data.