1. Field of the Invention
The present invention relates to a document processing device and document processing method for searching web data.
2. Related Background Art
Since the mid-1990s, opening WWW documents on the Internet is explosively increasing, and value thereof in the information industry is increasing. A WWW document is positioned in a logical information storage position on the Internet, called a URL (Uniform Resource Locator), and a structured data base is constructed by mutually referring to this URL. A search service to efficiently search this structured data base and provide [the required information] to a user is critical, and a search engine is considered as a system to execute this service.
Description about a search engine is made in Data of the Technology Trend Group Planning and Research Division, Patent Administration Dept., Japan Patent Office: “Theme title: Creation of standard technologies on search engines”, an overview of technical trends of WWW search engines, [online], [searched on Jan. 29, 2008], Internet <URL: http://www.jpo.go.jp/shiryou/s_sonota/hyoujun_gijutsu/search_engine/douko.htm>, specifically that a “search engine is handling information space which is enormous and constantly changing, so it must have the following functions which are different from conventional search technology, and research and development are progressing to implement and advance these functions:                function to efficiently collect information dispersed on the WWW        function to extract keywords from information described freely in an undefined format in HTML, and search this information at high-speed        interface function for each search        function to rank enormous search results efficiently.”        
In Data of the Technology Trend Group Planning and Research Division, Patent Administration Dept., Japan Patent Office: “Theme title: Creation of standard technologies on search engines”, an overview of technical trends of WWW search engines, [online], [searched on Jan. 29, 2008], Internet <URL: http://www.jpo.go.jp/shiryou/s_sonota/hyoujun_gijutsu/search_engine/douko.htm>, the following description is included. This search engine is comprised of such components as a “WWW robot, collected text group, indexer, search index file, search server and browser.” The WWW robot has a function to “(1) collect information” from the world of the Internet web. The collected WWW pages are stored in the collected text group, and “(2) data analysis (pre-processing)” is performed before transferring the data to the indexer. Index files for a full text search or category search are generated in the components of the indexer and search index file, and a basic data base for “(3) search processing” is operated. Information on input and output is exchanged among the search server, client and browser, where many “(4) input/output interfaces” intervene and function.
FIG. 1 is a diagram depicting a system configuration of a general search system mentioned above. As FIG. 1 shows, a web robot 501 automatically collects web pages containing HTML text from the Internet web 500. The collected web pages are stored in an index file 503 via a server 502. The operator may store each web page in the index file 503 by operating a PC 504.
The user sends a search request to a search server 505 via a web server 506 using a web browser of a terminal 507. The search server 505 performs search processing, referring to the index file 503, and outputs the result to the terminal 507, whereby the terminal can acquire the search result.
By this processing, the user receives an enormous amount of search results. Therefore it is demanded to grasp the search result efficiently. Here a prior art on “a function to efficiently rank the enormous amount of search results” will be described. This function is normally implemented by combining conformity and significance. Conformity is a scale that measures a degree of matching the intention of the search, such as whether the word searched by the user is included frequently [in a WWW document], or whether [the WWW document] matches the search history of the user. Significance is a scale that measures a degree of the beneficial information generally read by many individuals included in a WWW document.
For example, U.S. Pat. No. 6,112,202 Description and “Technical trends of WWW search engines” by Masanori Harada, Technical Report of IEICE, SSE2000-228, pp. 17-22, 2001 describe HITS, which is one ranking search method that implements both conformity and significance. HITS searches web pages including a keyword representing a topic, detecting the authority and hub from a web graph near a web page having a high conformity of the searched web pages. Authority is a scale indicating a web page which is referred to by many hubs in the web graph, and which receives high evaluations. Hub is a scale indicating a web page which corresponds to links, referring to many authorities in the web graph. In HITS, the authority score and hub score of each web page in the web graph is calculated by iterated calculation, and web pages are output in the sequence of the authority score. Thereby significant web pages can be searched out of the web page group related to the provided topic. FIG. 2 is a diagram depicting a concept of an HITS algorithm. As FIG. 2 shows, the web page 601, which is referred to by many web pages, has a high authority score. The web page 602, which refers to many web pages, has a high hub score.
The above is calculated during a search, but as a static method for calculating significance of WWW documents, a page ranking method used by Google Inc. in the USA is well known. For example, as U.S. Pat. No. 6,285,999 Description shows, this page ranking method uses a huge link structure of WWW documents.
For example, if WWW document A refers to WWW document B, it is regarded that WWW document A supports significance of WWW document B. At this time, the significance of WWW document A is weighted by this support. The significance of WWW document A is represented by the sum total of the support of other WWW documents, which refers to [WWW document A] and the weighted significance. In this way, if large scale calculation is performed recurrently, tracking the references of all WWW documents, significance of each WWW document is determined.
Recently due to improved software and browser functions to read WWW documents, browsers that users are accessing are measured, linking with search engines, and this measured popularity is added to the parameters to determine significance.
According to “2 Beyond Page Rank: Machine Learning for Static Ranking” by Matthew Richardson, Amit Prakash, Eric Brill, Proc. WWW 2006, [online], [searched on Jan. 29, 2008], Internet, <URL: http://www2006.org/programme/files/xhtml/3101/p3101-Richardson.html>, the frequency and time when users access (that is, popularity) is added to the page ranking to determine the significance of a WWW document. According to US Patent Application Laid-Open No. 2007/0143345 Description, data on how often [the WWW document] was clicked on, out of the search result during a predetermined period, is used for calculating ranking as a history.
Prior arts on [determining] significance of WWW documents were described above, but a problem is that there are too many choices to present the search result according to conformity. To solve this problem of too many choices, a method of estimating user interest based on browsing history of the user, and rearranging the ranking of the pages listed based on the weight of the characteristics of search history, has been proposed. In “E output interface E-2-(1), output with ranking” reported in Data of the Technology Trend Group Planning and Research Division, Patent Administration Dept., Japan Patent Office: “Theme title: Creation of standard technologies on search engines”, an overview of technical trends of WWW search engines, [online], [searched on Jan. 29, 2008], Internet <URL: http://www.jpo.go.jp/shiryou/s_sonota/hyoujun_gijutsu/search_engine/douko.htm>, the following is disclosed.
In other words, in order to solve the problem of too many choices, a method of estimating user interest based on browsing history of this user, and rearranging the sequence of pages listed based on the weight of the characteristics of the search history, is proposed. In more concrete terms, it is assumed that a user browsed pages 1, 2, . . . , n following links. Based on the assumption that the interest of the user is higher for the content which was read more recently, weight is increased for the most recently read web page. A weight of a word (weight of index) is determined by adding up the “weight of history’ of pages including the target word. This will be described with reference to FIG. 3. FIG. 3 is a diagram depicting the transition of web pages read by a user, and shows that the user sequentially access page 1 to page 4. Here in FIG. 3, Nw(k) indicates a weight of history, and can be expressed by Nw(k)=rn−k, for example. The user has browsed page 1, page 2, page 3 and page 4, and since the word “e” is included in page 1, page 3 and page 4, the “weight of index e” is determined by adding the weight of history Nw(k) of these pages.
After the above browsing, the user inputs a keyword to the search engine, and collects necessary information. An index included in each of the collected pages is detected, and the weights of these indexes are added up, whereby the weight of the page, that is the selection candidate, is calculated. The user can access sequentially from a page having a heavier weight. The same method is also disclosed in Japanese Patent Application Laid-Open Nos. 10-207901 and 2002-32401.
In a document search, a search technology using the tf·idf characteristic is under consideration. In this technology, the weight of keyword ti (i=1, . . . , M), which appears in a document set {Dj|j=1, . . . , N} is calculated for each document, and the keyword weight vector wj is expressed by the following Expression (1).
[Expression 1]wj=(wj1, wj2, . . . , wjM)T  (1)where T denotes transposition.
Here N denotes a number of search target documents, M denotes a number of keywords in a natural language (e.g. Tokyo, portable phone, baseball, station, economy, stocks, . . . ), and is a very large number.
Each weight can be calculated by the following Expression (2),
[Expression 2]wji=tfji×idfi  (2)In other words, the weight is given by the product of term frequency (tf) and the inverse number of document frequency (idf). Term is a synonym for keyword.
A weight wji of a keyword ti, which appears in a document Dj, should be high if [the keyword ti] appears frequently in a document Dj, and do not appear infrequently in other documents. If the keyword ti appears frequently [in document Dj] and also appears frequently in other documents, the weight wji may be low. The tf·idf characteristic is a representation of this heuristic knowledge, and can be defined as shown in the following Expressions (3) and Expression (4).
[Expression 3]tfji=freq(i,j)  (3)where freq(i, j) denotes frequency of appearance of the term ti in the document Dj.
                    [                  Expression          ⁢                                          ⁢          4                ]                                                                      idf          i                =                              log            ⁢                          N                              D                ⁢                                                                  ⁢                                  freq                  ⁡                                      (                    i                    )                                                                                +          1                                    (        4        )            where Dfreq(i) denotes a number of documents in which the term ti appears (document frequency), and idfi denotes Dfreq(i) normalized by the total number of documents N. The tf·idf characteristic has many improved versions, but the above mentioned general definition is used here.
Now search input is expressed as a search vector q. This is also M-dimensional, and is given by the following Expression (5).
[Expression 5]q=(q1, q2, . . . , qM)T  (5)
In Expression (5), qi is 1 if the keyword ti is included, and is 0 if not included.
In search processing, document DX, of which similarity is the maximum, is searched out of the document set. For searching, the cosine distance determined by normalizing the inner product is normally used, as shown in Expression (6) and Expression (7), to normalize the number of words in a document.
                                              ⁢                  [                      Expression            ⁢                                                  ⁢            6                    ]                                                                                              ⁢                                            x              =                                                                    argmax                    ⁢                                                                                                  j                                ⁢                                  sim                  ⁡                                      (                                          q                      ,                                              w                        j                                                              )                                                                        ,                                                  ⁢                          1              ≤              j              ≤              N                                ⁢                                          ⁢                                          ⁢          where                                    (        6        )                                                          ⁢                  [                      Expression            ⁢                                                  ⁢            7                    ]                                                                              sim          ⁡                      (                          q              ,                              w                j                                      )                          =                                                            q                T                            ⁢                              w                j                                                                                                                              q                                                                                        ⁢                                  w                  j                                                                            =                                                                      q                  1                                ⁢                                  w                  j                  1                                            +              …              +                                                q                  M                                ⁢                                  w                  j                  M                                                                                                                          q                    1                    2                                    +                  …                  +                                      q                    M                    2                                                              ×                                                                                          (                                              w                        j                        1                                            )                                        2                                    +                  …                  +                                                            (                                              w                        j                        M                                            )                                        2                                                                                                          (        7        )            
Expression (7) itself, however, expresses a degree of similarity, and the cosine distance used as a scale to satisfy the system of axioms of distance is 1−sim(q, wj).
Conventional example 1 is a search system based on the keyword weight vector shown in FIG. 4, embodying the prior art. FIG. 4 is a diagram depicting a system configuration to indicate a general search system, and is comprised of a terminal 20, web server 21 and search server 22. In this sample, a searching word, which is input from the terminal 20, is sent to the web server 21, and the searching word, which was input, is converted into a search vector q and is sent to the search server 22. The search server 22 searches according to the search vector q, and as a search result, a WWW document DX is sent to the web server 21 and the terminal 20.
This conventional example 1 is for simply outputting a search result, so a conventional example 2, which is a search system using evaluation values given by the following Expression (8) and Expression (9), to evaluate similarity considering user profile, is under consideration as an improvement of conventional example 1. Based on the evaluation values calculated by Expression (8) and Expression (9), display of the searched WWW documents is processed. In other words, the searched WWW documents are displayed in the sequence according to the evaluation values.
[Expression 8]A_score(q,wj;pk)=λsim(q,wj)+(1−λ)sim(pk,wj), 0≦λ≦1  (8)where pk denotes a user profile of a user k.[Expression 9]pk=(pk1, pk2, . . . , pkM)T  (9)
As shown above, the user profile of a user k is represented by the keyword weight vector. In this way, the WWW documents, searching word and user profile of a user k can also be represented by similar vectors.
To construct the user profile, the sum of Nw(j) in the WWW documents Dj accessed in the past is determined, as shown in FIG. 3 (Data of the Technology Trend Group Planning and Research Division, Patent Administration Dept., Japan Patent Office: “Theme title: Creation of standard technologies on search engines”, an overview of technical trends of WWW search engines, [online], [searched on Jan. 29, 2008], Internet <URL: http://www.jpo.go.jp/shiryou/s_sonota/hyoujun_gijutsu/search_engine/douko.htm>). By replacing Nw(j) in FIG. 3 with wj, the following Expression (10) can be created.
[Expression 10]pk=Σjεdocuments visited by user kwjT  (10)
Also as a format to add significance as an evaluation point, a conventional example 3, which is a search system using an evaluation value given by the following Expression (11), is under consideration.
[Expression 11]B_score(q,wj;pk,sj)=λA_score(q,wj;pk)+(1−λ)sj, 0≦λ≦1  (11)where sj (0≦sj≦1) denotes a significance of the WWW document Dj. The value λ may differ from that in Expression (8).
FIG. 5 shows an operation of a standard search system according to the above conventional examples 2 and 3. As FIG. 5 shows, a user inputs a searching word from a terminal 20 (S101), and a search vector q is generated in a web server 21 (S102). The search vector q generated here is sent to the search server 22, and document IDs are output in the search server 22 in the sequence of higher similarity (S103). In the web server 21, a content to display WWW documents having higher similarity is generated (S104), and the content is displayed on the terminal 20 (S105).
Furthermore “Shohei Tsujimoto, Noriyuki Matsuda, So Harijima, Junichi Toyota, “Browsing support using context information—mounting on web and experimental evaluation thereof”, Annual Conference of JSAI (11th) Post Proceedings (Jun. 24, 1997), The Japanese Society for Artificial Intelligence, pp. 466-467” is known.
The above conventional search methods are based on the following assumptions. That is, (1) basic concept on page ranking, that a WWW document linked with a good quality WWW document has good quality, and (2) a keyword weight vector w of a WWW document and a personal profile p of a user are generated by sufficient information.
However, the above assumptions are not always applicable to a set of WWW documents viewed by a mobile terminal (hereafter called “mobile content”), and an appropriate search result cannot always be acquired by a prior art. FIG. 6 shows a structure of a mobile content. FIG. 6 is a diagram depicting a structure of mobile content in site A and site B. An independent server to provide the service here is called a “site”. WWW documents viewed from personal computers are often mutually referred to (linked), but mobile content, which has of a tree structure directory within a server which provides the respective service, is in many cases independent, and normally without being linked between sites. For example, as FIG. 6 shows, site A and site B are independent from each other, and the respective content is not linked at all.
since sites are not linked to each other, the assumption that a WWW document linked with a good quality WWW document has a good quality, is not always established. Also WWW documents are short documents and do not contain many keywords, which is a different characteristic from WWW documents viewed on a PC. Another characteristic is that a number of dynamically generated WWW documents, such as news and transfer guides, is high. For example, in the case of site A in FIG. 6, newspapers and news are stored in a dynamic WWW document A, and transfer guide information is stored in a dynamic WWW document B. This information is updated or generated based on a user request. Therefore the content of a document existing in a predetermined URL is often different.
Because of this situation, it is difficult to determine significance of a several hundred word content without a link, considering the personal accessing history using such an evaluation value as the one shown in Expression (8) or Expression (11), and it is also difficult to represent a personal profile with a keyword weight vector, and as a consequence, it is difficult to present WWW documents that satisfy a user in a search.