US 7,321,880 B2 | ||
Web services access to classification engines | ||
Desiree D. G. Gosby, Allston, Mass. (US) | ||
Assigned to International Business Machines Corporation, Armonk, N.Y. (US) | ||
Filed on Jul. 02, 2003, as Appl. No. 10/613,560. | ||
Prior Publication US 2005/0005232 A1, Jan. 06, 2005 | ||
Int. Cl. G06F 7/00 (2006.01); G06F 17/30 (2006.01) |
U.S. Cl. 706—20 [707/3; 707/6] | 20 Claims |
1. A method for document analysis and retrieval, comprising the steps of:
accessing a document taxonomy that comprises M categories such that M is at least 2, wherein the document taxonomy is a based
on a subject matter classification in conjunction with a collection of stored documents, wherein each category of the M categories
has an associated at least one category key, wherein the category keys of all M categories collectively consist of N unique
category keys sequentially ordered and denoted as CATKEY, CATKEY, . . . , CATKEY;
transmitting, by a remote host in a first computing system to a web service host in a second computing system, a first portion
of a document; and
sequentially transmitting, by the remote host to the web service host, at least one additional portion of the document, wherein
the first portion and the at least one additional portion collectively comprise the entire document, wherein the entire document
is adapted to be reconstructed and subsequently processed via processing said entire document by the web service host, said
processing comprising:
extracting text from said entire document to configure said text in a text format, if said entire document received by said
web service host comprises said text in a non-text format;
generating a plurality of document keys associated with said text from analysis of said text in said text format, if said
entire document received by said web service host comprises said text in said text format, or if said web service host has
previously performed said extracting such that said text in said text format is available to said web service host;
generating a document key vector VDOC of order N, wherein said generating VDOC comprises for n=1, 2, . . . , N: determining setting VDOC equal to 1 if the plurality of document keys comprises a document key equal to CATKEY, otherwise setting VDOC equal to 0;
after said generating VDOC, generating a document weight vector WDOC of order N, wherein said generating VDOC comprises for n=1,2, . . . , N: setting WDOC equal to a first frequency count raised to a power P1 greater than 1, wherein the first frequency count consists of a number of appearances, in the document, of the document key
associated with VDOC if VDOC is equal to 1 or consists of 0 if VDOC is equal to 0;
for each category m (m=1, 2 . . . , M): generating a category vector VCAT (m) of order N, wherein said generating VCAT(m) comprises for n=1, 2 . . . , N: setting VCAT(m) equal to 1 if category m has a category key equal to equal to CATKEY, otherwise setting VCAT(m) equal to 0;
after said generating VCAT(m), for each category m (m=1, 2 . . . , M): generating a category weight vector WCAT(m) of order N, wherein said generating WCAT(m) comprises for n=1, 2, . . . , N: setting WCAT(m) equal to a second frequency count raised to a power P2 greater than 1, wherein the second frequency count consists of a number of appearances, in the collection of stored documents,
of the category key associated with VCAT(m) if VCAT(m) is equal to 1 or consists of 0 if VCAT(m) is equal to 0;
computing distances, wherein said computing distances is selected from the group consisting of computing first distances,
computing second distances, computing third distances, and computing fourth distances, wherein said computing first distances
comprises computing a dot product of VDOC and VCAT (m) for m=1, 2, . . . , M, wherein said computing second distances comprises computing a dot product of VDOC and WCAT (m) for m=1, 2, . . . , M, wherein said computing third distances comprises computing a dot product of WDOC and VCAT(m) for m=1, 2, . . . , M, and wherein said computing fourth distances comprises computing a dot product of WDOC and WCAT(m) for m=1, 2, . . . , M;
determining, from said computed distances, a set of closest categories to the document, if said entire document received by
said web service host comprises said document keys, or if said web service host has previously performed said generating the
plurality of document keys such that said document keys are available to said web service host.
|