WO2002048911A1

WO2002048911A1 - A system and method for multi-class multi-label hierachical categorization

Info

Publication number: WO2002048911A1
Application number: PCT/IL2001/001138
Authority: WO
Inventors: Ran El Yaniv; Ron Meir
Original assignee: Gammasite Inc.; Bordin, Allan; Ravid, Yiftach; Porat, Noa
Priority date: 2000-12-17
Filing date: 2001-12-09
Publication date: 2002-06-20
Also published as: AU2002222466A1; IL140365A0

Abstract

The objects ar categoried into a hierarchical structure (fig. 1). The hierarchical structure is given together with a classifier at each node (A-K) of the hierarchical structure (fig. 1) and possibly, a set of example objects at each node (A, B, C, etc.). The system provides a means to efficiently compute the most appropriate set of nodes (A, B, C, etc.) from the hierarchical structure (fig. 1) that best describing for any object.

Description

A SYSTEM AND METHOD FOR MULTI-CLASS MULTI-LABEL HIERARCHICAL CATEGORIZATION

FIELD OF THE INVENTION

The present invention relates to a method and system for categorizing objects. More specifically, the₍embσdiments of the present invention relate to determining the set of topics (or labels) most appropriate for describing a given object within a hierarchical taxonomy of topics, assuming trained classifiers (or filters) at each node of the hierarchy and possibly one or more example objects in each node.

BACKGROUND OF THE INVENTION

The task of categorizing objects into one of several classes has known algorithmic solutions. However, the bulk of work to date deals with two-class (binary) classification. This relatively simple problem has been vigorously studied since the 1940s within the pattern recognition, statistics and machine learning communities. In contrast, the task of multi-class classification (i.e., classification into more than two classes) is still an active research area with several recent developments (e.g., T.G. Dietterich and G. Bakiri, Solving multiclass learning problems via error-correcting codes, Journal of Artificial Intelligence Research, 2: 263-286, 1995], and [R. Schapire and Y. Singer, Improved boosting algorithms using confidence-rated predictions, Machine Learning 37(3): 297-336, 1999 - herein SS99]). Moreover, only recently have serious attempts at achieving hierarchical multi-class classification been initiated, e.g., [D. Koiler and M. Sahami, Hierarchically classifying documents using very few words - herein KS97] and [S. Chakrabarti, B. Dom, R. Agrawal and P. Raghavan, Using taxonomy, and signatures for navigating in text databases, Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997 - herein CDA97]. The task of multi-label classification, in which each object may be assigned simultaneously to several classes, is considered especially hard. The main motivation for the recent interest in multi-label classification is the drastic growth of information that is available online. In a hierarchical taxonomy, the main focus of this invention, objects are inherently multi- labeled, falling naturally into several topics simultaneously.

In constructing multi-label classifiers, one encounters serious difficulties due to the combinatorial nature of the problem, which is further exacerbated for large hierarchies. Specifically, one has to choose a subset of labels from the entire set of possible labels. When this set is larger than, say, 50, the number of choices is astronomical. In typical cases, we envisage hierarchical structures with several hundreds or even thousands of possible topics, rendering exhaustive search impossible.

A straightforward approach to the multi-label classification of objects into a hierarchical structure of possible labels is to ignore the hierarchy and classify objects using any (flat) multi-label classification method. However, this simplistic approach can yield extremely poor classification results as it does not exploit the information exhibited in the hierarchy. Instead of reducing the problem to many small classification problems, this approach must deal with one, very large problem, with possibly hundreds or thousands labels.

We are aware of several pieces of work attempting to partially address the problem of Efficient Hierarchical Multi-Class Multi-Label (HMCML) categorization of objects. A desired solution to the HMCML problem must deal with both the multi-label classification problem and at the same time exploit the hierarchical nature of the problem. We are not aware of any previous work that attacks the full HMCML problem simultaneously along these two dimensions of the problem. We now turn to a short summary of some of the previous work. The work of Schapire and Singer [SS99, R. Schapire and Y. Singer,

Boostexter: A boosting-based system for text categorization, Machine Learning 39(2/3): 135-168, 2000 - herein SSOO] provides an extension of a particular algorithm called AdaBoost for dealing with multi-label (and multi- class) problems. Several variants of AdaBoost are suggested, based on classifiers which are able to provide, for each object, a list of labels rather than a single label as in standard classifiers. We note two major problems with the work of Schapire and Singer [SS99, SSOO]. First, they do not relate at all to a hierarchical structure of labels. Second, no procedure is suggested for automatically determining an appropriate number of labels.

Chakrabati et al. [CDA97] introduce the TAPER system (taxonomy and path enhanced retrieval) for the organization of text documents in a taxonomy. The approach is based on the construction of simple probabilistic classifiers at each node and the extraction of appropriate features. Several experiments are provided by the authors to demonstrate the effectiveness of the hierarchy as opposed to non-hierarchical classification. In particular, it is observed that both the quality of classification and the classification speed are seen to improve due to the hierarchy. The main drawbacks we observe in this system are as follows. First, no attempt is made here to produce multi- labeled classification. Second, the proposed categorization algorithm does not perform any correction for specificity (as proposed in the present invention) and is thus likely to misclassify objects whose categories are located at deep levels of the hierarchy. Furthermore, the classifiers used at each node are rather simplistic, and the suggested categorization procedure depends explicitly on their form.

McCallum [A. McCallum, Multi-label text classification with a mixture model trained by EM, AAAI workshop on text learning, 1999 - herein M99] offers a simple probabilistic method for computing a subset of labels that have the maximum posterior probability given a document. The method uses a simple generative probabilistic model to compute the posterior probability of a subset of topics. This work does not address the full HMCML problem as it does not at all address the case where topics can be hierarchically structured. In addition, due to computational intractability the proposed method only approximates the true posterior probabilities of topic subsets and is therefore likely to be trapped in local maxima of the posterior distribution function and therefore miss the true set of topics.

Koiler and Sahami [KS97] address the problem of learning and classification in a hierarchical structure, focusing on the case of text classification. The task is divided into a set of smaller sub-problems, each with a corresponding small number of features, adapted to the hierarchy structure. This work addresses mainly the problem of learning the local classifiers in each of the hierarchy nodes, and indeed demonstrates improved performance relative to non-hierarchical classifiers. However, this work does not address the issues of computationally efficient and precise categorization, which is the subject of this invention. Moreover, no treatment of the multi- label classification problem is provided.

BRIEF DESCRIPTION OF THE INVENTION

Our approach provides the first solution to the full HMCML problem. Our algorithmic solution takes advantage of both the hierarchical structure of the domain of interest together with a flexible and principled choice of object categories at each level of the hierarchy. The proposed solution allows for the selection of high-quality HMCML assignments for objects in a computationally efficient manner. In particular, the invention proposes a principled method to determine the set of labels which best fit any input object. The main advantages of the procedure are the following: (i) Applicability to (large) hierarchical taxonomies, which are a very natural and efficient representation for organizing complex topic domains, (ii) Independence of any specific type of classifier (or filter) as well as effective utilization of additional example objects, (iii) Provision of the most appropriate subset of class labels, based on solid statistical and information theoretic principles.

Finally, we comment that in this manuscript we use the term hierarchical structure, taxonomy, and tree interchangeably and we use the term classifier and filter interchangeably. DESCRIPTION OF THE FIGURES

FIG. 1 is a generic taxonomy tree (with two levels) in accordance with a preferred embodiment of the present invention. FIG. 2 is a block diagram of the system.

FIG. 3 is a flowchart of a candidate reduction .

FIG. 4 is a flowchart for the random sampling algorithm.

FIG. 5 is a flowchart of a method for determining the number of labels FIG. 6 is a flowchart of the greedy tree traversal algorithm for subset selection.

FIG. 7 is a flowchart of the optimal subset selection algorithm.

FIG. 8 is a flowchart of a method for determining the optimal subset of labels from the candidate set. FIG 9 is a flowchart of a second method for determining the optimal subset of labels from the candidate set.

FIG. 10 is pseudocode for the Random Sampling algorithm.

FIG. 11 is pseudocode for the Greedy Tree Traversal algorithm.

DETAILED DESCRIPTION OF THE INVENTION

We consider the problem of classifying a given object into the most pertinent nodes in a hierarchical taxonomy. It is assumed that the topic hierarchy is given, and we do not concern ourselves with its construction, which may have been formed manually, automatically or semi-automatically.

Thus, we are given a taxonomy in the form of a rooted tree ^{7 =} v ,E) , where ^v = i ^c\ . ^c2 >^κ ' ^cs ) represents the set of possible topics (which are labels or names of categories or topics). Consider the schematic description of a taxonomy tree in FIG. 1. We distinguish between internal nodes (nodes A.B.D.E in FIG. 1) and leaves (the gray nodes C,F,G,H,I,J and K in FIG. 1). Each internal node has one or more descendents (children) which are connected to it by an edge (e.g. nodes F and G are children of node B). Leaf nodes do not have children. If ^w is a child of ^v (e.g., node F is a child of node B) then there is an edge (^v' ^w) ^{e E} and the interpretation is that topic ^w is a sub-topic of ^v . The square nodes in FIG. 1 (which, graph-theoretically, are leaves) represent nodes that will be referred to as 'miscellaneous' (MISC) nodes: their nature is described below. Finally, each node in the tree has a classifier associated with it, which, for each input object, computes the confidence it has in the classification of the object as belonging to this particular node. In the embodiment described in this work we assume that the confidence is a real number between 0 and 1 , although any real number may be used. Such a taxonomy is assumed to be populated with content objects (or pointers to such content objects) and as mentioned this invention proposes a system and method capable of efficiently and effectively performing this task. Note that content objects can be of any type. Typical examples of types of content objects are text documents, images, music and/or combinations of them (i.e., multimedia objects). Content objects may also be representations of abstract objects (e.g. biological sequences such as DNA and proteins, people or animals, companies, countries etc.). Note that it is not necessary that the same type of objects populate a taxonomy and we envisage categorizing taxonomies with different types of objects (such as music, text, pictures, movies, etc.).

Each internal node represents a topic having some number of explicit sub-topics as well as an implicit child representing what we call a miscellaneous (MISC) sub-topic (see square nodes attached to the internal nodes in the tree of FIG. 1 ). The MISC sub-topic may contain either "overview" (or mixture) objects or "other" objects that cannot be considered to concern any of the existing subtopics. Such "other" objects naturally appear in every hierarchical partition of a topic where each topic contains isolated objects of which there are too few similar objects to warrant a new sub-topic. Another case where such objects appear is whenever the tree structure does not accurately reflect the topic (e.g., bad or low resolution taxonomy construction or a change of the topic domain). Let D be the set of all objects and let ^{d s} D be a specific object. We assume that each of the topic nodes in the tree is associated with a classifier (or filter) or a set of classifiers whose functionality is as follows. For each topic ^c , the associated classifier assigns, for the given object ^d a certain real number serving as "confidence rate" (also known as "confidence level") to the classification of d into ^c . Without loss of generality we assume that these confidence rates are real numbers in the interval I-⁰'¹] . This assumed functionality captures any type of classifier and in particular, soft (two-class) classifiers (which assign probabilities to each class) and hard (two-class) classifiers (which assign either 0 or 1). We note again that any classifier that returns an arbitrary real number (bounded or unbounded) can be mapped to a classifier returning values between 0 and 1. (For example, if the interval is unbounded one can use the logistic regression function.) In this invention we are not concerned with either the type of the given classifiers (which can be arbitrary) or with the method used to create these classifiers. What matters is that these classifiers generate the (hard or soft) confidence rates for each given document in each topic node. For example, each of the first level nodes of the tree in FIG. 1 has an associated confidence level (computed for a hypothetical given object ^d ) in the interval [0, 1 ]; note that nodes D and E are associated with hard (0/1 ) classifications, whereas the other nodes (B and C) yield soft decisions. It is assumed that confidence rates (soft or hard) represent degrees of confidence of the classifier with respect to the current input object. Clearly, the proposed invention will produce better results if the associated classifiers can generate (soft or hard) decisions which better represent the nature of objects. Nevertheless, the proposed invention will work with any type of classifier and will exploit even low quality classifiers in the best possible way. In this invention we also consider the case where some or all the leaves and/or internal nodes each include one or more associated example objects. In this case our method can exploit these additional example objects to achieve better performance. Here again, we are not concerned with the method used to obtain these example objects. We only note that the assumption of a small number of example objects is significantly less restrictive than our previous assumption that there is a classifier associated with each node. Thus in most application there will be such examples available. However since in one embodiment of the present invention we do employ such examples we later provide a method for generating such examples (with similar statistical properties to those original examples used explicitly or implicitly to generate the classifiers) whenever they are missing in some or all the tree nodes.

Another input parameter that we may have is the maximum number of topics that can be assigned to an object. We denote this number by ∞* his number can reflect an intrinsic property of the collection (e.g. 99.9% or all of the objects belong to at most N_max topics) or can reflect a system requirement (e.g. that more than N_max labels are confusing). If such a number is given, we can exploit it and improve the results. Nevertheless, our invention works well even if this bound is not known (after we describe the procedure we note how to handle this case).

A note on the representation of objects: In order to construct a classifier at each node of the hierarchy one must fix a representation for each object. This representation may assume one of a variety of forms. The simplest and most common representation is given by a vector of features, so that each object is a vector in a Euclidean space. More general settings include graphs of various sorts. Our invention can work with any type of representation and in particular with the vector of features representation.

To summarize this section, this invention is concerned with the following task. Given a taxonomy tree with associated classifiers (or filters) in each topic node, for a given input object ^{ά e D} , we want to efficiently compute a set of topics which in some natural sense best describe the subject matter of the object. The proposed method for this computation relies on additional example objects in some or each of the topic nodes, and can rely on a number ^max giving a bound on the maximum number of topics that can be assigned to one object. Nevertheless the proposed method can handle cases in which neither additional example objects are given, nor is ^msx known.

The proposed method is divided to two conceptual procedures: candidate set reduction and multi-label subset selection.

In the candidate set reduction procedure we compute, for a given object ^{d e D} , a set ^s containing a number of topics, i.e., ^s is a sub-set of the nodes in the tree. In case ^max is known, the size of ^s is at most ^« . The goal of this procedure is to exploit the information given by the taxonomy tree and use it to generate a rather small set of topic topics ^s which contain the "true" topics that best describe ^d . In the multi-label subset selection procedure we compute a subset

^«yt S_as he final set of topics for the given object ^d . The goal of this procedure is to exploit the statistical information embedded in the object ^d and the candidate set ^s , and to generate a subset ^°of ^s containing the optimal subset of topics contained in ^s . We note that in the second step (multi-label subset selection) we produce optimal solutions when the candidate set ^s is relatively small (e.g. less than 14). Two embodiments of the present invention treat the case where the candidate set S is large and in this case the results approximate the optimal solution. In what follows we describe in detail each of these steps. FIG. 2 displays the general system overview. At step 12 a digital representation of an object is input to a general computer system depicted at 16, and composed of a hard-disk, a local RAM memory and a processor. The computer system 16 interfaces with the hierarchical taxonomy of FIG. 1 at 14, and serves to store and process the information from 14. At step 18 the various algorithms discussed in this invention (FIGURES 3-9), are utilized in order to generate an appropriate set of labels, for describing any given object.

STEP 1: CANDIDATE SET REDUCTION

The main purpose of the candidate set selection step is to exploit the information embedded in the hierarchical taxonomy so as to reduce the possible topic choices (initially containing all possible topics) into a small promising set of topics that should include the most informative set of topics max that can be attributed to a given object. For now we assume that is

N max known (later we describe how to operate whenever is unknown). The general procedure is described in FIG. 3.

FIG. 3 depicts the candidate reduction system. At 32 the following inputs are provided, (i) A hierarchy of labels including a classifier at each node and a set of corrective weights (whose nature is explained below) one for each label as described in equations (0) and (1) below; (ii) An object to be categorized into the hierarchy; (iii) A maximal number, N_max , of candidate labels to be computed. At 34 the system computes label statistics. This process has two embodiments. One is described in FIG 4 starting at 50 and uses a random sampling procedure. The second embodiment for 34 computes these statistics using the mathematical expectation operatorE[« described in equation (2). At 38 the system combines the label statistics and the corrective weights (see equations (0) and (1)) by multiplying them. After this computation each label is assigned a non-negative numerical value that we term "corrected statistics". The corrected statistics values are then sorted in descending order at 40 using any sorting algorithm (e.g., quicksort). At 42 the system starts scanning the sorted list of corrected statistics in descending order and for each label scanned, all its descendents and ancestor labels are removed from the list, so that in the final list no labels on the same path to the root of the hierarchy remain. From this final candidate label list at the end of 42, the system outputs the first N_max labels with highest corrected statistics value.

We propose the following two methods for the candidate set selection: random sampling and greedy tree traversal (although others can be used), which can be used either separately or in conjunction. These methods are described in detail in FIG. 4 and FIG. 6, respectively.

Candidate Selection - Method 1 : Random Sampling

We describe the general idea first, providing detailed pseudo-code for the algorithm in FIG. 10 and a detailed description in FIG. 4. The underlying idea is that a random sampling defined over the taxonomy (in the form of directed random walk from the root to the leaves), based on the classifiers' confidence rates at each node, samples more probable topic nodes at a higher rate than other topic nodes.

FIG. 4 depicts the random sampling algorithm, which generates the label statistics for step 34 in FIG. 3. At 50 the following inputs are provided:

(i) A hierarchy of labels including a classifier at each node; (ii) An object to be

N categorized into the hierarchy; (Hi) A maximal number of sampling steps "^ which is set by the user in accordance with the desired confidence level. At

52 the putative candidate set size T is initialized to 1 and the count for each node is set to zero. At step 54, if T is larger than "∞ , the current set of candidate nodes is output at step 56. Otherwise, computer variable R is initialized to the root of the hierarchy at step 58. At 60 the system records in the computer memory that the node represented by R has been visited. During step 62 the system examines the condition whether the node represented by R is an internal label. If the answer is negative, the value of T is incremented at 64, and at step 66 the system jumps back to 58. If the answer at 62 is positive, the system proceeds to ask at 68 whether confidence rates have been computed for all the children of R. If the answer at 68 is negative, the confidence rates of all children of R are computed at 72 and stored in the computer memory in step 74. If the answer at 68 is affirmative, the confidence rates of the children of R are read from the computer memory at 70. The low confidence children of R are pruned at step 76, which means that they will not be visited again during the current traversal, and the remaining confidences are normalized at 78 so that the sum of confidences is unity. One of the children of R is selected at random at 80 according to the probabilities (normalized confidences) computed at 78. Then R is set to the randomly chosen child at 82, the count of the node is incremented by 1 , and from step 84 the system returns to step 62.

One advantage of the method proposed is that it can choose internal (MISC) nodes to be in the candidate set even if there is no classifier associated with them. Given an object ^d we first compute the confidence rates assigned to ^d by the various classifiers of the first level (associated with children of the root); then we normalize the confidence rates (conditional probabilities) assigned to ^d by node classifiers so that they sum to one. Then we choose one random child of the root according to these probabilities. Suppose that the subtopic ^u was chosen. We apply this procedure recursively to the sub-tree rooted by ^u . We stop when we reach a leaf. We repeat this random walk from the root to a leaf a number of times (this number will be specified later). Intuitively, if there is one prominent subtopic ^u that best suits the object ^d , then ^u will be sampled most of the times. Thus, in each random walk we start at the root and randomly branch according to the probabilities at each node, until we end in a leaf. We denote the number of times node * is visited by "' .

A straightforward approach at this stage would be to count the number of visits to all nodes (internal and leaves) and sort the nodes by their counts.

N max

The candidate set is now set to be the nodes with largest counts. However, this simplistic approach is not sufficient, and one must perform the following correction:

Preference for specificity. This correction aims to reflect the desired priority we wish to give to more detailed topics. The correction is done by assigning a weight to each node in the tree. The weight will reflect preference for more specific (deeper) nodes and at the same time will enable detection of mixture MISC documents. As before, let ^v be a node and let ^uι^>u2>-^>uk be its children. Let^α°'^α"-' ^Λ be the path from the root to ^v, where °is the root and^a" ^{= v} . Define

Here P (P ^{> 1} ) and ^ε are user specified parameters to be determined (e.g. our experiments with textual objects indicate that constants ? = ^3/2 and 5 = 1 2 _are effective). Then we set, for each 1 ≤ * ≤ ₍

The particular choice of these weights can vary with the application but in any case we would require that the corrective weight function is monotone with the degree of the node. That is, deg'(v) =/(deg(v)) ^j

where f(*> is a monotone function.

Note: In the extreme (and unlikely) case where ^deS(^v) = 1 _we should prefer the son of ^v to ^v due to its increased specificity. Therefore the weight

P should be greater than 1. The correction is performed by multiplying the score, count or confidence of a node, by the corrective weight described in equation (0).

Another correction that can be made and is specified in one embodiment of the present invention is the following correction for MISC nodes. This correction is especially important if there is no classifier pre- associated with the MISC nodes. Correction for MISC nodes. Let ^v be an internal node and let "^.'^•■■>"* be its children (i.e., * ^{= de§}^ ). Let ^ci = P(^uι I Λ * = \-,k _{| be tπe} confidence rates assigned to ^d by each of the children. Whenever ^max(^c.} < ^1/ e trim the traversal at ^v, where the bias ^α is any real number larger than 2. This correction aims at detecting situations where the document ^d is supposed to be contained within a child of ^v but this child does not exist (and therefore ^d is considered a MISC object).

For sufficiently small trees, it is possible to compute the outcome of the classifiers at each node and in this case there is no need to execute random walks. Specifically, the expected number of visits ^-"'t at node ^v> can be explicitly calculated from the classifiers' outcomes in the following way. Each node in the tree is represented by a conditional probability distribution Pψ I p (t)] _wnere pα(t) _{js the} p_arent ₀f _nocj_e t performing a series of random walks, it follows that the probability of reaching a certain node (and thus Ml} j_n th_e t_ree is given by the product of the conditional probabilities along the path.

Candidate Selection - Method 2: Greedy Tree Traversal

This method can be much more efficient than method 1 and uses thresholded classifiers to perform a greedy tree traversal to select the candidate set S. The idea is to use a L -step look-ahead greedy choice at each decision node. The method is greedy in the sense that once a node ^c has been marked we require that at least one node in the sub-tree rooted at ^c will become part of the set S . The idea is to ignore future dependence, while selecting labels (up to a L look-ahead) and to choose the most promising candidates using a top-down approach. The pseudo-code in FIG. 11 describes candidate set selection method 2: Greedy Tree Traversal.

FIG. 6 depicts the greedy subset reduction system. At 100 the following inputs are provided: (i) A hierarchy of labels including a classifier at each node , one corresponding to each label; (ii) An object to be categorized into the hierarchy; (iii) A maximal size of the candidate set ^max . At 102 the set S is initialized to be the empty set and M is initialized to be the root node of the tree. At 104 the following condition is examined by the system: Is M not empty and is the size of S less than or equal to ^max ? If this condition at 104 is false, the system outputs the current subset S at 106 and halts; otherwise, the system proceed to step 108 and computes the confidence levels for all members of M. At 110, the label c with the highest confidence rate in M is selected, and removed from M at 112. The confidence is computed for each child of c at 114. At 1 6 the system examined the Boolean condition of whether any of the children of c possess sufficiently high confidence. If this condition at 116 is found to be false, c is added to the set S at 118 and the sub-tree is terminated at c; otherwise, the condition at 116 holds and the high confidence children of c are added to M at step 120. At step 122 the system returns to step 104 and repeats until the Boolean condition at 04 is answered in the negative. Note: In an alternative embodiment of the proposed invention we combine the above greedy method for candidate selection with the 'correction for specificity' method described in equations (0) and (1 ). Thus, in this alternative embodiment at 110, the label c with the highest confidence is chosen after correcting the confidence of that label by multiplying it by the corrective weight described in equation (0). The rest of the method is identical from 112 onward.

Note: whenever ^max is not given we perform a doubling guessing approach, as described in FIG. 5. Specifically, we repeat the entire HMCML routine (including both the above candidate reduction procedure and the multi-label subset selection procedure, to be described) starting with ^N = 1 and repeat doubling its value until the resulting HMCML routine selects a subset of topics that is strictly smaller than N .

FIG. 5 depicts a method for applying the candidate subset reduction without explicit knowledge of the parameter ^max . At step 90 the system assigns the computer variable N to 1. Then at 92, the system applies a method for candidate reduction (e.g. the one in FIG. 6). At 94 the system examines the Boolean condition asking whether the candidate reduction system outputs a number of labels strictly smaller than the value of the variable N. (That is, less than N labels were returned even though N labels were permitted.) If this condition holds the system outputs the labels as computed by the candidate reduction method at step 92. Otherwise, the new value of N is set to be twice the value of the old N and the system continues the computation at step 92. Note that a method for candidate reduction might use stricter standards when N_max is small so that the doubling procedure does not necessarily return the same number of labels as it would if N_max were set to the total number of labels. Additionally, the candidate reduction method might be much more efficient for small values of N_max , and since we would expect most documents to belong to a small number of topics, the doubling procedure is more efficient than setting N_maxto be the total (very

large) number of possible topics.

STEP 2: MULTI-LABEL SUBSET SELECTION

Having chosen the candidate set, we now propose a method for selecting a subset ^■^ ^ of topic labels that will approximate the set of most informative categories. The labels in will be assigned to the object ^d ; that is, these are the topic nodes into which the object ^will be placed. The idea is to choose a subset of topic identifiers, which will statistically best describe the object.

If each of the category nodes had an associated "hard classifier", a naive method that could be considered is to place the object in all classes in the candidate set whose hard binary classification accepts the document. However, this simple method is prone to errors stemming from noisy (or arbitrary) calibration of the hard classifier. Our proposed methods are more sophisticated and overcome this deficiency.

The method we describe here assumes that each of the topic nodes has one or more example objects associated with it (in addition to the associated classifier). Later we note how to operate whenever such examples are not supplied.

Multi-Label Subset Selection - Most Common Source

The idea of the 'common source' method is to view each object as arising from a possible mixture of different statistical sources. The method is formulated in terms of statistical mutual sources. Given ^k sources with prior

probabilities

₎ their mutual source ^'^K >* is given by

Z(1,K ,*) = ∑£R

(2) that is, their average weighted according to the priors. It can be shown that

Z(l, ,k) = K&Xm_{z j}q_iD_KL(P_l \\Z)

where

Cover and J. Thomas, Elements of Information Theory, John Wiley & Sons, New York, 1991. - herein CT91 ]. When no other prior information is available we simply choose 1- ^~ " "^■ .

Thus, the mutual source is the distribution which is closest to all other distributions in terms of the KL-divergence.

We are given a candidate set of labels, ^^c" ^{, Ck} ' . Each label ^c- p has an associated distribution ' whose support is the (entire) feature set, and the distribution is computed according to the feature occurrence frequency in documents assigned to the labels.

Thus, we can consider the mutual source ^(^1>κ ^ of the • . Each

, p object ^« can be viewed as an empirical distribution ^d over the (entire) set of features. . In particular,

PΛ ) = N_d(Λ)/∑ NAf_ilK ,P_d(f„) = N_cl(f_h)/∑ N_d(fj)

J J

where ^d^^ is the number of occurrences of feature J in object ^d .We can now measure the dissimilarity between Z \& ,k) _anζj d _ϋSing the Jensen-Shannon divergence. The Jensen-Shannon (JS) divergence JS(P,Q) between two distributions R and Q is

JS_x(P,Q) = λD_SL(P\\0) + (\ -X)D_KL(0\\Z) ₍₃₎

where Z = IP + (1 - λ)0 -_{]S ne u}tual source of P and Q . Whenever we cannot determine the relative importance of P and Q we take equal priors; that is ^ = 1/2 Note that any other form of divergence between distributions may be used; we use the JS divergence due it optimality properties. In particular, we can use any of the ^p norms, the standard cosine function for Euclidean vectors, hamming distance, all of which approximate the JS- divergence to a certain extent.

Given an object ^d and a candidate subset °we can use ^■/'^y(-^^>2W) to measure the plausibility of the subset °. Intuitively, when we add irrelevant labels to /ve increase ' " *™ because we 'dilute' the mutual source which becomes more distant from the empirical distribution of the object. On the other hand, when we add relevant labels to °we decrease ^W^l™ because the mutual source becomes more similar to the empirical distribution ^d of the document. Thus we can use ^JS^^ ^/) to drive a greedy or non- greedy algorithm for choosing the (near) optimal subset of labels from the candidate set ^s .

FIG. 7 depicts greedy method number 1 for the selection of a subset from the candidate set generated either by the method of FIG. 4 or by the method of FIG. 6. At 140, the following inputs are provided: (i) A set of labels 8911

19

(the candidate set); (ii) An object d to be categorized into the hierarchy. At 142, the system initializes the optimal set to the empty set and initializes D to the maximal distance between labels (that is, the distributions associated with these labels) in the tree. The distance is computed according to the method described in equation (3). At 144, the optimal subset is stored in the computer memory and the closest label from the candidate is selected and added to the optimal set.. The mutual source of the optimal subset is computed at 146, using Eq. (2). The new value of D is set at 148 to be the distance between the mutual source and the object d. At 150 a decision is made according to the Boolean condition whether the new value of D is larger than the previous value. If this condition at 150 is false, the value of D is reassigned at 152 to be the newly computed value and from 156 and the system returns to 144; otherwise, the value of the previous optimal subset is output and the algorithm halts. Observe that if the number of labels in the candidate set ^s is small, then an exhaustive search through all subsets of ^s is possible and is recommended for achieving an optimal subset of labels..

Note: When there are no example objects given at some or all the nodes, we operate as follows. As mentioned earlier, the assumption that classifiers are associated with tree topics is more restrictive than the assumption of example objects and in particular, we envisage three modes for the construction of a classifier at each node, (i) Training using a labeled set of examples, (ii) training using unlabeled data, and (iii) a hand-coded classifier, i.e., one which was constructed by hand and not learned from data. In the first case we may directly use the best labeled data as examples, where by best we refer to those achieving the highest rank by the classifier. In case (ii) we may simply pass the data through the classifiers, and select at each node the inputs receiving the highest rank. Finally, in case (iii) (which often corresponds to a Boolean query type classifier, e.g., a decision tree) we use the given classifiers to generate random example objects in each of the nodes. If the associated classifier is hard (taking the value 0 or 1), we simply generate an artificial document which obeys the conditions. If the (binary) 911

20

classifier is soft, assume that it can be represented by a real number w , where large values of S ^χ) correspond to high confidence in the class ^c and small values correspond to low confidence.

We then introduce a Gibbs distribution on the space of possible objects through the definition "W = e^χ3?(g(x)/T)/Z _{^ wπere} T js a temperature parameter and Z is a normalization constant. At this point standard approaches based on simulated annealing and Markov-Chain Monte-Carlo methods (e.g., [Neal, R. Bayesian Learning for Neural Networks, Lecture Notes in Statistics No. 118, Springer-Verfag, 1996 - herein Nea96]) may be used in order to generate representative samples.

FIG. 8 depicts another greedy method for the selection of a subset from the candidate set generated either by the method of FIG. 4 or by the method of FIG. 6. At 160, the following inputs are provided: (i) A set of labels (the candidate set); (ii) An object d to be categorized into the hierarchy. At 162, the optimal set is initialized to the empty set, D is set to the maximal distance between members of candidate set and M is set to zero. A single label is chosen from the candidate set at 164, for which the distance from the optimal set (combined with this single label), to the object d is minimal. The distance between M labels and an object is computed using the Jensen Shannon divergence (formula (3)) between the mutual source of the M labels to the object where the mutual source of the M objects is computed using formula (2). At 166 the distance D is set to be the minimum distance computed at step 164. Then M is incremented. From step 168, the method proceeds as in step 150 of FIG. 7. FIG. 9 depicts the exhaustive optimal subset selection algorithm. At 180

, the following inputs are provided: (i) A set of labels (the candidate set); (ii) An object d to be categorized into the hierarchy. At 182 a decision is made depending on whether the labels are associated with example documents. If example objects do not exist, the classifiers are used at 184 to generate them. Alternatively, the empirical distributions of members of the candidate set are computed at 186. The vector distance between the object d and all subsets of the candidate set are computed at 192, and the subset leading to the smallest distance is denoted as the optimal subset. The optimal subset is output at 194.

Claims

C L A I M S

1. A method for multi-class, multi-label automatic computerized categorization of a digitally represented object into a hierarchical structure, said hierarchical structure being composed of objects having a predetermined maximum number N_max of labels, each label associated with a node, associated with a classifier, and associated with at least one example of the type of digitally represented object, the method comprising the following steps:

1 ) reducing the initial set of labels to a final reduced set of labels, based on the relationship between the digitally represented object and the hierarchical structure as determined by the classifiers, such that for each possible label, a weight is computed based on the classifiers' confidence rates computed for each node along the path from the root to the node associated with that label, with the final reduced set being a candidate set of labels consisting of the most highly weighted labels; and

2) selecting the most appropriate subset of labels by computing the similarity between each possible subset of said candidate set of labels and-the given digitally represented object, retaining the most appropriate subset of labels; thereby categorizing the digitally represented object into the most appropriate subset of labels within the hierarchical structure.

2. The method of claim 1 , each label further associated with a corrective weight from a set of corrective weights, wherein said reduction step is performed using a random walk procedure, comprising: traversing the hierarchical structure from the root down to the leaves and reading or computing at each node the confidence rate of the classifier;

normalizing said confidence rates so that the sum of the values for all children of each node is unity;

choosing, a predetermined number of times, for each node in the hierarchical structure, a random child according to said normalized confidence rates;

recording the number of visits to each node;

computing a weight for each node in the hierarchical structure;

sorting the nodes into a list in descending order according to the product of said number of visits times said weight;

going over said list, starting from the label with the largest value of said product, and eliminating the descendents and ancestors of each label.

thereby generating said predetermined number or less of top labels.

3. The method of claim 1 , each label further associated with a corrective weight from a set of corrective weights, wherein said reduction step is performed using an exact computation, comprising: traversing the hierarchical structure from the root down to the leaves and reading or computing at each node the confidence rate of the classifier;

normalizing said confidence rates so that the sum of the values for all children of each node is unity; computing the expected number of visits to each node by taking the product of all probabilities along the path from the root to the node; computing a weight for each node in the hierarchical structure; sorting the nodes into a list in descending order according to the product of said expected number of visits times said weight; going over said list, starting from the label with the largest value of said product, and eliminating the descendents and ancestors of each label. thereby generating said predetermined number or less of top labels.

4. The methods of claims 2 or 3 wherein said weights are computed using the following formula : deg*(v) = /(deg(v))

where ^v is a node and ^u^^u^--^u^ are its children, and ₀,a_x,..., _{h js {he} p_{ath from the rQOt tø v ^ where} a_{0 js the rQot}

and^α* ^{~ v}, where deg(v) - ε deg(v) > 1 deg'(v) = β deg(v) = 1

and where ^{> 1} ) and ^s are parameters to be determined (e.g. β - ^3/2 and * = ^1/2 ), and setting for each l ≤' ≤^:

and where the particular choice of these weights can vary with the application but the corrective weight function is required to be monotone with the degree of the node; that is, deg'(v) = /(deg(v)) (

where f(*' is a monotone function and where if ^de§(^v) ^{= 1} then prefer the son of ^v to ^v by setting P to be greater than 1.

5. The method of claim 1 wherein said reduction step is performed using a greedy hierarchical structure traversal procedure comprising: defining a first candidate set and second set mark set-

initializing said first set to be the empty set and said second set to be the root node;

from said second set, selecting a node with the highest confidence as a parent and removing it from said second set;

computing the confidence rates of each of the children of said removed node;

marking any child whose confidence rate is sufficiently high according to a predetermined user specified parameter (e.g. Vs.), and adding said child it to said second set; terminating the search in the branch if a good child has been found and marking this child; if said confidence rates of all said children are too low, adding said parent to said first set. 6. The method of claim 1 wherein said reduction step is performed using a L-level greedy hierarchical structure traversal procedure comprising: defining a first set and a second set; initializing said first set to be the empty set and said second set to be the root node;

computing the confidence rates of each of the children of said removed node;

marking any child whose confidence rate is sufficiently high and adding said child it to said second set;

performing an L-level breadth-first lookahead for any child whose confidence rate is too low;

terminating the search in the branch if a good child has been found and marking this child;

if said confidence rates of all said children is too low, adding said parent to said first set.

7. The method of claim 1 where said selection step is performed using an exhaustive search algorithm comprising: computing the distance between each subset of said candidate set of labels and the digitally represented object;

selecting the subset which is closest to the digitally represented object.

8. The method of claim 1 where said selection step is performed using a greedy algorithm comprising: defining an optimal set and setting said optimal set to be empty and setting a maximal distance between labels in the hierarchical structure; selecting the label from said candidate set for which the classifier's confidence rate is highest; adding said label to said optimal set; computing the mutual source of said optimal subset;

computing said mutual source's new distance from the digitally represented object; if said new distance is larger than said maximal distance, outputting the optimal subset that existed prior to adding said label;

otherwise, setting the distance to be the new distance and returning to said selecting the label from the candidate set and repeating from there.

The method of claim 1 where said selection step is performed using a greedy algorithm comprising: setting the optimal set to be empty;

setting a maximal distance between labels in the hierarchical structure,

setting the mark set to be zero; computing, for each label in the candidate set, the mutual source of said label and the current optimal set and also computing the similarity of such mutual source to the digitally represented object;

choosing the label that maximizes said similarity;

removing said label that maximized said similarity from the candidate set; adding said label that maximized said similarity to the optimal set of labels ; incrementing the mark set by 1 ; returning to the step of computing the mutual source and repeating from there until no increase in similarity can be achieved by any of the labels in the candidate set, in which case, output the resulting optimal set of labels.

10. A method for multi-class, multi-label automatic computerized categorization of a digitally represented object into a hierarchical structure, said hierarchical structure being composed of objects having a predetermined maximum number of labels, each label associated with a node, associated with a classifier, and associated with at least one example of the type of digitally represented object, the method comprising: setting the number of labels = 1 ;

performing a reduction step comprising reducing the initial set of labels to a final reduced set of labels, based on the relationship between the digitally represented object and the hierarchical structure as determined by the classifiers, such that for each possible label, a weight is computed based on the classifiers' confidence rates computed for each node along the path from the root to the node associated with that label, with the final reduced set being a candidate set of labels consisting of the most highly weighted labels; if said candidate set of labels has a smaller number of labels than the predetermined number of labels, outputting said candidate set of labels;

otherwise doubling said predetermined number of labels so that the new predetermined number of labels equals twice its previous value and then returning to performing said reduction step and repeating from there; performing a selection step comprising a computation of the similarity between each possible subset of said candidate set of labels and the given digitally represented object, retaining the most appropriate subset of labels;

thereby categorizing the digitally represented object into the most appropriate subset of labels within the hierarchical structure.

11. The method of claim 1 wherein the digitally represented objects can be digital representations of textual documents, pictures or movies. 12. A method for multi-class, multi-label automatic computerized categorization of a digitally represented object into a hierarchical structure, said hierarchical structure being composed of objects having a predetermined maximum number of labels, each label associated with a node and associated with a classifier, the method comprising: a reduction step comprising reducing the initial set of labels to a final reduced set of labels, based on the relationship between the digitally represented object and the hierarchical structure as determined by the classifiers, such that for each possible label, a weight is computed based on the classifiers' confidence rates computed¹ for each node along the path from the root to the node associated with that label, with the final reduced set being a candidate set of labels consisting of the most highly weighted labels; and a selection step comprising a computation of the similarity between each possible subset of said candidate set of labels and the given digitally represented object, retaining the most appropriate subset of labels; thereby categorizing the digitally represented object into the most appropriate subset of labels within the hierarchical structure.

13. The method of claim 1 wherein the hierarchical structure further comprises at least one child node not associated with a classifier, and wherein said child node is associated with a miscellaneous set of objects that cannot accurately be categorized into one of its sibling nodes.

14. The method of claim 1 wherein one designated child at each of one or more of the internal nodes is not associated with a classifier.

15. The method of claim 1 wherein one designated child at each of one or more of the internal nodes is not associated with a classifier and wherein each said designated child contains one or more mixture objects.

16. The method of claim 15 wherein each said designated child contains one or more objects unrelated to any of the existing subtopics which are children of said internal node.

17. A system for multi-class, multi-label automatic computerized categorization of a digitally represented object into a hierarchical structure, said hierarchical structure being composed of objects having a predetermined maximum number of labels, each label associated with a node, associated with a classifier, and associated with at least one example of the type of digitally represented object, the system comprising the following steps:

1 ) a means for reducing the initial set of labels to a final reduced set of labels, based on the relationship between the digitally represented object and the hierarchical structure as determined by the classifiers, such that for each possible label, a weight is computed based on the classifiers' confidence rates computed for each node along the path from the root to the node associated with that label, with the final reduced set being a candidate set of labels consisting of the most highly weighted labels; and 2) a means for selecting the most appropriate subset of labels by computing the similarity between each possible subset of said candidate set of labels and the given digitally represented object, retaining the most appropriate subset of labels;

18. The system of claim 17, each label further associated with a corrective weight from a set of corrective weights, wherein said reduction step is performed using a random walk procedure, comprising: a means for traversing the hierarchical structure from the root down to the leaves and reading or computing at each node the confidence rate of the classifier;

a means for normalizing said confidence rates so that the sum of the values for all children of each node is unity;

a means for choosing, a predetermined number of times, for each node in the hierarchical structure, a random child according to said normalized confidence rates;

a means for recording the-number of visits to each node;

a means for computing a weight for each node in the hierarchical structure;

a means for sorting the nodes into a list in descending order according to the product of said number of visits times said weight; a means for going over said list, starting from the label with the largest value of said product, and eliminating the descendents and ancestors of each label. thereby generating said predetermined number or less of top labels.

19. The system of claim 17, each label further associated with a corrective weight from a set of corrective weights, wherein said reduction step is performed using an exact computation, comprising: a means for traversing the hierarchical structure from the root down to the leaves and reading or computing at each node the confidence rate of. the classifier;

a means for computing the expected number of visits to each node by taking the product of all probabilities along the path from the root to the node;

a means for computing a weight for each node in the hierarchical structure; a means for sorting the nodes into a list in descending order according to the product of said expected number of visits times said weight;

a means for going over said list, starting from the label with the largest value of said product, and eliminating the descendents and ancestors of each label. thereby generating said predetermined number or less of top labels.

20. The system of claim 18 or 19 wherein said weights are computed using the following formula: deg'(v) = /(deg(v)) where ^v is a node and ^M.'^{κ2 »}-,»* _are j _s children, and a₀, _x,..., _h ._{g the pgth from} ^_{Q rQQt} ._{Q v ^ where} a₀._{g the root}

and* * ^{= v} , where

^ _Λ f ^de§0) - deg(v)> l deg^*(v) = ]

{ β deg(v) = 1

and where ( ^> ^ ) and ^ε are parameters to be determined

(e.g. β = ^{3 / 2} and ^{ff = 1 2} ), and setting for each l ≤ i ≤ k

and where the particular choice of these weights can vary with the application but the corrective weight function is required to be monotone with the degree of the node; that is, deg'(v) = /(deg(v)) ^

where f W j_s a monotone function and where if ^de§(^{v =} ^ then prefer the son of ^v to ^v by setting to be greater than 1.

21. The system of claim 17, wherein said reduction step is performed using a greedy hierarchical structure traversal procedure comprising: a means for defining a first candidate set and second set mark set; a means for initializing said first set to be the empty set and said second set to be the root node; from said second set, a means for selecting a node with the highest confidence as a parent and removing it from said second set; a means for computing the confidence rates of each of the children of said removed node;

a means for marking any child whose confidence rate is sufficiently high according to a predetermined user specified parameter (e.g. Va), and adding said child it to said second set;

a means for terminating the search in the branch if a good child has been found and marking this child;

if said confidence rates of all said children are too low, a means for adding said parent to said first set.

22. The system of claim 17, wherein said reduction step is performed using a L-level greedy hierarchical structure traversal procedure comprising: a means for defining a first set and a second set;

a means for initializing said first set to be the empty set and said second set to be the root node;

from said second set, a means for selecting a node with the highest confidence as a parent and removing it from said second set;

a means for computing theOonfidence rates of each of the children of said removed node;

a means for marking any child whose confidence rate is sufficiently high and adding said child it to said second set;

a means for performing an L-level breadth-first lookahead for any child whose confidence rate is too low;

a means for terminating the search in the branch if a good child has been found and marking this child; if said confidence rates of all said children is too low, a means for adding said parent to said first set.

23. The system of claim 17, where said selection step is performed using an exhaustive search algorithm comprising: a means for computing the distance between each subset of said candidate set of labels and the digitally represented object;

a means for selecting the subset which is closest to the digitally represented object.

24. The system of claim 17, where said selection step is performed using a greedy algorithm comprising: a means for defining an optimal set and setting said optimal set to be empty and setting a maximal distance between labels in the hierarchical structure; a means for selecting the label from said candidate set for which the classifier's confidence rate is highest; a means for adding said label to said optimal set;

a means for computing the mutual source of said optimal subset;

a means for computing said mutual source's new distance from the digitally represented object; if said new distance is larger than said maximal distance, a means for outputting the optimal subset that existed prior to adding said label; otherwise, a means for setting the distance to be the new distance and returning to said selecting the label from the candidate set and repeating from there.

25. The system of claim 17, wherein said selection step is performed using a greedy algorithm comprising: a means for setting the optimal set to be empty;

a means for setting a maximal distance between labels in the hierarchical structure,

a means for setting the mark set to be zero;

a means for computing, for each label in the candidate set, the mutual source of said label and the current optimal set and also computing the similarity of such mutual source to the digitally represented object;

a means for choosing the label that maximizes said similarity;

a means for removing said label that maximized said similarity from the candidate set;

a means for adding said label that maximized said similarity to the optimal set of labels ;

a means for incrementing the mark set by 1 ;

a means for returning to the step of computing the mutual source and repeating from there until no increase in similarity can be achieved by any of the labels in the candidate set, in which case, output the resulting optimal set of labels.

^■a 26. A system for multi-class, multi-label automatic computerized categorization of a digitally represented object into a hierarchical structure, said hierarchical structure being composed of objects which are not associated with a predetermined maximum number of labels, each label associated with a node, associated with a classifier, and associated with at least one example of the type of digitally represented object, the system comprising: a means for setting the number of labels = 1 ; a means for performing a reduction step comprising reducing the initial set of labels to a final reduced set of labels, based on the relationship between the digitally represented object and the hierarchical structure as determined by the classifiers, such that for each possible label, a weight is computed based on the classifiers' confidence rates computed for each node along the path from the root to the node associated with that label, with the final reduced set being a candidate set of labels consisting of the most highly weighted labels;

if said candidate set of labels has a smaller number of labels than the predetermined number of labels, a means for outputting said candidate set of labels;

otherwise a means for doubling said predetermined number of labels so that the new predetermined number of labels equals twice its previous value and then returning to performing said reduction step and repeating from there;

a means for performing a selection step comprising a computation of the similarity between each possible subset of said candidate set of labels and the given digitally represented object, retaining the most appropriate subset of labels;

27. The system of claim 17, wherein the digitally represented objects can be digital representations of textual documents, pictures or movies.

28. A system for multi-class, multi-label automatic computerized categorization of a digitally represented object into a hierarchical structure, said hierarchical structure being composed of objects having a predetermined maximum number of labels, each label associated with a node and associated with a classifier, the system comprising: a means for a reducing step comprising reducing the initial set of labels to a final reduced set of labels, based on the relationship between the digitally represented object and the hierarchical structure as determined by the classifiers, such that for each possible label, a weight is computed based on the classifiers' confidence rates computed for each node along the path from the root to the node associated with that label, with the final reduced set being a candidate set of labels consisting of the most highly weighted labels;

and a means for a selection step comprising a computation of the similarity between each possible subset of said candidate set of labels and the given digitally represented object, retaining the most appropriate subset of labels;

29. The system of claim 17, wherein the hierarchical structure further comprises at least one child node not associated with a classifier, and wherein said child node is associated with a miscellaneous set of objects that cannot accurately be categorized into one of its sibling nodes.

30. The system of claim 17, wherein one designated child at each of one or more of the internal nodes is not associated with a classifier.

31. The system of claim 17, wherein one designated child at each of one or more of the internal nodes is not associated with a classifier and wherein each said designated child contains one or more mixture objects.

2. The system of claim 31 wherein each said designated child contains one or more objects unrelated to any of the existing subtopics.