A SYSTEM AND METHOD FOR MULTI-CLASS MULTI-LABEL HIERARCHICAL CATEGORIZATION
FIELD OF THE INVENTION
The present invention relates to a method and system for categorizing objects. More specifically, the(embσdiments of the present invention relate to determining the set of topics (or labels) most appropriate for describing a given object within a hierarchical taxonomy of topics, assuming trained classifiers (or filters) at each node of the hierarchy and possibly one or more example objects in each node.
BACKGROUND OF THE INVENTION
The task of categorizing objects into one of several classes has known algorithmic solutions. However, the bulk of work to date deals with two-class (binary) classification. This relatively simple problem has been vigorously studied since the 1940s within the pattern recognition, statistics and machine learning communities. In contrast, the task of multi-class classification (i.e., classification into more than two classes) is still an active research area with several recent developments (e.g., T.G. Dietterich and G. Bakiri, Solving multiclass learning problems via error-correcting codes, Journal of Artificial Intelligence Research, 2: 263-286, 1995], and [R. Schapire and Y. Singer, Improved boosting algorithms using confidence-rated predictions, Machine Learning 37(3): 297-336, 1999 - herein SS99]). Moreover, only recently have serious attempts at achieving hierarchical multi-class classification been initiated, e.g., [D. Koiler and M. Sahami, Hierarchically classifying documents using very few words - herein KS97] and [S. Chakrabarti, B. Dom, R. Agrawal and P. Raghavan, Using taxonomy, and signatures for navigating in text databases, Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997
- herein CDA97]. The task of multi-label classification, in which each object may be assigned simultaneously to several classes, is considered especially hard. The main motivation for the recent interest in multi-label classification is the drastic growth of information that is available online. In a hierarchical taxonomy, the main focus of this invention, objects are inherently multi- labeled, falling naturally into several topics simultaneously.
In constructing multi-label classifiers, one encounters serious difficulties due to the combinatorial nature of the problem, which is further exacerbated for large hierarchies. Specifically, one has to choose a subset of labels from the entire set of possible labels. When this set is larger than, say, 50, the number of choices is astronomical. In typical cases, we envisage hierarchical structures with several hundreds or even thousands of possible topics, rendering exhaustive search impossible.
A straightforward approach to the multi-label classification of objects into a hierarchical structure of possible labels is to ignore the hierarchy and classify objects using any (flat) multi-label classification method. However, this simplistic approach can yield extremely poor classification results as it does not exploit the information exhibited in the hierarchy. Instead of reducing the problem to many small classification problems, this approach must deal with one, very large problem, with possibly hundreds or thousands labels.
We are aware of several pieces of work attempting to partially address the problem of Efficient Hierarchical Multi-Class Multi-Label (HMCML) categorization of objects. A desired solution to the HMCML problem must deal with both the multi-label classification problem and at the same time exploit the hierarchical nature of the problem. We are not aware of any previous work that attacks the full HMCML problem simultaneously along these two dimensions of the problem. We now turn to a short summary of some of the previous work. The work of Schapire and Singer [SS99, R. Schapire and Y. Singer,
Boostexter: A boosting-based system for text categorization, Machine
Learning 39(2/3): 135-168, 2000 - herein SSOO] provides an extension of a particular algorithm called AdaBoost for dealing with multi-label (and multi- class) problems. Several variants of AdaBoost are suggested, based on classifiers which are able to provide, for each object, a list of labels rather than a single label as in standard classifiers. We note two major problems with the work of Schapire and Singer [SS99, SSOO]. First, they do not relate at all to a hierarchical structure of labels. Second, no procedure is suggested for automatically determining an appropriate number of labels.
Chakrabati et al. [CDA97] introduce the TAPER system (taxonomy and path enhanced retrieval) for the organization of text documents in a taxonomy. The approach is based on the construction of simple probabilistic classifiers at each node and the extraction of appropriate features. Several experiments are provided by the authors to demonstrate the effectiveness of the hierarchy as opposed to non-hierarchical classification. In particular, it is observed that both the quality of classification and the classification speed are seen to improve due to the hierarchy. The main drawbacks we observe in this system are as follows. First, no attempt is made here to produce multi- labeled classification. Second, the proposed categorization algorithm does not perform any correction for specificity (as proposed in the present invention) and is thus likely to misclassify objects whose categories are located at deep levels of the hierarchy. Furthermore, the classifiers used at each node are rather simplistic, and the suggested categorization procedure depends explicitly on their form.
McCallum [A. McCallum, Multi-label text classification with a mixture model trained by EM, AAAI workshop on text learning, 1999 - herein M99] offers a simple probabilistic method for computing a subset of labels that have the maximum posterior probability given a document. The method uses a simple generative probabilistic model to compute the posterior probability of a subset of topics. This work does not address the full HMCML problem as it does not at all address the case where topics can be hierarchically structured. In addition, due to computational intractability the proposed method only approximates the true posterior probabilities of topic subsets and
is therefore likely to be trapped in local maxima of the posterior distribution function and therefore miss the true set of topics.
Koiler and Sahami [KS97] address the problem of learning and classification in a hierarchical structure, focusing on the case of text classification. The task is divided into a set of smaller sub-problems, each with a corresponding small number of features, adapted to the hierarchy structure. This work addresses mainly the problem of learning the local classifiers in each of the hierarchy nodes, and indeed demonstrates improved performance relative to non-hierarchical classifiers. However, this work does not address the issues of computationally efficient and precise categorization, which is the subject of this invention. Moreover, no treatment of the multi- label classification problem is provided.
BRIEF DESCRIPTION OF THE INVENTION
Our approach provides the first solution to the full HMCML problem. Our algorithmic solution takes advantage of both the hierarchical structure of the domain of interest together with a flexible and principled choice of object categories at each level of the hierarchy. The proposed solution allows for the selection of high-quality HMCML assignments for objects in a computationally efficient manner. In particular, the invention proposes a principled method to determine the set of labels which best fit any input object. The main advantages of the procedure are the following: (i) Applicability to (large) hierarchical taxonomies, which are a very natural and efficient representation for organizing complex topic domains, (ii) Independence of any specific type of classifier (or filter) as well as effective utilization of additional example objects, (iii) Provision of the most appropriate subset of class labels, based on solid statistical and information theoretic principles.
Finally, we comment that in this manuscript we use the term hierarchical structure, taxonomy, and tree interchangeably and we use the term classifier and filter interchangeably.
DESCRIPTION OF THE FIGURES
FIG. 1 is a generic taxonomy tree (with two levels) in accordance with a preferred embodiment of the present invention. FIG. 2 is a block diagram of the system.
FIG. 3 is a flowchart of a candidate reduction .
FIG. 4 is a flowchart for the random sampling algorithm.
FIG. 5 is a flowchart of a method for determining the number of labels FIG. 6 is a flowchart of the greedy tree traversal algorithm for subset selection.
FIG. 7 is a flowchart of the optimal subset selection algorithm.
FIG. 8 is a flowchart of a method for determining the optimal subset of labels from the candidate set. FIG 9 is a flowchart of a second method for determining the optimal subset of labels from the candidate set.
FIG. 10 is pseudocode for the Random Sampling algorithm.
FIG. 11 is pseudocode for the Greedy Tree Traversal algorithm.
DETAILED DESCRIPTION OF THE INVENTION
We consider the problem of classifying a given object into the most pertinent nodes in a hierarchical taxonomy. It is assumed that the topic hierarchy is given, and we do not concern ourselves with its construction, which may have been formed manually, automatically or semi-automatically.
Thus, we are given a taxonomy in the form of a rooted tree 7 = v ,E) , where v = i c\ . c2 >κ ' cs ) represents the set of possible topics (which are labels or names of categories or topics). Consider the schematic description of a
taxonomy tree in FIG. 1. We distinguish between internal nodes (nodes A.B.D.E in FIG. 1) and leaves (the gray nodes C,F,G,H,I,J and K in FIG. 1). Each internal node has one or more descendents (children) which are connected to it by an edge (e.g. nodes F and G are children of node B). Leaf nodes do not have children. If w is a child of v (e.g., node F is a child of node B) then there is an edge (v' w) e E and the interpretation is that topic w is a sub-topic of v . The square nodes in FIG. 1 (which, graph-theoretically, are leaves) represent nodes that will be referred to as 'miscellaneous' (MISC) nodes: their nature is described below. Finally, each node in the tree has a classifier associated with it, which, for each input object, computes the confidence it has in the classification of the object as belonging to this particular node. In the embodiment described in this work we assume that the confidence is a real number between 0 and 1 , although any real number may be used. Such a taxonomy is assumed to be populated with content objects (or pointers to such content objects) and as mentioned this invention proposes a system and method capable of efficiently and effectively performing this task. Note that content objects can be of any type. Typical examples of types of content objects are text documents, images, music and/or combinations of them (i.e., multimedia objects). Content objects may also be representations of abstract objects (e.g. biological sequences such as DNA and proteins, people or animals, companies, countries etc.). Note that it is not necessary that the same type of objects populate a taxonomy and we envisage categorizing taxonomies with different types of objects (such as music, text, pictures, movies, etc.).
Each internal node represents a topic having some number of explicit sub-topics as well as an implicit child representing what we call a miscellaneous (MISC) sub-topic (see square nodes attached to the internal nodes in the tree of FIG. 1 ). The MISC sub-topic may contain either "overview" (or mixture) objects or "other" objects that cannot be considered to concern any of the existing subtopics. Such "other" objects naturally appear in every hierarchical partition of a topic where each topic contains isolated
objects of which there are too few similar objects to warrant a new sub-topic. Another case where such objects appear is whenever the tree structure does not accurately reflect the topic (e.g., bad or low resolution taxonomy construction or a change of the topic domain). Let D be the set of all objects and let d s D be a specific object. We assume that each of the topic nodes in the tree is associated with a classifier (or filter) or a set of classifiers whose functionality is as follows. For each topic c , the associated classifier assigns, for the given object d a certain real number serving as "confidence rate" (also known as "confidence level") to the classification of d into c . Without loss of generality we assume that these confidence rates are real numbers in the interval I-0'1] . This assumed functionality captures any type of classifier and in particular, soft (two-class) classifiers (which assign probabilities to each class) and hard (two-class) classifiers (which assign either 0 or 1). We note again that any classifier that returns an arbitrary real number (bounded or unbounded) can be mapped to a classifier returning values between 0 and 1. (For example, if the interval is unbounded one can use the logistic regression function.) In this invention we are not concerned with either the type of the given classifiers (which can be arbitrary) or with the method used to create these classifiers. What matters is that these classifiers generate the (hard or soft) confidence rates for each given document in each topic node. For example, each of the first level nodes of the tree in FIG. 1 has an associated confidence level (computed for a hypothetical given object d ) in the interval [0, 1 ]; note that nodes D and E are associated with hard (0/1 ) classifications, whereas the other nodes (B and C) yield soft decisions. It is assumed that confidence rates (soft or hard) represent degrees of confidence of the classifier with respect to the current input object. Clearly, the proposed invention will produce better results if the associated classifiers can generate (soft or hard) decisions which better represent the nature of objects. Nevertheless, the proposed invention will work with any type of classifier and will exploit even low quality classifiers in the best possible way.
In this invention we also consider the case where some or all the leaves and/or internal nodes each include one or more associated example objects. In this case our method can exploit these additional example objects to achieve better performance. Here again, we are not concerned with the method used to obtain these example objects. We only note that the assumption of a small number of example objects is significantly less restrictive than our previous assumption that there is a classifier associated with each node. Thus in most application there will be such examples available. However since in one embodiment of the present invention we do employ such examples we later provide a method for generating such examples (with similar statistical properties to those original examples used explicitly or implicitly to generate the classifiers) whenever they are missing in some or all the tree nodes.
Another input parameter that we may have is the maximum number of topics that can be assigned to an object. We denote this number by ∞* his number can reflect an intrinsic property of the collection (e.g. 99.9% or all of the objects belong to at most Nmax topics) or can reflect a system requirement (e.g. that more than Nmax labels are confusing). If such a number is given, we can exploit it and improve the results. Nevertheless, our invention works well even if this bound is not known (after we describe the procedure we note how to handle this case).
A note on the representation of objects: In order to construct a classifier at each node of the hierarchy one must fix a representation for each object. This representation may assume one of a variety of forms. The simplest and most common representation is given by a vector of features, so that each object is a vector in a Euclidean space. More general settings include graphs of various sorts. Our invention can work with any type of representation and in particular with the vector of features representation.
To summarize this section, this invention is concerned with the following task. Given a taxonomy tree with associated classifiers (or filters) in each topic node, for a given input object ά e D , we want to efficiently compute a
set of topics which in some natural sense best describe the subject matter of the object. The proposed method for this computation relies on additional example objects in some or each of the topic nodes, and can rely on a number max giving a bound on the maximum number of topics that can be assigned to one object. Nevertheless the proposed method can handle cases in which neither additional example objects are given, nor is msx known.
The proposed method is divided to two conceptual procedures: candidate set reduction and multi-label subset selection.
In the candidate set reduction procedure we compute, for a given object d e D , a set s containing a number of topics, i.e., s is a sub-set of the nodes in the tree. In case max is known, the size of s is at most « . The goal of this procedure is to exploit the information given by the taxonomy tree and use it to generate a rather small set of topic topics s which contain the "true" topics that best describe d . In the multi-label subset selection procedure we compute a subset
«yt Sas he final set of topics for the given object d . The goal of this procedure is to exploit the statistical information embedded in the object d and the candidate set s , and to generate a subset ^°of s containing the optimal subset of topics contained in s . We note that in the second step (multi-label subset selection) we produce optimal solutions when the candidate set s is relatively small (e.g. less than 14). Two embodiments of the present invention treat the case where the candidate set S is large and in this case the results approximate the optimal solution. In what follows we describe in detail each of these steps. FIG. 2 displays the general system overview. At step 12 a digital representation of an object is input to a general computer system depicted at 16, and composed of a hard-disk, a local RAM memory and a processor. The computer system 16 interfaces with the hierarchical taxonomy of FIG. 1 at 14, and serves to store and process the information from 14. At step 18 the
various algorithms discussed in this invention (FIGURES 3-9), are utilized in order to generate an appropriate set of labels, for describing any given object.
STEP 1: CANDIDATE SET REDUCTION
The main purpose of the candidate set selection step is to exploit the information embedded in the hierarchical taxonomy so as to reduce the possible topic choices (initially containing all possible topics) into a small promising set of topics that should include the most informative set of topics max that can be attributed to a given object. For now we assume that is
N max known (later we describe how to operate whenever is unknown). The general procedure is described in FIG. 3.
FIG. 3 depicts the candidate reduction system. At 32 the following inputs are provided, (i) A hierarchy of labels including a classifier at each node and a set of corrective weights (whose nature is explained below) one for each label as described in equations (0) and (1) below; (ii) An object to be categorized into the hierarchy; (iii) A maximal number, Nmax , of candidate labels to be computed. At 34 the system computes label statistics. This process has two embodiments. One is described in FIG 4 starting at 50 and uses a random sampling procedure. The second embodiment for 34 computes these statistics using the mathematical expectation operatorE[« described in equation (2). At 38 the system combines the label statistics and the corrective weights (see equations (0) and (1)) by multiplying them. After this computation each label is assigned a non-negative numerical value that we term "corrected statistics". The corrected statistics values are then sorted in descending order at 40 using any sorting algorithm (e.g., quicksort). At 42 the system starts scanning the sorted list of corrected statistics in descending order and for each label scanned, all its descendents and ancestor labels are removed from the list, so that in the final list no labels on the same path to the root of the hierarchy remain. From this final candidate label list at the end of
42, the system outputs the first Nmax labels with highest corrected statistics value.
We propose the following two methods for the candidate set selection: random sampling and greedy tree traversal (although others can be used), which can be used either separately or in conjunction. These methods are described in detail in FIG. 4 and FIG. 6, respectively.
Candidate Selection - Method 1 : Random Sampling
We describe the general idea first, providing detailed pseudo-code for the algorithm in FIG. 10 and a detailed description in FIG. 4. The underlying idea is that a random sampling defined over the taxonomy (in the form of directed random walk from the root to the leaves), based on the classifiers' confidence rates at each node, samples more probable topic nodes at a higher rate than other topic nodes.
FIG. 4 depicts the random sampling algorithm, which generates the label statistics for step 34 in FIG. 3. At 50 the following inputs are provided:
(i) A hierarchy of labels including a classifier at each node; (ii) An object to be
N categorized into the hierarchy; (Hi) A maximal number of sampling steps "^ which is set by the user in accordance with the desired confidence level. At
52 the putative candidate set size T is initialized to 1 and the count for each node is set to zero. At step 54, if T is larger than "∞ , the current set of candidate nodes is output at step 56. Otherwise, computer variable R is initialized to the root of the hierarchy at step 58. At 60 the system records in the computer memory that the node represented by R has been visited. During step 62 the system examines the condition whether the node represented by R is an internal label. If the answer is negative, the value of T is incremented at 64, and at step 66 the system jumps back to 58. If the answer at 62 is positive, the system proceeds to ask at 68 whether confidence rates have been computed for all the children of R. If the answer at 68 is negative, the confidence rates of all children of R are computed at 72 and stored in the computer memory in step 74. If the answer at 68 is
affirmative, the confidence rates of the children of R are read from the computer memory at 70. The low confidence children of R are pruned at step 76, which means that they will not be visited again during the current traversal, and the remaining confidences are normalized at 78 so that the sum of confidences is unity. One of the children of R is selected at random at 80 according to the probabilities (normalized confidences) computed at 78. Then R is set to the randomly chosen child at 82, the count of the node is incremented by 1 , and from step 84 the system returns to step 62.
One advantage of the method proposed is that it can choose internal (MISC) nodes to be in the candidate set even if there is no classifier associated with them. Given an object d we first compute the confidence rates assigned to d by the various classifiers of the first level (associated with children of the root); then we normalize the confidence rates (conditional probabilities) assigned to d by node classifiers so that they sum to one. Then we choose one random child of the root according to these probabilities. Suppose that the subtopic u was chosen. We apply this procedure recursively to the sub-tree rooted by u . We stop when we reach a leaf. We repeat this random walk from the root to a leaf a number of times (this number will be specified later). Intuitively, if there is one prominent subtopic u that best suits the object d , then u will be sampled most of the times. Thus, in each random walk we start at the root and randomly branch according to the probabilities at each node, until we end in a leaf. We denote the number of times node * is visited by "' .
A straightforward approach at this stage would be to count the number of visits to all nodes (internal and leaves) and sort the nodes by their counts.
N max
The candidate set is now set to be the nodes with largest counts. However, this simplistic approach is not sufficient, and one must perform the following correction:
Preference for specificity. This correction aims to reflect the desired priority we wish to give to more detailed topics. The correction is done by
assigning a weight to each node in the tree. The weight will reflect preference for more specific (deeper) nodes and at the same time will enable detection of mixture MISC documents. As before, let v be a node and let uι>u2>->uk be its children. Letα°'α"-' Λ be the path from the root to v, where °is the root anda" = v . Define
Here P (P > 1 ) and ε are user specified parameters to be determined (e.g. our experiments with textual objects indicate that constants ? = 3/2 and 5 = 1 2 are effective). Then we set, for each 1 ≤ * ≤ (
The particular choice of these weights can vary with the application but in any case we would require that the corrective weight function is monotone with the degree of the node. That is, deg'(v) =/(deg(v)) ^j
where f(*> is a monotone function.
Note: In the extreme (and unlikely) case where deS(v) = 1 we should prefer the son of v to v due to its increased specificity. Therefore the weight
P should be greater than 1. The correction is performed by multiplying the score, count or confidence of a node, by the corrective weight described in equation (0).
Another correction that can be made and is specified in one embodiment of the present invention is the following correction for MISC nodes. This correction is especially important if there is no classifier pre- associated with the MISC nodes. Correction for MISC nodes. Let v be an internal node and let ".'•■■>"* be its children (i.e., * = de§^ ). Let
ci = P(uι I Λ * = \-,k | be tπe confidence rates assigned to d by each of the children. Whenever max(c.} < 1/ e trim the traversal at v, where the bias α is any real number larger than 2. This correction aims at detecting situations where the document d is supposed to be contained within a child of v but this child does not exist (and therefore d is considered a MISC object).
For sufficiently small trees, it is possible to compute the outcome of the classifiers at each node and in this case there is no need to execute random walks. Specifically, the expected number of visits ^-"'t at node v> can be explicitly calculated from the classifiers' outcomes in the following way. Each node in the tree is represented by a conditional probability distribution Pψ I p (t)] wnere pα(t) js the parent 0f nocje t performing a series of random walks, it follows that the probability of reaching a certain node (and thus Ml} jn the tree is given by the product of the conditional probabilities along the path.
Candidate Selection - Method 2: Greedy Tree Traversal
This method can be much more efficient than method 1 and uses thresholded classifiers to perform a greedy tree traversal to select the candidate set S. The idea is to use a L -step look-ahead greedy choice at each decision node. The method is greedy in the sense that once a node c has been marked we require that at least one node in the sub-tree rooted at c will become part of the set S . The idea is to ignore future dependence, while selecting labels (up to a L look-ahead) and to choose the most promising candidates using a top-down approach. The pseudo-code in FIG. 11 describes candidate set selection method 2: Greedy Tree Traversal.
FIG. 6 depicts the greedy subset reduction system. At 100 the following inputs are provided: (i) A hierarchy of labels including a classifier at each node , one corresponding to each label; (ii) An object to be categorized into the hierarchy; (iii) A maximal size of the candidate set max . At 102 the set S
is initialized to be the empty set and M is initialized to be the root node of the tree. At 104 the following condition is examined by the system: Is M not empty and is the size of S less than or equal to max ? If this condition at 104 is false, the system outputs the current subset S at 106 and halts; otherwise, the system proceed to step 108 and computes the confidence levels for all members of M. At 110, the label c with the highest confidence rate in M is selected, and removed from M at 112. The confidence is computed for each child of c at 114. At 1 6 the system examined the Boolean condition of whether any of the children of c possess sufficiently high confidence. If this condition at 116 is found to be false, c is added to the set S at 118 and the sub-tree is terminated at c; otherwise, the condition at 116 holds and the high confidence children of c are added to M at step 120. At step 122 the system returns to step 104 and repeats until the Boolean condition at 04 is answered in the negative. Note: In an alternative embodiment of the proposed invention we combine the above greedy method for candidate selection with the 'correction for specificity' method described in equations (0) and (1 ). Thus, in this alternative embodiment at 110, the label c with the highest confidence is chosen after correcting the confidence of that label by multiplying it by the corrective weight described in equation (0). The rest of the method is identical from 112 onward.
Note: whenever max is not given we perform a doubling guessing approach, as described in FIG. 5. Specifically, we repeat the entire HMCML routine (including both the above candidate reduction procedure and the multi-label subset selection procedure, to be described) starting with N = 1 and repeat doubling its value until the resulting HMCML routine selects a subset of topics that is strictly smaller than N .
FIG. 5 depicts a method for applying the candidate subset reduction without explicit knowledge of the parameter max . At step 90 the system assigns the computer variable N to 1. Then at 92, the system applies a method for candidate reduction (e.g. the one in FIG. 6). At 94 the system
examines the Boolean condition asking whether the candidate reduction system outputs a number of labels strictly smaller than the value of the variable N. (That is, less than N labels were returned even though N labels were permitted.) If this condition holds the system outputs the labels as computed by the candidate reduction method at step 92. Otherwise, the new value of N is set to be twice the value of the old N and the system continues the computation at step 92. Note that a method for candidate reduction might use stricter standards when Nmax is small so that the doubling procedure does not necessarily return the same number of labels as it would if Nmax were set to the total number of labels. Additionally, the candidate reduction method might be much more efficient for small values of Nmax , and since we would expect most documents to belong to a small number of topics, the doubling procedure is more efficient than setting Nmaxto be the total (very
large) number of possible topics.
STEP 2: MULTI-LABEL SUBSET SELECTION
Having chosen the candidate set, we now propose a method for selecting a subset ■^ ^ of topic labels that will approximate the set of most informative categories. The labels in will be assigned to the object d ; that is, these are the topic nodes into which the object ^will be placed. The idea is to choose a subset of topic identifiers, which will statistically best describe the object.
If each of the category nodes had an associated "hard classifier", a naive method that could be considered is to place the object in all classes in the candidate set whose hard binary classification accepts the document. However, this simple method is prone to errors stemming from noisy (or arbitrary) calibration of the hard classifier. Our proposed methods are more sophisticated and overcome this deficiency.
The method we describe here assumes that each of the topic nodes has one or more example objects associated with it (in addition to the associated
classifier). Later we note how to operate whenever such examples are not supplied.
Multi-Label Subset Selection - Most Common Source
The idea of the 'common source' method is to view each object as arising from a possible mixture of different statistical sources. The method is formulated in terms of statistical mutual sources. Given k sources with prior
probabilities
) their mutual source ^'
K >* is given by
Z(1,K ,*) = ∑£R
(2) that is, their average weighted according to the priors. It can be shown that
Z(l, ,k) = K&Xmz jqiDKL(Pl \\Z)
where
Cover and J. Thomas, Elements of Information Theory, John Wiley & Sons, New York, 1991. - herein CT91 ]. When no other prior information is available we simply choose 1- ~ " "■ .
Thus, the mutual source is the distribution which is closest to all other distributions in terms of the KL-divergence.
We are given a candidate set of labels, ^c" , Ck ' . Each label c- p has an associated distribution ' whose support is the (entire) feature set, and the distribution is computed according to the feature occurrence frequency in documents assigned to the labels.
Thus, we can consider the mutual source ^(1>κ ^ of the • . Each
, p object « can be viewed as an empirical distribution d over the (entire) set of
features. . In particular,
PΛ ) = Nd(Λ)/∑ NAfilK ,P_d(f„) = Ncl(fh)/∑ Nd(fj)
J J
where d^^ is the number of occurrences of feature J in object d .We can now measure the dissimilarity between Z \& ,k) anζj d ϋSing the Jensen-Shannon divergence. The Jensen-Shannon (JS) divergence JS(P,Q) between two distributions R and Q is
JSx(P,Q) = λDSL(P\\0) + (\ -X)DKL(0\\Z) (3)
where Z = IP + (1 - λ)0 -]S ne utual source of P and Q . Whenever we cannot determine the relative importance of P and Q we take equal priors; that is ^ = 1/2 Note that any other form of divergence between distributions may be used; we use the JS divergence due it optimality properties. In particular, we can use any of the p norms, the standard cosine function for Euclidean vectors, hamming distance, all of which approximate the JS- divergence to a certain extent.
Given an object d and a candidate subset °we can use ■/'y(-^>2W) to measure the plausibility of the subset °. Intuitively, when we add irrelevant labels to /ve increase ' " *™ because we 'dilute' the mutual source which becomes more distant from the empirical distribution of the object. On the other hand, when we add relevant labels to °we decrease ^W^l™ because the mutual source becomes more similar to the empirical distribution d of the document. Thus we can use JS^^ /) to drive a greedy or non- greedy algorithm for choosing the (near) optimal subset of labels from the candidate set s .
FIG. 7 depicts greedy method number 1 for the selection of a subset from the candidate set generated either by the method of FIG. 4 or by the method of FIG. 6. At 140, the following inputs are provided: (i) A set of labels
8911
19
(the candidate set); (ii) An object d to be categorized into the hierarchy. At 142, the system initializes the optimal set to the empty set and initializes D to the maximal distance between labels (that is, the distributions associated with these labels) in the tree. The distance is computed according to the method described in equation (3). At 144, the optimal subset is stored in the computer memory and the closest label from the candidate is selected and added to the optimal set.. The mutual source of the optimal subset is computed at 146, using Eq. (2). The new value of D is set at 148 to be the distance between the mutual source and the object d. At 150 a decision is made according to the Boolean condition whether the new value of D is larger than the previous value. If this condition at 150 is false, the value of D is reassigned at 152 to be the newly computed value and from 156 and the system returns to 144; otherwise, the value of the previous optimal subset is output and the algorithm halts. Observe that if the number of labels in the candidate set s is small, then an exhaustive search through all subsets of s is possible and is recommended for achieving an optimal subset of labels..
Note: When there are no example objects given at some or all the nodes, we operate as follows. As mentioned earlier, the assumption that classifiers are associated with tree topics is more restrictive than the assumption of example objects and in particular, we envisage three modes for the construction of a classifier at each node, (i) Training using a labeled set of examples, (ii) training using unlabeled data, and (iii) a hand-coded classifier, i.e., one which was constructed by hand and not learned from data. In the first case we may directly use the best labeled data as examples, where by best we refer to those achieving the highest rank by the classifier. In case (ii) we may simply pass the data through the classifiers, and select at each node the inputs receiving the highest rank. Finally, in case (iii) (which often corresponds to a Boolean query type classifier, e.g., a decision tree) we use the given classifiers to generate random example objects in each of the nodes. If the associated classifier is hard (taking the value 0 or 1), we simply generate an artificial document which obeys the conditions. If the (binary)
911
20
classifier is soft, assume that it can be represented by a real number w , where large values of S χ) correspond to high confidence in the class c and small values correspond to low confidence.
We then introduce a Gibbs distribution on the space of possible objects through the definition "W = eχ3?(g(x)/T)/Z ^ wπere T js a temperature parameter and Z is a normalization constant. At this point standard approaches based on simulated annealing and Markov-Chain Monte-Carlo methods (e.g., [Neal, R. Bayesian Learning for Neural Networks, Lecture Notes in Statistics No. 118, Springer-Verfag, 1996 - herein Nea96]) may be used in order to generate representative samples.
FIG. 8 depicts another greedy method for the selection of a subset from the candidate set generated either by the method of FIG. 4 or by the method of FIG. 6. At 160, the following inputs are provided: (i) A set of labels (the candidate set); (ii) An object d to be categorized into the hierarchy. At 162, the optimal set is initialized to the empty set, D is set to the maximal distance between members of candidate set and M is set to zero. A single label is chosen from the candidate set at 164, for which the distance from the optimal set (combined with this single label), to the object d is minimal. The distance between M labels and an object is computed using the Jensen Shannon divergence (formula (3)) between the mutual source of the M labels to the object where the mutual source of the M objects is computed using formula (2). At 166 the distance D is set to be the minimum distance computed at step 164. Then M is incremented. From step 168, the method proceeds as in step 150 of FIG. 7. FIG. 9 depicts the exhaustive optimal subset selection algorithm. At 180
, the following inputs are provided: (i) A set of labels (the candidate set); (ii) An object d to be categorized into the hierarchy. At 182 a decision is made depending on whether the labels are associated with example documents. If example objects do not exist, the classifiers are used at 184 to generate them. Alternatively, the empirical distributions of members of the candidate set are computed at 186. The vector distance between the object d and all
subsets of the candidate set are computed at 192, and the subset leading to the smallest distance is denoted as the optimal subset. The optimal subset is output at 194.