WO2005072405A2

WO2005072405A2 - Enabling recommendations and community by massively-distributed nearest-neighbor searching

Info

Publication number: WO2005072405A2
Application number: PCT/US2005/002731
Authority: WO
Inventors: Gary Robinson
Original assignee: Transpose, Llc
Priority date: 2004-01-27
Filing date: 2005-01-27
Publication date: 2005-08-11
Also published as: US8190082B2; US7783249B2; WO2005072405A3; US20080261516A1; US20060020662A1; US20110143650A1; US20130040556A1

Abstract

The computer associated with each of a potentially large number of end users (fig. 1 , 20a-20c) is harnessed to provide a massively- distributed mechanism for finding the nearest neighbors of each user, according to tastes and/or interests. Once these nearest neighbors are determined, there taste or and/or interest profiles (fig.l, item 5) are leveraged for highly accurate recommendations, and their online addresses (fig. 2) are leveraged for community purposes.

Description

ENABLING RECOMMENDATIONS AND COMMUNITY BY MASSIVELY-DISTRIBUTED NEAREST-NEIGHBOR SEARCHING

RESERVATION OF COPYRIGHT Copyright © 2003, 2004, 2005 Transpose, LLC This application includes material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent disclosure as it appears in patent office files or records, but otherwise reserves all copyright rights whatsoever. CROSS-REFERENCE TO RELATED APPLICATIONS Applicants hereby claim priority from and benefit of the following U.S. Provisional Patent Applications: 60/540,041 filed 27 Jan 2004, for Enabling Recommendations and Community by Massively-Distributed Nearest-Neighbor Searching; 60/611,222 filed 18 Sep 2004 for Community and Recommendation System; and 60/635,197 filed 9 Dec 2004 for Community and Recommendation System. Applicants hereby incorporate by reference herein to the fullest extent allowed by law the entire disclosure of each of the aforesaid applications, including all text, drawings, and code whether on paper or machine-readable media.

TECHNICAL FIELD The present invention is in the fields of collaborative filtering and online community, typically as implemented on networks of communicating computers.

BACKGROUND ART Collaborative filtering systems are well known, as are online community systems. Examples of the former include Amazon.com's recommendation technology and other similar systems such as eMusic.com's. Examples of the latter include Google Groups. However, none of the existing solutions effectively leverages the fact that users of online recommendations systems and online community systems typically own their own computers, and have the opportunity to make the central processing units of those computers available for making such systems more useful and enjoyable. In particular, the task of matching people with extremely similar tastes and interests becomes very computationally difficult as the number of people increases and as the complexity of the similarity measure increases. With hundreds of thousands or even millions of people such as are typically enrolled in major online services, limitations of server hardware resources constrain the system's ability to find the best matches between people based on taste and interest. To the degree that such matches are made with real accuracy, "neighborhoods" of individuals with extremely similar interests may be formed that can be used for purposes of recommendation and community. What is needed, then, is an effective way of leveraging the computers owned by end-users of a community and recommendation system for the purpose massively-distributed similarity searching.

SUMMARY OF THE INVENTION The present invention puts the computer used by a particular end-user (the 'client computer' or 'client machine') to work in finding his or her best matches, thus offloading that computational load from the server. (In some variants, some users' computers may do that work for a manageable number of other users; for purposes of example this summary will not discuss those details.) To enable the computations to occur in the client machines, the necessary data needs to be transported there. This data consists, at least in part, of 'profiles' of various users. Various embodiments do this in different ways, the common denominator being that profiles that are relatively likely to be matches to the user for whom neighbors are being sought arrive first. Then the client computer conducts a substantially (or completely) exhaustive search of that available data for the very best matches. Typically at least part of the profile data performs a dual purpose. First it is used for similarity calculations. Second, it is used for display purposes, so that a user can view taste information pertaining to his neighbors. For instance, in a typical music application, this will include song title and artist information for songs in the neighbors' collections. This disclosure will make use of a detailed listing of key aspects, followed by a glossary containing definitions for terms used therein. ASPECT 1. A networked computer system for supplying recommendations and taste-based community to a target user, comprising: networked means for providing representations of nearest neighbor candidate taste profiles and associated user identifiers in an order such that said nearest neighbor candidate taste profiles tend to be at least as similar to a taste profile of the target user according to a predetermined similarity metric as are subsequently retrieved ones of said nearest neighbor candidate taste profiles, means to receive said representations of nearest neighbor candidate taste profiles and associated user identifiers on at least one neighbor-finding user node, said neighbor-finding user nodes each having at least one similarity metric calculator calculating said predetermined similarity metric, at least one selector residing on at least one of said neighbor-finding user nodes using the output of said at least one similarity metric calculator for building a list representing the nearest-neighbor users, said list representing said nearest-neighbor users providing access to associated ones of said candidate profiles, a nearest-neighbor based recommender which uses said associated ones of said candidate profiles to recommend items, a display for viewing identifiers of recommended items, a display for viewing identifiers of a plurality of nearest neighbor users, means to select at least one of said nearest neighbor users from said display of identifiers of a plurality of nearest neighbor users, a display of information relating to at least one of the items in said nearest neighbor user's collection, whereby massively distributed processing is harnessed in a bandwidth-conserving way for finding the best neighbors out of the entire population of users, and the same neighborhood is leveraged to provide recommendations as well as highly focused taste-based community for sharing the enjoyment of items including recommended items

ASPECT 2: The networked computer system of ASPECT 1, further including means to facilitate communication with at least said nearest neighbor users where the type of communication comprises at least one selected from the group consisting of online chat, email, online discussion boards, voice, and video.

ASPECT 3: A networked computer system for supplying recommendations and taste-based community to a target user, comprising an ordered plurality of nearest neighbor candidate taste profiles and associated user identifiers such that said nearest neighbor candidate taste profiles tend to be at least as similar to a taste profile of the target user according to a predetermined similarity metric as are subsequently positioned ones of said nearest neighbor candidate taste profiles, networked means to receive said nearest neighbor candidate taste profiles and associated user identifiers on at least one neighbor-finding user node, said neighbor-finding user nodes each having at least one similarity metric calculator calculating said predetermined similarity metric, at least one selector residing on at least one of said neighbor-finding user nodes using the output of said at least one similarity metric calculator for building a list representing the nearest-neighbor users, said list representing said nearest-neighbor users providing access to associated ones of said candidate profiles, a nearest-neighbor based recommender which uses said associated ones of said a nearest- neighbor based recommender which uses said associated ones of said candidate profiles to recommend items, a display for viewing identifiers of recommended items, a display for viewing identifiers of a plurality of nearest neighbor users, means to select at least one of said nearest neighbor users from said display of identifiers of a plurality of nearest neighbor users, a display of information relating to at least one of the items in said nearest neighbor user's collection, whereby massively distributed processing is harnessed in a bandwidth-conserving way for finding the best neighbors out of the entire population of users, and the same neighborhood is leveraged to provide recommendations as well as highly focused taste-based community for sharing the enjoyment of items including recommended items ASPECT 4: The networked computer system ASPECT 1 , further including a single downloadable file that contains software that executes all necessary non-server computer instructions.

GLOSSARY

REPRESENTATION: In the above discussion of "aspects," representations may be the user profiles themselves (including the taste profiles), or just the taste profiles (which should include an identifier of the user) — or they may be user ID's of the users, or URL's enabling the data to be located on the network, or any other data that allows taste profiles and associated user ID's to be accessed. These are all functionally equivalent from the standpoint of the invention.

TASTE PROFILE: This term refers to data representing an individual's tastes or interests. It can take many forms. It may be the XML file generated by Apple's iTunes application which contains a list of music files in the user's collection as well as how many times he has played each one, and other related information. This is a fairly complete profile, having the disadvantage that it tends to consume a fairly large number of bytes that thus take significant bandwidth to download.

Other profile types include simple lists of song identifiers or album or artist identifiers, or various combinations thereof. In non-music domains, other examples include book ISBN's, or author names, or combinations thereof; or weblog URL's, or weblog posting identifiers, or combinations thereof; of any of a multitude of other represenations of a user's tastes and/or interests. Just as different profile types may contain various different types of data, there are many formats that can be used for representing such data to be processed by a computer. XML is one, but such specifications as CORBA and many others provide ways that data objects can be represented and transported across a network, and in general such formats as vectors or other binary or text-based formats can be used.

A taste profile is data that represents a user's tastes and/or interests. The format and contents are particular to particular embodiments, and it must not be construed that the present invention is limited in scope to particular contents or formats as long as the data comprises a user's tastes and/or interests or some useful summary thereof.

Further, it should be noted that a user may have a plurality of taste profiles. For instance, a user may have one type of music he likes to listen to while studying, and another type he likes to listen to while dancing. Preferred embodiments of the invention allow the user to choose different taste profiles ~ and correspondingly different nearest neighbors and recommendations ~ according to mood.

Still further, note that taste profiles may be either manually or passively generated. For instance the iTunes application captures user activity in the course of playing music, and stores it to its associated XML file. The user does not have to make any separate effort to cause a taste profile to be generated based upon that data. On the other hand, taste profiles can be manually generated by manually supplying ratings to items such as songs, movies, or artists. A playlist — a list of songs a user likes to play together, and which has usually been generated manually — can be considered in some embodiments to be a taste profile. Some embodiments use taste profiles that incorporate a combination of passively and actively collected data. For instance, a profile may include manually-generated ratings of songs, as well as the number of times each song has been played.

Finally, note that taste profiles do not necessarily include data directly entered by the user; they can instead be a computer-derived representation. For instance, in embodiments which associate information such as genre or tempo for songs, software developers of ordinary skills will be able to see how to summarize data for songs the user has in has his collection to create a profile showing which genres or tempos the user likes most; that information may then comprise the user's taste profile. Or, in certain embodiments with numeric values for attributes, the log of the values may be used.

TARGET USER: The aspect discussion describes the invention in a way that focuses on serving a particular user, who we call the "target user." There are a plurality of users who could be considered to be target users, but for descriptive purposes we focus on one such user. USER PROFILE: A user profile contains information related to the individual such as his name, contact information, and biographical text. It also contains his taste profile. An embodiment may make all, some, or none of this information publicly available.

SIMILARITY METRIC: Degrees of similarity are computed according to a similarity metric, which is not necessarily a "metric" in the formal sense of a "metric space" as that term is used in mathematical literature (for instance http://en.wikipedia.org/wiki/Metric space). A very great variety of similarity metrics are available. There is necessarily a correspondence between the nature of the similarity metric and the taste profile, because similarity metrics often require particular types of data.

For instance, if ratings data is present where numerical values are given such as on a scale from 1 to 7 where 1 is poor and 7 is excellent, such simple methods can be used as computing average difference between the ratings of the items which have ratings in both taste profiles. Other techniques include computing a Euclidean distance, Mahalanobis distance, cosine similarity, or Pearson's r correlation using that data [13, 15]. Another approach is given in [16], beginning column 20, line 59. Any other computation that results in a metric that tends to be indicative of similarities of taste between the two users can be used.

In many embodiments data is massaged to make it more appropriate use with certain popular similarity metrics. For instance, in a music application when song play counts are included in the taste profile, the songs may be ranked in order of frequency of play; songs in the top seventh have an "implied rating" of 7, songs in the next seventh have an implied rating of 6, etc. This data can then be used with similarity metrics such as those mentioned above.

Note that some similarity metrics, such as Pearson's r, enable the computation of levels of probabilistic certainty, or p-values, with respect to a null hypothesis. In many cases, such as r, it is possible to state a null hypothesis that roughly corresponds to the concept "the two users have no particular tendency to agree." This enables the system to take into account the fact that some pairs of users have more data to base the metric on then others, and thus more reason to have confidence. This is a significant advantage over many of the simpler techniques. However, this approach nevertheless has a drawback. As an example consider two users with a very large number of items in common which they have each rated, where a p- value derived from r is used as the metric. Suppose further that on average, there is a slight tendency to agree rather than disagree. Then, simply due to the large number of items with ratings in common, the p- value may be extremely indicative of rejection of the null hypothesis, even though on average, there isn't a very unusual amount of agreement between ratings. In practical use with a large number of users, where not too many nearest neighbors need to be found, this effect is normally not a major problem, because there will also be users who do have a lot of agreement and who also have a high number of rate items in common, and such pairings will result in even greater extremities of p-values. In such cases, there can be a lot of confidence that the similarity metric is finding users who are actually very similar in taste ~ even though their may be other pairings, with even more similarity, that are left behind due to not having as much data for comparison.

The immediately preceding paragraphs focus on situations where degrees of agreement can be discerned for each item. Another type of profile involves presence/absence data ~ where all that is known about each item is whether a user is associated with it or not — for instance whether a user has a particular song in his collection or not. In such cases, such calculations as the well-known Jaccard's Index, Sorensen's Quotient of Similarity, or Mountford's Index of Similarity can be useful.

Some embodiments combine different similarity metrics. For instance r can be used to compute a degree of similarity in ratings of items that are in common between two users, and Jaccard's Index to compute the degree of similarity implied by the numbers of items that are and are not in common between the users. An average or geometric mean (weighted or not) may be used to combine the metrics into one that incorporates both kinds of information; other techniques such as p-value combining with respect to a null hopothesis ([16]) can be sued as well, by converting the metrics into p-values.

Source code included in the file tasteprofileclass.py in Appendix 4 accompanying this specification takes a different approach for computing similarity based on iTunes' XML file. Consider a "shared song" to be a song that is in the collection of both users. This method calculates an approximate probability that the next shared song to come into existence will be the next song played. That is, if user A takes a recommendation from B's collection, it will be a song that A doesn't have yet. When he has it, it will be another shared song. What is the probability that it will be the next song played, once it is in A's collection? This is a particularly appropriate similarity measure, because it measures similarity of tastes in a way that directly relates to a key purpose of finding nearest neighbors: making recommendations that the user will want to play frequently. Details of the algorithm appear in the source code. That algorithm is the currently preferred similarity metric.

The only requirement of the similarity metric is that, for a significant portion of pairs users which includes those who tend to be the most similar in taste, the following applies: if the calculated similarity of two taste profiles A and B is greater than the calculated similarity of two taste profiles A and C, then it is likelier than not that users A and B are actually more similar in relevant tastes than are users A and C. This likelihood will be greater for similarity metrics that will be associated with the highest-performing embodiments of the invention. For instance, simply using the average distance between ratings may be acceptable for some applications, but using Euclidean distance is better than a simple average.

There are many ways to calculate similarity. Other than the requirement above, the invention has no dependence on the particular similarity metric that may be chosen by a particular embodiment. The invention must not be construed to be limited to a particular similarity metric or type of similarity metric; the ones listed here are for reasons of example only. Similarity metrics are interchangeable for purposes of the invention.

MEANS FOR FACILITATING RETRIEVAL OF REPRESENTATIONS: There are a variety of ways to provide the functionality needed. It must be stressed that all provide identical or equivalent functionality for the purposes of the invention. While there are several basic structures available, there are many variants for each that are only insubstantially different and should not be construed as different in a way that would make them fall outside the scope of the invention.

What is needed is a means for facilitating retrieval of representations of nearest neighbor candidate taste profiles and associated user identifiers in an order such that said nearest neighbor candidate taste profiles tend to be at least as similar to a taste profile of the target user according to a predetermined similarity metric as are subsequently retrieved ones of said nearest neighbor candidate taste profiles.

The representations mentioned in the previous paragraph may be the user profiles themselves (including the taste profiles), or just the taste profiles (which should include an identifier of the user) ~ or they may be user ID's of the users, or URL's enabling the data to be located on the network, or any other data that allows taste profiles and associated user ID's to be accessed. These are all functionally equivalent from the standpoint of the invention.

It is important to note that the means for facilitating this retrieval does not need to make use of the predetermined similarity metric or a calculator that can calculate it. In particular, it isn't required that the retrieval of representations is exactly in the same order that would be given by the similarity metric.

One implication of this is that even if the similarity metric is not a metric in the sense of a metric space, a metric space-based metric can be used in the means for facilitating this retrieval. This makes available a large number of algorithms in the literature for facilitating the retrieval. In preferred embodiments the data used in facilitating this retrieval is a subset of the data used in the similarity metric, or a summary derived from that data, or a combination of the two, in order to lower computational costs.

1) Pre-existing data structures

Data structures may be created that provide the foundation for retrieval in the necessary order or sequence. For instance, clustering may be done using a variety of methods. See, for example, [1] and [2] which apply to "metric spaces," that is, a structure involving a distance function where the function used to compute the distance between any two objects satisfies the positivity, symmetry, and triangle inequality postulates. Such a distance function can be a similarity metric; examples include Euclidean distance.

See also [3] which works on large binary data sets where data points have high dimensionality and most of their coordinates are zero. For instance this can be used to cluster based upon attributes consisting of indicators of whether or not a user has a particular song in his collection. See also [4].

Appendix 4 contains an algorithm which uses genre data (genrerankhandler.py), but a practitioner of ordinary skill in the art will see how to modify it for use with other kinds of data which is of limited dimensionality.

For a given clustering scheme, practitioners of ordinary skill in the art will know how to compare a particular taste profile to a particular cluster of taste profiles, and thus determine an affinity between each cluster and the taste profile.

Then, the cluster with the most computed affinity to the given taste profile is first in the retrieval order, the cluster with the next most computed affinity is the returned next, etc. Of course, there can be some degree of difference from this strict order without violating the spirit of the invention or moving outside its scope. When we discuss retrieving a cluster, we mean either a set of representations of nearest neighbor candidate user profiles, or a representation of such representations. For instance such a representation can be the name or Internet address of a file containing the representations of candidates.

Another approach which uses clustering is given in [5]. Clusters are not the only kind of structure that can be used. See, for example, [6] and [4]. Practitioners of ordinary skill will see how to use such structures for retrieving in an order consistent with the needs of the invention. Many such structures with different details of implementation, but these details are not substantial differences for the purposes of the invention. It is not possible to list all possible combinations of such details, and it must not be construed that one can move outside of the scope of the invention merely by finding such variations on the structures listed here, which it cannot be stressed enough are listed for reasons of example only.

The source code in Appendix A provides the exemplary key aspects of one particular method for causing the representations to be retrieved an order consistent with the needs of the invention. See the explanatory text in the section for clusterfitterclass.py.

Of course preferred embodiments update or replace these structures over time as taste profiles associated with users change, and users are added to or removed from the database associated with the embodiment.

Note fiirther that the data structure may be built and stored on a central server, on machines owned by end-users of the invention which communicate their results directly to a server and/or to other end-user machines via peer-to-peer means, or on a combination. It must not be construed that a system falls outside of the scope of the invention merely because the necessary computational and storage resources for the foundation for retrieval are provided at one location or set of locations rather than another, or one type of network node rather than another.

As one example of a combined approach, consider [7]. That paper provides an algorithm to do clustering based on nearest neighbors. It can be leveraged to produce a combined approach as follows.

Use a peer-to-peer system such as the Gnutella protocol or any other protocol that enables one to search for a file. Each end-user machine is a node in such a network, also known as a "cloud."

Each end-user machine then conducts a search for each file, or a substantial subset, of files that are already in that machine's collection, using the words in the name of each fie (or a substantial subset of them). A "hit" occurs when the protocol returns an identifier of a node that has a file with matching words in its name.

Some searches will get more "hits" than others. For purposes of the algorithm in [7], "nearest neighbors" will have a different definition than the one involving the predetermined similarity metric of the present invention. It involves a couple of components.

The first component is "hit-nearness." Suppose a query returns only 1 hit. That means that the node identified by that hit is considered to be in the first tier of hit-nearness. If it returns 2 hits, each of the nodes are considered to be in the second tier of hit-nearness. And so on. The tiers are ranked, and the ranks are divided by the number of tiers. If T is the number of tiers, the best hit-nearness is 1/T, the next best is 2/T, and the worst is T/T (1).

The next component is "quantity-nearness". We count the number of times a particular node's identifier is retrieved in the process of seaching for files. We create tiers based on those numbers using the same tiered approach as for hit-nearness, and again resulting in a number between 0 and 1 where the worst node ~ the node with the smallest number of hits ~ has a quantity-nearness, Q, of 1.

Then the distance of a node to the node doing the search is the square root of T * Q. So the ordering of each node's the neighbors for the algorithm in [7] is laid out that way.

The work of finding neighbors for [7] is thus carried out on the end-user machines. Then, that nearest neighbor information is uploaded to the server from each node, and the algorithm in [7] is carried out there.

For instance, the algorithm could include Gnutella protocol code, and use the procedure described above to cluster similar taste profiles together, where similarity is determined by having more neighbors in common (rather than by our predetermined similarity metric).

Then to determine the order in which clusters should be downloaded to a particular user's node, the one that contains the greatest number of his neighbors should be downloaded first, then the one that has the next greatest number of his neighbors, etc.

2) Dynamic searches for neighbor candidates

Instead of, or in combination with, pre-existing data structures such as described above, many embodiments use dynamic searches. Probably the simplest example of this is a server-based system with a table of attributes culled from the taste profiles, one row per user. In one embodiment these attributes are bits representing the presence or absence of particular genres. So, if there are 100 defined genres, each row has 100 bits. Then to determine the order in which taste profiles should be downloaded, the server simply checks each row and counts the proportion of matching genres to total genres in the other user's taste profile. The representations of taste profiles with the highest proportions are retrieved first. The table could be a RAM- based bitmap, a database such as based upon SQL, or any other convenient configuration. Of course they data used wouldn't have to be genres. It could be a selection of artists or songs or ablums, or in non-music domains, book titles, web logs, paintings, news articles, school subjects, course numbers, etc.

In another set of embodiments, there is virtually no server-based processing at all; the only server processing is to supply network addresses for a set of seed nodes that may be online at the time, which may in fact be included with the download of the software that executes the computer steps involved in the invention.

In these embodiments, a peer-to-peer protocol such as Gnutella's is used to conduct searches for files, as described above in this text. Note that if a pre-existing, popular protocol such as Gnutella's is used it should be modified so that a node can respond to a request for a complete taste profile; if that does not include a list of all (or a substantial subset of) items on the node's machine, then nodes should also be able to respond to a request for such items.

As described elsewhere in this specification, a node (we will refer to it as the "target node") initiates searches for files it has in its collection. Nodes that are the subject of hits are candidate nearest neighbors. Nodes that have more files matching the target nodes files than others are statistically more likely to be hit before nodes with a smaller number of files. The representation that comes along with the hit is then used the taste profile and if necessary the list of files. So, that satisfies the requirement of the means for facilitating retrieval in the desired order. No other server activity is required.

Note that to increase the performance over protocols such as Gnutella that are popular at the current time, currently preferred embodiments use the peer-to-peer method described in [12]. Also, at the time that user machines connect for a new session in the peer to peer network, they should connect to randomly chosen seed nodes in order to increase the randomness of results obtained from searches.

It must not be construed that the scope of the present invention is limited to the particular techniques listed here. The requirement of the invention is simply that 3) Note on retrieval techniques

Whether the means for facilitating retrieval is based upon a pre-existing data structure or whether dynamic computations are done, there is still the question of actually delivering the representations of nearest neighbor candidate profiles, and if separate, the profiles themselves.

In some embodiments these come directly from the server. In others such as peer-to-peer techniques like those described above, they may be the result of direct communication with the machine owned by the user whose profile is required.

In some embodiments caching solutions such as BitTorrent [8], FreeNet [9], FreeCache [10] and Coral [11] are used to distribute the represenations and or the profiles. It is preferred to use BitTorrent to distribute cluster files, where the clusters contain the profiles.

4) Further note on scope. It must not be construed that the scope of the invention is limited to the specific examples which are listed here for explanatory purposes. The requirement is that profile representation are retrieved s in an order such that the nearest neighbor candidate taste profiles tend to be at least as similar to a taste profile of the target user according to a predetermined similarity metric as are subsequently retrieved ones of said nearest neighbor candidate taste profiles. The intent is not to carry out the impossible task of listing every possible way to achieve that. The intent is to teach a number of ways to achieve that end; other techniques that achieve that end are equivalent for our purposes. That is, such techniques are interchangeable in the sense that they will result in an embodiment of the invention that falls within the scope.

NEAREST-NEIGHBOR: A target user profile's nearest neighbors are the other user profiles whose taste profiles are closest to the target user profiles according to the predetermined similarity metric. However in preferred embodiments there are exceptions: users can cause entries to be added to the nearest neighbor list that may not be ones that have the most computed similarity, and they may delete entries from the list, and they may cause an entry to become permanent (though manually deleteable). They can do these actions manually or through automatic means such as a program that runs through ones email address book and makes the user profiles associated with email address found there permanenty. Such features may detract from recommendation accuracy while adding to the user's pleasure in the nearest neighbor community. NEAREST-NEIGHBOR BASED RECOMMENDER: Nearest-neighbor-based recommendation algorithms are well-known in the literature. See for example, [13] and [14]. The source code file recommenderclass.py included Appendix 4 also includes a technique. The scope of the present invention should not be construed as limited to any particular nearest-neighbor- based recommendation algorithm. They are fundamentally interchangeable for the intents and purposes of the invention, although some will have better accuracy than others. The currently preferred technique is given in recommenderclass.py. SERVER: The term "server" as used in this specification means one or more networked computers, incorporating a central processing unit and temporary storage such a RAM and also persistent storage such as hard disks. They perform central functions such as storing a central list of users. While there may be more than one server, they usually do not have to be separately accessed by user-associated computers; rather they present a unified interface. One such example of multiple servers working together is the case of a server computer running software that interacts with client software running on user-associated computers, which uses other computers for database storage and to provide database redundancy.

USER NODE: The computer (also referred to as the "machine") associated with a human user of the computer, providing one or more input devices such as a keyboard and one or more output devices such as LCD screen. It is networked, preferably through the Internet, to other user nodes. A common protocol such as TCP/IP is used for communication with other user nodes.

NEIGHBOR-FINDING USER NODE: In currently preferred embodiments all nodes are essentially the same, and play the role of "neighbor-finding user nodes; but in some embodiments, certain tasks are relegated to certain of the user nodes. For instance, it may be that certain users are willing to make their computational and bandwidth resources available to others, and that others are less willing; for instance those who are willing may get a price break.

In such embodiments, neighbor-finding user nodes take it upon themselves to do work for multiple users. For purposes of neighbor-finding, they work either independently of the user nodes they are helping or in concert with them. For instance, they may receive the candidate nearest neighbors for other users, and use their taste profiles to compute the similarity according to the similarity metric, and then pass on only the most similar nearest neighbors to the user nodes across the network.

IDENTIFIERS FOR DISPLAY: Identifiers of items and nearest neighbors are displayed in such visual constructs on a visual computer display as tables in a window or menus such as pop-up menus. Some embodiments may use audio means as a kind of display when visual display is not possible. The identifiers may be identifiers used internally to keep track of the items and users, or they may be special public identifiers supplied by the users or item producers, or any other identifier that is thought would be convenient for the users.

NOTES

While this specification focuses on the example of music recommendation and communities, that is for purposes of example and ease of explanation only. It applies just as completely to other domains, such as books, web logs, web sites, movies, news, educational items, discussion groups, and others. Embodiments in all of these domains and other domains which could benefit from taste-based recommendations and communities. Occasionally in this specification the word "item" is used inclusively to represent the various types of objects of taste or interest.

The word "taste" as used in this specification should not be construed to imply that the invention's scope is limited to artistic works. It applies equally well to information such as news sources. The word "interest" should be considered a synonym for "taste" for purposes of this specification.

Other information besides the taste profiles may be used in finding nearest neighbors. As one example, some embodiments allow the list of nearest neighbors to be restricted to individuals who live in particular physical localities.

The specification sometimes uses the word "machine" as an equivalent for "computer."

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is an overall flowchart illustrating an embodiment in which each client node is responsible for determining its own user's nearest neighbors. FIG. 2 is a chart showing how the nearest neighbor list 110 is put to use

MODES FOR CARRYING OUT THE INVENTION

FIG. 1 illustrates an embodiment in which each client node is responsible for determining its own user's nearest neighbors. Representations of user profiles and associated user identifiers 5 are provided in order of likely similarity to the user. See, for example, the introductory text above the source code to clusterfitter.py in Appendix 4, which describes a way a client node can determine the order in which to download each one of a set of clusters. In the preferred embodiment, these clusters are downloaded with the help of other client nodes using BitTorrent. In the preferred embodiment there are a limited number of clusters, retrieved by each client node in its own appropriate order. Not every cluster is retrieved by every client, because only a certain amount of time is available to do the downloads. But on the whole, each can generally, in time, be found on a number of client nodes. This enables a BitTorrent tracker running on the server, together with BitTorrent client software running on the clients, to work together to share the community bandwideth to download a cluster to a client that requests it. A programmer of ordinary skill in the art will readily see how to use BitTorrent client software, publicly available in open-source form (http://bittorrent.com/) to accomplish these tasks.

This disclosure contains several additional sections, each designated as an Appendix, and together with the rest of the text and computer code presented herein, forming a unified disclosure of the present invention. As one alternative way of achieving the desired ordering of profiles see the distributed profile climbing technique described in Appendix 3.

The profiles are received at the user nodes 20a-c. The similarity of each one to the local user is calculated 30a-c. The ones that are similar enough 40a-c to the current user (for instance, by being more similar than the least-similar current member of the nearest neighbor list) are put into the appropriate position 50a-c in the nearest neighbor list. In preferred embodiments that position is consistent with an ordering by similarity.

In FIG. 2 the nearest neighbor list 110 is put to use. Combined with the local user profile 120, recommendations are generated 130 for the user (see, for example, recommenderclass.py in Appendix 4 for an example of how to accomplish that).

Interactive communications are also enabled 140. For instance, preferred embodiments display the user identifiers of nearest neighbors in a list on a computer display. An interaction means such as clicking on a particular icon enables an email to be automatically generated addressed to the neighbor and indicating that the sender is the current user; the user then fills in the message text and sends it.

BIBLIOGRAPHY — References listed below in this section are hereby incoφorated by reference in their entireties to the fullest extent allowed by law.

[1] V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French. Clustering large datasets in arbitrary metric spaces. Technical report, University of Wisconsin-Madison, 1998. htφ://citeseer.ist.psu.edu/ganti99clustering.html [2] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental clustering for mining in a data warehousing environment. Proc. 24th Intl. Conf. on Very Large Data Bases (VLDB), 1998.

[3] C. Ordonez, E. Omiecinski, and Norberto Ezquerra. A fast algorithm to cluster high dimensional basket data. In IEEE ICDM Conference, 2001. http://citeseer.ist.psu.edu/ordonez01fast.html More

[4] Peter Yianilos, Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithm, Pages 311 - 321, Austin, Texas, United States, 1993.

[5] C. Li, E. Chang, H. Garcia-Molina, and G. Wiederhold. Clustering for approximate similarity search in high-dimensional spaces. IEEE Transactions on Knowledge and Data Engineering, 14(4):792~808, July- August 2002

[6] P. Ciaccia, M. Patella, F. Rabitti, and P. Zezula. Indexing metric spaces with mtree. In Quinto

Convegno Nazionale Sistemi Evoluti per Basi di Dati, pages 67-86, Verona, Italy, 25—27 June 1997.

[7] R.A. Jarvis and E.A. Patrick. Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers, C-22(l 1), pages 1025-1034, November 1973

[8] http://bittorrent.com/

[9] http://freenet.sourceforge.net/

[10] http://www.archive.org/web/freecache.php

[11] http://www.scs.cs.nyu.edu/coral/

[12] N. Sarshar, P. Boykin, V. Roychowdhury. Percolation Search in Power Law Networks: Making Unstructured Peer-to-Peer Networks Scalable. Fourth International Conference on Peer-to-Peer Computing, pages 2 - 9, August 2004

[13] U. Shardanand, Social Information Filtering for Music Recommendation. MIT Master's Degree Thesis, 1994.

[14] B Sarwar, F. Karypis, J. Konstan, J. Riedl. Recommender Systems for Large-scale E-Commerce: Scalable Neighborhood Formation Using Clustering. Proceedings of the Fifth International Conference on Computer and Information Technology (iccrr 2002), 2002. [15] U. Shardanand, and P. Maes. Social Information Filtering: Algorithms for Automating "Word of Mouth", in Proceedings of CHI'95 (Denver CO, May 1995), ACM Press, 210-217.

[16] U.S. Patent 5,884,282

APPENDIX 1

This appendix describes a number of variations which we consider to be part of the invention.

• Some embodiments of the invention use "playlist sites" or "mp3 blogs" or "music blogs" to supply profile information, rather than, or in addition to, profile information stored on a local disk such as the XML database generated by Apple's iTunes product. In typical embodiments this information is collected by a "screen scraping" procedure, either by a process or processes running on the server system, or on user nodes. In some cases, such sites publish song information using OPML or other XML formats such as RSS, which reduces or eliminates the need for screen scraping. For embodiments making use of this capability, profile information will be provided to users of the system that may represent the tastes of other individuals who are not users of the system. To a large degree, the data associated with these individuals is treated identically to the data associated with users. In some aspects it will normally not be possible to treat them identically because less data will be available for them. The adjustments that need to be made in such cases will be readily apparent to the software developers. Note that since this specification focuses primarily on users of the system, there will be cases where the term "user" should be considered to also include "ghost users" derived from external data representing non-users. Another source of ghost user data is services such as audioscrobbler that make identifiers of songs currently being played by a given user available on the Web. One of ordinary skill in the art will immediately see how to monitor such a service to build up a profile, over time, of users whose currently- played-song is displayed.

• Some embodiments provide a facility whereby simply loading a web page (and optionally giving permission for security reasons) will cause software to be automatically loaded into the user's machine that provides the necessary functionality; this avoids the separate step of downloading and installing application software. This can be accomplished, for instance, by means of Java-language code called by a Web browser. Preferred embodiments have a "permanent neighbors" feature, as well as a "machine-generated neighbors list" feature. The machine-generated neighbors list displays identifiers for those users that have been determined to be very close matches in taste or interest to the current user. The permanent neighbors list displays identifiers for users that have been selected by the current user. • In preferred embodiments, user-interface techniques are provided for turning machine-generated neighbors into permanent neighbors. Typically this is done by a drag features where a member of the displayed list of machine-generated users is dragged to the displayed list of permanent neighbors. Other techniques include allowing the user to select member of the displayed list of machine-generated users and call a menu option to cause it to be listed as a permanent neighbor; this can be a pop-up menu, a contextual menu, or a standard menu. Permanent neighbors may be manually removed from the permanent neighbors list by the user; for instance, by means of a menu choice or drag operation. Another option is a checkbox where multiple permanent neighbors can be marked for removal, accompanied by a separate button to cause the removals to happen.* In preferred embodiments, UI elements are provided to enter an email or IM address for an individual, and cause him to be emailed such that the said email includes a link (or other technique) for enabling easy download of client software implementing the invention. In further embodiments, the other user is automatically added to the permanent neighbors list when the other individual becomes a registered user of the system. This may be accomplished in many ways, readily discemable to one skilled in the art; the scope of the invention should not be construed as being limited to the examples listed in this paragraph; they are listed for reasons of example only. For instance, as the user profiles arrive on the client machines for determining which are nearest neighbors, they can be checked to determine whether an emailed individual is among them. (The addresses of emailed users would be stored on the local user's machine for this puφose.) Alternatively, the client can periodically query a database table residing on the server, to check whether the emailed user has become a registered user.

• In preferred embodiments, permanent neighbors can include ghost users, where the ghost users are identified by the local user by appropriate network identifiers. For instance in the case of online playlists, a URL that identifies the playlist of the particular individual would be one appropriate type of identifier. In further embodiments, the data for such neighbors is retrieved directly (across the network) by the client node without interaction with the server that implements the server portion of the invention.

• In preferred embodiments, users may click on the identifier for a permanent neighbor and cause information to be displayed that represents the user's musical tastes; such as a list of artists and/or songs in the user's collection, possibly including such elements as the number of times each song has been played, the date added to the collection, and others; this list is for example only and not intended to be inclusive. Further embodiments display this data for permanent users in the same onscreen list area that is also used for displaying the analogous data associated with machine generated neighbors. • In preferred embodiments where neighbors are used as the basis for generating recommendations, it is recognized that permanent neighbors may or may not be the ideal individuals to generate recommendations from. For instance, an individual may be made into a permanent neighbor because he is a friend, rather than because his tastes are remarkably similar to those of the local user. Accordingly, in such preferred embodiments the option is provided to leave permanent neighbors out of the recommendation process. In some such embodiments, this is done as a single binary choice for all permanent neighbors, for instance, using a checkbox that appears in a Preferences dialog. In others, it is done on a one-by-one basis, for instance, with checkboxes accompanying each listed, displayed identifier for permanent neighbors in the user interface. In some embodiments, it is possible to make a single binary choice to indicate that only permanent neighbors are used for recommendations; in others there is a screen widget such as a collection of 3 radio buttons or a standard menu which "sticky" indicators of a previously made selection, where the user can choose between not using permanent neighbors in the recommendations processing, only using permanent neighbors, or using both.

• Preferred embodiments display the most recent date and/or time that each permanent or machine- generated neighbor last used the system, to the extent that the client may be easily aware of that information. For instance, it may be included in profile information that arrives at the local user's node for processing of candidate neighbors; in which case it may not be the most recent data available to the system as a whole. Alternatively it is retrieved directly from the server when it is to be displayed, and is thus up to date. • Preferred embodiments contain on-screen lists of neighbors (which may include permanent neighbors or where permanent neighbors may be in separate, similar lists); in further preferred embodiments these lists contain screen elements of the presence or absence of email addresses for the users (needed because, in preferred embodiments, it is optional to supply an email address and/or to allow other people, preferably including other users, to be made aware of them). In further embodiments, clicking on such an element causes an email application opened and an automatically-addressed email to be generated, to be populated with content by the user. Similarly, elements indicating an IM address, or other communications handles, may be displayed, and UI functionality provided to facilitate such communications. In some such embodiments one element is provided for each neighbor to indicate one or more than one modes of communication as available, and clicking it causes a menu to appear that lists them; choosing one facilitates communication by the chosen mode. In other embodiments, the user selects the list row containing the user identifier, and brings up a standard menu to choose a mode to communicate with the selected user; when communication handles are not provided for a particular mode, that one is greyed-out. A software developer of ordinary skill will readily see other variations of how to facilitate user interaction regarding what modes are available and how to facilitate engaging each one. Such variants which contain some on-screen indicator of the availability of communications with a given user are within the scope of the invention. Software developers of ordinary skill in the art will immediately see how to implement this. • In preferred embodiments certain individuals are registered as being artists. When an item such as a song by such an individual is displayed on screen, and if the artist has indicated that he wishes communications with him to be enabled, an indicator of that is provided, and UI techniques for facilitating such communications are provided; these techniques will generally be similar to those already discussed for user-to-user communication. Software developers of ordinary skill in the art will immediately see how to implement this. • When artists communicate with users, preferred embodiments monitor the uniqueness of the communications, in an attempt to determine whether artists are really communicating one-to-one with users. One way to determine this would be to randomly sample a number of pairs of communications from artists, and use "diff ' text comparison techniques to compare them. Artists with low average number of differences are considered by the system to not be truly engaging in one-to-one communications. Other techniques that enable some measure of general uniqueness to be determined also fall with in the scope; the invention is not dependent on any particular technique for that functionality. In various further embodiments, there are ramifications of being considered to not engage in true one-to-one communications; for instance, in some embodiments, such artists are banned from being presented to users as potential targets of communication; in others there is a displayed list of artists who appear to tend to use "canned" responses; in others that individual is not enabled to initiate communications with non- artist users. In preferred versions of such embodiments an artist can denote a particular communication as being an announcement, and it would then be excluded from the described uniqueness checking. Some embodiments provide UI functionality that allow the user to specify a genre or artist or other criteria for determining a subset of items, and then causing item recommendations to be selected from that subset. • Some embodiments enable recommendations to have their order at least partly determined by the similarity of the item to the items associated with some specified artist(s), item(s), or other grouping of items (such as an album of songs).

• Some embodiments provide professional-interest-matching or dating services by examining files on the user's local computer, for instance words in documents, and possibly words in linked URL's where the links themselves are stored on the user's computer, to build interest profiles; neighbors and, in preferred such embodiments, item recommendations, are based on this data.

• Some embodiments use a bar-code reader or other automatic means for identifying physical objects in order to generate, or as a contribution to, the data in the user's taste profile. For instance, music CD cases typically have bar codes that can be used for that piupose. (Note that a software product for the Mac OS X operating system, called Delicious Library, has the ability to take data supplied by a bar code reader to build a digital library of physical CD's and other items; however it has none of the other features described in this invention.) Some embodiments add a gift suggestion feature. The individual for whom gift recommendations are to be made available makes his relevant data available to the machine associated with the user who wants to give a gift. For instance, an once such embodiment, an iTunes user might email his iTunes Music Library.xml file to the user who wants to give him a gift. Other techniques for getting the relevant information to the local user are equivalent from the standpoint of the invention. Then local processing occurs for that other user's data that is basically the same as for the local user's own data. For instance, in embodiments involving recommendations by means of neighbors, a collection of machine-generated neighbors is found relative to the gift recipient's data, and recommendations are generated from that and displayed on the screen. The value of this is that the local user already has the code necessary for such functionality, for his own recommendations; and in this case much of that same code is re-used for puφoses of gift suggestions. • Some embodiments interact with an online music store in such a way that highly recommended music is automatically purchased at regular intervals of time. For example, on a monthly basis, an embodiment that works with the iTunes music store could cause the most recommended n songs where n is 1 or some greater number to be automatically purchased and downloaded to the user's machine. In preferred embodiments, the user is alerted before this occurs and given the choice to modify the list of songs to be purchased; for instance, the application software might display an alert dialog, the day before the purchase is to be made, which indicates that the top 10 songs will be purchased; input means such as checkboxes next to the listed songs may be used to indicate that certain songs should excluded from the automatic purchasing. Preferred embodiments allow the user to choose the periodicity and number of songs to be automatically purchased. In some embodiments, this process is used to cause the creation of a physical CD by a store, containing recommended music (or, in other embodiments, videos, books, etc,), which is subsequently shipped to the user. Preferred embodiments give the user control over which artists are considered to be part of the user's effective taste profile. For instance, in one embodiment the local user can view a list of the artists in his music collection; there is a checkbox next to each one, defaulting to checked; if it's unchecked, that artist is effectively ignored in other processing based on the taste profile. In the embodiment in question, this is accomplished by means of a tuned taste profile and untuned taste profile; the only real use of the untuned one is to present that list to the user for tuning by unchecking checkboxes. So, in embodiments providing control over the artists that are considered to be effectively part of the taste profile, where the user's local taste profile is used for finding nearest neigbhors, only the desired artists are used; and where taste profiles are part of the data broadcast by the system to be viewed by other users and/or for other users to choose neighbors, only the desired artists are used. In some embodiments differerent sets of artists may be chosen for finding neighbors of the local user and for broadcasting, but preferred embodiments combine those features. A software developer of ordinary skill in the art will immediately see other ways of handling the user interface and technical issues for achieving the same puφoses; these are equivalent from the standpoint of the invention. One embodiment involves online chat. An interest profile is built based upon a) the words the local user types into his chat client and/or b) the words that appear in messages types by other people into the same chat room. In the case of (b) a subset may be used where the only messages that are at least somewhat likely to be responses to messages from the local user are used — for instance by distance in time from the time of a message sent by the current user to the chat room where the potential response appeared. By collecting these words over time (and in some embodiments, giving words posted by other individuals less weight), a profile of chat interests can be built for each user. Then, when the system builds neighborhoods of similar users, those neighborhoods can be viewed as potential chat partners. In preferred embodiments a user clicks on a user identifier to start a chat session with them. In some embodiments chat rooms are automatically initiated for groups of similar users. In chat embodiments, no other recommendations are necessary. Note that variants of this set of embodiments use different techniques to match people together according to the words they type. The simplest way is to simply treat the words used by a user as a document; then techniques for document similarity which take word frequency into account can be used. (A search on Google for "document similarity" will bring up numerous techniques.) But any technique that calculates useful similarities based on the word content is equivalent for pinposes of the invention. • Some embodiments provide means to restrict candidate neighbors by certain criteria such as physical locality. One way to do this is to simply assign the lowest possible similarity to people who don't meet the restriction requirements; another is to exclude them at the outset from the neighbor-searching process. Techniques to do this will be immediately apparent to a software developer of ordinary skill. One advantage of having software running in user nodes is that certain parameters for recommendation quality can be tuned on the user node, for the given user, by computationally expensive techniques such as genetic algorithms. Some embodiments take advantage of this fact by using iterative testing, genetic algorithms, simulated annealing, or other optimization techniques to tune parameters such as the following: the number of neighbors to use in recommendation calculations (assuming only the must similar neighbors are chosen), the optimal adventurousness (see elsewhere in the specification for discussion of adventurousness), a cutoff release date for recommended items (for instance, the user may not be interested in old music), and others. One such other is a number representing the lowest weight to be associated with any user's information; the least similar of the nearest neighbors is assigned this weight and inteφolation, with a max of 1 for the single nearest neighbor, is used to assign weights to the other neighbors according their rank or another measure. The optimization may be based on tuning the parameters to get the best match between recommended music and the music actually aheady in the user's collection. (Obviously under normal processing, preferred embodiments do not recommend music that the user aheady has, and this screening is disabled for optimization puφoses.) Preferred embodiments in the music domain try to optimize the match between ranks based on song plays per day and order of recommendation. For instance, Spearman's Rank Correlation can be used to do this. Some tuning operations may change the number of recommended songs; to find the optimal setting it may be useful to compute the p- value associated with each pair of rankings; the more statistically significant the p-value, the better. When rank correlation is used, preferred embodiments only consider the ranks of the top recommendations, because we are less interested in the exact rank of songs that are not particularly recommended. At an extreme of this general approach, some embodiments uses Koza's Genetic Programming technique to generate at least part of the algorithm used in the recommendation process, using similar fitness criteria to the optimization measures mentioned already in this paragraph. In embodiments which carry out evolutionary computation like genetic programming, the invention has useful ramifications for multiprocessing. For instance a each user node evolves chromosomes (such as hierarchical programs in a genetic programming environment) which best suit the needs of the local user. It is likely that those same chromosomes will be relatively high-performing for other users who have the local user among their neighbors. So in preferred evolutionary computation embodiments, one or more of the highest-performing genomes that has resulted from the evolutionary process on a user node becomes part of the profile, which also includes the taste profile. Then other user nodes that select a particular user as a neighbor will also have his highest-performing genome(s) available. These can be used directly; combined with those supplied by other neighbors by (for example) averaging the recommendation strength for each song across all genomes, or seeded into the evolving population of genomes on that node; this is a form of multiprocessing evolutionary computation. It should not be construed that the invention is limited to the example of multiprocessing evolutionary processing described here; it is an example only. For instance literature on genetic programming is rich with research on ways to do genetic processing in a multiprocessor environment. Those skilled in the art of genetic programming will see numerous ways to leverage the fact that each user has a neighborhood of users who will tend to be well-served by many of the same genomes, and have user nodes that are available for multiprocessing to better serve the needs of all such users. For instance, without restricting genomes that are fed from other users into a local user's genome population to just the set of nearest neighbors, some embodiments give more or less probability to a foreign genome being added commensurate with the other user's similarity to the local user. Many variants of taking advantage of the similarity information and the overall structure of the invention will occur to those of ordinary skill in the art of genetic programming, and it must not be construed that such variants are not within the scope of the invention; that is, variants are within the scope if they result in better performance due to the following attributes of the invention: a) the fact that the mechanism that transports taste profiles from user node to user node (which may involve using the server as an intermediate step) can also be used to transport genomes, either as a separate data package or as part of the same data package, and b) that mechanism is set up so that profiles with a higher similarity to the local user have a higher probability of arriving sooner (and, in some embodiments, at all), and those genomes are more likely than randomly-chosen ones to have higher fitness for the local user. • In preferred embodiments, direct peer-to-peer communication of individual taste profile information occurs between neighbors. This can enable faster updating of neighbor taste profile data than would occur through the usual mechanism described in this specification. Further embodiments provide an output mechanism for showing identifiers of the digital item currently being experienced by other neighbors; in some such embodiments that information is also used to update the neighbors' taste profiles stored on the local node while waiting for full updated taste profiles to arrive through the usual mechanisms. In preferred embodiments which make use of peer-to-peer techniques as described in this paragraph, the fact that some nodes may be behind firewalls that prohibit incoming connections from being made are handled by sending the necessary data through other nodes that do have the necessary ports open. Any software developer of ordinary skill in the art of peer-to-peer network programming will immediately see how to create the necessary peer-to-peer mechanisms for the functionality described in this paragraph; it should not be construed that only particular implementation mechanism are within the scope of the invention. In preferred embodiments, users can create different taste profiles for themselves which fit different moods or interests. Most or all of the overall mechanism described in this specification then applies to each separate taste profile. Neighbors are found and recommendations are generated for each one. For instance, playlists generated in Apple's iTunes program can comprise music taste profiles. In some embodiments at least some users run a special version of software that implements the invention, in which not all the usual user interface features are necessarily present. In these embodiments, certain musical tracks are indicated as being free of charge — for instance, in the names of the files, or in a database. The user is recommended a collection of free songs. Identifiers for the songs are then uploaded to server system (not necessarily the same one as for other functions). Then the free songs are copied into a portable music player from a computer that is networked to the that server. Then the portable music player is packed and shipped to the user. Facilities are provided where there is a web site where the user orders and pays for the the player, and is informed about how to get the software that will make the recommendations. In preferred embodiments a list of recommended free songs is presented to the user and the user can choose which ones he wants; identifiers of the chosen songs are sent to the server.

Networking software and online store developers of orderinary skill will immediately see numerous ways to implement the required functionality; various implementations are equivalent from the standpoint of the invention; the scope of the invention is therefore not limited to certain implementation techniques. Note that this functionality may be removed from other aspects of the present invention; recommendations may be wholly made on the server based on data that is input, via the Web, to the server, using any recommendation methodology; the recommended songs are then loaded onto a portable players and shipped as described. • Some embodiments which involve artists having special accounts enable chat rooms for each artist, and provided indicators in the UI associated with artist names (such as next to the artist names in a list of artists) that show whether they are in the chat room or not, and means are provided for the user to click or otherwise interact with an onscreen control to cause them to "enter" the artist's chat room and chat with the artist. Practitioners of ordinary skill in the relevant programming techniques will immediately see numerous ways to implement this and these are equivalent from the standpoint of the invention. GLOSSARY FOR APPENDIX 1 • User Nodes — machines that are on the network that also directly interact with users; typically these are machines owned by users or associated with them at their work locations. • Screen scraping ~ a software process that reads an HTML (or other) page on the World Wide Web (or other network system) that is intended for human use, and extracts useful data from it for machine use. • Ghost user ~ data representing an individual that is derived from an external source such as a music blog. In many ways, ghost users may be treated identically to regular users of the system. Current user, Target user, Local user — these terms represent the user who is running software which implements a client portion of the invention; typically he or she is the one that is recommendations and one or more lists of neighbors are associated with in the course of examples in this specification. IM — instant message, typically asociated with chat software.

• Neighbor — may be used to indicate machine-generated neighbors and/or permanent neighbors. Note that different embodiments may use different terminology for these. Nearest neighbors — the set of neighbors who are most similar in taste to the local user; normally the same as machine-generated neighbors; though it is not impossible that a user will manually find a neighbor that is actually more similar in taste than the machine-generated ones, and add him to permanent neighbors. Artist ~ a creator of items of interest to the subject domain or one of the subject domains of an embodiment of the invention. We use the term for shorthand, and, for example, in some domains such as academic papers, it could refer to an academic who wrote or co-wrote such a paper. UI - user interface. In most cases, the user interface will involve a computer with a CRT or flat- panel screen and a keyboard, displaying a windowing system such as Microsoft Windows or Mac OS X. Such systems normally provide standard means to create menus, lists (or tables), checkboxes, etc. In other cases the UI may be audio with input by means of telephone touch-tones. The requirement is that it provides functionality that facilitates human-computer interaction. • Item — an item is the basis • Interest profile or taste profile ~ data which is indicative of the interests or tastes of a user. Often used interchangeably in this specification. For instance, digital music user will normally have identifiers of the songs he likes (or that are in his collection) in his taste profile. • Server - a server is a central computer, or networked group of central computers that handle certain tasks for the benefit of the client nodes, such as storing a database containing login ID's, passwords, and profiles.

APPENDD 2

This appendix describes another description of key functionality of the invention, including but not limited to facilitating retrieval of representations of nearest neighbor candidate taste profiles and associated user identifiers in an order such that said nearest neighbor candidate taste profiles tend to be at least as similar to a taste profile of the target user according to a predetermined similarity metric as are subsequently retrieved ones of said nearest neighbor candidate taste profiles.

This description is from U.S. provisional patent application 60/540,041, filed January 27, 2004.

The specification describes a product named Goombah. However, the focus on Goombah is for clarity and descriptive puφoses only and it must not be construed that the scope of the invention is limited to that particular embodiment or to the field that Goombah operates on (music).

Goombah's first puφose is to build a list of "nearest neighbors" for each user. They then form a community of like-minded people for communication puφoses, and they also form a source for recommendations of items ~ if you have extremely similar tastes to me and you have an album I don't have and you play it all the time, I should probably give it a try. So that's the basis of the recommendations.

To find nearest neighbors exactly correctly is an 0(N^Λ2) problem if simple technology is used, and we hope to have hundreds of thousands or millions of users whose profiles are constantly being updated, so we wanted to do better than 0(N^Λ2).

There are probabilistic nearest neighbor algorithms that reduce this complexity hugely, but at a loss in reliability in finding the true nearest neighbors. We wanted to do better. The key idea behind Goombah, whose piupose is to solve the above problem, is that the computations for finding the local user's nearest neighbors are carried out on that user's machine. So, if we have a million users, we have a million CPU's doing the work of finding nearest neighbors. There are three reasons why such an approach is within now the reahn of feasibility, where it wasn't a few years ago:

1 ) Most people who are heavy users of digital music have high-speed Internet connections, otherwise it would be unpleasant to do downloads from the likes of Apple's iTunes Music Store.

2) New technologies such as BitTorrent has emerged recently which offload bandwidth concerns for sending large files from a central server to the user nodes. In particular, the following is true for BitTorrent: The central server has a copy of the file that people need, but once one user has it on his machine, he is automatically set up as a server as well, and so on for every other user. Transfers are carried out from other users invisibly. (This is different from something like napster where you have to choose another user and request a download. Instead, the central server knows where all the copies of the files are, and tells a node that needs a copy the addresses of several machines to simultaneously get different chunks of the file from until the whole file is build. If a sending node drops out, other nodes automatically take its place, and so the file is eventually downloaded from multiple changing sources in a completely automated way.) This means it is possible for a company like Transpose to make very large files available to very large numbers of users without having hugely expensive server and bandwidth needs. Furthermore, it happens that BitTorrent is open source with a very friendly license and written in the same language (Python) that Goombah is written in.

3) Any serious digital music user already has a hard drive with gigabytes of space devoted to music, so spending a 100 megs or more on the data associated with an application like Goombah is no big deal. In the future, videos will commonly be stored on user hard drives, so that is another application for Goombah as it evolves.

So, essentially the idea, when a local user wants to find his nearest neighbors, is to download the profiles of all other users who could reasonably be considered to be candidates to be nearest neighbors of that local user. Then, the local user's Goombah application does a search of all those profiles to find the best matches.

Instead of downloading individual profiles, Goombah will download a single very large file — 10's or even 100's of megs — that contain the candidate profiles. This will happen by means of BitTorrent. These large files will be formed by a clustering algorithm.

We will find clusters of similar users which are large enough to contain most reasonable nearest-neighbor candidates for each general type of musical taste. They will be large enough to fill that need, and small enough to download in a reasonable time on a high-speed connection and not take a problematic amount of space on the user's hard drive.

So, the local user will download a large BitTorrent file containing all nearest neighbor candidates and do an exhaustive search on his machine for nearest neighbors.

Then he can communicate with his taste-mates and get automated music recommendations from them.

The large file will be updated on a regular basis with further BitTorrent downloads.

The clustering algorithm can be any clustering algorithm that is capable of clustering a large number of users according to their degree of interest in a large number of subject items. (Where the degree of interest may be indicated by real-valued, binary, integer or any other that can represent a degree of interest.)

As one example, the commonly-used C4.5 algorithm can do this. For example, the open-source Java software WEKA has a module, weka.classifiers.trees.j48, which implements C4.5. In the context of using this module in a music setting, each user is an "Instance" and the song identifiers, such as strings containing the artist name, album name (if any), and song title, are used as the values of a "nominal attribute" representing the songs.

MISCELANEOUS NOTES FOR APPENDIX 2:

The step of using the local CPU to find nearest neighbors can be conducted in various ways. Any sub- algorithm which accomplishes the function "find nearest neighbors out of the downloaded large file" is considered equivalent for the puφoses of the present invention. Possible ways to do it include an exhaustive search for the other users that are most similar to the local user according to some similarity metric. (The attached Python scripts, recommenderclass.py and tasteprofileclass.py contain code for generating a similarity metric. However it must be stressed that there are innumerable ways of generating a similarity metric for nearest-neighbor puφoses, and they are all functionally equivalent from the standpoint of the present invention and all fall within the scope of the present invention. We can use any metric that results in reasonable likelihood that two users that are considered more "similar" than another pair of users actually have more shared interests in the targeted interest-domain [such as music] than another pair of users with lesser similarity. Note further that we aren't using the word "metric" in its most rigorous sense, but in its general sense as a quantity used for measurement and comparisons.) Another way to find the nearest neighbors from the downloaded large file is to use the vp-tree technique introduced by Peter N. Yianilos in his paper "Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces". The large file to be downloaded would be formatted as a vp-tree and thus very fast nearest-neighbor searches would be facilitated on the local machine. Again, any technique used to find the nearest neighbors is functionally equivalent from the standpoint of the invention and falls within the scope of the invention.

The step of using peer-to-peer techniques for downloading the large files can also occur in various ways which are functionally equivalent from the point of view if the current invention. In fact, the invention does not depend on any particular technique for getting files from peers and all such techniques should therefore be considered functionally equivalent from the point of view of the invention. For instance, while BitTorrent provides a particularly compelling model for how this may be accomplished, the Gnutella provides an alternative model.

A difference between the BitTorrent and Gnutella approaches is that with BitTorrent, each file has a distinct URL which is imderstandable by a server machine which runs BitTorrent "tracker" software. By means of this URL, client software is told by the tracker which peers store the file (or parts of the file) so that the client can cause downloads to be started from a subset (or all) of those peers. With the Gnutella approach, there is no central server, and the local computer sends queries into the "cloud" of known peers and machines known to those peers, looking for files with particular filenames. Then, normally, one of those peers is chosen to be the source of the download.

The commonality between all these various techniques is that the large files each represent a group of similar profiles (or, alternatively, all available profiles), there are a fixed set of such files at any point in time, and the user causes one (or more) to be downloaded that is (are) particularly likely to contain worthy nearest neighbor candidates; these files are usually downloaded from one or more peers rather than from a central server. All techniques which satisfy these requirements are functionally equivalent from the perspective of the present invention and thus fall within the scope.

One key step is determining which large file a particular client should download in order to meet the needs of its user. Of course, in embodiments where all the profiles are in one large cluster, there is no issue. When they are divided into clusters, and each cluster is represented by a particular large file, however, this step needs to be carried out.

One way to accomplish this step is as follows:

When a system is first set up to embody this invention, it will usually only have a relatively small number of users on Day 1. Thus, there is no need to divide the population into separate clusters for downloading. As the user population grows in size, a single file is used for download puφoses.

Finally a point may arrive at which it is deemed, due to the relative of expense of bandwidth and diskspace, that the user population should be divided into two clusters. At that time, a clustering algorithm is run and the user population is divided into two clusters. Each of the two clusters is given a name: for instance, "U0" and Ul".

Now, as time goes on, we do not regenerate those clusters from scratch. Rather, as new users are added to the system, they are added to the most appropriate cluster. This may be done in any number of ways. A centroid for the cluster may be calculated, and the new user added to the cluster whose centroid it is most similar to. Or the average similarity between the user and each cluster member may be calculated for each candidate cluster, and the most appropriate cluster chosen on that basis. Or, the change in entropy that would arise in the system as a whole due to each possible choice of cluster can be calculated, and the choice taken that minimizes the change in entropy. Any of these techniques, and all other techniques that cause the user to be placed in one of the existing clusters, are functionally equivalent from the point of view of this patent as long as they have put the user in a cluster that is highly likely to result in a good degree of similarity between the new user and other members of the cluster.

In this way, clusters have consistent meaning over time, and the user can stay in the same cluster, until a further split is deemed necessary. In preferred embodiments, this is handled by the expected large file simply not existing at a particular point in time, and this is detected by the client, which thus assumes it needs a new cluster assignment. It then queries the server system for a new assignment. For a pre-existing user this is easily determined because the new assignment was made during the split process, so the server returns another cluster identifier consistent with that split. For example, if a user was in cluster U0, he may now be in cluster U01 (where the leading 0 represents the lineage). (Of course any cluster naming convention can be used, but preferred ones encode the lineage in the name).

Other embodiments which use a fast enough clustering approach regenerate the clusters from scratch on a regular basis. In such embodiments the client either requests a new identifier for the cluster file, or one is sent automatically by the server when the client and server are in communication. (Note that this communication can actually take a number of forms. Rather than sending text strings, numeric or other identifiers can be sent which are in turn used by the client to build the necessary handle to access the file. Two examples: In a Gnutella-style system, this handle would probably be a search term. In a BitTorrent- style system, the handle might be the URL for the torrent.)

Still other embodiments have relatively stable clusters but continuously work to refine them by moving users from one cluster to another if such a movement provides superior clustering. For instance, periodically each user may be considered again as if it were a new user, and a decision made about what cluster it should go into. If it changes then that will be reflected in future communications between the client and the server (although the change does not need to be reflected immediately).

In some embodiments, the client has no persistent "knowledge" about what cluster the user is in, and when it's time to get a new cluster, queries the server for the information required to start a download of the appropriate one.

In some embodiments, users may be assigned to more than one cluster. As one example of how that might be done, a number of standard clustering approaches such as C4.5, assign probabilies for cluster assignments; thus a user might with a higher probability reside in one cluster than another. It would be possible to take the two clusters with the highest probability for a given user, and say that he resides in both of them. The invention is not limited to any particular approach to putting users in more than 1 cluster. The functionality is simply that the user would go in the clusters that provide a high match to his interests, and any technique that accomplishes that is functionally equivalent from the perspective of the present invention and is therefore within the scope.

In some embodiments, different clustering arrangements exist for different genres. For example a user who has both classical music and jazz in his collection might benefit from different nearest-neighbor communities generating different recommendations in each area. So, the entire clustering and downloading structure and steps, in some embodiments, are carried out more than once. In other (preferred) embodiments, each user still is in only one (or a small group of ) cluster(s), but his client software finds different nearest neighbor sets, depending on genre, from within those clusters. Of course, in non-music applications, this concept is extended by means of the analogous principle to "genre" that exists in that other subject area. For instance, if the items are weblogs, then an individual might be interested in weblogs about Perl scripting and also weblots about Republican politics. These different subject areas are handled analogously to genres in the music world. In order for the system to respond to the needs of users who are continually buying new music (viewing new weblogs, etc), in preferred embodiments it is possible for neighborhoods to be updated according accordingly. This means that the large files representing clusters need to be either re-downloaded or updated periodically. We will discuss below some of the ways this is accomplished in various embodiments. The scope should not be construed to be limited to these particular techniques. Rather, any technique that "enables the potential neighbor files to be updated or replaced often enough to increase the accuracy and pleasure in using the system" equally fulfils the required function and is thus considered to be in the scope.

In some embodiments, download file identifiers (which may be URL's, terms, etc.) are constructed based on two pieces of data: the cluster identifier plus the date. For instance a user might be in cluster U011. If the date is Janurary 27, 2004, the download file identifier might be U01120040127. The client can then get an update by, for instance, downloading the file containing that string in its name or by constructing a BitTorrent URL based on that string.

The client machine can then download the file upon whatever schedule is most consitent with the user's needs and desires. Bandwidth will be a constraint, so there is reason not to download the files too frequently. In preferred embodiments, there is a choice in the "preferences" section of the program whereby the user can specify how often he wants to update the file. He will probably do so less frequently if he has a dialup modem connection than if he has a cable modem. Some embodiments use information available in the computer (for instance, provided by the operating system) to determine the connection speed, and automatically choose a download schedule accordingly. Some ask the user to specify the download speed and automatically choose a download schedule accordingly. Other ways of determining a download schedule, including the user's manually starting each download, are all functionally equivalent and within the scope.

Some embodiments automatically cause files of different sizes to be downloaded according to connection speed (or at the choice of the user). One way this is done is for the server to store a tree of cluster arrangements. For instance, suppose clusters are arrived at by splitting bigger clusters in half, and the lineage of the cluster is represented in the file name. Then, for example, U0 might be the parent of U01, and U01 might be the parent of U011. Then a client with less bandwidth available to it might retrieve cluster U011 and one with a great amount of bandwidth but with a user with a very similar taste profile to the first client, might retrieve cluster U0. The difference is that the larger the downloaded cluster, the more likely it is that the true most similar neighbors, out of the whole universe of neighbors, will be found by the client. In some (preferred) embodiments it is possible to either download a cluster as a whole, or download updates. For instance, using the naming convention we have used above, U01120040127-20040126 might be the identifier of the file that contains the difference data between an up-to-date representation of cluster U011 as it appeared on January 26, 2004 and the version that was current on January 27, 2004. Then a preferred embodiment will automatically choose whatever method will result in getting current more quickly. For instance, if no update has occurred in a number of days, it may be more efficient to download the complete file. But if the last update was recent, it may be more efficient to download a series of daily updates. In a preferred embodiment making use of BitTorrent, the server stores, for each cluster, files representing the current complete cluster, individual updates for the last 6 days, and the last 4 weekly update files (files that update for a whole week). BitTorrent requests for any of these files cause them to be loaded to client machines, where they are henceforth made available in a peer-to-peer manner. Any such manner of scheduling updates is functionally equivalent.

Those skilled in the art will know how to create such update files. There are general "patching" software technologies, but more particularly it is easy to create custom approaches. For instance, if the cluster file contains a list of user ID's with each user ID followed by a list of the songs found on his or her computer, an update file may consist of a list of user ID's of users who downloaded new songs in the corresponding time interval, with each user ID followed by a list of the new songs and a list of songs that used to be on the user's disk and no longer are. All such representations are functionally equivalent and fall within the scope of the invention.

Another aspect is the fact that changes on the user's machine need to be uploaded to the server. In some embodiments this is done on a regular schedule when there are changes to upload. Preferred embodiments only send changes since the last upload rather than uploading the entire interest profile. Preferred embodiments don't send changes until sufficient changes have accrued that it is "worthwhile" to do an update. For instance, in embodiments where taste profiles include information about the number of times a song has been played, it makes a big difference when that count goes from 0 to 10, but very little difference when it goes from 1000 to 1001. A simple way to determine significance is to have a cutoff for the percentages involved. For instance, if play counts are used, the if overall they have changed by 1%, that might be considered significant. If simple presence/absence data is used, than a 1% difference in that data might be considered significant. Alternatively, the entropy of the data may be used. For instance, entropy can be calculated based on the exercise of choosing a "play" at random, and computing the probability that such a randomly chosen play instance would arrive at a particular song. So there is one probability for each song. Based on those probabilities the song entropy may be calculated. Then significance may be determined by a particular amount of change in entropy occurring, either on a percentage basis or based on a fixed minimum change in value. Any technique that determines that a desirable amount of change has occurred is considered functionally equivalent from the standpoint of the invention and thus falls within the scope.

In some embodiments the user can determine how much significance is required before an update occurs; in others it is automatically determined based on bandwidth; in others it is determined on a global basis by the server; in others some combination is used such as a maximum upload frequency being determined by the server with the user having the ability to set the frequency or significance required as long as it is below the global value; any number of other techniques are possible and considered functionally equivalent within the scope of the invention.

Note: Music is discussed in this specification for reasons of example only. The invention applies to other areas just as well, including text documents, videos, weblogs, and indeed any type of item where user interest can be determined by means of his association, and/or degree of association, with a number items of potential interest. Software developers will readily see how to create these alternative embodiments. It must not be construed that the invention is limited to the specific examples described in this specification.

The overall invention, in broadest form, consists of a server (or networked group of servers) that stores the cluster files containing interest profiles and distributes them to client machines, and client machines that then distribute those files to other client machines; the nearest neighbors are then chosen on client machines and used for puφoses of recommendation and community.

Clusters should be large enough to include most users whose profiles are reasonably likely to be global "nearest neighbors" for any given local user.

It would be worth while to discuss one further sample application of the technology. That is one where users are purchasers of DVD's for viewing videos. The interest profile would consist of the list of DVD's owned by the user (perhaps with additional entries that are liked or particularly disliked by the user), optionally associated with the ratings. Numerous technologies are available for finding nearest neighbors based on such data, such as those used by Firefly or the movie recommendation patents of John Hey, or the present inventor's U.S. patent number 05884282. (All such algorithms are functionally equivalent from the standpoint of the present invention.) This profile data is usually manually entered by the user.

In addition to foiming communities and recommendations as aheady described, this embodiment adds functionality for making it visible to other users that one has DVD's one is willing to lend out, and for eeping track of DVD's that have been lent. Additionally, preferred embodiments have functionality for rating lenders of DVD's according to their reliability (much as is done on eBay or various action sites with respect to sellers). Skilled practicioners of the art of Web programming will immediately see how to create appropriate user interfaces.

In some embodiments this lending data is stored on the server for easy access by various clients and in others it is made available by peer-to-peer means.

The idea is that when the system finds people who have similar tastes, they will be able to help each other by lending DVD's to each other. Because they have similar tastes, they will be able to lend multiple DVD's. They may also email each other or chat with each other about DVD's of interest through addresses made available through the interface or through automatic means. These factors lead to a relationship of trust, which minimizes the risk in sharing DVD's. So such a service has the potential to do what netflix does, but since there is no central repository of DVD's, at much lower cost.

Of course other physical objects of interest than DVD's are the subject of other embodiments; CD's is one applicable subject area.

APPENDIX 3

Introduction (Appendix 3)

This appendix describes another way of implementing key functionality of the invention, including but not limited to facilitating retrieval of representations of nearest neighbor candidate taste profiles and associated user identifiers in an order such that said nearest neighbor candidate taste profiles tend to be at least as similar to a taste profile of the target user according to a predetermined similarity metric as are subsequently retrieved ones of said nearest neighbor candidate taste profiles.

The representations mentioned in the previous paragraph may be the user profiles themselves (including the taste profiles), or just the taste profiles (which should include an identifier of the user) ~ or they may be user ID's of the users, or URL's enabling the data to be located on the network, or any other data that allows taste profiles and associated user ED's to be accessed. These are all functionally equivalent from the standpoint of the invention.

So that it may be taken separately, this Appendix describes the invention anew. ' hέ present invention is a new approacn 10 dynamically creating online groups of similarly-minded people for both community-building and generating recommendations of items of interest to the communities. The invention is a form of distributed computing for searching which we will refer to as "distributed profile climbing" or "DPC". In preferred embodiments it is a kind of middle ground between a server- based Internet service and a peer-to-peer one.

The invention consists of a networked computer system running special software. The network is typically the Internet (but can be any network which interconnects computers) and the computer can be a broad range of computer hardware that a user might own, a typical personal computer running with 256 megabytes of RAM a Pentium processor being one example. The connection to the network may be a direct connection, or may be wireless, based on radio, light, Ethernet cabling, etc.

Distributed Profile Climbing

Peer-to-peer networks are a popular way to handle such challenges as sharing files between many users. The main problem is that not everyone who wants to participate in such a network can do so fully. This is for a number of reasons — computers may not be on all the time, or they may be portable, or they may have firewall and/or network address translation issues.

Pseudo-peer-to-peer networks handle that problem by creating proxies for the machines of each user who wants to participate. These proxies exist on server systems, but typically the technical requirements for those servers are light because the proxies merely store and transmit data related to the machine they are proxying.

An example of this is Radio UserLand's "upstreaming". Radio UserLand is a software package that runs on end-user computers and lets users create weblog entries. Those entries may then be sent ("upstreamed") to UserLand's servers. Web users who wish to view a Radio UserLand customer's weblog can then look at the proxy data on UserLand's servers. Note that, in a world where everyone had computers always able to allow access to other users, there would be no need for this upstreaming to take place. Each weblog writer's machine could serve their weblogs to the rest of the world. But we are not in such a world, so the practical solution is to send the weblog data somewhere where can be always available to other people, in the form of a data object which is located at a particular URL on a reliable server. This data object is the proxy for the user's machine. DPC^' networks' share a common foundation with pseudo-peer-to-peer networks like UserLand Radio in the sense that each user's data is represented by a proxy data object located on a remote server. However, in DPC networks, this data contains a profile of the user in order to compare similarity of interests. In preferred embodiments, the proxy object for a user further contains key information for other users who have aheady been found to be similar in interests to that user. This key information is sufficient to enable the proxies of those other users to be accessed (typically, this would be by means of constructing a URL that accesses the proxies).

One very important aspect of searching for similar profiles is intelligently handling users that have aheady been compared at least once. In some cases, it may be deshed to never compare them again; in others it may be deshed to compare them again after a certain amount of time or a certain number of updates have occurred. Most approaches for taking care of this involve storing representations of which pairs of profiles have aheady been compared.

For instance some solutions store a table with a concatenated key containing the logon ID's of the two users that have been compared. But this is a problem. If we assume that over time every user will be compared to every other (ignoring the expense of those comparisons for now) and there are 10,000,000 users in the database, the result is a table with 100,000,000,000,000 records. That is not within the reahn of reasonable possibility for affordable server installations.

However, now assume there are 10,000,000 users each with their own machine, and each machine stores the logon ID's approximately the approximately 10,000,000 users it may have been compared to over time. This is entirely within reason given the most computers being sold today are equipped with 10's of gigabytes of storage. This is the way DPC handles the problem, in embodiments which involve such lists. Preferred such embodiments contain the calculated similarity metric for each comparison as well as the date and time of the comparison, and other pertinent information may be included as well.

Moreover for embodiments that handle previously-checked lists, there is no need for the kind of very sophisticated, highly scalable database software that would be required to store that data on a central server.

Furthermore, in most DPC systems, the similarity metrics are computed on the user's machines rather than on the server. This is not a requirement, but it does help to distribute the workload and simplify the scalability issues for the server. As a matter of practical implementation, preferred embodiments where there are large numbers of users divide the proxies for various users among separate servers residing in one or more physical hosting sites. Usually the proxies are divided up in such a way that a hash function based on the user's ED can be used to determine which server (or subgroup of servers) hosts that user's proxy. The benefit of dividing the server side up this way is one of simplicity and cost — there is no need for a high-performance central database system. Instead the servers can operate in relative isolation to each other, even storing all data in local RAM for speed, using communicating with other server hardware for control and backup puφoses.

An algorithm for one embodiment of the invention is shown below. Steps are carried out in the order shown. Deeper indentation is used in the representation of repeated groups of operations, or operations that are dependent on the result of an "if test. An "else" relates to the previous "if at the same indentation level. A "break" causes the process to immediately terminate the currently innermost loop, while allowing outer loops to continue undisturbed. The operations depicted carried by the software operating on end-user machines, except that the server is invoked to provide data on occasion.

First we will introduce some terms. THISUSER is the user whose machine the algorithm is running on. Each user has an associated NEIGHBORBAG which is his current list of ID's of similar users. In this example embodiment, the NEIGHBORBAG has a fixed maximum size. PREVIOUSLYCHECKEDBAG is collection of users that have already been checked as potential neighbors (members of NEIGHBORBAG).

In the example which will follow, all similarities are between 0 and 1 , and higher similarities are better. When similarities between THISUSER and another are considered, it is implied that one of the following happens: a) the user's machine requests that the server send the other's user's taste profile, such as an encoded version of the relevant data from his iTunes Music Library database, and the taste profiles of the two users are compared on THISUSER's machine, or b) the server compares the two users using that same data and returns the result to THISUSER's machine. The former has the overhead that the profiles need to be sent to the user's machine, which consumes network bandwidth. The latter adds more work that must be done on the server side, increasing the complexity of the server. Different embodiments need to trade off these factors.

• repeat as long as THISUSER is online: o ask the server for the ID of a random, already-existing user; set N to be this returned ID o set PREVCLIMBER to null; set PREVSIMILARITY to 0 o repeat: ^■ if N is a member of THISUSER's PREVIOUSLYCHECKEDB AG, and was added to it < 6 months ago: • break " ask the server for N's NEIGHBORBAG; save it in CLMBERBAG ■ set C to be the member of CLIMBERBAG that is most similar to THISUSER • add all members of CLIMBERBAG that are not aheady there to THISUSER's PREVIOUSLYCHECKEDBAG ■ set CSIMILARITY TO C's similarity to THISUSER ■ if CSIMILARITY > PREVSMILARITY: • set PREVSMILARITY to CSIMILARITY • set PREVCLIMBER to C • set N to C ■ else: • if there are any members of THISUSER's NEIGHBORBAG that have a similarity to THISUSER that is < PREVSMILARITY: o If the maximum size for NEIGHBORBAG has been reached: ^■ remove the member of THISUSER's NEIGHBORBAG which has the least similarity to THISUSER o add PREVCLMBER to THISUSER's NEIGHBORBAG • break

Note that this invention must not be construed as being limited to the algorithm above, which is presented merely as one of the more simple ways of implementing the invention.

However, all approaches that fall within the scope of the invention have in common that profiles arrive at the client node in an order that tends to receive the profiles most similar to the current user first.

Accordingly processing is included above whereby, a profile isn't retrieved again until a sufficient time period has passed for the profile to have appreciably changed. In the short term, the most similar matches will exhaust themselves and less similar matches will follow.

At the beginning the retrieved profiles are essentially random, but the process quickly "climbs" to strong matches. The process therefore will not retrieve profiles in exactly the ideal order; however it the techniques used do not generally retrieve the profiles in exactly the ideal order. This method will retrieve proviϊes in a good enough order that once climbing has reached a high level of similarity and profiles are not being retrieved because they aheady have been, we have the required general decreasing similarity.

The climbing is accomplished by means of calculating the similarity metric with respect to the nearest neighbors of a user for which the similarity has previously been calculated, where the latter was found to be at a level high enough that it is worth the expense of going on to retrieve the interest profiles for that user's neighbors to determine whether one or more of them will have an even greater similarity to the target user. Some peer-to-peer networks, such as the Moφheus file-sharing network, have an architecture which causes data which would traditionally be stored on a server to instead be stored on a subset of user computers. We will refer to such servers, in the context of this invention (not necessarily in the Moφheus context) as user-associated servers. In the conduct of the illegal file trading of copyrighted files, the main "advantage" of this technique is arguably that there is no company which controls the master index and which can therefore be prosecuted or sued.

However, from the point of view of the present invention, there is another reason, and that is to completely (or almost completely) eliminate the expense associated with a central server. If there is a central server (or server network separate from user-associated servers), then some entity has to pay for maintaining it, providing the bandwidth, etc. Without one, that necessity disappears. Eliminating that necessity enables this invention to be embodied, in a sense, in "pure software" such as an open-source software project, instead of needing to embody it in a project run as a business in order to pay for the servers. Based on the experience of the file-sharing networks, there are enough users who do not have severe firewall or connectivity issues and who are willing to help others by making then resources available that this is a feasible solution. Moreover, unlike file sharing networks, there is little real problem if a user-associated server becomes temporarily or permanently unavailable, because the searching is normally done in the background rather than in real-time.

Note that this specification has already described how a hash of the user's ID can be used to determine which server to access for his data. In order to extend that to using user-associated servers, more is required (and the already-described hash may or may not be part of that).

In one set of embodiments there is still a central server but rather than serving the taste profiles, it contains a list of identifiers which can be used to construct the URL's where the taste profile for each user may be found. So the actual amount of data that needs to be stored on, and sent from, the server is far less than in the earlier description. For many implementations, the load will be light enough that a single desktop computer with cable modem or DSL (or similar) connection to the Internet will be enough. The Gnutella network, for example, provides a "cloud" of user-associated servers, many or all of which store the URL's (or data that can be used to construct the URL's) of many or all of the other user- associated servers. When a user obtains Gnutella-compliant software (whether by download or by other means) it normally is distributed with a list of user-associated servers that are frequently available. The software then contacts those servers, and can get lists from them of other such servers. The local node is then updated with this information, and it is available to other nodes that might eventually contact this node. Thus, no single central server is required. This specification will not describe the construction of such networks in detail; rather the technical descriptions for Gnutella and other such networks, readily found online using such search tools as Google, should be used. Use such existing networks as a model for constructing a "cloud" of nodes which point to each other and obviate the need for a central server. Preferred embodiments of the invention where the profile data is stored on user-associated servers generally use the same computers for storing that data as are used by their associated users as their day-today computers, with the exception that they must be accessible to inbound connections (i.e., few if any Firewall or NAT issues should apply and they should be connected to the Internet, and turned on, a substantial amount of the time).

Each user-associated server stores the profiles and neighbor lists of a number of other users.

For preferred such embodiments, the step of retrieving a random user ID is modified so that instead of asking a central server, first a random user-associated server in the cloud (or semi-random, influenced by the fact that only a subset of the cloud may be known to the node at the time) is chosen, and then that server is asked to provide a random user ID of those whose profiles and neighbor lists are stored on that computer. Then the algorithm proceeds as before, with the exception that instead of retrieving just the ID of other users, enough data is retrieved to construct a URL where that user's information is available. Then it is accessed at that location. Further, if an access fails because the URL doesn't respond or the data that is supposed to be there isn't, a "break" is executed and the innermost loop explicitly spelled out in the pseudocode is exited. Further embodiments lower the percentage of times non-response or not-found errors occur by providing multiple URL's where the same data can be found on different user-associated servers. Then if one fails, one or more fallback machines can be tried.

In preferred embodiments, user-associated servers take responsibility for serving the nearest neighbors of that particular user to the broader community. This causes data for similar users to be gravitate toward being stored on the same machines. One advantage of this technique is that if user-associated server A is being accessed and provides a NEIGHBORBAG for similarity testing, it is likely that when the accessing node wants to get the taste profiles for the users in the bag, seconds or minutes later, that machine will still be available on the network.

A further improvement is that, instead of sending the taste profiles for the accessing user for the similarities to be calculated, they can be calculated on the user-associated server in cases where it is judged that it would be more efficient when data transmission expenses are calculated, to send the data there. In such a case, the querying node would upload its taste profile to the user-associated server so that multiple comparisons can be carried out there without further need for network data transmission.

In further embodiments, such user-associated servers not only store the neighbors of then associated users, but also other neighbors with relatively high similarity to other users that are stored on that user-associated server. For instance in some embodiments a centroid may be calculated that represents an average of the taste profiles of the users stored on that server. One type of taste profile contains identifiers for every song a user has played on a particular target platform (such as Apple's iTunes), together with the date it was first added to the user's collection and the number of times he has played it. A centroid averaging a number of such user profiles might contain the identifiers for all the songs played by any of the associated users, together with, for each song, the average of the dates it was added to the system and the average number of plays of that song per user.

The algorithm described above to find the most similar neighbors for a user may be carried out but with respect to this centroid rather than with respect to the user. The ID's of the users most similar to this centroid are stored in a neighbor list for the centroid, and their profiles and neighbor lists (together, their proxies) are the ones that that particular user-associated server takes responsibility for serving to the community. But it should not be construed that the invention is limited in scope to the concept or

"centroid" or "averaging." Any summary of multiple user's profile information that is comparable via a similarity metric to an individual user's profile is equivalent for the puφoses of the invention.

For example, in some embodiments that involve user's interests with respect to text documents, a user's interests may be captured in a list of the most unusual keywords that regularly turn up in text they read. For instance a paleontologist might read text containing the word "archaeopteryx" fairly frequently. The exact frequency isn't as important as the fact that the population at large very rarely reads text with that word whereas the paleontologist frequently does. So, the paleontologist's interest profile can be realistically represented by a list of such words that meet certain predetermined thresholds for "unusualness" with respect to the general population, and "frequency" with respect to the user himself. Extending that concept to a group of users rather than a single user, it is clear that the interests of a group of similarly-minded individuals can be represented by a list that contains all the words that are in any of the individuals' personal word-lists (or that are in some predetermined proportion of such lists). This is a completely different approach from using averaging to create a centroid, but it falls equally within the scope of the invention, as do all other approaches which serve the pinpose of representing an individual's interest where individuals are concerned, and summarizing such interests for a group where groups are concerned, as long as it is possible to compare the interest profiles of individuals to each other or individual interest profiles to summary interest profiles or summary interest profiles to summary interest profiles and calculate appropriate similarity metrics. (With respect to the word list, a simple similarity metric is to calculate the percentage of words out of the total pool of words formed when the lists are combined are held in common. A more sophisticated approach is to consider every word in the combined list to be a "trial", with success -being that the word is held in common; the similarity metric is then the posterior mean based on a binomial distribution and a beta prior.) Note that this process may frequently result in more than one user-associated server hosting the proxy of a given user. That is good, because that allows for redundancy in the system for times when a user-associated server is not available. Moreover, there is more redundancy for users who are similar to a lot of users then for users who are similar to only a few others. This allows for providing the most reliable and efficient service to the most people.

As a further example, in some embodiments the summary is simply the taste profile of the user associated with the user-associated server that is directing the search. By finding nearest neighbors to that such a user is also finding neighbors who are relatively similar in taste to other users whose profile is stored on that user-associated server, as long as the question of whose profile shall be stored is also resolved by virtue of having a high similarity metric with respect to the user associated with the user-associated server.

In further embodiments, each user-associated server carries out searches using an algorithm almost identical to one of those described above, with the exception that the search is done with respect to similarity to the collection of users whose proxies (whether the proxy contains the taste profile or the user's neighbor list or both and/or contains other items) are already being served from that particular user- associated server. (This is as opposed to doing such searches with respect to each individual user whose proxy is stored on the server or facilitating, by serving data, such searches carried out by the individual user-associated nodes.) This may be done, as described above, by comparing other users to a centroid of the collection or it may be done by other summary means (all of which fall within the scope of the invention). The standard literature on the subject of data clustering will reveal a number of methods that are equivalent for the puφoses of this specification. In preferred such embodiments, the user who is associated with the user-associated server is always among the users whose proxy would be added to that collection if the user wasn't aheady there. For instance, in the method which involves a centroid produced by averaging the profiles of the users, the algorithm would never remove the user associated with the user- associated server from the list of users whose profiles are averaged to produce the centroid.

NOTES FOR APPENDD 3

A central server may be not only a single server computer, but a set of such computers, the distinguishing characteristic not being the number of computers in the central server, but rather the fact that they are not associated with a particular user but rather made available on the network to serve data to a substantial number of user-associated computers.

When this specification uses the term "associated with" for the relationship between a user and a computer, the computer is the computer that the user normally accesses to get the benefits of the system, for instance, viewing a list of the users that are more similar to him than any others that have been examined.

The term "target user" is used occasionally in this specification to refer to a particular user who is using the invention and for whom the invention has found, and/or is finding, other users with similar interests and/or tastes.

Preferred embodiments make a display of the individual users who have been found to be most similar to the target user available through a computer user interface. In some embodiments this takes the form of a list; in others there are other displays such as images representing the users in 2D or N-Dimensional space. In some embodiments the positions such images take with respect to each other in the visual plane represent how similar they are to each other.

Preferred embodiments make recommendations to the target user of specific items based on a list of nearest neighbors, that is, a list of neighbors who are relatively similar to the target user in taste when with respect to other users of the system. They do this by processing the preferences of the nearest neighbors in ways that are similar to how this is done in other nearest-neighbor-based collaborative filtering systems such as, for example, in the GroupLens Usenet filtering system, http://www.si.umich.edu/~presnick/papers/cscw94/GroupLens.htm. incoφorated herein by reference, or the system described in Upendra Shardanand's 1995 thesis, Social Information Filtering: Algorithms for Automating "Word of Mouth," htφ://citeseer.nj.nec.com rd/61053528%2C323706%2Cl%2C0.25⁰/o2CDownload http://citeseer.nj.nec.co m cache/papers/cs/15862/http:zSzzSzmas.cs.umass.eduzSz%7EaseltinezSz791SzSzshardanand.social_inf ormation_filtering.pdf shardanand95social.pdf, incoφorated herein by reference. Note that those two papers, and others, describe how recommendations may be made once a list of nearest neighbors has been determined, and those and other approaches exemplified by those may be used once such a list has been determined, regardless of the particular calculation originally done to determine the degree of similarity another user has and thus how the decision was made about how to add him to the list of nearest neighbors.

However, it is important to note that while the papers mentioned above make recommendations based on ratings manually entered by the users, the present invention may be used in situations where no such ratings are available. Instead other information may be available, such as the fact that the user has purchased particular items, or has chosen to experience them a certain number of times (for instance, has played a musical track a certain number of times). When only purchase data is available, a purchase can be considered to be equivalent to a rating of "good" and no purchase can be considered equivalent to a rating of "poor". When the number of times a user has chosen to experience an item is available, an easy way to approximate the effect of having ratings is to rank the items by the number of experiences. Then divide the rank by the number of items. This results in a number between 0 and 1 that can be used as a rating-equivalent, normalized to that interval so that the "ratings" of all users are on the same scale. So the techniques mentioned in the afore-mentioned papers, and others, are still usable even where there are no explicit ratings. However, for piuposes of example, a particular technique of making recommendations for situations where nearest neighbors have been found and "number of experiences" data is available for each item will be presented here.

This technique is to simply add up the number of experiences for each item for all nearest neighbors. For example, assume that out of a universe of 1,000,000 music fans, the system has found 100 nearest neighbors for the target user. For each item associated with each fan, there is a count of how many times each song has been played. If the system simply adds up these counts for each item, the item with the highest total count may be considered to be the most popular item in that community, and should be recommended to the target user if he hasn't already experienced it. Equivalently, one can compute the arithmetic mean of the number of plays, where the number of plays is 0 for users that haven't experienced the item at all.

A variant of the approach described in the previous paragraph that is arguably more reliable is to compute log(l + K) for each neighbor/item combination, where K is the number of times the user has experienced the item in question, and then calculate the sum of these values for the population of nearest neighbors. The higher that sum is, the more highly the item should be recommended. The advantage of using the log is that for an item to be recommended highly, it is more important for the item to be experienced often by a large number of nearest neighbors than it is for a few nearest neighbors to experienced the item a huge number of times.

The same two papers as mentioned above that discuss collaborative filtering, and others such as the specification of my own patent 5,884,282, herein incoφorated by reference, describe different ways of creating metrics to capture degrees of similar between two users. All such metrics fall within the scope of the invention. The invention isn't limited to particular metrics; rather the focus of the invention is on the structure of the search and where the relevant data is stored.

A similarity metric that is used in preferred embodiments where explicit user-entered ratings are not available is the following. Assume user A is the target user, and we want to know how similar user B is to user A. We calculate an approximation, subject to certain assumptions which are useful to us but may not be true in the real world, of a certain probability. This can be loosely summarized as being probability that, if a randomly chosen item X not in A's collection but in B's collection is put into A's collection, that if we pick a random time in the future when A is experiencing an item from his collection, it will be X. An implementation of this concept that teaches the technique is included in the tasteprofile.py module in Appendix 4.

Embodiments of this invention serve the useful puφose of determining which other participating users are most similar to a user who is a participant in the system, and storing that information in the computer for puφoses of displaying that community and/or making recommendations of desirable items. Further embodiments not only store that information, but display the community members and/or recommendations through the system's user interface.

Some embodiments store each user's profile on their associated computers. Due to issues mentioned above, many user-associated computers may not be accessible to other users from the internet. So a technique must be provided by which users can serve their profiles when they are stored on user machines. Gnutella-style networks provide an example for this. Nodes which are accessible from the Internet allow incoming connections to be made from nodes which are not necessarily connected. Then, data on those not-otherwise-accessible nodes is made available to other nodes on the network, through the network- accessible nodes which the not-otherwise-accessible nodes are connected to. In the case of Gnutella, this data includes lists of available files and the files themselves. (See http://www9.limewire.com/developer/gnutella protocol OApdf. hereby incoφorated by reference, for more information on the details of the Gnutella approach.) In the present invention, the network-accessible servers usually store lists of the user ID's associated with the nodes they are connected to, and when a request arrives for data asociated one of those ID's, the request is routed to the appropriate connected node, the data is retrieved by the network-accessible node, and then sent by the networkdd-accessible node to the requesting node. Most embodiments that use the search algorithm described earlier in this specification modify it when it is used in the configuration described in this paragraph so that if the data for an ID is not available a "continue" is called in the innermost loop so that control goes to the top of the loop, and processing continues as if that information had not been requested. Note that to facilitate "hits" occurring as frequently as possible, nodes normally try to connect to network-accessible computers who are on then nearest-neighbors list. This makes it likely that network-addressable nodes will be connected to some of their associated users's nearest neighbors, so that when the interest profiles of neighbors are needed by the algorithm, they can more often be retrieved. In general, the presented algorithm is modified so that where, originally, ID's of similar users are requested, information is provided that can be used to constract a one or more URL's where the information can be found. If the information is not found on a directly network-accessible computer, the URL of a network-accessible one (such as the one providing the URL!) can be given, which includes parameters such as the ID of the user whose information is desired, to tell that node which possibly-connected node to get the information from. An individual of ordinary skill in the art of peer-to-peer software development will understand how to create the necessary software in accordance with this description. It should be stressed that this paragraph is for example only, and that there are many equivalent variants that involve, for instance, caching data on intermediate user-associated nodes, transporting profiles to other nodes for comparison, etc. This invention's scope must not be construed as being dependent on specific techniques for making the data and computations available in a peer-to-peer setting.

In some embodiments two forms of interest profiles are created and stored. One is a very small (in terms of the amount of data) representation. For example, if the main interest profile contains the song names, and artist names for songs in the user's collection and the number of times he has played each one, which could have thousands of entries, this miniature profile may contain only the user's most frequently-played 10 songs identified by a hash such as that generated by Python's built-in hash() function. Preliminary screening, including climbing, happens as described elsewhere in this specification using the miniature rather than the full profile. Then as a last step, before adding another user to the target user's nearest neighbor list, the full profiles are checked to be sure the similarity metric is really high enough that the user should be a nearest neighbor (for instance, that it's higher than the metric associated with the least similar neighbor). If it doesn't meet this final test, it doesn't go on the list.

When a miniature profile is used, any technique that serves to produce a relatively small (from the perspective of number-of-bytes), not necessarily complete, representation of the data in the interest profile may be used. The scope of the invention is not limited to particular miniaturizing technologies. For instance, in addition to the simple approach described above, applicable approaches include using all of the item hashes without any counts, usmg a random selection of items and including the song name itself rather than a hash and optionally further using standard compression algorithms such as are in the standard Python zlib library. "Neighbors," "users," and similar terms are often used in this specification to represent then interest profiles, ID's etc.; the meaning is clear in the context.

APPENDIX 4: SOURCE CODE Introduction (Appendix 4):

In the interest of simplicity and space conservation, we are not presenting every source file involved with the current embodiment of the invention. Some of them will be described in this Introduction, and others will not be described explicitly. In each such case the semantics will be obvious to a programmer of ordinary skill based on usage combined with any additional explanation provided in this introduction.

applicationmutex: This provides an application-wide, inter-process mutex facility.

openexclusive: This provides a "file-like object" in python terminology called Exclusivefile. It is associated with a real disk file. If the same process process opens a file tgether for both reading and writing, the writes do not affect the the open ExclusiveFile still being read from. After the ExclusiveFile being written is closed, an ExclusiveFile still open for reading remains unaffected. Only one process can access a file through ExclusiveFile at a time; the second one blocks.

plisthandlerclass: This provides facilities to read an XML file in Apple's iTunes Music Store format. A programmer of ordinary skill will readily see how to use various available libraries and custom code to get the same effect. = tasteprofileclass.py ==========

The pair of classes appearing in this module, CalcData and TasteProfile, are tightly connected. Each TasteProfile object may have a number of associated CalcData objects. The CalcData objects represent one song in the collection of the user whose TasteProfile it is.

Methods are provided for loading the object from various sources; a programmer of ordinary skill will readily infer the formats from the input code. It is worth noting that for convenience and to save memory, songs are frequently identified by an MD5 hash based on combining and normalizing their artist, album, and song names.

The most important method is probably TasteProfile.calculateSimilarityO, which compares the current called TasteProfile object with another one passed to it as a parameter. Usually this is used for the local user to sequentially compare his profile to those of other users, in order to find the best ones — the nearest neighbors.

In such usage, a nearest neighbor list is maintained of a predetermined length is maintained, and when a profile of greater similarity to the local user comes along, compared to the least similar of the current nearest neighbors, the least similar one is removed from the list and the new one added.

import binascii from math import log import sets import struct import sys import time import types import zlib

import applicationmutex import errorloggerclass import transposeexceptions import openexclusive import plisthandlerclass import timeutilities import userpathsclass import utilities

# Note: the following constants are defined at the bottom # because the access TasteProfile which has not # yet been defined here at the top # LOAD_FROM_STRING # LOAD_FROM_UNCOMPRESSED_STRING # RWS 5/7/04 Added when building cluster files # LOAD_FROM_FI E # LOAD_FROM_DATABASE

PLIST_SNAPSHOT_FILE = ' snapshot .xml ' # don't include the path here since when testing we need # to check if userpathsclass .UserPath. getlnstance ( ) . setTesting ( ) # has been called, but it is never called before global initializers

FAST SAVE VERSION = 1

SONG = ^■ Name ' SONG_COL = 0

ARTIST = 'Artist'

ARTIST_COL = 1

ALBUM = 'Album '

ALBUM_COL = 2

TOTA _TIME = 'Total Time" TOTAL_TIME_COL = 3 PLAY_COUNT = 'Play Count' PLAY_COUNT_COL = 4 PLAY_DATE = 'Play Date UTC' PLAY_DATE_COL = 5 RATING = ' Rating ' RATING_CO = 6 DATE_ADDED = 'Date Added" DATE_ADDED_COL = 7 GENRE = ' Genre ' SώSTRE_COL = 8

DCT_CO = { SONG SONG_COL, ARTIST ARTIST_COL, ALBUM ALBUM_COL, TOTAL_TIME TOTAL_TIME_COL, PLAY_COUNT PLAY_COUNT_CO , PLAY_DATE PLAY_DATE_COL, RATING RATING_COL, DATE_ADDED DATE_ADDED_CO , GENRE GENRE_COL}

NUM_COLS = len(DCT_COL)

INNER_PESSIMISTIC_WEIGHT = 50 OUTER_PESSIMISTIC_WEIGHT = 1

MINIMAL_LISTENS = (0, 1, None) # These count as the same number of listens for reasons explained below

class TasteProfileException (Exception) : pass

class CalcData (ob ect) ckbTrace = False def _debugStr (str) : if CalcData. ckbTrace: sys. stderr.write (' [CalcData] ' + str + '\n' sys . stderr . flush ( ) _debugStr = staticmethod(_debugStr) def init (self, tasteProfile, md5SongHash, IstRow) : assert type(tasteProfile._getSaveSecsSinceEpoch( ) ) == types. FloatType, type (tasteProfile._getSaveSecsSinceEpoch( ) ) assert type(lstRow[DATE_ADDED_COL] ) == types .FloatType, type ( IstRow[DATE_ADDED_COL] ) self._lstRow = IstRow

# self ._tasteProfile = weakref .proxy (tasteProfile) self ._tasteProfile = tasteProfile self ._md5SongHash = md5SongHash self ._iPlayCount = IstRow[PLAY_COUNT_COL] # Could eliminate this attribute and just do the lookup where needed self._iRank = None self ._fBayesExpectedPlaysPerDay = None

fSecsDiff = max(0, tasteProfile._getSaveSecsSinceEpoch( ) -

1stRow[DATE_ADDED_CO ] ) iDaysInCollection = 1 + int (fSecsDiff // (60 * 60 * 24) ) # int and // both round to lower integer if md5SongHash ==

'\xb6l\xcf\xd0\xf3{\xadd\xbl\xl8\xb2\xclfj\x8d\xlb' : self ._debugStr ( "Brother Goodness from date added, plays, sees, days: ' + repr( IstRow[DATE_ADDED_COL] ) \ + ', ' + repr ( 1stRow[PLAY_COUNT_CO ] ) + ' , ' + repr (fSecsDiff) \ + ' , ' + repr (iDaysInCollection) ) iSongCount = len (tasteProfile. getMatrix( ) ) # Earlier versions of this code had: # if self ._iPlayCount in MINIMAL_LISTENS: # fDumbAvgPlaysPerDay = 0.0 # else: # fDumbAvgPlaysPerDay = float (self ._iPlayCount - 1) / iDaysInCollection # below. # This comment will explain the rationale for that, and why fDumbAvgPlaysPerDay is now # set to 1.0 instead. Providing both rationales allows for a more # complete understanding of the issues involved. # EXPLANTION FOR ASSIGNMENT OF 0 to fDumbAvgPlaysPerDay # Only count plays after the first one. A song is almost always played when it # is first added so the first play only says "I am in the collection" and doesn't # impart useful data. All the other plays also say "I am in the collection." # Also there is always a distortion because songs are played more at # first then later which has nothing to do with taste. So ignoring the # first play lessens that distortion. # If the play count is None or 0, we count it as if it was 1 , because # it's likely the user plays on his iPod or only recently upgraded to # an itunes version that handled play counts. # EXPLANATION FOR ASSIGNMENT OF 1 to _iPlayCount # First of all the calculation (self ._iPlayCount - 1) was an error because # MINIMAL_LISTENS can be None or 0. But much more importantly: # Many users may use iTunes only to get music into their iPods. For instance # iTunes uses considerable CPU cycles when it's playing a song. There # is little reason not to use one's iPod even when one is sitting at # one's computer in order to save those cycles. So many people # may not play most songs more than once (or at all) directly through # iTunes. If fDumbAvgPlaysPerDay is 0, those songs are ignored for # purposes of matching on tastes, which doesn't make sense for such # users. In an extreme case, such a user may have never played # any songs more than once through iTunes, or maybe only 1 or 2, # which would mean those songs would be the sole determinant of # his tastes for matching purposes. That would lead to very poor # matching performance for those users . # So, keeping in mind the factors that led us to originally # assign 0, we assign 1 instead, as a reasonable compromise between # the various competing issues. Note that another approach to this # would be to use a non-zero prior in _fBayesExpectedPlaysPerDay, # but the current solution was thought to be better. # The following line based on the use of _iFlooredPlayCount # is equivalent to: # if self ._iPlayCount in MINIMAL_LISTENS: # fAdjustedPlayCount = 1.0 # else: # fAdjustedPlayCount = float (self ._iPlayCount) # fDumbAvgPlaysPerDay = fAdjustedPlayCount / iDaysInCollection fDumbAvgPlaysPerDay = float (self ._iFlooredPlayCount) / iDaysInCollection if self ._lstRow[SONG_COL] == "Tom's Diner (A Cappella) " : CalcData._debugStr ( "Tom' s Diner's self ._iFlooredPlayCount : %i, iDaysInCollection: %i" % (self ._iFlooredPlayCount, iDaysInCollection) ) if self ._lstRow[SONG_COL] == "XXX": CalcData._debugStr( "XXX" s self ._iFlooredPlayCount: %i, iDaysInCollection: %i" % (self ._iFlooredPlayCount, iDaysInCollection)) if self ._lstRow[SONG_COL] == 'MP3.com Interview': CalcData._debugStr ( "MP3.com Interview's self ._iFlooredPlayCount: %i, iDaysInCollection: %i" % (self ._iFlooredPlayCount, iDaysInCollection) )

# Below we do a Bayesian expectation with 0 as the as the prior expectation # and a weight of INNER_PESSIMISTIC_WEIGHT. The main reason we have 'a pessimistic # prior are that when a song is new, if it's good it tends to get played # a lot anyway, so the pessimistic prior somewhat compensates for that . Note # that INNER_PESSIMISTIC_WEIGHT should be small so that real data overwhelms # it pretty quickly (as iDaysInCollection increases) . self ._fBayesExpectedPlaysPerDay = (fDumbAvgPlaysPerDay * iDaysInCollection) \ / (INNER_PESSIMISTIC_WEIGHT + iDaysInCollection) if self ._lstRow[SONG_COL] == "Tom's Diner (A Cappella) " : CalcData ._debugStr ( "Tom ' s Diner ' s self -_fBayesExpectedPlaysPerDay: %f, fDumbAvgPlaysPerDay: %f" % (self ._fBayesExpectedPlaysPerDay, fDumbAvgPlaysPerDay) ) if self ._lstRow[SONG_COL] == "XXX": CalcData._debugStr( "XXX' s self ._fBayesExpectedPlaysPerDay: %f, fDumbAvgPlaysPerDay: %f" % (self ._fBayesExpectedPlaysPerDay, fDumbAvgPlaysPerDay) ) if self ._lstRow[SONG_COL] == 'MP3.com Interview': CalcData ._debugStr ( " ' MP3. com Interview' s self ._fBayesExpectedPlaysPerDay: %f, fDumbAvgPlaysPerDay: %f" % (self ._fBayesExpectedPlaysPerDay, fDumbAvgPlaysPerDay) )

# The existence of a new CalcData object invalidates the existing summary # data, so we set the indicators to false in the constructor, self ._tasteProfile._bReadyToSupplyRawRecommendationData = 0 self ._tasteProfile._bCalcDataObjectsExist = 0

def getRow(self ) : return self._lstRow

def getNextPlayProbability(self) : assert type ( self ._fBayesExpectedPlaysPerDay) == types . FloatType assert self ._tasteProfile._fSumBayesExpectedPlaysPerDay != 0 # One might expect the mean of the prior to be the mean plays # per day across the population. But, there is a skewing with new songs because # a user will almost definitely play a new song once on the day he adds it # to his collection; after that he may never play it again. So, statistics # for songs that have only been owned for a brief period are misleading in the direction # of making it appear that they are more liked than they may really be. So # we start with a mean of 0 for the prior distribution; as the song is in # the user's collection for more and more days, the prior will have less # and less weight (asymptotically approaching 0) . fResult = self ._fBayesExpectedPlaysPerDay / self ._tasteProfile._fSumBayesExpectedPlaysPerDay return fResult def getBayesExpectedPlaysPerDay (self) : assert not self ._fBayesExpectedPlaysPerDay is None return self ._fBayesExpectedPlaysPerDay

def getRawRecommendationData(self) :

Only populates the relevant data if this is called, since it is expensive to populate. Bigger values for fMeanLogNormalizedRank are better.

Returns (fMeanNormalizedRank, iFlooredPlayCount) . iFlooredPlayCount is like self ._iPlayCount but the minimum value is 1. Note that all songs with the same expected plays per day get the same rank

# CalcData._debugStr (' In getRawRecommendationData ' ) def meanLogRange(f ow, fHigh) : # Calculate the definite integral of ln(x) where fLow < x < fHigh. # Note that this integral is -1 when fLow==0 and fHigh == 1

(see # http : //functions .wolfram. com/ElementaryFunctions/Log/21/02/01/) if fLow > 0.0: return ((fHigh * log (fHigh) - fLow * log (fLow)) / ( fHigh - fLow) ) - 1.0 else: return ((fHigh * log (fHigh)) / (fHigh)) - 1.0

# We use "levels" rather than absolute frequencies. For one thing, different people may play # music much less or much more often than others. For another, someone manipulating the system # may play one song FAR more often than others. Using fixed- interval levels eliminates that # concern. # 1 is the "best" rank (the one with the highest expectation) . if not self ._tasteProfile._bReadyToSupplyRawRecommendationData: # Expensive, so only do if necessary. IstSongHash = self ._tasteProfile. getListSongHashSorted ( ) dctUniqueExpectationLevel = { } for md5SongHash in IstSongHash: calcData = self ._tasteProfile. getCalcDataUsingHash(md5SongHash) if calcData._iPlayCount in MINIMAL_LISTENS: fLevel = 0.0 # We want just one level for all cases where the user isn't playing a song. # to signify that there is no useful data in association with the # song. else: fLevel = calcData. getBayesExpectedPlaysPerDay ( ) dctUniqueExpectationLevel [fLevel] = None lstUniqueExpectationLevel = dctUniqueExpectationLevel . keys ( ) lstUniqueExpectationLevel . sort ( ) i = iLevelCount = len( lstUniqueExpectationLevel) for fUniqueExpectationLevel in lstUniqueExpectationLevel : dctUniqueExpectationLevel [fUniqueExpectationLevel] = i i -= 1 assert i == 0 assert not None in dctUniqueExpectationLevel . items ( ) for md5SongHash in IstSongHash: calcData = self ._tasteProfile.getCalcDataUsingHash(md5SongHash) if calcData._iPlayCount in MINIMAL_LISTENS: fLevel = 0.0 # We want just one level for all cases where the user isn't playing a song. # to signify that there is no useful data in association with the # song. else: fLevel = calcData. getBayesExpectedPlaysPerDay ( )

self ._tasteProfile. getCalcDataUsingHash(md5SongHash) ._iRank = dctUniqueExpectationLevel [ fLevel] self ._tasteProfile._iLevelCount = iLevelCount self ._tasteProfile._bReadyToSupplyRawRecommendationData = 1 assert type (self ._tasteProfile. getCalcDataUsingHash(md5SongHash) ._iRank) == types . IntType else: iLevelCount = self ._tasteProfile._iLevelCount if self._iRank == iLevelCount: fBucketTop = 1.0 # Don't want to worry about FP division artifacts at limits else: fBucketTop = float (self ._iRank) / iLevelCount if self._iRank == 1: # Don't want to worry about FP division artifacts at limits fBucketBottom = 0.0 else : fBucketBottom = (float (self ._iRank - 1.0)) / iLevelCount fMeanLogNormalizedRank = -meanLogRange ( fBucketBottom, fBucketTop) if self ._1stRow[SONG_COL] == "Tom's Diner (A Cappella) " : CalcData._debugStr("%s's fMeanLogNormalizedRank: %f, iLevelCount: %s, _iPlayCount: %s" % \ (self ._lstRow[SONG_COL] , fMeanLogNormalizedRank, str ( iLevelCount ) , str (self ._iPlayCount) ) ) if self ._lstRow[SONG_COL] == "XXX": CalcData._debugStr ( "%s 's fMeanLogNormalizedRank: %f, iLevelCount: %s, _iPlayCount: %s" % \ (self ._lstRow[SONG_COL] , fMeanLogNormalizedRank, str (iLevelCount) , str (self ._iPlayCount) ) ) if self ._lstRow[SONG_COL] == 'MP3.com Interview': CalcData._debugStr ( "%s ' s fMeanLogNormalizedRank: %f, iLevelCount: %s, _iPlayCount: %s" % \ (self ._lstRow[SONG_COL] , fMeanLogNormalizedRank, str (iLevelCount) , str (self ._iPlayCount) ) ) return (fMeanLogNormalizedRank, self ._iFlooredPlayCount) def _getFlooredPlayCount (self) : if self ._iPlayCount in MINIMAL_ ISTENS: iFlooredPlayCount = 1 else: iFlooredPlayCount = self ._iPlayCount return iFlooredPlayCount

_iFlooredPlayCount = property (_getFlooredPlayCount) slots = '_lstRow', '_tasteProfile' , '_md5SongHash' , '_iPlayCount ' , '_iRank' , '_fBayesExpectedPlaysPerDay ' class TasteProfile (obj ect) : class FileRegistryMutex(applicationmutex.ApplicationMutex) : pass

# Constants used with struct to pack/unpack the data ckBigEndian = ' ! ' # defined by struct module ckstrHeaderFormat = ckBigEndian + 'hf # iVersion, fSaveSecsSinceEpoch ckstrltemHeaderFormat = ckBigEndian + '9h' # iSongNameLen, iArtistLen, iAlbumLen, # iTotalTimeLen, iPlayCountLen, iPlayDateLen, # iRatingLen, iDateAddedLen, iGenreLen # ckstrltemFormat must be created on the fly since the individual values are only # added if they exist. Use these type constants to ensure they are read and written # the same way. ckStructType_SongName = ' s ' ckStructType_Artist = ' s ' ckStructType_Album = ' s ' ckStructType_TotalTime = 'L' ckStructType_PlayCount = 'L' ckStructType_PlayDate = ' f ' ckStructType_Rating = 'B' ckStructType_DateAdded = ' f ' ckStructType_Genre = ' s ' ckbShowTiming = False ckbTrace = False ckbShowFileLocking = False

def _debugStr(str) : if TasteProfile. ckbTrace .^■ sys. stderr.write (' [TasteProfile] ' + str + '\n') sys . stderr . flush ( ) _debugStr = staticmethod(_debugStr)

def init (self, loadMethod, xSource, processProgressWriter=None) : The load method can be any of the constants listed in the assert just below this docstring. strSource is either the string containing the data, or a string containing a file name.

Examples : from tasteprofileclass import TasteProfile, LOAD_FROM_STRING, LOAD_FROM_FILE, LOAD_FROM_DATABASE prof = TasteProfile (LOAD_FROM_STRING, strStringWithData) prof = TasteProfile (LOAD_FROM_DATABASE, strFileName) THE LOAD METHODS MUST NOT BE CALLED OUTSIDE OF INIT (),

OTHERWISE CONSISTENCY OF DATA IS NOT GUARANTEED.

assert loadMethod in ( LOAD_FROM_STRING, LOAD_FROM_UNCOMPRESSED_STRING, LOAD_FROM_FILE, LOAD_FROM_DATABASE, LOAD_FROM_MATRIX_FOR_TESTING) self ._fSaveSecsSinceEpoch = None # Populated when data loaded, with the date the data was saved self ._lstMatrix = [] self ._lstPrevKey = None self ._processProgressWriter = processProgressWriter loadMethod (* (self, xSource) ) self ._bCalcDataObjectsExist = 0 self ._bReadyToSupplyRawRecommendationData = 0 # Only made ready if needed because expensive. self ._fileSaveFast = None self ._strFast = None self ._dctCalcData = None self ._lstSongHashSorted = None # Populated if/when getListSongHashSorted is called # Stuff used by CalcData self ._fSumBayesExpectedPlaysPerDay = None # populated by createCalcDataObj ects self ._iLevelCount = None def approximateEQ(self , other): ' ' 'This is not necessarily a comprehensive eq , but it is good enough for simple testing. ' ' ' return self ._lstMatrix == other._lstMatrix def _getSaveSecsSinceEpoch(self) : # For use by CalcData only return self ._fSaveSecsSinceEpoch def getCalcDataUsingHash(self , md5SongHash) :

Returns the CalcData instance if it's there; otherwise returns None. if not self ._bCalcDataObjectsExist: self ._createCalcDataObjects ( ) TasteProfile._debugStr ( 'in getCalcDataUsingHash: created _createCalcDataObjects ' ) return self ._dctCalcData. get (md5SongHash)

def getListSongHashSorted(self) :

Returns all the hashes in integer order. If the user has no songs, returns [].

# We could get the hashes from CalcData instances, but this may be # used in situations where it has not yet been necessary to # generate CalcData instances . In general we want to try # to conserve memory because background CPU time is less # important . if self ._lstSongHashSorted == None: self ._lstSongHashSorted = [utilities.makeSongHash( IstRow[ARTIST_COL] , \ IstRow[ALBUM_COL] , lstRow[SONG_COL] ) for IstRow in self ._lstMatrix] # Sorted by name self ._lstSongHashSorted. sort ( ) return self ._lstSongHashSorted

# def makeSongHash ( IstRow) :

# Returns a 16-byte string containing an MD5 hash of the song title,

# artist name, and album name. # # # This is used in Recommender as well as TasteProfile. It should # # perhaps be moved to utilities. y and accept the 3 strings # # rather than the row for greater reliabily. # # # Search groups.google.com for "md5 with small strings" #

# strRaw = '%s\0%s\0%s' % (IstRow[ARTIST_COL] , IstRow[ALBUM_C0L] , lstRθw[SONG_COL] )

#

# # Do some normalization -- lower case, remove punctuation and extra whitespace

# strOnePrev = None # strOut = ' '

# for i in range (len(strRaw) ) :

# strOne = strRaw[i]

# # we leave \0 as a seperator (see the assignment to strRaw) # # and \\ as part of the Unicode encoding we're using,

# if not (strOne.isalnumO or strOne in ( ' ', '\0', ' W)

# strOne = ' '

# if not (strOne == strOnePrev and strOnePrev in ( ' ',

None) ) : # strOut += strOne

# strOnePrev = strOne

# strNormalized = strOut . strip ( ) .lower ( )

#

# # # MD5 is guaranteed to be consistent across platforms and

# # across time. I (GR) have explicit confirmation on that

# # from both Tim Peters and Frederick Lundh of the Python inner

# # developer's circle. There was a bug in 1996 on 64-bit # # systems that has been fixed.

# m = md5. new( ) m.update(strNormalized)

# md5SongHash = m.digestO

# assert types . StringType == type(md5SongHash)

# assert len(md5SongHash) == 16

# return md5SongHash

# makeSongHash = staticmethod(makeSongHash)

def releaseCalcData(self ) :

There are situations, such as right after calculating similarity (calculateSimilarity ( ) ) when the CalcData isn't needed and it makes sense to release it to free memory.

So we do so.

Also, there is a reference cycle after a TasteProfile object with CalcData objects is deleted if this isn't called. The gc would eventually collect it anyway, but it is a good idea to call this method to speed the process whenever the CalcData has been created and is no longer necessary.

if self ._dctCalcData: for md5Key in self ._dctCalcData: self ._dctCalcData[md5Key] ._tasteProfile = None self ._dctCalcData = None self .JbCalcDataObjectsExist = 0 self ._bReadyToSupplyRawRecommendationData = 0 def _createCalcDataObjects(self) Creates and populates the CalcData objects except for the "raw" data which is expensive to create and so is only created when needed.

self ._dctCalcData = {} self ._fSumBayesExpectedPlaysPerDay = 0.0 self ._bCalcDataObjectsExist = 0 for IstRow in self ._lstMatrix: # Songs with incomplete data or that haven't been played are ignored. if not IstRow[SONG_COL] or not lstRow[DATE_ADDED_COL] : errorloggerclass .ErrorLogger.writeInfo( "empty song name or date added in TasteProfile._createCalcDataObjects . " + \ "Row: " + str (IstRow) ) raise transposeexceptions . LogicError md5SongHash = utilities.makeSongHash ( IstRow[ARTIST_COL] , IstRow[ALBUM_COL] , \ IstRow[ SONG_COL] ) if not self ._dctCalcData.has_key(md5SongHash) : # song/artict/cd dups thrown away calcData = CalcData (self, md5SongHash, IstRow) self ._dctCalcData[md5SongHash] = calcData self ._fSumBayesExpectedPlaysPerDay += calcData . getBayesExpectedPlaysPerDay ( ) self ._bCalcDataObjectsExist = 1

def calculateSimilarity (self , nonLocalTasteProfile) :

IT IS IMPORTANT, DURING COMPARISONS TO GROUPS OF OTHER PROFILES, THAT THE NONLOCAL ONE BE IN NONLOCALTASTEPROFILE (instead of calling the method on the nonlocal one and passing it the local one) BECAUSE MEMORY USED BY THE NONLOCAL ONE FOR MATHEMATICAL CALCULATIONS IS FORCIBLY FREED AFTER

PROCESSING IS DONE, BUT THE EQUIVALENT MEMORY FOR THE LOCAL ONE REMAINS. Consider a "shared song" to be a song that is in the collection of both users. This method calculates the probability that the NEXT shared song to come into existence will be the next song played. That is, if user A takes a recommendtion from B's collection, it will be a song that A doesn't have yet. When he has it, it will be another shared song. What is the probability that it will be the next song played, once it is in A's collection? Since we want to find B's who are good at recommending things that A will play a lot, this is exactly the right calculation.

Note that we multiply by the numerator by the number of shared songs . Otherwise, users with a lot of songs in common would be at a disadvantage because the probability that any given one would be played next would be smaller than if there were fewer songs.

At the end, we call releaseCalcDatat ) , under the assumption we only created the calc data in order to do this calc if not self ._bCalcDataObjectsExist: self ._createCalcDataObjects ( ) IstOtherSongHash = nonLocalTasteProfile.getListSongHashSorted( )

# Now compute the probability that both users will play the same song next, by # computing that probability for each individual song that is in both # collections, and summing fSumProbability = 0.0 for md5SongHash in IstOtherSongHash: if self ._dctCalcData.has_key(md5SongHash) : # Prob is 0 if this user doesn't have song, so no need to add otherUserSong = nonLocalTasteProfile . getCalcDataUsingHash (md5SongHash) thisUserSong = self ._dctCalcData[md5SongHash]

# The idea here is that for every (expected) play (in a day) of the other song, we compute the probability # that this user will happen to play the same song. So it can be viewed as a sum of probabilities, # one for each song the other user is expected to play. fSumProbability += thisUserSong . getNextPlayProbability( ) * otherUserSong . getBayesExpectedPlaysPerDay ( ) if fSumProbability == 0.0: fResult = 0.0 # The users have no songs in common else: # Since we're assuming this user has accepted one recommendation from the other user, he's added # one more song to his collection, so the probability of any of the existing ones being played # is a little less . Obviously the amount less really depends on the exact probability of # playing the new one, which we don't know. We we just assume it's typical. The calculation # would probably work just about as well without this calc, but we do it because, at least for # now, we want to focus on at least a decent amount of correctness. The idea is that we don't # want to give any credit in the sense of increasing probabilities due to a song that # we are recommending and hence know nothing about -- better to be pessimistic fSumProbabilityAdusted = fSumProbability * len(self ._dctCalcData) / (len(self ._dctCalcData) + 1)

# Below we pretend we have a Bayesian trial for every EXPECTED play. <smile> # The OUTER_PESSIMISTIC_WEIGHT is actually the representative of a beta prior with mean # 0. (Actually it wouldn't be 0 in reality, just extremely low; we're using 0 # as a convenienct approximation of this number. Since it would be multiplied by 0, it's not represented # in the numerator . ) fResult = fSumProbabilityAdusted / (OUTER_PESSIMISTIC_WEIGHT + nonLocalTasteProfile ._fSumBayesExpectedPlaysPerDay) nonLocalTasteProfile . releaseCalcData ( ) return fResult

def _loadFromITunesDatabase(self , strFileName) :

Calls plisthandlerclass . PlistHandler ( self .processltem) to populate self ._lstMatrix

# Below we will parse the database. # We'll throw away the dictionary output (dctDatabase) , # which is a tree representing the entire # database. It's not needed for this application; the relevant # will be collected by means of self .processItem( ) . # Afterwards vie will call self ._screenCurrentLastRow( ) . This is done inside # plisthandlerclass. PlistHandler (self .processltem) for every row # except the last, plisthandlerclass. PlistHandler can't handle # the last row, since it doesn't know which row is last. So it's handled here. self ._debugStr ( '_loadFromITunesDatabase: About to make a PlistHandler' ) PlistHandler = plisthandlerclass. PlistHandler (self .processltem, self ._processProgressWriter) self ._debugStr ( '_loadFromITunesDatabase: About to make a call processPlistFile' ) dctDatabase = plistHandler.processPlistFile (strFileName, \

userpathsclass . UserPaths . getlnstance ( ) . getApplicationSupportPath ( ) + PLIST_SNAPSHOT_FILE) self ._screenCurrentLastRow( ) del dctDatabase self ._debugStr ( '_loadFromITunesDatabase: matrix size before dup elimination: ' + str (len(self ._lstMatrix) ) ) dctDupFinder = {} IstRowNumsToDelete = [] for i in range (len( self ._lstMatrix) ) : IstRow = self ._lstMatrix[i] md5SongHash = utilities.makeSongHash ( lstRow[ARTIST_COL] , IstRow [ALBUM_COL] , lstRow[ SONG_COL] ) if not dctDupFinder. has_key(md5SongHash) : dctDupFinder [md5SongHash] = IstRow else: # merge the dups in as useful a way as possible IstDupRow = dctDupFinder [md5SongHash] self ._debugStr ( '_loadFromITunesDatabase: merging rows: ' + 1StDupRow [ SONG_COL] ) if ( IstDupRow[TOTAL_TIME_COL] is None) \ and not ( IstRow[TOTAL_TIME_COL] is None) : 1StDupRow [ OTAL_TIME_COL] = 1StRow [TOTAL_TIME_COL] if not (lstRow[PLAY_COUNT_COL] is None) : if IstDupRow[PLAY_COUNT_COL] is None: IstDupRow[PLAY_COUNT_COL] = 1StRow[PLAY_COUNT_COL] else: IstDupRow[PLAY_COUNT_COL] += 1StRow[PLAY_COUNT_COL] if IstDupRow[PLAY_DATE_COL] is None \ or IstDupRow[PLAY_DATE_COL] < lstRow[PLAY_DATE_COL] : if not ( IstRow[PLAY_DATE_COL] is None) : assert types . FloatType == type ( 1stRow[PLAY_DATE_COL] ) 1StDupRow[PLAY_DATE_COL] = lstRow[PLAY_DATE_COL] if IstDupRow[DATE_ADDED_COL] is None \ or IstDupRow[DATE_ADDED_COL] > lstRow[DATE_ADDED_COL] : if not ( IstRow[DATE_ADDED_COL] is None) : assert types . FloatType == type (IstRow[DATE_ADDED_COL] ) 1StDupRow[DATE_ADDED_COL] = 1StRow[DATE_ADDED_COL] if IstDupRow[RATING_COL] is None \ and not (IstRow[RATING_COL] is None) : IstDupRow[RATING_COL] = lstRow[RATING_COL] if IstDupRow[GENRE_COL] is None \ and not ( IstRow[GENRE_C0L] is None) : IstDupRow[GENRE_COL] = lstRow[GENRE_COL] lstRowNumsToDelete.append(i) IstRowNumsToDelete . reverse ( ) for i in IstRowNumsToDelete: del self ._lstMatrix[i] self ._debugStr ( '_loadFromlTunesDatabase: matrix size after dup elimination: ' + str (len(self ._lstMatrix) ) ) def namePrep ( str) : if str == None: strO = ' ' else: strO = str strl = strO.upper ( ) .strip( ) if Strl[:4] == 'THE ' : Str2 = strl [4:] else: str2 = strl return str2

def comparer (IstRowl, lstRow2) : strArtistl = namePrep (IstRowl [ARTIST_COL] ) strArtist2 = namePrep (lstRow2 [ARTIST_COL] ) if strArtistl < strArtist2 : return -1 elif strArtistl > strArtist2: return 1 else: # same artist strAlbuml = namePrep ( IstRowl [ALBUM_COL] ) strAlbum2 = namePrep (lstRow2 [ALBUM_COL] ) if strAlbuml < strAlbum2 : return -1 elif strAlbuml > strAlbum2: return 1 else: # same artist and album strSongl = namePrep ( IstRowl [ SONG_COL] ) strSong2 = namePrep (lstRow2 [SONG_COL] ) if strSongl < strSong2 : return -1 elif strSongl > strSong2 : return 1 else: # all 3 are equal return 0 self ._lstMatrix. sor (comparer)

self ._bCalcDataObjectsExist = 0 self ._bReadyToSupplyRawRecommendationData = 0 self ._fSaveSecsSinceEpoch = time. time 0 def suppressArtists (self , stArtists) : ' ' ' remove the data for the artists specified. ' ' ' if stArtists == None or len( stArtists) == 0: return # nothing to do self .releaseCalcData( ) # ensure that the cached data is recomputed def notSuppressed(row) : return row[ARTIST_COL] not in stArtists self ._lstMatrix = filter (notSuppressed, self ._lstMatrix) def saveFastToString(self ) : ' ' ' The fast string is safe for network transmission. ' ' return uncompressedToFast ( self . saveToString ( ) ) def saveFastToFile (self, strFileName) : '^■' RevealedExceptions: transposeexceptions . FileNotThereException ' ' ' self ._strFast = self . saveFastToString( ) if TasteProfile . ckbShowFileLocking : sys. stderr.write ( "TasteProfile. saveFastToFile opening " + strFileName + " exclusively\n" ) s sys . stderr . flush ( ) self ._fileSaveFast = openexclusive. openExclusive( strFileName, 'wb' ) try: self ._fileSaveFast .write (self ._strFast) self ._fileSaveFast . close ( ) if TasteProfile. ckbShowFileLocking: sys. stderr.write ( "TasteProfile. saveFastToFile closed " + strFileName + "\n") sys. stderr. flush () self ._fileSaveFast = None finally: self .cleanup ( ) def saveToString(self) : ¹ ' ' The actual code to save to a string. Does not compress the results. ' ' ' strHeaderStruct = struct.pack (TasteProfile. ckstrHeaderFormat, \ FAST_SAVE_VERSION, self ._fSaveSecsSinceEpoch) IstResult = [strHeaderStruct] kiTotalTimeLen = 4 # we always save four bytes for the total time bFoundOne = False for IstRow in self ._lstMatrix: try: strArtist = utilities .noNone(lstRow[ARTIST_COL] , '') iArtistLen = len(strArtist) strAlbum = utilities .noNone (IstRow[ALBUM_COL] , '') iAlbumLen = len(strAlbum) strSongName = utilities.noNone (IstRow[ SONG_COL] , '') iSongNameLen = len(strSongName) strGenre = utilities. noNone ( lstRow[GENRE_COL] , ^■') iGenreLen = len(strGenre) iPlayCount = lstRow[PLAY_COUNT_COL] iPlayCountLen = self ._intLength(iPlayCount) fPlayDate = lstRow[PLAY_DATE_COL] iPlayDateLen = self ._intLength( fPlayDate) byteRating = lstRow[RATING_COL] iRatingLen = self ._byteLength(byteRating) fDateAdded = lstRow[DATE_ADDED_COL] iDateAddedLen = self ._intLength( fDateAdded) iTotalTime = utilities. noNone (IstRow[ OTAL_TIME_COL] ,

0) # kiTotalTimeLen defined above since it does not change between loop iterations structltemHeader = struct .pack (TasteProfile . ckstrltemHeaderFormat , \ iSongNameLen, \ iArtistLen, \ iAlbumLen, \ kiTotalTimeLen, \ iPlayCountLen, \ iPlayDateLen, \ iRatingLen, \ iDateAddedLen, \ iGenreLen)

IstltemFormats = [TasteProfile. ckBigEndian + \ str (iSongNameLen) + TasteProfile . ckStructType_SongName] lstltems = [strSongName] if iArtistLen != 0: IstltemFormats . append ( str ( iArtistLen) + TasteProfile . ckStructType_Artist) lstltems .append (strArtist) i f iAlbumLen ! = 0 : IstltemFormats . append (str (iAlbumLen) + TasteProfile . ckStructType_Album) lstltems .append (strAlbum)

IstltemFormats . append (TasteProfile . ckStructType_TotalTime) lstltems . append (iTotalTime) if iPlayCountLen != 0:

IstltemFormats . append (TasteProfile . ckStructType_PlayCount) lstltems . append (iPlayCount) if iPlayDateLen != 0:

IstltemFormats .append (TasteProfile. ckStructType_PlayDate) lstltems . append ( fPlayDate) if iRatingLen != 0:

IstltemFormats . append(TasteProfile. ckStructType_Rating) lstltems .append(byteRating) if iDateAddedLen != 0:

IstltemFormats . append (TasteProfile . ckStructType_DateAdded) lstltems . append( fDateAdded) if iGenreLen != 0: IstltemFormats . append ( str (iGenreLen) + TasteProfile . ckStructType_Genre) lstltems . append ( strGenre) structltem = struct .pack( ' ' .join (IstltemFormats) , *lstltems)

IstResult . append (structltemHeader) IstResult .append (structltem) bFoundOne = True except SystemExit: raise # catch and raise SystemExit so it is not caught and ignored by the following except: except Exception, e: errorloggerclass . ErrorLogger .writeException (e) errorloggerclass . ErrorLogger.writeInfo( "Exception in TasteProfile. saveToString ignored - going on with rest of file") pass # If a record is bad for some reason, ignore it.

We don't need every single record if not bFoundOne: raise TasteProfileException, 'No complete records in iTunes Music Library.xml' return ' ' .join (IstResult) def cleanup (self) : This MUST be called when an interupt is received in case the file is being written to disk. if self ._fileSaveFast != None: TasteProfile._debugStr( 'writing savefast in cleanup!') try: # ensure we close even if the write fails so that the lock is cleared self ._fileSaveFast.write (self ._strFast) finally: if TasteProfile . ckbShowFileLocking : sys. stderr.write ( "TasteProfile. cleanup closing file'Nn") sys . stderr . flush ( ) self ._fileSaveFast . close ( ) self ._fileSaveFast = None def _intLength(self, i) : ' ' ' return the storage length of the int in bytes . Returns 4 if int exists, or zero for None. ' ' ' if i == None: return 0 else: return 4 def _byteLength(self, byte) : ' ' ' return the storage length of the byte in bytes . Returns 1 if byte exists, or zero for None. ' ' ' if byte == None: return 0 else: return 1 def _loadFastFromString(self, strFast) : ' ' ' Replacement for TasteProfile._loadFastFromString using structs . These load faster, but are byte compatible with the old format. ' ' ' strStruct = zlib.decompress (binascii.a2b_base64 (strFast) ) self ._loadFromString (strStruct) def _loadFromString(self , strStruct) : ' ' ' The actual code to load from an uncompressed string. ' ' ' self ._lstMatrix=[] import gc gc. collect () self ._debugStr( ' the number of objects is %i; unreach %i ' % (len(gc.get_objects ( ) ) , gc. collect ( ) ) ) iHeaderSize = struct. calcsize (TasteProfile. ckstrHeaderFormat) iltemHeaderSize = struct . calcsize (TasteProfile . ckstrltemHeaderFormat) iVersion, self ._fSaveSecsSinceEpoch = struct .unpack (TasteProfile . ckstrHeaderFormat, \

strStruct [ : iHeaderSize] ) assert iVersion == FAST_SAVE_VERSION assert self ._fSaveSecsSinceEpoch >= 0 i = iHeaderSize # tracks the position in strStruct while True: strHeader = strStruct [i:i+iItemHeaderSize] if len( strHeader) == 0: break # we ' re done iSongNameLen, iArtistLen, iAlbumLen, iTotalTimeLen, iPlayCountLen, \ iPlayDateLen, iRatingLen, iDateAddedLen, iGenreLen = \ struct . unpack (TasteProfile . ckstrltemHeaderFormat , strHeader) i += iltemHeaderSize

IstltemFormats = [TasteProfile. ckBigEndian, \ str (iSongNameLen) , TasteProfile. ckStructType_SongName] if iArtistLen != 0: IstltemFormats . append ( str ( iArtistLen) ) IstltemFormats . append (TasteProfile . ckStructType_Artist) if iAlbumLen != 0: IstltemFormats . append ( str ( iAlbumLen) ) IstltemFormats . append (TasteProfile . ckStructType_Album) if iTotalTimeLen != 0:

IstltemFormats . append(TasteProfile . ckStructType_TotalTime) if iPlayCountLen != 0:

IstltemFormats . ppend (TasteProfile . ckStructType_PlayCount) if iPlayDateLen != 0:

IstltemFormats . append (TasteProfile . ckStructType_PlayDate) if iRatingLen != 0: IstltemFormats . append (TasteProfile . ckStructType_Rating) if iDateAddedLen != 0:

IstltemFormats . append (TasteProfile . ckStructType_DateAdded) if iGenreLen != 0: IstltemFormats . append ( str ( iGenreLen) ) IstltemFormats . append (TasteProfile . ckStructType_Genre) iLen = iSongNameLen + iArtistLen + iAlbumLen + iTotalTimeLen + iPlayCountLen + \ iPlayDateLen + iRatingLen + iDateAddedLen + iGenreLen tup = struct .unpack ( ' ' .join (IstltemFormats) , strStruct [i:i+iLen] ) i += iLen

# we want None rather than an empty string iTupIndex = 0 # next item in the tuple

IstRow = [tup [i upIndex] ] if lstRow[-l] == None or lstRow[-l] == ' ' : continue # unrolled _screenCurrentLastRow for speed iTupIndex += 1

if iArtistLen == 0: continue # unrolled _screenCurrentLastRow for speed 1stRow. append (None) else: IstRow. append ( tup [iTupIndex] ) if lstRow[-l] == ' ' : continue # unrolled _screenCurrentLastRow for speed iTupIndex += 1

if iAlbumLen == 0: 1stRow. append (None) else: IstRow. append (tup [iTupIndex] ) iTupIndex += 1

if iTotalTimeLen == 0: continue # unrolled _screenCurrentLastRow for speed 1stRo . append (None) else: IstRow. append (tup [iTupIndex] ) if lstRow[-l] == ' ' or lstRow[-l] < 1000: continue # unrolled _screenCurrentLastRow for speed iTupIndex += 1

if iPlayCountLen == 0 : 1stRow. append (None) else: IstRow. append (tup [iTupIndex] ) iTupIndex += 1 if iPlayDateLen == 0-. 1stRow. append (None) else: IstRow. append (tup [iTupIndex] ) iTupIndex += 1 if iRatingLen == 0 : 1stRow. append (None) else: IstRow. ppend ( tup [ iTupIndex] ) iTupIndex += 1 if iDateAddedLen == 0: continue # unrolled _screenCurrentLastRow for speed IstRow. append (None) else: IstRow. append (tup [iTupIndex] ) if IstRow[-1] == ' ' : continue # unrolled _screenCurrentLastRow for speed iTupIndex += 1 if iGenreLen == 0: 1stRow. append (None) else: IstRow. append (tup [iTupIndex] ) # no need since this is the end of the tuple: iTupIndex += 1 self ._lstMatrix. append (IstRow) self ._bCalcDataObjectsExist = False self ._bReadyToSupplyRawRecommendationData = False def _loadFastFromFile(self , strFileName): ' ' ' RevealedExceptions : transposeexceptions .FileNotThereException ' ' ' if TasteProfile . ckbShowFileLocking : sys. stderr.write ( "TasteProfile._loadFastFromFile opening " + strFileName + " exclusively\n" ) sys . stderr . flush ( ) file = openexclusive. openExclusive (strFileName, 'rb', 1000) try: # ensure file is closed so that lock is cleared strFast = file. read() self ._loadFastFromString( strFast) self ._bCalcDataObjectsExist = 0 self ._bReadyToSupplyRawRecommendationData = 0 finally: if TasteProfile. ckbShowFileLocking: sys .stderr .write ( "TasteProfile._loadFastFromFile closing " + strFileName + "\n") sys. stderr. flush( ) file.close( ) def _screenCurrentLastRow(self) : if len(self ._lstMatrix) == 0: return # no data, then nothing to do

1st = self ._lstMatrix[-l] if None in (1st [SONG_COL] , 1st [ARTIST_COL] , lst[DATE_ADDED_COL] , 1st [TOTAL_TIME_COL] ) or \ '' in (lst[SONG_COL] , 1st [ARTIST_COL] , lst[DATE_ADDED_COL] , 1st [TOTAL_TIME_COL] ) or \ lst[TOTAL_TlME_COL] < 1000: # too short to really be music, may attempt to make personal info available on iPod del self ._lstMatrix[-l] def processltem(self, IstKey, xValue) :

Populates self ._lstMatrix if lstκey[0] == 'Tracks': if self ._lstPrevKey == None: # get things started self ._lstPrevKey = IstKey[: 2] self ._lstMatrix. append ( [None] * NUM_COLS) elif lstKey[:2] != self ._lstPrevKey: # Now that we have at a complete record, let's make sure we like it. self ._screenCurrentLastRow( ) self ._lstPrevKey = IstKey[: 2] self ._lstMatrix. append ( [None] * NUM_COLS)

if DCT_COL.has_key (IstKey[2] ) : i = DCT_COL [IstKey[2 ] ] if type(xValue) == types . InstanceType and isinstance(xValue, plisthandlerclass. ConvenientValuelnterface) : self ._lstMatrix[-l] [i] = xValue . getConvenientValue ( ) else: self ._lstMatrix[-l] [i] = xValue def getMatrix(self) : assert self ._lstMatrix, 'Attempt to access matrix without loading data ' return self ._lstMatrix def getArtistList (self) : return self ._getUniqueValue (ARTIST_COL) def getAlbumList (self , strArtist=None) : if strArtist: def selectDesiredArtist (IstRow, iCol, strArtist): return IstRow[iCol] == strArtist iCol = ARTIST_COL tuplnfo = (iCol, strArtist) return self ._getUniqueValue(ALBUM_COL, selectDesiredArtist, tuplnfo ) else: return self ._getUniqueValue (ALBUM_COL) def getAlbumAndArtistSet (self, strArtist=None) : ' ' ' get a list of unique album names, each with the set of corresponding artists. ' ' ' if strArtist: def selectDesiredArtist (IstRow, iCol, strArtist): return IstRow[iCol] == strArtist return self ._getUniqueValueAndSet (ALBUM_COL, ARTIST_COL, \ selectDesiredArtist, (ARTIST_COL, strArtist) ) else: return self ._getUniqueValueAndSet (ALBUM_COL, ARTIST_COL) def getSongList (self , strArtist=None, strAlbum=None) : def selectDesiredArtistAndAlbum(IstRow) : if strArtist != None and IstRow[ARTIST_COL] != strArtist: return 0 # Although we use None for strAlbum to mean that we don't care what albums the # songs belong to, we also allow songs that don't have albums . The UI uses # an empty string to signify this, so when we get an empty string for the # album, allow rows with None for the album to match. Roundup Issue 166 if strAlbum == '' and IstRo [ALBUM_COL] != '' and IstRow[ALBUM_C0L] != None: return 0 if strAlbum != None and strAlbum != ' ' and lstRow[ALBUM_COL] != strAlbum: return 0 return 1 return filter (selectDesiredArtistAndAlbum, self ._lstMatrix) def _getUniqueValue(self, iColumn, screenFunc=None, tupScreenFuncInfo=None) : dct = {} for IstRow in self ._lstMatrix: if screenFunc: if not apply (screenFunc, (IstRow, ) + tupScreenFuncInfo) : continue dct [IstRow [iColumn] ] = None 1st = dct.keysO 1st. sort () return 1st def _getUniqueValueAndSet (self , iColumn, iSetColumn, screenFunc=None, tupScreenFundnfo=None) : ' ' ' get a list of all of the unique values of iColumn and the set of values in iSetColumn for each, filtered by screenFunc using tupScreenFuncInfo. ' ' ' dct = {} for IstRow in self .--.IstMatrix: if screenFunc: if not apply ( screenFunc , (IstRow, ) + tupScreenFuncInfo) : continue if not dct.has_key (lstRow[iColumn] ) : dct [IstRow[ iColumn] ] = sets. Set () dct [IstRow[iColumn] ] .add(IstRow[iSetColumn] ) 1st = dct. items () 1st. sort () return 1st def strSongDisplay (self) : import time def x( strRaw, iLen) : if len(strRaw) < iLen: strSpaces = ' ' * (iLen - len( strRaw)) else: strSpaces = ' ' return strRaw[ :iLen] + strSpaces strOut = ' Song name Plays Date

Probability\n ' for seqRow in self ._lstMatrix: md5Hash = utilities.makeSongHash ( seqRow[ARTIST_COL] , \ seqRow[ALBUM_COL] , seqRow[SONG_COL] ) fNextPlayProb = self . getCalcDataUsingHash (md5Hash) . getNextPlayProbability ( ) strOut += x(seqRow[SONG_COL] , 40) \ + x(str(seqRow[PLAY_COUNT_COL] ) , 6) \ + x(str (int ( (time. time ( ) - seqRow[DATE_ADDED_COL] )/60/60/24) ) , 6) \ + x (str (fNextPlayProb) , 10) + '\n' return strOut def _loadFromMatrixForTesting(self, IstMatrix) : # Expected to only be used for testing. Build a song matrix, # and use it in the taste profile.

'# Does not so the screening done by other load methods, so # should not be used outside of testing, self ._lstMatrix = IstMatrix self ._bCalcDataObjectsExist = 0 self ._bReadyToSupplyRawRecommendationData = 0 self ._fSaveSecsSinceEpoch = time. time 0 LOAD_FROM_STRING = TasteProfile ._loadFastFromString LOAD_FROM_UNCOMPRESSED_STRING = TasteProfile ._loadFromString L0AD_FR0M_FILE = TasteProfile ._loadFastFromFile LOAD_FROM_DATABASE = TasteProfile ._loadFromITunesDatabase LOAD_FROM_MATRIX_FOR_TESTING = TasteProfile ._loadFromMatrixForTesting

def uncompressedToFast (strUncompressed) : ' ' ' convert an uncompressed string of the TasteProfile to a fast string. ' ' ' return binascii .b2a_base64 (zlib. compress (strUncompressed) ) # because of repeated artst and album name data

def fastToFile(strFast, strFileName): ' ' ' create a TasteProfile file from a fast string. ' ' ' if TasteProfile . ckbShowFileLocking : sys. stderr .write ( "TasteProfile. fastToFile opening " + strFileName + " exclusively\n" ) sys . stderr . flush ( ) f = openexclusive. openExclusive(strFileName, 'wb') try: f .write (strFast) finally: f .close ( ) if TasteProfile . ckbShowFileLocking : sys. stderr.write ( "TasteProfile. saveFastToFile closed " + strFileName + "\n") sys .stderr . flush ( ) f = None

def uncompressedToFile(strUncompressed, StrFileName): ' ' ' create a TasteProfile file from an uncompressed string of the TasteProfile. ' ' ' fastToFile (uncompressedToFast (strUncompressed) , strFileName) def totalMSecForAHSongs (IstMatrix) : ' ' ' get the total number of msec for all songs in the supplied matrix. ' ' ' iMsec = 0 for row in IstMatrix: iMsec += row[TOTAL_TIME_COL] return iMsec = recommenderclass.py =

This module handles the task of using the list of nearest neighbors, and their associated profiles for recommendation puφoses.

It makes recommendations, subject to an "adventurousness control." When the control is at one extreme, it looks for consensus among neighbors; as it moves toward the other extreme, it is more and more sensitive to opinions of individual users. (In the current embodiment, these opinions are expressed passively simply by recording how many times each song is played.)

import bisect import errno import math import os import preffileclass import sets import struct import sys import types import operator

import activeneighbors import applicationmutex import errorloggerclass from openexclusive import openExclusive import slaveswitcher from tasteprofileclass import TasteProfile, LOAD_FROM_STRiNG, LOAD_FROM_FILE, LOAD_FROM_DATABASE import tasteprofileclass import transposeexceptions import userpathsclass import transposefileregistrants import fileregistry import preffileclass import sortedneighborlistclass import tasteprofilepaths import utilities

# PROTOCOL FOR USING RECOMMENDER OBJECTS:

# Recommender.update ( ) should be called whenever #

# a) a neighbor is added

# b) a neighbor is removed # c) a neighbor profile is changed

# d) the local user's profile is changed #

# In addition Recommender. getModTime( ) should be called

# once per second (or once per minute, whatever is # convenient) from the main Goombah app process.

#

# If the mod time changes from the last call, then the GUI code

# should call Recommender. getSongList ( ) and update

# the displayed recommendations.

# HOW IS DATA INTEGRITY MAINTAINED?

# Without protection there would be opportunity for

# integrity problems with Recommender-related data. # For instance, a taste profile could be registered

# but not actually used, or vice versa, because # the program stopped between those operations. # We address these concerns in two ways. First, # data.recs is self-consistent because openExclusive # enables an atomic action to occur upon closing the # file which makes all changes available. # Second, file registrations are brought in line # with the contents of data.recs by _synchronize( )

# at the beginning of every update. There is

# still presently some vulnerability if a registered # file does not actually exist but the likelihood

# of that occurring is relatively low; future work

# could address that concern. A related concern

# is that an unregistered fil will still exist, adding

# to a collection of orphaned files over time. We # can look at addressing that in the future too.

SONG_COL = 0 ARTIST_COL = 1 ALBUM_COL = 2

GOODNESS_COL = 3 PLAY_COUNT_COL = 4 USER_COU T = 5 MD5_HASH_COL = 6

RESET = 1

# Code cleanup idea. Instead of all the tups, could I have methods that are

# returning lots of output return instances, instead, that are or classes

# defined within the caller, which have one attribute for each desired # output field? That would mean that only the caller would have to know

# anything about the class, except that recipients would be able # to ask for the fields they need by name. # See RecFileWrapper below for an example of something like # what I'm talking about. If it is used, may want # to make a base class representing the wrapper, and change # it to automatically close the file at the end of the- # iterator.

def padLength( iLength, roundTo=4) : ' ' 'Return number of pad bytes so that iLength is a multiple of roundTo ' ' ' return (roundTo - (iLength % roundTo) ) % roundTo

class Recommender (object) : # *NYI* THIS SHOULD BE DIVIDED INTO RecommenderReader and RecommenderWriter, because # the functions of reading the rec file and writing it don't use anything in common, # and it is more convenient to not pass neighborbag etc into the object # when it is used for reading, but then the object isn't complete. So, it's # incorrect as it is. # IN FACT MAYBE THIS CAN'T BE USED AS A SOURCE FOR RECS AND AS AN UPDATER AT ONCE # BECAUSE LSTMATRIX IS USED IN THE UPDATING PIECE TO ACCESS NAME DATA _ckstrStop = 17*chr(255) # 17 bytes of chr(255) is > largest possible md5 (16*char (255) )

_ckstrRecDataFormat = ' >16sfll' # %ix is for pad bytes _ckstrRecFormat = ' >%is%is%is%ixfII16s ' _ckbDebug = False def _debugStr(str) : if Recommender._ckbDebug: sys. stderr.write (' [Recommender] ' + str + '\n') sys . stderr . flush ( ) _debugStr = staticmethod(_debugStr) def init (self, bReset=0) : Recommender._debugStr ( ' In init ' ) if bReset: try: os . remove (Recommender ._getRecDataFilePath ( ) ) except EnvironmentError, excep: if excep. errno != errno . ENOEN : raise # if no file there, that's fine, just go on self ._lstMatrix = [] def _sendHeartBeat(self) : ensure a heartbeat at intervals of not more than 4 min slaveswitcher . sendHeartBeat ( ) def updatefself, neighborBag, thisUserTasteProfile, fAdventurousness, iRecsWanted) : # Recommender. update ( ) should be called whenever # # a) a neighbor is added # b) a neighbor is removed # c) a neighbor profile is changed # d) the local user's profile is changed # This version can be called fairly frequently because it returns # without actually doing an update if there are no changes to # deal with. It does spend some time figuring out whether there # are changes, so don't call it more frequently than, say, every # 10 seconds.

# An important aspect of this method is that other things can change in another thread, # for instance users may be added or removed from the neighborbag, while this # method is executing and that ' s OK because the files will be written # in a consistent state, ignoring such changes after the initial # populating of IstBagNeighborlD. Same for adventerousness .

Recommender ._debugStr ( ' IN Recommender . update ' ) IstBagNeighborlD = activeneighbors .getActiveNeighborlDs ( ) if IstBagNeighborlD == None: return

Recommender._debugStr( 'Recommender .update IstBagNeighborlD: %s ' % (repr (IstBagNeighborlD) ,) ) if self .getAdventurousness ( ) != fAdventurousness : Recommender._debugStr (' self .getAdventurousness ( ) %s != fAdventurousness %s' % (repr (self .getAdventurousness ()) , repr ( fAdventurousness ) ) ) else: Recommender ._debugStr ( ' self . getAdventurousness ( ) == fAdventurousness ' ) if (not os. access (os.path. expanduser (Recommender._getRecDataFilePath( ) ) , os.F_OK)) \ or (self .getAdventurousness ( ) != fAdventurousness) : blnitialization = True else: blnitialization = False

if blnitialization: recDataFile = openExclusive (Recommender ._getRecDataFilePath( ) , 'wb' ) recDataFile. rite (' \n' ) # Inialize to self-consistent "empty" state recDataFile. close ( )

self ._synchronize ( )

# To save memory and CPU, we use md5 hashes rather than name strings # as identifiers for most of the processing. We plan to get name # strings back at the very end of the recommendations process. The following # list contains a list of good candidate source profiles for finding any we # need. We will look in these first. lstRememberingAsPotentialSourcesForNameStrings = [] setstrTasteProfilePaths = self ._getTasteProfilePathsSet ( ) ststrCurrentTasteProfilePaths = sets. Set 0 for iNeighborlD in IstBagNeighborlD: strTPPName = tasteprofilepaths. latest (iNeighborlD) if strTPPName is None: raise transposeexceptions.FileNotThereException, 'Apparently missing taste profile for neighbor ID:' + str (iNeighborlD) StstrCurrentTasteProfilePaths . dd ( strTPPName)

# For brevity TPP will stand for taste profile paths IststrRemovedNeighborTPP = [strTPP for strTPP in setstrTasteProfilePaths \ if not strTPP in StstrCurrentTasteProfilePaths] IststrAddedNeighborTPP = [strTPP for strTPP in

StstrCurrentTasteProfilePaths \ if not strTPP in setstrTasteProfilePaths] Recommender._debugStr( 'StstrCurrentTasteProfilePaths:' + repr (ststrCurrentTasteProfilePaths) ) Recommender._debugStr( 'setstrTasteProfilePaths:' + repr (setstrTasteProfilePaths) ) Recommender._debugStr ( 'IststrAddedNeighborTPP:' + repr (IststrAddedNeighborTPP) ) self . iAddedCountForUnitTests = 0 for strAddedNeighborTPP in IststrAddedNeighborTPP: Recommender._debugStr( 'ADDING: ' + StrAddedNeighborTPP) iAddedNeighborlD = tasteprofilepaths. fileNameToUserlD (strAddedNeighborTPP. split ( '/' ) [-1] )

# Use Mutex and try block to handle case of unexpected deletion of the # file before it ' s registered

applicationmutex. grab (tasteprofileclass . TasteProfile . FileRegistryMutex) try : if not utilities. fileExists (strAddedNeighborTPP) : continue

# try: # fileregistry. register ( transposefileregistrants . RecommenderFileRegistran t, strAddedNeighborTPP)

# except fileregistry.AlreadyRegistered, e:

# pass # we may have registered this in a previous session which the user

# # then quit before this process finished and so we would be passed

# # this user again

fileregistry. register (transposefileregistrants .RecommenderFileRegistran t , strAddedNeighborTPP) finally:

applicationmutex. release ( tasteprofileclass . TasteProfile . FileRegistryMut ex) setstrTasteProfilePaths . add(strAddedNeighborTPP) mergeProfile = TasteProfile (LOAD_FROM_FILE, strAddedNeighborTPP) self ._mergeNeighbor (mergeProfile, setstrTasteProfilePaths, fAdventurousness) mergeProfile.releaseCalcData( ) # Free reference cycle in tasteProfile

lstRememberingAsPotentialSourcesForNameStrings . append ( iAddedNeighborlD)

# Could be useful since these strings are not in existing rec file self ._sendHeartBeat ( ) self .iAddedCountForUnitTests += 1 self .iRemovedCountForUnitTests = 0 Recommender._debugStr( ' lststrRemovedNeighborTPP: ' + repr (lststrRemovedNeighborTPP) ) for strRemovedNeighborTPP in lststrRemovedNeighborTPP: iRemovedNeighborlD = tasteprofilepaths . fileNameToUserlD( strRemovedNeighborTPP. split ( ' / ' ) [- 11) Recommender ._debugStr ( 'REMOVING: ' + str (iRemovedNeighborlD) ) setstrTasteProfilePaths . remove (strRemovedNeighborTPP) mergeProfile = TasteProfile (LOAD_FROM_FILE, StrRemovedNeighborTPP) self ._removeMergedNeighbor (mergeProfile, setstrTasteProfilePaths , fAdventurousness) mergeProfile. releaseCalcData ( ) # Free reference cycle in tasteProfile

applicationmutex. grab (tasteprofileclass . TasteProfile . FileRegistryMutex) try:

# try: # fileregistry. release (transposefileregistrants .RecommenderFileRegistrant , strRemovedNeighborTPP)

# except fileregistry.NotRegistered, e:

# pass # we may have released this in a previous session which the user

# # then quit before this process finished and so we would be passed

# # this user again

fileregistry. release {transposefileregistrants . RecommenderFileRegistrant , strRemovedNeighborTPP) finally:

applicationmutex . release ( tasteprofileclass . TasteProfile . FileRegistryMut ex) self ._sendHeartBeat ( ) self .iRemovedCountForUnitTests += 1

# The following two lines make IstNewestFirst a unique list of neighbor ID's with the newly added ones first. # This ordering is used to more efficiently retrieve the name strings for the new song # when they are needed. IstNewestFirst = [] [IstNewestFirst .append (i) for i in (lstRememberingAsPotentialSourcesForNameStrings + IstBagNeighborlD) if not IstNewestFirst. count (i) ] self ._saveRecommendationFile( thisUserTasteProfile, fAdventurousness, IstNewestFirst, iRecsWanted) thisUserTasteProfile.releaseCalcData( ) # Defensive programming -- hopefully, none to release.

def _synchronize (self) : # This is to handle inconsistencies that could come up after a crash, or even # simply as a result of the user quitting while recommendations were in progress. # The contents of self -_getTasteProfilePathsSet ( ) is taken to be the Bible, because # that file is guaranteed self-consistent. # Note that we can't here fix cases where a file we're supposed to have has # actually been physically deleted.

applicationmutex. grab (tasteprofileclass .TasteProfile . FileRegistryMutex) try: setstrTasteProfilePaths = self ._getTasteProfilePathsSet ( ) for strTasteProfilePath in setstrTasteProfilePaths: if not fileregistry. isRegistered(transposefileregistrants .RecommenderFileRegis trant, strTasteProfilePath) : fileregistry. register (transposefileregistrants. RecommenderFileRegistran t, strTasteProfilePath) IstRegisteredPaths = fileregistry. getFilePathsForRegistrant(transposefileregistrants. Recomme nderFileRegistrant) lstUnnecessaryRegisteredPaths = [strPath for strPath in IstRegisteredPaths if strPath not in setstrTasteProfilePaths] for strPath in lstUnnecessaryRegisteredPaths: fileregistry. release ( transposefileregistrants . RecommenderFileRegistrant , strPath) finally:

applicationmutex. release ( tasteprofileclass . TasteProfile . FileRegistryMut ex) def _getTasteProfilePathsSet () : recDataFile = openExclusive (Recommender._getRecDataFilePath( ) , ' rb ' ) try: setstrTasteProfilePaths = sets. Set ([s for s in recDataFile. readline( ) .strip () .split (' \t' ) if len(s) > 0]) finally: try: recDataFile . close ( ) except SystemExit: raise except : pass return setstrTasteProfilePaths _getTasteProfilePathsSet = staticmethod(_getTasteProfilePathsSet) def _getRecFilesDirPath() : return userpathsclass.UserPaths. getlnstance ( ) . getApplicationSupportPath( ) _getRecFilesDirPath = staticmethod(_getRecFilesDirPath) def _getRecDataFilePath() : return Recommender._getRecFilesDirPath ( ) + "data.rec" _getRecDataFilePath = staticmethod(_getRecDataFilePath) def _getRecFilePath() : return Recommender ._getRecFilesDirPath( ) + "recs.rec" _getRecFilePath = staticmethod(_getRecFilePath)

def _mainValueIteratorFactory ( ) : Recommender ._debugStr( 'In _mainValueIteratorFactory' ) recDataFile = openExclusive (Recommender._getRecDataFilePath( ) , 'rb') recDataFile. readline ( ) #skip first line iReadLen = struct .calcsize(Recommender._ckstrRecDataFormat) while 1 : strData = recDataFile. read (iReadLen) if ' ' == StrData: # EOF break (md5Hash, fGoodnessSum, iPlayCount, iUserCount) \ = struct. unpack (Recommender._ckstrRecDataFormat, strData) yield (md5Hash, fGoodnessSum, iPlayCount, iUserCount) recDataFile . close ( ) yield (Recommender._ckstrStop, None, None, None) assert 0, "mainValuelterator should never fall through" _mainValueIteratorFactory = staticmethod(_mainValueIteratorFactory) def _mergeValueIteratorFactory(mergeProfile, fAdventurousness=None) : Recommender._debugStr (' In _mergeValueIteratorFactory: Number of songs: ' + str (len (mergeProfile. getListSongHashSorted( ))) ) for md5SongHash in mergeProfile. getListSongHashSorted ( ) : # Bigger is better for fGoodness (fRawGoodness, iPlayCount) = mergeProfile . getCalcDataUsingHash (md5SongHash) . getRawRecommendationData () # if md5SongHash == ' \xb6l\xcf\xd0\xf3{\xadd\xbl\xl8\xb2\xclfj\x8d\xlb' : # Recommender._debugStr( 'Brother Goodness from Merge Value iterator access to calcdata: ' + repr (fRawGoodness) ) if fAdventurousness in (None, 1.0): fGoodness = fRawGoodness else:

# if fAdventurousness == 0.0:

# fLowVal = 10.0 # else:

# fLowVal = -math. log (fAdventurousness**3)

# fGoodness = max(fLowVal, fRawGoodness) # 3 is used because for a adventurousness==.5, # the most extreme ranges are roughly a factor of 2 away

# fAdventurousnessCubed = fAdventurousness**3

# fGoodness = fRawGoodness**fAdventurousnessCubed fGoodness = fRawGoodness** (- (math, logd .0- fAdventurousness) / (math. log (10) -math. log ( .01) ) ) ) # the above makes 10**x == (1- fAdventuruousness) ( .01**x) , meaningful if # 10 and .01 are the extremes of fRawGoodness

# if md5SongHash == ^•\xb6I\xcf\xdO\xf3{\xadd\xbl\xl8\xb2\xclfj\x8d\xlb' : # Recommender._debugStr( 'Brother Goodness from Merge

Value iterator Factory: ' + repr (fGoodness) ) yield (md5SongHash, fGoodness, iPlayCount) Recommender ._debugStr( ' In _mergeValueIteratorFactory: about to do final yield¹ ) yield (Recommender._ckstrStop, None, None) assert 0, "mergeValuelterator should never fall through" _mergeValueIteratorFactory = staticmethod(_mergeValueIteratorFactory) def _writeOneSong(outFile, md5Hash, fGoodnessSum, iPlayCount, iUserCount) : outFile.write (struct .pack ( * (Recommender._ckstrRecDataFormat, md5Hash, fGoodnessSum, iPlayCount, iUserCount))) _writeOneSong = staticmethod(_writeOneSong) def _mergeNeighbor (self , mergeProfile, setstrTasteProfilePaths, fAdventurousness) : Recommender ._debugStr( 'In _mergeNeighbor ' ) utilities . ensureDirectory (Recommender ._getRecFilesDirPath ( ) ) outFile = openExclusive (os .path. expanduser (Recommender ._getRecDataFilePath( ) ) , 'wb' ) outFile . truncate ( ) outFile.write (' \t' .join ( [strPath for strPath in list (setstrTasteProfilePaths) ] ) + '\n') main = self ._mainValueIteratorFactory( ) merge = self ._mergeValueIteratorFactory (mergeProfile, fAdventurousness) # The following is a classic 2-file merge algorithm that is # taught to COBOL programmers in business-oriented programming classes . # Recommendations._ckstrStop is higher than any possible legit value and # is used to very simply handle the situation # of getting to the end of one file before the other. # It may take a *little* bit of effort to wrap one's brain the # algorithm as a whole, but the good news is: it Just Works. (md5MainHash, fMainGoodnessSum, iMainPlayCount, iMainUserCount)

= main.next ( ) (md5MergeHash, fMergeGoodness, iMergePlayCount) = merge.nextO while not (md5MainHash == Recommender._ckstrStop and mdδMergeHash == Recommender ._ckstrStop) : if md5MainHash < md5MergeHash: self ._writeOneSong (outFile, md5MainHash, fMainGoodnessSum, iMainPlayCount, iMainUserCount) (md5MainHash, fMainGoodnessSum, iMainPlayCount, iMainUserCount) = main.nextO elif md5MainHash > mdδMergeHash: self ._writeOneSong (outFile, md5MergeHash, fMergeGoodness, iMergePlayCount, 1) (md5MergeHash, fMergeGoodness, iMergePlayCount) = merge . next ( ) else: # they're equal self ._writeOneSong(outFile, md5MainHash, fMainGoodnessSum + fMergeGoodness, \ iMainPlayCount + iMergePlayCount, iMainUserCount + 1) (mdδMainHash, fMainGoodnessSum, iMainPlayCount, iMainUserCount) = main.nextO (mdδMergeHash, fMergeGoodness, iMergePlayCount) = merge . next ( ) # Due to openExclusive, the following close () atomically causes # the low-level disk file to be populated with the written data outFile. close ( ) def _removeMergedNeighbor (self, mergeProfile, setstrTasteProfilePaths, fAdventurousness) : utilities .ensureDirectory(Recommender._getRecFilesDirPath( ) ) outFile = openExclusive (os.path. expanduser (Recommender._getRecDataFilePath( ) ) , 'wb' ) outFile.write (' \t '. join ( [strPath for strPath in list (setstrTasteProfilePaths) ] ) + '\n') main = self ._mainValueIteratorFactory ( ) merge = self ._mergeValueIteratorFactory(mergeProfile, fAdventurousness ) # See _mergeNeighbor for notes on the algorithm below (md5MainHash, fMainGoodnessSum, iMainPlayCount, iMainUserCount) = main. next ( ) (md5MergeHash, fMergeGoodness, iMergePlayCount) = merge.nextO while not (md5MainHash == Recommender._ckstrStop and md5MergeHash == Recommender.__ckstrStop) : if md5MainHash < md5MergeHash: self ._writeOneSong (outFile, md5MainHash, fMainGoodnessSum, iMainPlayCount, iMainUserCount) (md5MainHash, fMainGoodnessSum, iMainPlayCount, iMainUserCount) = main.nextO elif md5MainHash > md5MergeHash: assert 0, "Logic error; attempt to remove merge file already removed" else: # they're equal if iMainUserCount > 1: self ._writeOneSong (outFile, md5MainHash, fMainGoodnessSum - fMergeGoodness, \ iMainPlayCount - iMergePlayCount, iMainUserCount - 1) (md5MainHash, fMainGoodnessSum, iMainPlayCount, iMainUserCount) = main.next (md5MergeHash, fMergeGoodness, iMergePlayCount) = merge .next ( ) # Due to openExclusive, the following close 0 atomically causes # the low-level disk file to be populated with the written data outFile. close( ) Recommender ._debugStr (' In _removeMergedNeighbor, closing output file' )

class RecFileWrapper (object) _ckbDebug = False def _debugStr(str) : if Recommender . RecFileWrapper ._ckbDebug : sys. stderr.write (' [RecFileWrapper] ' + str + '\n') sys . stderr . flush ( ) _debugStr = staticmethod(_debugStr)

class RecFileRecord(object) : def init (self, strSong, strArtist, strAlbum, fBayesGoodness , iPlayCount, iUserCount, md5Hash) : self. strSong = strSong self . strArtist = strArtist self .strAlbum = strAlbum self . fBayesGoodness = fBayesGoodness self .iPlayCount = iPlayCount self .iUserCount = iUserCount self.md5Hash = md5Hash

def getRow(self) : assert self .strSong, ' RecFileWrapper. getRow() : There must be a song name' return (self . strSong, utilities. useNone( self .strArtist,

''), utilities .useNone (self. strAlbum, ''), self . fBayesGoodness , self .iPlayCount, self .iUserCount, self .md5Hash) def init (self, strFileMode, fAdventurousness=None) :

For writing, fAdventurousness must be supplied.

There are no read methods -- use the iterator. assert (not 'w' in strFileMode) or fAdventurousness != None, 'need fAdventurousness if rec file opening for writing' try: utilities . ensureDirectory (Recommender ._getRecFilesDirPath ( ) ) self ._file = openExclusive (Recommender._getRecFilePath() , strFileMode) self._isOpen = True if 'w' in strFileMode: # Recommender.RecFileWrapper._debugStr( 'Writing fAdventurousness:' + repr (fAdventurousness) ) self ._file.write (struct.pack ( ">f ' , fAdventurousness) ) if 'r' in strFileMode:

# Recommender .RecFileWrapper ._debugStr ( ' Skipping fAdventurousness:' + repr (fAdventurousness) ) self ._file. read (struct .calcsize (' >f ) ) # Read past fAdventurousness except transposeexceptions.FileNotThereException: self ._isOpen = False # Can only happen if openning for reading except : raise def write(self, recFileRecord) : tupRowComplete = (utilities.noNone (recFileRecord. strSong, "), \

utilities.noNone (recFileRecord. strArtist, '') , \

utilities. noNone (recFileRecord. strAlbum, ' ' ) , recFileRecord. fBayesGoodness , recFileRecord. iPlayCount, recFileRecord. iUserCount, recFileRecord.md5Hash) tupStringLengths = ( len( tupRowComplete [ 0] ) , len( tupRowComplete [1] ) , len (tupRowComplete [2 ] ) ) strl = struct. pack(*( ( ">3H' , ) + tupStringLengths)) # Must add pad bytes in middle and to end of structure to match native Darwin format tupStringLengthsAndPad = (tupStringLengths + (padLength(reduce (operator. add, tupStringLengths, 0)), )) str2 = struct .pack(* ( (Recommender._ckstrRecFormat % tupStringLengthsAndPad, ) + tupRowComplete) ) self ._file.write (strl + str2) def close (self) : if self ._isOpen: self ._file. close ( ) self._isOpen = False def iter (self) :

If the file isn't there, the iterator will simply be empty " " " assert struct. calcsize (' >lθsfII ' ) == struct . calcsize ( ' 16sfII ' ) if self ._isOpen: while 1 : strStringLengths = self ._file. read (struct .calcsize ( '>3H' ) ) if len (strStringLengths) == 0: break tupStringLengths = struct .unpack (' >3H' , strStringLengths) # Must add pad bytes to match native Darwin format tupStringLengthsAndPad = (tupStringLengths + (padLength (reduce (operator. add, tupStringLengths, 0)), )) strFormat = Recommender._ckstrRecFormat % tuple (tupStringLengthsAndPad) strRowComplete = self ._file . read (struct . calcsize ( strFormat) ) assert len (strRowComplete) == struct. calcsize ( strFormat) , \ "Expected length %d, but got %d" % (struct. calcsize (strFormat) , len (strRowComplete) ) recFileRecord = Recommender . RecFileWrapper . RecFileRecord ( *struct .unpack ( strFormat, strRowComplete) ) yield recFileRecord def _saveRecommendationFile(self, thisUserTasteProfile, fAdventurousness, IstBagNeighborlDNewestFirst, iRecsWanted) : Saves the actual recommendations consistent with fAdventurousness . fAdventurousness is 0 to 1, with 1 being the most adventurous recommendations .

Intended only to be called from self .update () .

# Note that in the following, we rebuild the top song list from scratch every time. # 1) One reason is that it means that if RECDATAFILE and

RECFILE get out of sync # because a signal or error occurs after the first is done and before the second # is done, it will not cause any problems. 2) Another reason is that a change # to the algorithm, or just to fAdventurousness, may result in criteria that # are incompatible with the earlier criteria in such a way that songs that are # worse may have better "goodness expectations" and may therefore stay # in the list; this can't happen if the list is rebuilt from scratch.

# There are two reasons we do this file matching algorithm to locate songs # the user already owns (and that we therefore don't want to present # in the recommendation list) rather than just looking the # hashes up in a dictionary. 1) the dictionary in TasteProfile involves # instantiating the CalcData objects which takes a lot of memory. We # are trying to minimize memory use by just relying on the sorted list # of hashes. 2) This algorithm is very CPU efficient since it just # traverses both lists in their sort orders; it doesn't need to repetively search # or look up. Recommender._debugStr (' In _saveRecommendationFile ' ) def alreadyPresentHashlteratorFactoryO : for md5SongHash in thisUserTasteProfile . getListSongHashSorted ( ) : yield md5SongHash thisUserTasteProfile.releaseCalcData( ) # Defensive programming, in case needed in future. Does no harm now. yield Recommender._ckstrStop assert 0, 'alreadyPresentHashlteratorFactoryO should not fall through to end¹

# NYI instead of making a tup with a different ordering, I could make

# a tup subclass that sorts in the desired sequence using the

# original ordering. Then I could just pass the tup around instead # of naming the items, which would make the code easier-to-read

# and more reliable. The main thing that held me back from doing # it is that I have been trying to confirm tha bisect will # continue to use It or cmp in the future. I did find # from the python mailing list that I can use It rather # than cmp and that that would be better. # 1 did test bisect with the following tup subclass and it # did work as expected. # class GoodnessSortedTuple (types .TupleType) : # # Assumes fGoodness is at the position indexed by 1 # def cmp (self, other) : # return cmp (self [1] , other [1] ) lstRecommendations = [] # list of tups, each with sequence (fBayesGoodness, md5MainHash, iMainPlayCount, iMainUserCount) main = self ._mainValueIteratorFactory ( ) merge = alreadyPresentHashlteratorFactory ( ) (md5MainHash, fMainGoodnessSum, iMainPlayCount, iMainUserCount) = main.next ( ) md5MergeHash = merge . next ( ) while not (md5MainHash == Recommender._ckstrStop and md5MergeHash == Recommender._ckstrStop) : if md5MainHash < md5MergeHash: #Possible recommendation -- this user doesn ' t have # The following line is the Bayesian part of our math. By way of explanation, let: # b = the goodness we assume from background information; # w = the weight we give to background information, compared to 1 person ' s data # g = the mean goodness conferred by our population for the item # n = the number of people in our population # # Then the formula were using is (b*w + g*n) / (w + n) . Explaining this formula # as a general principle is beyond the scope of this note (ask Gary) . But # with regard to the specifics of this application: # b = 0. We are looking at the very nearest neighbors to the one we ' re # recommending to. But if you just took a song randomly picked # from the entire universe of songs, it probably won ' t be in # any of their collections. (Or if it did, it would only be # due to random chance, by definition, and the odds of a taste # propensity for the song would be low. ) # The lowest-ranking goodnesses have ranks asymptotically # approaching 0. So for convenience we just use 0. That means that # the b*w term disappears. # w = ((the number of neighbors) ** (1- fAdventurousness) ) -1. # First let ' s take the case where # fAdventurousness is 0. Then w = (the number of neighbors) -1. This is saying # that the weight we give our assumption that the goodness is # 0 is almost equal to the combined weight of the user's neighbors, if every # population member had the song. That seems like a VERY conservative # position to take if the neighbor bag is at all large. Now say fAdventurousness # is 1. Then w = 0, and if only 1 neighbor has a song, and it is the song # with the most goodness on his system, then that song floats toward the # very top of the list. So, that's very adventurous: the user will # see songs that few people have, maybe just one, but who love them. Finally # say fAdventurousness is .5, and for exemplary purposes say the population # has 1000 members. Then w=31.62 which seems reasonable, because 31 # is substantial weight, but not overwhelming to a reasonable number of # actual people having it. I (GR) think we may find a better formula # in the future, but this is a user-settable option so getting the formula # perfect isn't that important (and may not even be a meaningful) # thing to try to do.

# fBayesGoodness = float (fMainGoodnessSum) \

# / ( (len(lstBagNeighborΙDNewestFirst) **(1.0 - fAdventurousness) - 1.0) + iMainUserCount)

# IF WE ARE USING THE FORMULA BELOW, THEN THE ABOVE NOTES ARE WRONG FOR FBAYESGOODNESS ! XXX # AND IT ISN'T EVEN BAYESIAN

# OK, so now we have a very nicely-calculated real- number ranking, with incredible accuracy. # ut how much USEFUL INFORMATION is there? Not very much. First of all tastes aren't all that # specific, firefly found that with a 7 point rating scale, people were frequently 1 rating off # if they rerated something the next week. 2nd of all, we're not even relying on ratings, # we're inferring how much something is liked by looking at listening patterns. So we shouldn't # assume we even have any reliability if we mapped to a 7-point scale.

# Moreover, the calculation we're using, though sensible from an abstract perspective, # results in the smallest ratings being hundreds of times smaller than the biggest ratings # -- which really makes no sense given the vagueries of the data.

# So, we set a minimum cutoff for the score. We keep the fine gradations at the high level, # because the songs a user really plays more than any others are highly likely to be really # liked so there's more data there. In contrast, low play levels may just mean the user # is tired of the song cuz he's heard it 10,000 times in other contexts, and that he loves it. # Our approach is to have a low-end-cutoff for the number, on a log scale. For "most adventurous" # the cutoff is 0. For least, it's as if every vote is in a top 1% percentile.

fWeightedAverageGoodnessSum = (1.0 - fAdventurousness)

^# iMainUserCount + fAdventurousness * fMainGoodnessSum fBayesGoodness = float (fWeightedAverageGoodnessSum) /

((1.0 - fAdventurousness) + iMainUserCount * fAdventurousness)

# if (IstRecommendations == []) or fBayesGoodness > IstRecommendations [0] [0] :

# Recommender._debugSt (' line 446: fBayesGoodness, %f, fMainGoodnessSum, %f, fAdventurousness: %f, iMainUserCount: %i, md5 :

%s' \ # % (fBayesGoodness, fMainGoodnessSum, fAdventurousness, iMainUserCount, repr (md5MainHash) ) ) if ( len (IstRecommendations) < iRecsWanted) or fBayesGoodness > IstRecommendations [ 0] [0] : bisect. insort (IstRecommendations, (fBayesGoodness, md5MainHash, iMainPlayCount, iMainUserCount)) if le (IstRecommendations) > iRecsWanted: del IstRecommendations [0]

# Recommender._debugStr( 'Recommender: in main rec loop: fBayesGoodness: %f, fAdventurousness: %f, fMainGoodnessSum: %f, iMainUserCount : %i ' % \

# (fBayesGoodness, fAdventurousness, fMainGoodnessSum, iMainUserCount)) (md5MainHash, fMainGoodnessSum, iMainPlayCount, iMainUserCount) = main.nex O elif md5MainHash > md5MergeHash: #This user has it, but no neighbor does . Not to be recommende ! md5MergeHash = merge.next ( ) else: # .This user already has the potential recommendation -- don't recommend (md5MainHash, fMainGoodnessSum, iMainPlayCount, iMainUserCount) = main.nextO md5MergeHash = merge . next ( ) self._sendHeartBeat ( )

# Below we do the work for finding name strings for the recommendations . We get them # wherever we can find them, in order of efficiency. IT IS

RECOMMENDED THAT # THIS CODE EVENTUALLY BE REPLACED WITH SOME KIND OF COMPRESSED STRING # DATABASE. self ._loadRecommendationFile() # loads self ._lstMatrix from current (old) rec file # Could do this with the 2-file merge algorithm rather than a dictionary, but # the recs are limited in # number and therefore less obnoxious memory-wise than other cases # where we choose the merge. And we had to populate IstMatrix anyway because # of the pre-existing code base. We may want to do things differently in # the future. dctHash2Row = {} # By the end of this for loop, self ._lstMatrix is empty. for i in xrange (len(self ._lstMatrix) - 1, -1, -1): IstRow = self ._lstMatrix[i] md5Hash = utilities .makeSongHash(lstRow[ARTIST_COL] , lstRow[ALBUM_COL] , IstRow[SONG_COL] )

# if 'Brother Bones' in IstRow[ARTIST_COL] : # self ._debugStr( 'Brother Bones MD5 : ' + repr (md5Hash) ) IstNewRow = 3 * [None] lstNewRow[SONG_COL] = IstRow[SONG_COL] IstNewRow[ARTIST_COL] = IstRow[ARTIST_COL] IstNewRow[ALBUM_COL] = IstRow[ALBUM_COL] dctHash2Row[md5Hash] = IstNewRow del self ._lstMatrix[i] # Not needed any more assert self ._lstMatrix == [] self ._sendHeartBeat ( )

# utilities . ensureDirectory (Recommender ._getRecFilesDirPath ( ) )

# outFile = openExclusive (Recommender ._getRecFilePath() , 'wb' )

# outFile.write (struct .pack ( ' f ' , fAdventurousness) ) # Do the loop so that the best is written first. For each recommendation, # The approach is to first try to get name strings from the list # that came from the recommendations # list we're replacing. That should provide almost all of them. # But if a particular song isn't there, we try to find it # in one of the taste profiles in the neighbor bag. It is most # common that any new song whose name strings # aren't found in the old recommendations file will be # found in one of the newest songs to be added to the bag. These # are at the front of IstBagNeighborlDNewestFirst; so these # are searched first. # Recommender._debugStr (' len (IstRecommendations) is ' + str (len ( IstRecommendations).) )

# Try to get name strings from dctHash2Row recFileWrapper = Recommender. RecFileWrapper ( 'wb' , fAdventurousness ) bAHDone = True for iRec in range (len ( IstRecommendations ) -1, -1, -1): (fBayesGoodness, md5Hash, iPlayCount, iUserCount) = IstRecommendations [iRec] if dctHash2Row.has_key(md5Hash) : IstRow = dctHash2Row[md5Hash] (strSong, strArtist, strAlbum) = (lstRow[SONG_COL] , lstRow[ARTIST_COL] , IstRow[ALBUM_COL] ) assert strSong, 'From dctHash2Row: There must be a song name! '

# Recommender ._debugStr( 'Found song name in dctHash2Row' ) else: (strSong, strArtist, strAlbum) = (None,) * 3 bAllDone = False assert not None in (fBayesGoodness, iPlayCount, iUserCount, md5Hash) , 'one of (fBayesGoodness, iPlayCount, iUserCount, md5Hash) not populated ' recFileRecord = Recommender .RecFileWrapper . RecFileRecord (strSong, strArtist, strAlbum, fBayesGoodness, iPlayCount, iUserCount, md5Hash) recFileWrapper.write (recFileRecord) # Recommender._debugStr( ' %f , %i, %i, %s ' % (fBayesGoodness, iUserCount, iPlayCount, strSong)) # Recommender._debugStr( 'Recommender: writing ordered recs : fBayesGoodness : %f, iUserCount: %i, iPlayCount: %i, strSong: %s ' % ( fBayesGoodness , iUserCount, iPlayCount, strSong)) recFileWrapper . close ( ) self ._sendHeartBeat ( ) # Try to get name strings from each neighbor until we have them all if not bAllDone: for iNeighborlD in IstBagNeighborlDNewestFirst: strTasteProfileLoadFilePath = tasteprofilepaths . lates ( iNeighborlD) searchProfile = TasteProfile (L0AD_FR0M_FILE, StrTasteProfileLoadFilePath)

# Recommender._debugStr( 'Recommender: processing ID for name strings:' + repr (iNeighborlD) ) outRecFileWrapper = Recommender. RecFileWrapper ( 'wb' , fAdventurousness) inRecFileWrapper = Recommender .RecFileWrapper (' rb' ) bAllDone = True

# Recommender ._debugStr ( ' DOING IstBagNeighborlDNewestFirst LOOP, len is %i ' %

(len (IstBagNeighborlDNewestFirst) , ) ) for recFileRecord in inRecFileWrapper: if recFileRecord. strSong in (None, ''): # Need to try to find the strings calcData = searchProfile. getCalcDataUsingHash (recFileRecord.md5Hash) if calcData != None: IstRow = calcData.getRow( ) recFileRecord. strArtist = IstRow[tasteprofileclass .ARTIST_COL] recFileRecord. strAlbum = IstRow[tasteprofileclass.ALBUM_COL] recFileRecord. strSong = IstRow[tasteprofileclass . SONG_COL] assert recFileRecord. strSong, 'Matching to IstBagNeighborlDNewestFirst: There must be a song name!' else: bAllDone = False outRecFileWrapper. rite (recFileRecord) inRecFileWrapper . close ( ) outRecFileWrapper . close ( ) searchProfile.releaseCalcDataO # Free reference cycle in tasteProfile if bAllDone:

# Recommender ._debugStr ( ' In

IstBagNeighborlDNewestFirst loop, bAllDone is True') break self ._sendHeartBeat() assert bAllDone, 'bAllDone theoretically must be true here'

def _loadRecommendationFile(self) : Recommender ._debugStr ( ' In _loadRecommendationFile ' ) recFileWrapper = Recommender. RecFileWrapper ( 'rb' ) try: self ._lstMatrix = [] for recFileRecord in recFileWrapper: # iterator is empty if file not there # While _saveRecommendationFile is running, the rec file is processed # repeatedly to fill in the strings. So, there will be intermediate # stages with empty strings . We don ' t put those in the matrix. # When _saveRecommendationFile completes, however, all strings # are filled in. if recFileRecord. strSong: self ._lstMatrix. append (recFileRecord. getRow( ) ) finally: recFileWrapper . close ( ) self ._sendHeartBeat ( )

def getAdventurousness ( ) : # Returns None if the recommendation file isn't there # Recommender._debugStr (' In getAdventurousness') try: inFile = openExclusive(Recommender._getRecFilePath( ) , 'rb' ) fAdventurousness = struct .unpack ( ">f ' , inFile. read (struct. calcsize ( ">f ' ) ) ) [0] inFile. close ( ) except transposeexceptions . FileNotThereException: fAdventurousness = None return fAdventurousness getAdventurousness = staticmethod(getAdventurousness) def getModTime ( )

# This is expected to be called once per second or whenever convenient, but # not infrequently. After it is found # that the time has changed, the GUI should call getSongListO try : fFileUpdateTime = utilities. getFileModTime (Recommender._getRecFilePath( ) ) except transposeexceptions.FileNotThereException: # If the file hasn't been created yet, this should be an OSErr fFileUpdateTime = None return fFileUpdateTime getModTime = staticmethod(getModTime) def getSongList(self) : # Should be called whenever getModTime () returns a different # time from the previous one.

Recommender ._debugStr ( ' In Recommender . getSongList ' )

# Returns None if the recommendation file isn't there.

# Performance note: It is expected that this is called rarely, # that is, it's called when the user opens a # recommendation window, and when there is a new recommendation file self ._loadRecommendationFile( )

# Recommender._debugStr( 'The number of recs is ' + str ( len ( self ._lstMatrix) ) ) return self ._lstMatrix def before_v0_505O : ' ' ' Return whether the recommendations file is old and needs upgrading (or removal) . ' ' ' recDataFile = None try: recDataFile = openExclusive (Recommender._getRecDataFilePath ( ) , 'rb' ) except transposeexceptions . FileNotThereException: return False # if not there then it is not old try : s = recDataFile. readlineO finally: try: recDataFile. close ( ) except SystemExit: raise except : pass return not ('\t' in s) before_v0_505 = staticmethod(before_v0_505)

# def _getNeighborIDsAndPaths ( ) : # ' ' ' returns a list of tuples, one for each neighbor, containing:

# (iNeighborlD, strMostRecentNeighborTasteProfilePath)

# IstNeighbors = sortedneighborlistclass . SortedNeighborListReader ( ) . getNeighbors ( )

# if IstNeighbors == None:

# return [ ]

# del IstNeighbors [preffileclass. getMaxNeighbors () : ] # only the ones for our visible neighbors # return [ (neighbor. getUserlD () , \

# tasteprofilepaths .latest (neighbor. getUserlDO ) ) \

# for neighbor in IstNeighbors ]

# _getNeighborIDsAndPaths = staticmethod(_getNeighborIDsAndPaths) #

def needToUpgrade ( ) : ^{1 1} ' If we have a version 0.505 or earlier data file then it must be deleted and rebuilt. ' ' ' return Recommender.before_v0_505 ( )

def upgrade ( ) : ' ' ' Remove the data. rec file so it can be rebuilt. ' ' ' utilities . removeFile (Recommender ._getRecDataFilePath ( ) ) -¹ genrerankhandlerclass.py ======== The code in this module represents one way of clustering cluster data containing songs where the songs (or most of the songs) have associated genre information. Of course, it can be used analogously for other subject areas; for instance in the area of academic research, it could make use of the papers in the users' collections (rather than songs), and then associated keywords (rather than genres). This algorithm has the advantage that it is much faster than most general clustering algorithms, due to making use of the effort that originally went into creating the genre information. Furthermore, programmers of ordinary skill in the art will readily see various ways of improving the speed of the code further (at the cost of more code complexity). # Copyright (c) 2001-2004 Transpose, LLC All rights reserverd.

Notes for understanding the algorithm.

CatchAllNodes represent clusters. There is one for each cluster. Could rename them "Clusters".

The basic idea of this is that it is deterministic, and so more predictable than general-purpose clustering techniques. We can expect fewer gotchas in practice; also, with further refinements to replaces traversals with dictionary lookups etc., it can be made O(n) . However the initial version is written almost without regard to execution speed; correctness is most of what counted.

iCount can always be generated by means of len(stUsers) and yet we maintain it separately since in an early stage we didn't have stUsers, and when we added it we could use iCount to check the correctness of the stUsers code, and we kept it because it may be useful for future optimizations. See testCountSameAsLenStUsers for a check that they are always consistent.

All leafs are CatchAllNodes .

At the time of this writing the following are basically synonyms: tokens : names CatchAllNodes : leafs : clusters

We need to standardize on the meaning; the discrepancies grew out of the history of how the code developed.

After prepare, each user ID is in the stUsers attribute of a Stub in exactly one CatchAllNode, and is nowhere else in the tree.

From a nomenclature point of view, think of the data structure as an upside down tree with leafs at the bottom. Names such as tupHigherUps make sense in that context.

Every user, assuming he has at least one genre, is represented in the clustering, if only by his single top genre. If it has few or no matches with anyone then it will get moved to Stub in a CatchAllNode and have no Nodes dedicated to it that genre.

The upside of this strategy is that we end up with a very compact data structure to represent the clustering, one that strives to efficiently handle the needs of as many people as possible as well as possible. However those on the fringe, whose tastes are unlike anyone else's, may only be represented by the the genre they like the most. This is not a problem for them, because that one genre is enough to determine the only cluster that represents their interests,- when they receive the whole cluster via torrent, all the details of all its users will be there. Also the clustering data will be compared on their computer against the descriptions of all other clusters, and those which have other cover that the user has will be then requested subsequently.

"Stubs" are data structures that contain users associated with a token.

Tokens with lower rankings than the ones in Stub.strName are ignored. So

Stub.strName is the lowest-ranking token for at least one user such that the token appears in the clustering data. It may be possible to think of a better name than "stubs".

When the cluster fitter finds that of the tokens the user has that match tokens in clusters, two clusters are equally appropriate, it then picks the one with less levels of rank data. Less information means that there was less missed opportunity to match, which means that there is a higher chance of actually being appropriate.

RankHandler ._redistributeNonCatchAllUsers 0 moves users to CatchAlls if they aren't already represented in CatchAllNodess. Note that a side effect of this is that some tokens end up in stubs under CatchAllNodes and also in the strName attribute of the the tupHigherUps for the same nodes. (In the unit test data, this is only 1 user, but should be many as the user base grows . ) It happens when a user has a small enough number of genres that none get assimilated into CatchAllNodes .

Users with no token data at all get put in the vaguest cluster in stubs with Stub.strName == None. For all other users Stub.strName is a string. The vaguest cluster is the one with the least specificity of tokens. There may be a tie, in which case it will go to a random one. useful info on genres: http://www.xiph.org/archives/vorbis/200112/0037.html (says cddb and id3 genres are different) id3 genres: http://www.id3.org/id3v2-00.txt at bottom.

import binascii import bisect import cPickle import extremes import os import types import sets import unique

import clusterfitterclass import transposeexceptions import utilities

DEFAULT_MINIMUM_CLUSTER_SIZE = 500 APPROXIMATE_MAXIMUM_CLUSTER_SIZE_FACTOR = 2 CATCH ALL = 'Catch-All'

class SortedChildrenList (types. ListType) Reliability and efficiency could both be increased by making this into an object that takes responsibility for maintaining its own order rather than depending on external sorts .

def insort(self, node):

Insert node in proper sort position. assert issubclass (node. class , Node) assert not (True in [bool(n is node) for n in self]) bisect . insort (self , node)

class Stub (object) : Nodes have a (possibly empty) list of stubs. Stubs are names that didn't have enough data to justify their own nodes, so that data was folded up into stubs. During the folding only the top level is retained. Note that since the rank data may have random ordering due to duplicate counts, some data with equal counts may be lost in this process . And that not sorting alphabetically is good so that this data loss isn't skewed toward particular names . strName can be None in one case: users with no token ata at all. They get put in stubs in the most boring cluster (leaf) with None as the name. qetitem is present for historical reasons; in early versions of the code this object was a tuple. It was made a class for reasons of self-documentation, but it is expexted that some of the tuple syntax will remain, if only in the unit tests. def init (self, strName, xUsers=None) : # xUsers is normally a Set containing integer user ID's.

However # for testing purposes, xUsers is allowed to be an integer, # in which case the given number of dummy users is automatically # generated.

# The users, whether generated automatically or supplied # as a set, go into self .stUsers . self. strName = strName if xUsers is None: self. stUsers = sets. Set O elif type (xUsers) is types . IntType : self. stUsers = sets . Set 0 for i in range (xUsers ) : self .stUsers. add(i) elif xUsers. class is sets. Set: self. stUsers = xUsers else: assert 0, 'bad type for xUsers : ' + repr (xUsers) + repr (strName) assert not self.iCount == 0, repr (xUsers) + repr (strName) def getitem__(self , i) : if i == 0: return self. strName else: assert i == 1 return self.iCount def .str (self) : return 'Stub name: ' + self. strName + ' ' + str (self .iCount) + ^• ' + \ repr ( self . stUsers ) def cmp (self, other) :

To make it act more like a tuple so tests don't break. return cmp ( (self .strName, self.iCount), (other. strName, other . iCount) ) iCount = property ( lambda self: len(self .stUsers) )

class Node (object)

Note that "==" and "is" are different for Nodes, and that this means that "in" can't normally be used because "in" uses "==". The difference is because of " cmp ", used for sort ordering.

_ckiSortOrderer = 1

def init (self, strName= 'root ' , parent=None) : assert not (strName == 'root' and parent is not None) assert not (strName != 'root' and parent is None) self . IstSortedChildren = SortedChildrenList ( ) self._iCount = 0 # Manually updated by increment 0 if not hasattr (self , 'parent'): # May be assigned in subclass self .parent = parent if self .parent is not None: self ._unLeafParent ( ) if not hasattr (self , 'strName'): # May be assigned in subclass self. strName = strName self .IstStubs = [] self ._stUsers = sets. Set () self .iNodeNumber = None self.blsLeaf = True def cmp (self, other) : return cmp ( (self ._ckiSortOrderer, self.iCount), (other._ckiSortOrderer, other. iCount) ) def _unLeafParent(self) : self .parent .blsLeaf = False def increment (self , recurseUntilBeforeNode=None, iIncrementAmount=l) : assert recurseUntilBeforeNode is None or issubclass (recurseUntilBeforeNode. class , Node) if recurseUntilBeforeNode is None or not (self is recurseUntilBeforeNode or self in recurseUntilBeforeNode. tupHigherUps ) : # so the caller doesn't have to check self._iCount += ilncrementAmount if self .parent and not self.parent is recurseUntilBeforeNode : self .parent . increment (recurseUntilBeforeNode, iIncrementA ount) # NYI the following line is a source of inefficiency, # because usually the increment will not change the # sort order. The suggested solution is to override # sort ( ) in IstSortedChildren so that it checks # to see if the order is already OK before doing the sort. self .parent . IstSortedChildren. sort ( ) def findChildwithName(self, strName): assert type (strName) == types . StringType # Can use diet to make this faster. For now, KISS IstNode = [node for node in self . IstSortedChildren if node. strName == strName] assert len (IstNode) in (0, 1) if len (IstNode) == 1: return IstNode [0] else: return None def traverseNestedChildrenlterator (self , tupHigherUps=None) : returns a tuple: (node, tupHigherUps) where node is the node being traversed tupHigherUps is parent, grandparent, etc. all the way to the top (not including the RankHandler instance) if tupHigherUps == None: tupHigherUps = ( ) for nodelnner in self .IstSortedChildren: yield nodelnner, tupHigherUps for (nodeDeeper, tupHigherUpsDeeper) in nodelnner. traverseNestedChildrenlterator (tupHigherUps + (nodelnner,)) : yield nodeDeeper, tupHigherUpsDeeper def str (self) : def strDisplayName(node) : if node. class == CatchAllNode: return CATCH_ALL else: return node . strName if hasattr (self , ' iNodeNumber ' ) and type(self . iNodeNumber) is types . IntType : IstStr = ['Node %s %i %i_:' % (strDisplayName(self) , self.iCount, self . iNodeNumber) ] else: IstStr = ['Node %s %i:' % (strDisplayName(self) , self.iCount)] for (node, tupHigherUps) in self . traverseNestedChildrenlterator ( ) : iLevel = len (tupHigherUps) IstStr. ppend ( (iLevel * 4 * ' ') + strDisplayName(node) + ' ' + str (node. iCount) \ + ' ' + str (node. iNodeNumber) ) if len(lstStr) == 1: IstStr .appen ( ' (no children) ' ) strResult = ' \n' . join(lstStr) return strResult

def _getSetUsersRecursive(self, stCumUsers=None) : if stCumUsers is None: stCumUsers = sets. Set (self . stUsers) else: stCumUsers |= self. stUsers if not self .blsLeaf : for subNode in self .IstSortedChildren: stCumUsers |= subNode. stUsersRecursive return stCumUsers

def _recursivelyMergeNodes (self)

The general idea here is that nodes that are too small get and agglomerated into newly created CatchAllNode instances where they become stubs in IstStubs. So a node that doesn't have a big-enough child node that is big enough to survive will have nothing but CatchAllNode instances as children.

Then there is some cleanup work because a particular set of siblings may end up starting with a CatchAllNode that is too small to justify a separate existence, and some CatchAllNode instances may be so large that they should be split up.

if len (self .IstSortedChildren) == 1: self . IstSortedChildren[0] ._recursivelyMergeNodes ( ) elif len (self .IstSortedChildren) == 0: pass else: i = 0 while i < len(self. IstSortedChildren) : node = self . IstSortedChildren [i] if node. iCount < self .rankHandler .kiMinimumClusterSize: if i == 0: assert not len(self .lstSortedChildren[0] . stUsersRecursive) == 0, str (node . class ) CatchAllNode = CatchAllNode (self, [Stub ( self . IstSortedChildren [0 ] . strName,

self . IstSortedChildren [0] .stUsersRecursive) ] ) self .IstSortedChildrenfO] = CatchAllNode i = 1 else: assert len (self .lstSortedChildren[l] . stUsersRecursive) > 0 self . IstSortedChildren [ 0 ] . IstStubs . appen ( Stub (self . IstSortedChildren [ 1 ] .strName, self .lstSortedChildren[l] ._getSetUsersRecursive() ) ) del self .lstSortedChildren[l] else: node._recursivelyMergeNodes ( ) i += 1 self .IstSortedChildren. sort () def _redistributeOneIsolatedSmallCatchAll (els, CatchAllNode): assert issubclass (CatchAllNode. class , CatchAllNode) for stub in CatchAllNode. IstStubs: els._redistributeOneStub( stub, CatchAllNode,

CatchAllNode . arent) _redistributeOneIsolatedSmallCatchAll = classmethod(_redistributeOneIsolatedSmallCatchAll)

def _redistributeOneStub(stub, nodeToIgnore, considerSubordinatesOfNode) : node = considerSubordinatesOfNode ._findSubordinateNodeWithMostUsersForToken ( st ub. strName, nodeToIgnore, lambda node:not issubclass (node. class ,

CatchAllNode) ) if node is None: # No nodes have the token destinationLeaf = considerSubordinatesOfNode ._findSubordinateLeafWithFewestUsers (blgnorel solatedSmallCatchAllNodes=True) else: destinationLeaf = node._findSubordinateLeafWithFewestUsers (blgnorelsolatedSmallCatchAllNo des=True) if destinationLeaf ._getCountForToken( stub. strName) == 0: assert issubclass (destinationLeaf . class , CatchAllNode) assert [] == [stubChecked for stubChecked in destinationLeaf . IstStubs if stubChecked. strName == Stub.strName] destinationLeaf. IstStubs .append (stub) else: [targetStub] = [stubChecked for stubChecked in destinationLeaf .IstStubs if stubChecked. strName == Stub.strName] iPrevCount = len(targetStub. stUsers) # NYI remove after testing targetStub. stUsers |= stub. stUsers assert len(targetStub. stUsers) == iPrevCount + len (stub. stUsers) , "did not add right" # NYI remove after testing

destinationLeaf .parent . increment (recurseUntilBeforeNode=considerSubordi natesOfNode, iIncrementAmount=stub. iCount) _redistributeOneStub = staticmethod(_redistributeOneStub)

def _findSubordinateIsolatedSmallCatchAlls (self) : lstlsolatedSmallCatchAlls = [] for (node, tupHigherUps) in self . traverseNestedChildrenlterator ( ) : if node .JblsIsolatedSmallCatchAll : lstlsolatedSmallCatchAlls . append (node) return lstlsolatedSmallCatchAlls

def _getCountForToken(self , strToken) : assert type (strToken) is types . StringType if self. strName == strToken: return self.iCount else: return 0 def _findSubordinateLeafWithFewestUsers (self, bIgnoreIsolatedSmallCatchAllNodes=False) : iSmallestCount = extremes .UniversalMaximum bestNode = None for (node, tupHigherUps) in self . traverseNestedChildrenlterator ( ) : if node.blsLeaf and (not (blgnorelsolatedSmallCatchAllNodes and node._bIsIsolatedSmallCatchAll) ) : assert issubclass (node. class , CatchAllNode) if node. iCount < iSmallestCount: bestNode = node iSmallestCount = node. iCount assert bestNode is not self assert bestNode.blsLeaf return bestNode def _findSubordinateNodeWithMostUsersForToken(self , strName, nodeToIgnore=None, qualifyingFunction=lambda node: True) : assert type(qualifyingFunction) is types. FunctionType assert nodeToIgnore is None or issubclass (nodeToIgnore. class , Node) assert type (strName) is types . StringType bestNode = None iBestCount = 0 for (node, tupHigherUps) in self . traverseNestedChildrenlterator ( ) : if nodeToIgnore is None or (nodeToIgnore is not None and node != nodeToIgnore): if qualifyingFunction(node) : iCount = node._getCountForToken( strName) if iCount > iBestCount: bestNode = node iBestCount = iCount assert bestNode is not self return bestNode def _getHigherUps (self) : node = self IstResult = [] if node.parent is not None: while not issubclass (node.parent . class , RankHandler) IstResult .insert (0, node.parent) node = node.parent return tuple (IstResult) def _setUsers (self , stUsers): self ._stUsers = stUsers def _getUsers (self) : # NYI this could be a lambda on the property return self ._stUsers

iCount = property( lambda self: self._iCount + sum([stub[l] for stub in self .IstStubs] ) ) rankHandler = property(lambda self: self .parent. rankHandler) tupHigherUps = property(_getHigherUps) stUsersRecursive = property (_getSetUsersRecursive) stUsers = property (_getUsers, _setUsers)

_bIsIsolatedSmallCatchAll = property (lambda self: False)

class CatchAllNode (Node) ckiSortOrderer = 0 def init (self, parent=None, lstStubs=None) : super (CatchAllNode, self) . init (None, parent) if IstStubs == None: self .IstStubs = [] else: self .IstStubs = IstStubs

def _getCountForToken(self , strToken) : IstMatchingStub = [stublnner for stublnner in self .IstStubs if stublnner. strName == strToken] assert len (IstMatchingStub) in (0, 1) if len (IstMatchingStub) == 1: stub = lstMatchingStub[0] iCount = stub. iCount else: iCount = 0 return iCount def _getUsers (self) : stCumUsers = sets. Set 0 for stub in self .IstStubs: stCumUsers |= stub. stUsers return stCumUsers iCount = property (lambda self: sum ( [stub. iCount for stub in self .IstStubs] ) )

_bIsIsolatedSmallCatchAll = property(lambda self: self.iCount < self . rankHandler . kiMinimumClusterSize \ and len (self .parent .IstSortedChildren) > 1 \ and (not issubclass (self .parent. IstSortedChildren [1] . class , CatchAllNode) ) ) stUsers = property (_getUsers)

# NYI: iSpecificity could conceivably be enhanced to involve the number of stubs, counting # stubs with strName == None differently somehow. iSpecificity = property (lambda self: len(self. tupHigherUps) )

class RankHandler (Node) :

def init (self, strFilePath, bClear=False, iOverrideMinimumClusterSizeForTesting=DEFAULT_MINIMUM_CLUSTER_SIZE) : self.parent = None self . strName = ' RankHandler '

self ._assignNodeNumbersToAllUnassignedNodes_iLastAssignedNodeNumber 1 if bClear: if os .access (strFilePath, os.F_0K): os . remove ( strFilePath) else: assert not os. access (strFilePath, os.F_OK), 'File is already there -- use bClear option or load' self .strFilePath = strFilePath self ._bDidPrepare = False self ._addNewData_iPreviousAutomaticUser = 0 # "static" variable self .kiMinimumClusterSize = iOverrideMinimumClusterSizeForTesting self . kiApproximateMaximumClusterSize = APPROXIMATE_MAXIMUM_CLUSTER_SIZE_FACTOR * self .kiMinimumClusterSize self .clusterFitter = None self ._stRegisteredUsers = None super (RankHandler, self). init 0 def load(cls, strFilePath): assert type(cls) == types . TypeType, 'Cannot instance. load 0 , use RankHandler. load () ' assert os. access (strFilePath, os.F_OK), 'File to load not found' loadFile = open (strFilePath, 'rb') rankHandler = cPickle. load(loadFile) loadFile . close ( ) return rankHandler load = classmethod(load) def clusterNumberForUser (self, iUser) : assert type(iUser) is types . IntType iClusterNumber = None for (node, tupHigherUps) in self . traverseNestedChildrenlterator ( ) : if issubclass (node. class , CatchAllNode) and iUser in node . stUsers : iClusterNumber = node . iNodeNumber if iClusterNumber is None: raise transposeexceptions.LogicError, 'Requested cluster number for unknown user: ' + str(iUser) return iClusterNumber def addNewData(self , seqName, iUser=None) : iUser is optional to enable testing to run successfully that isn't interested in the user identity. seqName is a sequence of names such as genre names which is ordered such that the most frequenly occuring name is first, the next most frequently occuring name is second etc.

An impliciation of that is that names are unique within seqName.

NOTE: For ties, should have random sorting so that some (high in the alphabet) genres don't tend to get lost in the shuffle.

NYI: do aforesaid randomization here. Note that for Goombah, it is not needed because it is done in clusterfitter code called by addGenreRawData . assert len(seqName) == len (unique. unique ( seqName) ) if self ._bDidPrepare: raise transposeexceptions . LogicError if iUser==None: self ._addNewData_iPreviousAutomaticUser += 1 # simulates static variable iUser = self ._addNewData_iPreviousAutomaticUser else: if self ._stRegisteredUsers is not None: if iUser not in self ._stRegisteredUsers : raise transposeexceptions.LogicError, 'Unregistered user added' node = self for strName in seqName: childNode = node. findChildWithName(strName) if childNode is None: childNode = Node (strName, node) node . IstSortedChildren. insort (childNode) if strName == seqName[-l]: childNode. increment { ) childNode . stUsers . add (iUser) else: node = childNode

def prepare (self, bDisableClusterFitterForTesting=False) self ._recursivelyMergeNodes ( ) self ._ensureEveryBranchHasCatchAll ( ) self ._redistributeIsolatedSmallCatchAlls ( ) self ._redistributeNonCatchAllUsers ( ) self ._addUsersWithoutDataToLeastDefinedCatchAll ( ) self ._splitVeryLargeCatchAlls ( ) self ._assignNodeNumbersToAllUnassignedNodes ( ) if not bDisableClusterFitterForTesting: self .ClusterFitter = clusterfitterclass . ClusterFitter (self .getClusterFitterString ( ) ) self ._bDidPrepare = True

def registerUsers (self, stUsers): # The following assert makes sure all the users are integers assert list (sets. Set ( [type(iUser) for iUser in stUsers])) == [types. IntType] self ._stRegisteredUsers = stUsers

def _addUsersWithoutDataToLeastDefinedCatchAll (self ) : if self ._stRegisteredUsers is not None: # FOR TESTING ONLY, it could be None. In real use it wouldn't be when this is called. stUsersWithoutData = self ._stRegisteredUsers - self . stUsersRecursive newStub = Stub(None, stUsersWithoutData) self .vaguestCatchAllNode . IstStubs . append (newStub)

self .vaguestCatchAllNode .parent . increment ( iIncrementAmount=le ( stUsersW ithoutData) ) def _getAllUsers (self ) : stCumUsers = sets. Set 0 for (node, tupHigherUps) in self . traverseNestedChildrenlterator () : stCumUsers |= node. stUsers return stCumUsers

def _redistributeIsolatedSmallCatchAlls (self) :

This method addresses an artifact of our processing method, whereby there may be a single very small CatchAllNode preceding one or more Nodes. '(Ths may occur at multiple locations in the tree.) We need to move the users represented by those CatchAllNodes to better places so that we don ' t end up with unnecessarily tiny clusters.

lstlsolatedSmallCatchAlls = self ._findSubordinatelsolatedSmallCatchAlls ( ) for isolatedSmallCatchAll in lstlsolatedSmallCatchAlls: parent = isolatedSmallCatchAll.parent

self ._redistributeOneIsolatedSmallCatchAll (isolatedSmallCatchAll) del parent. IstSortedChildren [0]

def _redistributeNonCatchAllUsers (self) :

This should be done after other merge steps; it ' s a cleanup. Without this step with the original beta test data, there was exactly one user who needed this done, in the top level ROCK node. All other users had been redistributed to CatchAllNodes in earlier steps. lstNodesWithNonCatchAllUsers = self . findNodes (lambda node: (not issubclass (node. class , CatchAllNode)) and len(node. stUsers) > 0) for node in lstNodesWithNonCatchAllUsers: assert type (node. strName) is types . StringType, 'number of collected nodes: %i ' % (len (lstNodesWithNonCatchAllUsers) , ) self ._redistributeOneStub( Stub (node. strName, node. stUsers) , None, node) node. stUsers = sets. Set 0 def _splitVeryLargeCatchAlls (self) : # We allow IstNewSortedChildren to be possibly temporarily out of strict sort sequence, # and use sortO to fix it after being copied over 1st . SortedChildren. # The logic here depends on node. IstSortedChildren being in correct sort order # with CatchAllNodes coming first at the beginning of this method.

stParentsOfCatchAllNodes = sets. Set ( [node.parent for node in self .CatchAllNodes] ) for node in stParentsOfCatchAllNodes: IstSortedChildren = node. IstSortedChildren node. IstSortedChildren = None IstNewSortedChildren = SortedChildrenList ( ) for childNode in IstSortedChildren: if childNode. iCount > self .kiApproximateMaximumClusterSize and issubclass (childNode. class , CatchAllNode): IstStubsWithCount = Us. iCount, s) for s in childNode . IstStubs] IstStubsWithCount . sort ( ) IstNewStubs = [] for i in range (le (childNode. IstStubs) ) : # Deal with too-big single stubs by breaking them apart. if childNode. IstStubs [i] .iCount > self . kiApproximateMaximumClusterSize : IstUsers = list (childNode . IstStubs [i] . stUsers) iNumberOfDerivedStubs = len (IstUsers) // self . kiMinimumClusterSize for iMod in range (iNumberOfDerivedStubs) : IstPartialUsers = [IstUsers [ilnner] for ilnner in range ( len (IstUsers ) ) if ilnner % iNumberOfDerivedStubs == iMod]

IstNewStubs. append (Stub (childNode. IstStubs [i] . strName, sets . Set ( IstPartialUsers ) ) ) else: IstNewStubs. append (childNode. IstStubs [i] ) currentCatchAllNode = CatchAllNode (node, IstNewStubs [ :1] ) del IstNewStubs [0] IstNewSortedChildren. append (currentCatchAllNode) while IstNewStubs: stub = IstNewStubs [0] del IstNewStubs [0] if currentCatchAllNode. iCount >= self . kiMinimumClusterSize : currentCatchAllNode = CatchAllNode (node,

[stub])

IstNewSortedChildren. append (currentCatchAllNode) else: currentCatchAllNode . IstStubs . append (stub) if currentCatchAllNode. iCount < self . kiMinimumClusterSize : destinationCatchAll = utilities .sortedList ( IstNewSortedChildren [: -1] , lambda n: n. iCount) [0]

destinationCatchAll . IstStubs . extend(currentCatchAllNode . IstStubs) del IstNewSortedChildren [-1] else: IstNewSortedChildren. append (childNode) node. IstSortedChildren = IstNewSortedChildren node. IstSortedChildren. sor ( )

def save (self) : saveFile = open (self .strFilePath, 'wb') cPickle. dump (self, saveFile, cPickle.HIGHEST_PROTOCOL) saveFile . close ( ) def getClusterFitterString(self) : IstClusterDescription = [] iCount = 0 # for debugging only, remove when done NYI for (node, tupHigherUps) in self . traverseNestedChildrenlterator ( ) : iCount += 1 if node.blsLeaf: assert node. strName is None tupOrderedTokens = [n. strName for n in tupHigherUps] assert ' ' not in tupOrderedTokens IstTokens = [Stub.strName for stub in node. IstStubs] # GenreRankHandler ._redistributeNonCatchAllUsers O causes users to be in nodes and also # in stubs (unordered tokens) . For the purposes of matching users to clusters, we # only want the node (ordered) mention. stUnorderedTokens = sets . ImmutableSet ( [strToken for strToken in IstTokens if strToken not in tupOrderedTokens] ) stUnorderedTokens = sets . ImmutableSet ( IstTokens ) CD = clusterfitterclass. ClusterDescription cd = CD (None, node . iNodeNumber, tupOrderedTokens, stUnorderedTokens )

IstClusterDescription. append (str (cd) ) assert str(cd) strClusterDescription = ' ; ' .join (IstClusterDescription) return binascii.b2a_base64 (strClusterDescription) def findNodes (self , xKeyObject) : xKeyObject can be an iNodeNumber or a sequence containing the strName ' s of the higher-up nodes and ending with the name of the node to be found. It can also be a function that accepts a node as its parameter and returns True if the node is a desired one or False otherwise.

The RankHandler itself is not represented in the xKeyObject tuple.

All matches are returned in a list. If no matches are obtained the list is empty. When searching on node number only one can be found, still it is returned in a list for consistency.

assert type (xKeyObject) in (types .ListType, types . TupleType, types . IntType, types. FunctionType) IstResult = [] for (node, tupHigherUps) in self . traverseNestedChildrenlterator ( ) : assert type ( tupHigherUps) == types .TupleType if (type (xKeyObject) is types . IntType and node . iNodeNumber

== xKeyObject) \ or (type (xKeyObject) in (types. ListType, types .TupleType) and \ (list (xKeyObject) == [nodeHigher. strName for nodeHigher in tupHigherUps] + [node. strName] ) ) \ or (type (xKeyOb ect) is types . FunctionType and xKeyObject (node) ) : IstResult . append (node) return IstResult def _ensureEveryBranchHasCatchAll (self) : IstNodesNeedingCatchAlls = [ ] for (node, tupHigherUps) in self . traverseNestedChildrenlterator ( ) : assert node. IstSortedChildren == [] or node. IstSortedChildren #make sure no wierd false value if node. IstSortedChildren == [] and node. class !=

CatchAllNode: IstNodesNeedingCatchAlls . append (node) for node in IstNodesNeedingCatchAlls: node. IstSortedChildren. append (CatchAllNode (node) ) def _assignNodeNumbersToAllUnassignedNodes (self) : for (node, tupHigherUps) in self . traverseNestedChildrenlterator ( ) : if node . iNodeNumber is None:

self ._assignNodeNumbersToAllUnassignedNodes_iLastAssignedNodeNumber += 1 node . iNodeNumber = self ._assignNodeNumbersToAllUnassignedNodes_iLastAssignedNodeNumber

def _determineMaximumDepth(self) : The nodes just under the RankHandler are considered to be at depth 0; this returns the maximum seen depth in the data. return max ( [len (tupHigherUps) for (node, tupHigherUps) in self . raverseNestedChildrenlterator () ] ) rankHandler = property(lambda self: self)

CatchAllNodes = property ( lambda node: node. findNodes (lambda nodelnner: issubclass (nodelnner. class , CatchAllNode))) vaguestCatchAllNode = property(lambda self: utilities. sortedList ( self . findNodes (lambda n: issubclass (n. class , CatchAllNode)),

lambda n: n.iSpecificity) [0] )

class GenreRankHandler (RankHandler) :

To create an empty GenreRankHandler, pass a backing file name to init (to clear an existing one, set bClear to True) .

Then, you can save it to disk with saveO, and restore it with clusterer = GenreRankHandler. load( ) Then use addGenreRawData to add a sequence of clusterfitterclass . GenreRawData objects to the clusterer. Each sequence represent one user.

Use that method once for each user.

When they are all added, call prepare^

Then pass a sequence of clusterfitterclass. GenreRawData objects to getClusterNumbersOrderedByFit ( ) to get an ordered list of cluster numbers for that user, the first one being the best match. That's the one the user should actually go in. NOTE! ! ! ! There is no guarantee that the same cluster will come out first if you run getClusterNumbersOrderedByFit twice in a row for the same user, since there are cases when two clusters are equally fit, and randomness may be used to choose between them. So when getting cluster assignments, you should all the method once and store the result in the database. Then, the next week, repopulate the GenreRankHandler object from scratch to to request a new cluster number that will reflect changes to the data over that week.

def init (self, strFilePath, bClear=False, iOverrideMinimumClusterSizeForTesting=DEFAULT_MINIMUM_CLUSTER_SIZE) : strFilePath is the location to pickle the data when saved is called. bClear tells whether any data previously saved there should be cleared. super (GenreRankHandler, self) . init (strFilePath, bClear, iOverrideMinimumClusterSizeForTesting)

def addGenreRawData (self, seqGenreRawData, iUser=None) :

Accepts a sequence of clusterfitterclass.GenreRawData objects. Called once for each user.

NYI This should really take a set rather than a sequence as input because the order does not matter, and there is a risk of giving the impression that it does because it does in RankHandler. addNewData. NYI in this method and getClusterNumbersOrderedByFit, it may make sense to refactor so that we don't have to call the static method in ClusterFitter, but rather than both classes inherit from another class which contains that functionality and is never instantiated.

# iUser is optional to enable testing to run successfully # that isn't interested in the user identity. NYI: In a future # version, could update all such test code to include a dummy user. tupSortedTokens = clusterfitterclass . ClusterFitter . getSortedTokens ( seqGenreRawData) self . addNewData ( tupSortedTokens , iUser) def prepare (self , bDisableClusterFitterForTesting=False) : Run this once after all users have been added with addGenreRawData ( ) . Once it has been run, no new users can be added. To process new users, you need to create a new GenreRankHandler from scratch -- and note that cluster numbers will not necessarily be consistent when a new one is created. super (GenreRankHandler, self) .prepare (bDisableClusterFitterForTesting) def save (self) : super (GenreRankHandler, self) . save ( ) def load (self, strFilePath) : return super (GenreRankHandler, self) .load (strFilePath) ^: clusterfitterclass.py ^:

On a server, this is a helper class for genrerankhandlerclass.py. However, it has another use as well. On the client, it serves to tell the clients which identifier is associated with the cluster a client should download first. That is, it outputs a sorted list of clusters with the ones most likely to yield high similarity to the local user.

It does that by means of summary data (the xlnitData parameter on the init method ) that is sent to the client from the server which contains data that summarizes the differences between the clusters.

In the current embodiment (from which this code is derived), this enables clients to request the clusters that are most likely to have good similarity matches first; this downloading is accomplished via

BitTorrent. We do not include the BitTorrent-related code here because techniques for accomplishing a BitTorrent download are readily apparent to a programmer of ordinary skill.

import types import sets import binascii import random import utilities

# The multi-word genres from id3 tags with winamp extensions,

# plus one or more extras of my own at the end,

# with symbols like & and - replaced with spaces,

# and made all-caps, are: MULTI_WORD_GENRES= \ [ 'CLASSIC ROCK' , ^■HIP HOP' , ' NEW AGE ' , ^■ R B ' , ' DEATH METAL ' , ' EURO TECHNO ' , 'TRIP HOP' , 'JAZZ FUNK' , ' INSTRUMENTAL POP ' , ' INSTRUMENTAL ROCK ' , ' TECHNO INDUSTRIAL ' , 'ROCK ROLL', #3 SPACES IN BETWEEN 'ROCK & ROLL' in id3 ' SOUTHERN ROCK ' , 'TOP 40' , ' POP FOLK ' , 'NATIVE AMERICAN' , 'CHRISTIAN RAP' , ' HARD ROCK ' , ' FOLK ROCK ' , ^■NATIONAL FOLK' , 'PROGRESSIVE ROCK', ' PSYCHEDELIC ROCK ' , 'FAST FUSION' , ^■SYMPHONIC ROCK' , ^• SLOW ROCK ' , ' BIG BAND ' , ' PORN GROOVE ' , ' SLOW JAM ' , ' POWER BALLAD ' , ' RHYTHMIC SOUL ' , ' PUNK ROCK ' , ' DRUM SOLO ' , 'A CAPELLA' , ' EURO HOUSE ' , 'DANCE HAL ' ,

'ROCK ROLL'] # second version of 'Rock&Roll' class GenreRawData (object) : def init (self, strGenreRaw, iCount) : strGenreRaw is the genre data for a song as it comes from the iTunes database -- it doesn't need to be processed in any way; that's done in ClusterFitter. iCount is the number of plays . self .strGenreRaw = strGenreRaw self.iCount = iCount def cmp (self, otherTokenData) : We want the biggest iCount's first.

Note that if iCounts are equal, the ordering is undefined. That is because we don't want to skew results according to alphabetical suborderings; better to leave suborderings undefined so that some tokens are not more important than others . return -cmp (self .iCount, otherTokenData . iCount)

class ClusterDescription(object) def init (self, ClusterFitter, iClusterNumber, seqOrderedTokens, stUnorderedTokens) : self .ClusterFitter = ClusterFitter self .iClusterNumber = iClusterNumber self .IstOrderedTokens = list (seqOrderedTokens) self .stUnorderedTokens = stUnorderedTokens self ._dctTokenUnitRanks = None # Needs to be computed later cuz of circular dependency def _getUnitRanks(self) : if self ._dctTokenUnitRanks is None: iNumberOfRanks = len (self .IstOrderedTokens) + len (self . stUnorderedTokens) dctTokenUnitRanks = {} for iOrderedlndex in range (len (self .IstOrderedTokens) ) : dctTokenUnitRanks [self . IstOrderedTokens [iOrderedlndex] ] \ = 1.0 - (float (iOrderedlndex) / self .ClusterFitter. iMaxDepth) # = 1.0 - (float (iOrderedlndex) / iNumberOfRanks) # for unorderds we compute the average unit rank for all the unorderds . But to compute that # we only need to consider the highest and lowest, fHighestUnordered = 1.0 - ( float ( len ( self . IstOrderedTokens ) ) / self . ClusterFitter . iMaxDepth) fLowestUnordered = fHighestUnordered - ( (len (self .stUnorderedTokens) - 1.0) / self .ClusterFitter .iMaxDepth) fAveragePeerUnitRank = (fHighestUnordered + fLowestUnordered) / 2.0

# print 'CALC:', self . iClusterNumber, len (self .IstOrderedTokens) , len (self .stUnorderedTokens) , \

# fHighestUnordered, fLowestUnordered, fAveragePeerUnitRank, self .ClusterFitter. iMaxDepth for strToken in self . stUnorderedTokens: dctTokenUnitRanks [strToken] = fAveragePeerUnitRank self ._dctTokenUnitRanks = dctTokenUnitRanks return self ._dctTokenUnitRanks def str (self) : assert type (self. IstOrderedTokens) is types .ListType, \ 'If this is a tuple there is a trailing comma when' \ + 'only one item which breaks the logic: ' + repr (type (self .IstOrderedTokens) ) return '%i: %s: %s ' % (self . iClusterNumber, str ( self . IstOrderedTokens ) [1 : - 1] . replace ( , ' ' ) , str ( list (self . stUnorderedTokens ) ) [ 1 : - 1] . replace ( , ' ' ) )

def cmp (self, other) : Sort by cluster number for convenience return cmp (self .iClusterNumber, other. iClusterNumber) dctTokenUnitRanks = property(_getUnitRanks)

# The idea here is that a cluster that has a lot of information describing # what it is is less likely to be appropriate for a random user who does NOT match lower # levels of detail than is another cluster that does not specify lower levels fAmountOfDetail = property ( lambda self: float (len (self .dctTokenUnitRanks) ) )

class ClusterFitter (object) : ckdctMultiWordGenres = { } for strMultiWordGenre in MULTI_WORD_GENRES : ckdctMultiWordGenres [strMultiWordGenre] = strMultiWordGenre. replace ( ' ', '') def init (self, xlnitData) : self ._lstClusterDescriptions = [] if type (xlnitData) == types . StringType : self ._instantiateFromString (xlnitData) def _instantiateFromString(self, strBinAsciiData) : # rename _populateClusterDescriptions NYI strData = binascii.a2b_base64 (strBinAsciiData) IstClusterDescriptionStrings = strData. split (';' ) for strClusterDescription in IstClusterDescriptionStrings: (strClusterNumber, strOrdered, strUnordered) = strClusterDescription. split ( ' : ' ) iClusterNumber = int (strClusterNumber) IstOrderedTokensRaw = [strToken. strip ( ) for strToken in strOrdere . split (',')] IstOrderedTokens = (IstOrderedTokensRaw, []) [IstOrderedTokensRaw == [' ']] # python idiom for a ? b : c stUnorderedTokens = sets. ImmutableSet ( [strToken. strip ( ) for strToken in strUnordered. split ( ' , ' ) if strToken. strip 0 != '']) assert not ( ' ' in stUnorderedTokens or ' ' in IstOrderedTokens ) clusterDescription = ClusterDescription(self , iClusterNumber, IstOrderedTokens, stUnorderedTokens) self ._lstClusterDescriptions. append (clusterDescription) self .iMaxDepth = max( [len (cd. IstOrderedTokens) +len(cd. stUnorderedTokens) for cd in self ._lstClusterDescriptions] ) def getClusterNumbersOrderedByFit (self , seqGenreRawData, bRandomize=True) :

This method is not necessarily deterministic ! Call once and store the result. Returns a list of integer cluster numbers . tupSortedTokens = self .getSortedTokens (seqGenreRawData) return self .getClusterNumbersOrderedByFitForTokenList (tupSortedTokens, bRandomize) def getClusterNumbersOrderedByFitForTokenList (self, seqOrderedTokens , bRandomize=True ,

bForTestingDoNotAppendBoring=False) : class BestHelper (object) : # So that only sorts on goodness def init (self, fGoodness, clusterDescription) : self . fGoodness = fGoodness self .clusterDescription = clusterDescription def cmp (self, other) :

We handle the case where fGoodness values are equal, but one cluster is more specified than the other (and so less likely to be appropriate if there wasn't a match represented by goodness) . assert other. class == BestHelper if round (self .fGoodness, 3) == round (other. fGoodness,

3) : iResult = - cmp (self . clusterDescription. fAmountOfDetail ,

other . clusterDescription . fAmountOfDetail) else: iResult = cmp (self. fGoodness, other . fGoodness) return iResult iClusterNumber = property (lambda self: self .clusterDescription. iClusterNumber) dct ctiveTokenUnitRank = {} for i in range (len(seqOrderedTokens) ) : strName = seqOrderedTokens [i] fUnitRank = 1.0 - ( float (i) / self . iMaxDepth) dctActiveTokenUnitRank[strName] = fUnitRank IstBestHelper = [] for clusterDescription in self ._lstClusterDescriptions: fGoodness = self ._computeMatchGoodness (dctActiveTokenUnitRank, clusterDescription) IstBestHelper . append(BestHelper ( fGoodness , clusterDescription) ) if bRandomize: # The shuffle is to deal with ties. We don't want sort artifacts like # the first tied item always coming out first, or else, for instannce, the "best" # might always be the first such item rather than one of the tied items. random. shuffle (IstBestHelper) IstBestHelper . sort ( ) IstBestHelper . reverse ( ) lstClusterNumbers = [bh. iClusterNumber for bh in IstBestHelper if bh. fGoodness != 0.0] IstResult = lstClusterNumbers if not bForTestingDoNotAppendBoring: IstNonDuplicateBoringClusters = [icn for icn in self ._getClusterNumbersSortedByBoringness ( ) if icn not in lstClusterNumbers] IstResult . extend ( IstNonDuplicateBoringClusters ) return lstClusterNumbers

def _computeMatchGoodness (dctTokenUnitRank, clusterDescription): dctTokenUnitRanks = clusterDescription. dctTokenUnitRanks fSum = 0.0 for strToken in dctTokenUnitRank: fUnitRank = dctTokenUnitRank[strToken] fProduct = fUnitRank * dctTokenUnitRanks .get (strToken, 0.0) fSum += fProduct return fSum _computeMatchGoodness = staticmethod(_computeMatchGoodness) def _extractTokens (els, strBig) :

Return blank-delimited tokens in a typle as upper case. Ignore single letters/numbers. Accept only alphanumeric characters. strUpperBig = strBig.upper () IstSingleChar = [] for strSingleChar in strUpperBig: if strSingleChar. isalnumO : strSingleCharGood = strSingleChar else: StrSingleCharGood = ' ' IstSingleChar . append ( strSingleCharGood) strNonAlphaNumericReplacedWithSpace = ' ' . join(lstSingleChar) if ' ' in strNonAlphaNumericReplacedWithSpace. strip () : StrMultiWordGenresRemoved = strNonAlphaNumericReplacedWithSpace for StrMultiWordGenre in MULTI_WORD_GENRES : StrMultiWordGenresRemoved =

StrMultiWordGenresRemoved. replace ( strMultiWordGenre, els . ckdctMultiWordGenres [strMultiWordGenre] ) else: StrMultiWordGenresRemoved = strNonAlphaNumericReplacedWithSpace return tuple ( [strToken for strToken in StrMultiWordGenresRemoved. split () if len(strToken) > 1]) _extractTokens = classmethod(_extractTokens) def getSortedTokens (seqGenreRawData) : In a typical application, seqGenreRawData contains the counts for row genre strings such as "Rock & Roll / 80 's". That would become a list 'ROCKROLL', '80' (2 tokens). Then the most common tokens in seqGenreRawData are returned first, one token per output list element. We shuffle the input sequence to eliinate artifacts such as an alphabetical sorting. Not all tokens with few hits get used in the clustering, and we don't want a prejudice toward any particular one based on alphabetical order or other artifacts. dctTokenCount = { } IstGenreRawData = list (seqGenreRawData) random. shuffle (IstGenreRawData) for genreRawData in IstGenreRawData: IstTokens = ClusterFitter._extractTokens (genreRawData. strGenreRaw) for strToken in IstTokens: if dctTokenCount.has_key(strToken) : dctTokenCount [strToken] += genreRawData . iCount else: dctTokenCount [strToken] = genreRawData . iCount IstCountedGenre = utilities. sortedList (dctTokenCount. items 0 , 1, bReverse=True) return [strToken for (strToken, iCount) in IstCountedGenre] getSortedTokens = staticmethod (getSortedTokens)

def _tokenBoringnesses (self) : dctTokenRankSum = (} for cd in self ._lstClusterDescriptions : dctTokenUnitRanks = cd. dctTokenUnitRanks for strToken in dctTokenUnitRanks: if not dctTokenRankSum.has_key( strToken) : dctTokenRankSu [strToken] = 0.0 dctTokenRankSum[strToken] += dctTokenUnitRanks [ strToken] return dctTokenRankSum

def _getClusterNumbersSortedByBoringness (self) : dctTokenBoringness = self ._tokenBoringnesses ( ) dctClusterNumberBoringness = {} for cd in self ._lstClusterDescriptions: fSum = 0.0 for strToken in cd. dctTokenUnitRanks: fSum += cd. dctTokenUnitRanks [strToken] \ * dctTokenBoringness [strToken] fBoringness = fSum / 1en (cd. dctTokenUnitRanks) dctClusterNumberBoringness [cd. iClusterNumber] = fBoringness IstBoringnessCluster = [(tup[l], tup[0]) for tup in dctClusterNumberBoringness . items ( ) ] IstBoringnessCluster . sort ( ) IstBoringnessCluster . reverse ( ) return [tup[l] for tup in IstBoringnessCluster]

iBoringestClusterNumber = property(lambda self: self ._getClusterNumbersSortedByBoringness 0 [0] )

INDUSTRIAL APPLICABILITY

The present invention is desirably implemented via a public network or internet. It may, for example, be coupled to a private network or intranet through a firewall server or router. As used herein, the term "internet" generally refers to any collection of distinct networks working together to appear as a single network to a user. The term "Internet", on the other hand, refers to a specific implementation of internet, the so-called world wide "network of networks" that are connected to each other using the Internet protocol (IP) and other similar protocols. The Internet provides file transfer, remote log in, electronic mail, news and other services. The system and techniques described herein can be used on any internet including the so-called Internet.

One of the unique aspects of the Internet system is that messages and data are transmitted through the use of data packets referred to as "datagrams." In a datagram-based network, messages are sent from a source to a destination in a manner similar to a government mail system. For example, a source computer may send a datagram packet to a destination computer regardless of whether or not the destination computer is currently powered on and coupled to the network. The Internet protocol (IP) is completely sessionless, such that IP data gram packets are not associated with one another.

The firewall server or router is a computer or item of equipment which couples the computers of a private network to the Internet. It may thus act as a gatekeeper for messages and datagrams going to and from the Internet 1.

An Internet service provider (ISP) is also coupled to the Internet. A service provider is an entity that provides connections to a part of the Internet, for a plurality of users. Also coupled to the Internet are a plurality of web sites or nodes . When a user wishes to conduct a transaction at one of the nodes, the user accesses the node through the Internet.

Each node is configured to understand which firewall and node to send data packets to given a destination IP address. This may be implemented by providing the firewalls and nodes with a map of all valid IP addresses disposed on its particular private network or another location on the Internet. The map may be in the form of prefix matches up to and including the full IP address.

Also coupled to the Internet is a server, containing an information database with representations of user profiles and associated user identifiers 5. The information may be stored, for example, as a record or as a file. The information associated with each particular user is stored in a particular data structure in a database. One exemplary database structure is as follows. The database may be stored, for example, as an object-oriented database management system (ODBMS), a relational database management system (e.g. DB2, SQL, etc.), a hierarchical database, a network database, a distributed database (i.e. a collection of multiple, logically interrelated databases distributed over a computer network) or any other type of database package. Thus, the database and the system can be implemented using object-oriented technology or via text files. A computer system on which the system of the present invention may be implemented may be, for example, a personal computer running Microsoft Windows, Linux, Apple Macintosh or an equivalent operating system. Such a computer system typically includes a central processing unit (CPU), e.g., a conventional microprocessor, a random access memory (RAM) for temporary storage of information, and a read only memory (ROM) for permanent storage of information. Each of the aforementioned components is coupled to a bus. The operating system controls allocation of system resources and performs tasks such as processing, scheduling, memory management, networking, and I/O services. Also coupled to the bus is typically a non-volatile mass storage device which may be provided as a fixed disk drive which is coupled to the bus by a disk controller.

Data and software may be provided to and extracted from computer system via removable storage media such as hard disk, diskette, and CD ROM. For example, data values generated using techniques described herein may be stored on storage media. The data values may then be retrieved from the media by the CPU and utilized to recommend one of a plurality of items in response to a user's query.

Alternatively, computer software useful for performing computations related to enabling recommendatons and community by massively-distributed nearest-neighbor searching may be stored on storage media. Such computer software may be retrieved from the media for immediate execution by the CPU or by processors included in one or more peripherals. The CPU may retrieve the computer software and subsequently store the software in RAM or ROM for later execution.

User input to the computer system may be provided by a number of devices. For example, a keyboard and a mouse are typically coupled to the bus by a controller. The computer system typically also includes a communications adapter which allows the system to be interconnected to a local area network (LAN) or a wide area network (WAN). Thus, data and computer program software can be transferred to and from the computer system via the adapter, bus and network.

Claims

CLAIMS 1. A networked computer system for supplying recommendations and taste-based community to a target user, comprising: networked means for providing representations of nearest neighbor candidate taste profiles and associated user identifiers in an order such that said nearest neighbor candidate taste profiles tend to be at least as similar to a taste profile of the target user according to a predetermined similarity metric as are subsequently retrieved ones of said nearest neighbor candidate taste profiles,

means to receive said representations of nearest neighbor candidate taste profiles and associated user identifiers on at least one neighbor-finding user node,

said neighbor-finding user nodes each having at least one similarity metric calculator calculating said predetermined similarity metric based upon said representations of nearest neighbor candidate taste,

at least one selector residing on at least one of said neighbor-finding user nodes using the output of said at least one similarity metric calculator for building a list representing the nearest-neighbor users,

said list representing said nearest-neighbor users providing access to associated ones of said candidate profiles,

a nearest-neighbor based recommender which uses said associated ones of said candidate profiles to recommend items,

a display for viewing identifiers of recommended items,

a display for viewing identifiers of a plurality of nearest neighbor users,

means to select at least one of said nearest neighbor users from said display of identifiers of a plurality of nearest neighbor users,

a display of information relating to at least one of the items in said nearest neighbor user's collection,

whereby massively distributed processing is harnessed in a bandwidth-conserving way for finding the best neighbors out of the entire population of users, and the same neighborhood is leveraged to provide recommendations as well as highly focused taste-based community for sharing the enjoyment of items including recommended items

2. The networked computer system of claim 1, further including means to facilitate communication with at least said nearest neighbor users where the type of communication comprises at least one selected from the group consisting of online chat, email, online discussion boards, voice, and video.

3. A networked computer system for supplying recommendations and taste-based community to a target user, comprising

an ordered plurality of nearest neighbor candidate taste profiles and associated user identifiers such that said nearest neighbor candidate taste profiles tend to be at least as similar to a taste profile of the target user according to a predetermined similarity metric as are subsequently positioned ones of said nearest neighbor candidate taste profiles,

networked means to receive said nearest neighbor candidate taste profiles and associated user identifiers on at least one neighbor-finding user node,

said neighbor-finding user nodes each having at least one similarity metric calculator calculating said predetermined similarity metric,

a nearest-neighbor based recommender which uses said associated ones of said a nearest-neighbor based recommender which uses said associated ones of said candidate profiles to recommend items,

a display for viewing identifiers of recommended items,

a display for viewing identifiers of a plurality of nearest neighbor users, means to select at least one of said nearest neighbor users from said display of identifiers of a plurahty of nearest neighbor users,

4. The networked computer system claim 1 , further including a single downloadable file that contains software that executes all necessary non-server computer instructions.