Recommending (from) Cassandra
By Sean Owen
26 Jul 2011
Category:
Technical Articles
If you're reading this, then you are probably either using Apache Cassandra (http://cassandra.apache.org) already, or are wondering why you should care. Maybe you, or the hip startup in the office next door, have deployed a Cassandra cluster to store data for a new large-scale application. You marvel at the simplicity of its key-value store compared to a relational database; you thrill at how easily it scales. Looking at this shiny new tool, you may be asking yourself, "what else can *I* do with it?"
Cassandra and other so-called "NoSQL" data stores have ushered in an era of cheap access to big data. These platforms practically dare you to find more data to collect and to throw at them. Storing data alone is boring, so in this article, I am going to suggest something useful to do with that data: learn, with Apache Mahout.
What's Apache Mahout?
Apache Mahout (http://mahout.apache.org) is a scalable machine learning library, also developed under the Apache Software Foundation's open source banner. It includes quite an assortment of machine learning algorithms, from collaborative filtering (recommenders), clustering, classification, frequent pattern set mining, and more.
Most algorithms are implemented on top of Apache Hadoop (http://hadoop.apache.org), a MapReduce implementation and batch-oriented distributed data processing framework. Your data in Cassandra can already serve as input to Hadoop-based algorithms like those in Mahout. Combining these three tools is both powerful and complex, and could form the basis of an entire book.
Here, in contrast, I'd like to present a much simpler way to experiment with Mahout machine learning on top of your Cassandra-based data, by showing off a simple recommender engine.
Think Recommender Engine
By now, most people have met a recommender engine. Amazon famously used this sort of machine learning to guess what other books and music you might buy based on past purchases and ratings. These techniques are not at all specific to customers or books or CDs; they merely infer new associations from past, existing associations. Recommender engines could be used to recommend people to books (that is: which customer is most likely to buy this book?), or people to people (think of dating sites), or anything else you can imagine.
Chances are that your business or application already records some associations between entities. Do you record which ads are most clicked for each type of news story on your news web site? Do you track help page views, and which page seems to be viewed next for each page? These are all examples of associations between things, associations that a recommender can work with. These are also examples of valuable associations that you may want to learn proactively.
It's easier here to use the terms "item" and "user" to talk about recommenders, since they're usually the things being recommended and recommended to, with the understanding that they can be any type of thing, really.
Getting Quick and Dirty with Mahout and Cassandra
Not all of Mahout is based on Hadoop. Parts of Mahout are plain Java, and intended for simple real-time applications rather than for computing machine learning results, offline, in long-running batch processes. In fact, much of the support for collaborative filtering and recommender engines is of just this kind.
These implementations are simple and speedy; the catch is that they don't scale to very large data sets. Most recommender engine algorithms are quite data intensive, and need frequent and random access to the associations ("preferences" in Mahout). As such they need data in memory and so hit a wall when available memory is exceeded. However, these algorithms are ideal for smaller data sets (millions of data points) -- and for prototyping and experimentation.
A recommender engine is available in just 10 lines of Java code. If you've downloaded the latest Mahout 0.6 snapshot from its Subversion repository (https://cwiki.apache.org/confluence/display/MAHOUT/Version+Control), compiled it, and added its "integration", "core" and "math" JAR files to your project, then this simple Java program will create and invoke the recommender engine:
import org.apache.mahout.cf.taste.impl.model.cassandra.CassandraDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
public class RecommenderDemo {
public static void main(String[] args) throws Exception {
CassandraDataModel dataModel = new CassandraDataModel("localhost", 9160, "recommender");
try {
UserSimilarity similarity =
new PearsonCorrelationSimilarity(dataModel);
NearestNUserNeighborhood neighborhood =
new NearestNUserNeighborhood(10, similarity, dataModel);
Recommender recommender =
new GenericUserBasedRecommender(dataModel, neighborhood, similarity);
System.out.println(recommender.recommend(101, 10));
System.out.println(recommender.recommend(102, 10));
System.out.println(recommender.recommend(103, 10));
} finally {
dataModel.close();
}
}
}
This is an example of a "user-based" recommender algorithm, one of many supported by Mahout. The details aren't important for purposes here; more detail is available on the project wiki at https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation. This code simply computes a list of 10 recommended items for user IDs 101, 102 and 103.
Recommender Data in Cassandra
Who is user 101? It depends on whatever "user" number 101 in your data represents. And of course, this only works if your data is already present in Cassandra. Here, this accesses a cluster available on the local machine, port 9160, and which stores recommender-related data in a keyspace called "recommender".
The keyspace contains column families "users" and "items" whose rows are keyed by user IDs and item IDs, respectively. They are "wide columns", containing a column for every item associated with the user (or user associated with item, respectively). The keyspace also contains "userIDs" and "itemIDs" column families listing all user and item IDs.
Mahout code can be used to populate the Cassandra keyspace appropriately from a Java program; here is a code snippet which would load "user,item,rating" data points from a CSV file into the cluster, for example:
for (String line : new FileLineIterable(new File("data.csv"))) {
String[] tokens = line.split(",");
long userID = Long.parseLong(tokens[0]);
long itemID = Long.parseLong(tokens[1]);
float value = Float.parseFloat(tokens[2]);
dataModel.setPreference(userID, itemID, value);
}
If you would like to obtain some sample data to play with, try the GroupLens 10 million rating data set from http://grouplens.org/node/73#attachments ; see its ratings.dat file. Note that it has to be converted to comma-separated format. The Unix command "tr -s ':' ',' < ratings.dat > ratings.csv" accomplishes this.
Interpreting Recommendations
With data loaded, running the first code snippet above ought to result in output like this:
[RecommendedItem[item:1131, value:5.0], RecommendedItem[item:858, value:5.0], RecommendedItem[item:1784, value:5.0], RecommendedItem[item:1234, value:4.8333335], ...]
...
This is actual output from the code above, when used with the GroupLens 10 million rating data set. In this case, users are actual people, items are movies, and the rating values found in the data set are movie ratings on a scale of 1 to 5. The first line of output above shows the result of recommending for user 101 in this data set. It shows the IDs of the recommendations from best to worst; the accompanying score is an estimate of user 101's rating for these movies on the same scale.
Of course, your output will be different with different data. It will mean something different if the input represents something else, like ads and web pages. But it will have a similar interpretation: the output shows new, strong associations that have not yet been observed.
Pushing Cassandra Performance to the Limit
You will find this process takes tens of seconds or more to produce the very first recommendation. As mentioned, these algorithms are very data intensive and usually hold all data in memory, not even on disk. Accessing the data in Cassandra every time is just too slow, even for Cassandra (or just about any representation that isn't in memory). CassandraDataModel caches much of what it reads in memory and quickly speeds up as the cache fills. You should find that subsequent recommendations complete in a fraction of a second. Still, this implementation will still challenge a Cassandra cluster, with thousands of requests per second at the outset, slowing as the cache helps relieve load. You will notice that some requests will still take several seconds. These spikes in recommendation time may be problematic if your application depends on a relatively quick answer.
Part of these "hiccups" are due to hiccups from Cassandra itself; it can occasionally slow down as it runs internal processes. Here is, for example, diagnostic logging information from the Hector library, which is used internally to access Cassandra:
hector.TimingLogger: Tag Avg(ms) Min Max Std Dev 95th Count
hector.TimingLogger: READ.success_ 0.16 0.10 240.40 1.71 0.24 324079
Cassandra performs exceptionally well here, answering 95% of queries in under a quarter of a millisecond. However, at least one query took 1,000 times longer according to this. A brief slowdown which causes this kind of latency spike can result in recommendations taking seconds, not milliseconds. This highlights the importance of tuning Cassandra if you are interested in using this recommender in a real application.
Next Steps
This has been an exceptionally brief introduction to one aspect of Apache Mahout. But, in this short article we've seen how to create and run a completely functional recommender engine using Mahout and Cassandra.
Interested readers can get to know more about Mahout by checking out the web site at http://mahout.apache.org and joining the user@mahout.apache.org mailing list. The project offers much more than recommenders, and much more even within recommenders. Mahout also integrates heavily with Hadoop to provide solutions at much larger scales; these processes can also use your Cassandra data, though in a different way.
blog comments powered by Disqus