Latest Tweet

Here at Signpost, we believe in track everything. We keep every request, every event, every log, every exception. We use this data for user tracking, analytics and business intelligence, and code health / bug triage.

Original System

We started out with MySQL. Requests were logged into a request table, except we handled the 1-Many relationship of request parameters in a single serialized column. Events were tracked in two tables: event_tracking and event_tracking_property, where there are many properties per event. Logs and exceptions were sent to log files, where they were rotated and hardly looked at for eternity. We ran with this setup in production for the first few months of Signpost’s lifetime and it worked out just fine.

Here’s the basic idea:

create table request (
  request_tracking_id int not null auto_increment,
  visitor_key varchar(64) not null,
  session_key varchar(36) not null,
  ip varchar(64) null,
  user_agent varchar(128) not null,
  uri varchar(255) not null,
  referer varchar(255) null,
  params text,
  email_user_id int null,
  user_id int null,
  response_code int not null,
  response_time int not null,
  timestamp timestamp not null,
  primary key (request_tracking_id)
);

create table event_tracking (
  event_tracking_id int not null auto_increment,
  visitor_key varchar(64) null,
  session_key varchar(36) null,
  event_name varchar(32) not null,
  timestamp timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
  primary key (event_tracking_id)
);

create table event_tracking_property (
  event_tracking_property_id int not null auto_increment,
  event_tracking_id int not null,
  event_key varchar(32) not null,
  event_value text,
  primary key (event_tracking_property_id),
);

Problems

Capturing and tracking everything is great. With some trouble we can find which pages are working and which are not, where users are coming from (we use Google Analytics as well for this), what they are doing on the site, when they run into problems, and when our system fails to operate as expected. The only problem with our approach, was the crazy amount of work needed to access and analyze this data for simple things. For example:

How many clicks did deal 212001 get from a session starting with an email click?

select count(distinct et.event_tracking_id)
  from event_tracking et, event_tracking_property etp1, event_tracking_property etp2
 where et.event_tracking_id = etp1.event_tracking_id
   and etp1.event_tracking_id = etp2.event_tracking_id
   and et.event_name = 'deal-click'
   and etp1.event_key = 'deal_at_location_id'
   and etp1.event_value = '212001'
   and etp2.event_key = 'session_ca'
   and etp2.event_value = 'eml';

In our new system, this would end up as:

db.event.count({
  event_name: 'deal-click',
  'event_properties.deal_at_location_id': 212001,
  'event_properties.session_ca': 'eml'
});

This is a simple case, but we can already see the thought that needs to go into the SQL vs the MongoDB query. The speed of both queries are comparable with the correct indexes. The request tracking design also made analysis on request parameters difficult. The parameters of a request were serialized into a query string format:

raw=false&feedId=dealActivity&user-opts=username,avatar&count=100&no-cache=1306867013431&feedType=comment&userId=&targetUserId=&deal-opts=id,location-name,title,category&activity-opts=id,user,type,rawtime,commentcontent&types=commenter&dealId=243844

This made it painfully slow to find requests with a certain parameter. Often we would just write some code in Java to parse the parameters and do the analysis. Not very efficient.

Desires

We wanted to ask more interesting questions in the future. Here are some of our desires:

  • Development wanted a way to see the session and requests that resulted in an exception
  • Community Management wanted a better way to help users when they had issues on the site (seeing their requests and pages they visited)
  • Business wanted both real-time and cumulative statistics, like the top 5 deals for today, or the deal with the most clicks in the past month
  • Business wanted a way to see if users were gaming the referral bonuses

This is just a short list, but it is indicative of the desires of most sites I can think of.

Considerations

I wanted the system to be fast, but also flexible. I knew that we would have new questions all the time and I didn’t want to re-compile some Java code or write long SQL queries every time. I also wanted a fast way to see aggregated data that I could graph over long periods of time. I had worked with MongoDB before and had built some Java tooling to help create applications quickly (GuiceyMongo). I knew that MongoDB was fast, and could handle a ton of writes per second (we’re averaging 4,000+ per second on a single EC2 large instance also running a Tomcat server). I also knew that I could execute random Javascript as part of my queries, which gives me that flexibility I wanted. Sounds like a plan!

New System

The system that I came up with contains three collections: request, event, event.rollup. The simplest is request.

{
  "_id" : ObjectId("4de53548b09a6541874eb641"),
  "method" : "GET",
  "response_code" : 200,
  "parameters" : {
    "raw" : "false",
    "feedId" : "dealActivity",
    "user-opts" : "username,avatar",
    "count" : "100",
    "no-cache" : "1306867013431",
    "feedType" : "comment",
    "userId" : "",
    "targetUserId" : "",
    "deal-opts" : "id,location-name,title,category",
    "activity-opts" : "id,user,type,rawtime,commentcontent",
    "types" : "commenter",
    "dealId" : "243844"
  },
  "user_agent" : "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.17) Gecko/20110420 Firefox/3.6.17 (.NET CLR 3.5.30729)",
  "referer" : "http://www.signpost.com/deals/Boston-MA/Cafe-Gigu/-5-for-10-Worth-of-Food-at-Cafe-Gigu-/243844",
  "visitor_key" : "6f6d7e229f2a06e6a0dcdac42b09b26d17dbfac2318be53bea72a996204e3731",
  "uri" : "/api/recent-activity",
  "user_id" : null,
  "session_key" : "008F75D5B7515B202DC1A610F9D86539",
  "ip" : "123.28.156.46",
  "response_time" : 4,
  "timestamp" : ISODate("2011-05-31T18:36:56.177Z")
}

This is practically the same as the MySQL table, but with an embedded object for the query parameters. I have indexes on fields that we search often, like visitor_id, session_id, and uri. An example of a common query is to find all the visitor_id and session_ids for a user, along with the time range those ids were active. This allows me to see when and how long a user is on the site, and then dig deeper into their requests.

Visitor ID breakdown

Session ID breakdown

Request breakdown

The queries for this data are pretty simple. For example, the first image comes from a map-reduce:

db.request.mapReduce(function() {
  emit(this.visitor_key, {start: this.timestamp, end: this.timestamp});
}, function(key, values) {
  return {start: values[0].start, end: values[values.length - 1].end};
}, {
  query: {user_id: 12345},
  sort: {timestamp: 1}
});

Now let’s get into the events. I have two collections here, the event and event.rollup collections. The event collection is once again, very similar to the MySQL table that I started with, except with embedded objects and time window information for grouping the events.

{
  "_id" : ObjectId("4de53c85c7d8327f56284f3a"),
  "visitor_key" : "50f46d60e00efe0a17c42c7fffdc68bda530572d1f83972a83d6c61d684ba3ce",
  "event_name" : "deal-click",
  "event_properties" : {
    "ca" : "dl",
    "deal_at_location_id" : 98303,
    "session_ca" : "dl",
    "ct" : "?",
    "cr" : "?",
    "node" : "web2",
    "bot" : "false",
    "user_id" : 163181,
    "guest" : "true",
    "session_cr" : "?",
    "session_ct" : "?"
  },
  "session_key" : "1C9FA142A59465B745ECE2C1ACB15E62",
  "timestamp" : ISODate("2011-05-31T19:07:48.533Z"),
  "window" : {
    "minute" : ISODate("2011-05-31T19:07:00Z"),
    "hour" : ISODate("2011-05-31T19:00:00Z"),
    "day" : ISODate("2011-05-31T00:00:00Z"),
    "week" : ISODate("2011-05-30T00:00:00Z"),
    "month" : ISODate("2011-05-01T00:00:00Z"),
    "year" : ISODate("2011-01-01T00:00:00Z")
  }
}

This format is significantly easier to query, and we have pulled a whole lot of ad-hoc data just from querying these events like the query from the beginning:

db.event.count({
  event_name: 'deal-click',
  'event_properties.deal_at_location_id': 212001,
  'event_properties.session_ca': 'eml'
});

I’ve written email tracking, exception tracking, email and session clickthrough tracking, and a bunch of other little close-to-realtime reports from this collection that we host on an internal Business Intelligence server. Iterating has been very quick, and I can usually answer most questions in a few minutes, if not much less. A quick sidenote, we also have a report table that contains the Javascript code to run all of our reports. To add a new report, write some Javascript and store it in that collection. Our internal BI website knows how to read that collection, present the relevant options to the user, and run the reports, resulting in pretty graphs, like this:

Sample BI graph

So what is this and how is it done? This is where the event.rollup collection comes into play. It gives us the ability to do long-term analysis across a large aggregate of events, very very quickly. The graph above is actually the click traffic of the top 5 deals for May, summed by hour. The query took less than 2 seconds, and the graph took another 3 to render. That’s pretty fast considering the top line is actually over 10,000 clicks (each an individual event). Here’s how the rollup events look:

{
  "_id" : ObjectId("4de54041c7d8327f56285563"),
  "c" : 1,
  "e" : "deal-click",
  "k" : [{
    "k" : "deal_at_location_id",
    "v" : 245931
  }],
  "s" : "month",
  "w" : ISODate("2011-05-01T00:00:00Z")
}

This is a month rollup started May 1, 2011 for the event named “deal-click” with a key/value pair of deal_at_location_id = 245931 and a count of 1. There are over 12 million of these objects in the database currently, and they get updated every minute without raising the cpu usage of mongod past 5%. Pretty neat! We can filter, sort, and group these by any number of key/value pairs, since MongoDB can index on objects within arrays. This allows powerful queries like the graph above. I’ll show you the basic code behind that query:

function sum_top_n_by_range(from, to, event, property, count) {
  /* create collection that looks like
  {
    _id: property_value,
    value: count
  }
  */
  var r = db.event.rollup.mapReduce(function () {
    emit(this.k[0].v, this.c);
  }, function (k, v) {
    var s = 0;
    for (var i in v) {
      s += v[i];
    }
    return s;
  }, {
    query: {
      e: event,
      s: 'day',
      w: {$gte: from, $lte: to},
      k: {$size: 1},
      'k.k': property
    }
  });

  /* take the top count property_values */
  var keys = r.find().sort({value: -1}).limit(count).map(function (o) {return o._id;});
  r.drop();

  var cursor = db.event.rollup.find({
    e: event,
    s: 'hour',
    w: {$gte:from, $lte:to},
    k: {$size: 1},
    'k.k': property,
    'k.v': {$in: keys}
  }).sort({w: 1});

  var result = {};
  cursor.forEach(function(dataPoint) {
    /* organize data points in the result object as needed by your output */
  });
  
  return result;
}

It’s a lot of code, but not all that crazy, right?

Thoughts

I’m very pleased with our decision to move this system to MongoDB. The nature of event and request tracking does not require pre-conceived columns and hard data constraints, which saves a lot of development and planning time. The ability to embed objects inside documents has made querying complex data and conceptualizing the records more straightforward. A large win was the ability to compute rollups in near real-time because of the upsert speed of MongoDB. I can create far more rollups than events if we choose to track more complex data and not worry about bogging down the database with writes. Currently we’re clocking between 4,000 and 7,000 updates per second on a full rebuild, and around 500 per second during peak website usage. If this becomes a pain point, sharding out the rollups would be an easy solution to increase our write throughput.

MongoNYC 2011

I’ll be speaking about this and showing more of the way Signpost uses MongoDB at the MongoNYC 2011 conference next week. If you don’t have a ticket yet, why not get one?

Slides & Video

Slides: http://bit.ly/jvwev4

blog comments powered by Disqus