Monday, January 01, 2007

Spending more time with emacs

For a variety of reasons, it's been a while since I wrote much code. But last Friday, a few ideas I've had rattling around came together, along with some free time, and in a very pleasant heads-down frenzy I cranked out an interesting little tool.

At work, we don't really have a good way of monitoring or recording HTTP traffic. Web server logs aren't turned on and are inconvenient to collect in our current configuration, but even beyond that, some of the information we would need to do anything useful is sometimes stuck in the POST bodies. For other operational reasons, it's even more inconvenient to put any recording tools in the HTTP stack right now. I would expect our load balancer to be useful for this sort of thing, since it's already got its hands pretty deep in the HTTP conversations, but it isn't.

From time to time, therefore, I've looked around for an open-source tool which would let me reassemble and analyze HTTP traffic as seen on a span port or something. Unfortunately, everything I've found that does this kind of TCP / HTTP reassembly and analysis is designed for troubleshooting or intrusion detection, like Ethereal or Snort.

I've been meaning to get a better handle on Ruby, too. I think it's my current favorite programming language, but Python is still my trusty standby tool for this kind of thing, the one I can reach for and use without having to think too hard or look everything up in the language reference. So after a quick search revealed Ruby bindings for libpcap, I decided to sit down and write my own TCP and HTTP reassembly tool.

Once I started, it came together surprisingly quickly. The layering was pretty obvious; packets come in from libpcap and get fed to the appropriate TCP stream; the TCP stream layer handles reassembly (and punts, for the moment, on out-of-order packets); data is fed to the HTTP layer and the whole HTTP stream is buffered in memory, lightly parsed to separate the first line, the headers, and the body. Once both halves of the connection are closed, both streams are handed off to an analysis function.

I haven't gotten my hands this dirty with TCP before, and discovered a few interesting things. First of all, I didn't understand the push flag correctly before; I thought it was set whenever there was data, but apparently it means flush your buffers to the application. Interesting. Second, I didn't have to bother with acknowledgements at all. Third, I haven't had to set up any TCP timers yet. Fourth, Microsoft stuff has a weird habit of sending one byte of data—always just one—along with an ACK. And lastly, my first-pass caveman implementation gets pretty far without even attempting to deal with things like reassembling out-of-order packets, data arriving after the FIN, etc.

I still have to refine the sorts of things you'd expect, cleaning out old streams and perfecting my capture setup so I'm not dropping any packets and tackling reassembly, but the main thing to sort out now is actually recording the data for analysis, in different ways. Patrick Logan wrote a very interesting blog post a long time ago, arguing that the only patterns you really need for persistent data are tuple spaces, versioned document trees, and star schema storage, i.e. (my interpolation) data warehouses. This idea has stuck with me ever since, particularly with respect to tuple spaces, since I've seen the deficiencies of regular SQL databases and of message queues for storing working data. I've kept an occasional reading interest in data warehouses, too. But there is a serious shortage of open-source implementations of those two concepts; Javaspaces is the only tuple space technology I've heard of that sounds production-ready (and I am following Patrick's growing stream of posts on it with a keen interest), but as far as I can tell the star schema universe is owned by commercial databases with five-figure price tags. You can use a regular old database, of course, but it's not what they're optimized for in any sense, and they certainly don't rate as simple or low-maintenance.

Nonetheless, I think that is exactly what I am going to do, record all my web traffic in SQL Server (wouldn't I like to have Oracle back again...). I'll try to do a decent star-schema design; it'll make a good exercise, if nothing else. We have some external feeds that will want web server log files, but I think I'll just write a little tool to synthesize those from the database. That way we won't lose any information, and I can record all the application-specific information I want for each request.

Plus, the TCP and HTTP reassembly layers of this tool could easily and usefully be open-sourced. I know I would have liked to find such a thing, and HTTP certainly lends itself to out-of-line monitoring, so I have to think at least one other person somewhere would find it useful. I guess I have to name it first.

And regarding the title, I'm not happy at all with the XEmacs ruby-mode, or at least its default bindings. C-M-f and C-M-b really need to be forward-sexp and backward-sexp, not whatever-mark-defun. I'll start by trying to customize it the way I like it, but Dave keeps telling me I should take TextMate for a spin, and maybe I will.