Thursday, July 24, 2008

Success Story: Hadoop + NYT

Our speaker from the New York Times, Derek, is telling us to read the paper on Google's MapReduce, Apache Hadoop being the open source implementation of said Google filesystem (GFS -- what's used to store this blog).

MapReduce is a design pattern for getting large numbers of computers to paw through vast data stores based on a {key : value} "map", returning intermediate pairs sorted by intermediate keys, and feeding these pairs to a "reduce" function designed to output a single {key : value}, which value might be a list -- of all pages containing a particular search term for example.

The NYT started with a pay-as-you-go model for making its historical archives available, then decided to go to free with at least the 1851 - 1922 piece. Some libraries give free access to larger chunks I'm supposing.

A given article is all over the map in storage, mirroring how newspapers scatter content across pages, mixed with ads etc., making this problem an ideal candidate for a mapReduce solution.

excerpt re Bucky, NYT, April 5, 1963, pg. 33
The project: copy source data to Amazon's S3 service, then boot custom Linux Xen images with Hadoop using Amazon's EC2 service, one master with multiple slaves, and kick off the job request.

About 100 computers running simultaneously processed 4.3 TB of TIFF images in about 24 hours for about $240 in computing costs, and $650 for storage. A small mistake meant running it twice (not a big deal, given these low costs and fast turnaround).

Setting up The TimesMachine was a side benefit and spin-off from said project, able to serve full pages of the NY Times, complete with ads etc.

Hadoop uses commodity hardware and software to provide a framework for storing petabytes of data, is implemented in Java, has about 15 committers currently, 3 for Hbase.

HDFS and HBase store the data, with mapReduce used for processing. Facebook is a contributor and user, as is Yahoo, which implements Hadoop internally on multiple 2k clusters.

There's lots of academic interest in this framework, with universities getting into it.

A key characteristic of Hadoop is separating metadata from data, scaling the former (the namenode) vertically, the latter (storage and I/O nodes) horizontally.

We're talking simple sequential files mostly, with an append operation but no random access write operation. This is not a relational database. The design keeps file blocks physically close to the client for reading and writing, automatically takes care of replicating blocks to redundant locations.

Storage units report on their blocks to the namenode, which runs completely in memory, and which takes appropriate action if blocks are over or under replicated. If a server fails, a common enough occurance in large data centers, the surviving blocks get replicated up to quota (like RAID, but across multiple hosts).

RPC is used for interprocess communication. Lots of improvements on the drawing board, being implemented, many having to do with scaling.