MapReduce Sucks!
No it doesn't. But The Database Column would have you believe it.
A more fitting title to this may have been What Are These Guys Smoking?
Read the article here: MapReduce: A major step backwards.
Then take a deep breath and let it out the big "Huh?"
My first reaction to this is simple. Your typical database is based on perfectly structured data. Where the data doesn't fit neatly into the structure (schema) you transform it, often times using an ETL tool of sorts (thats Extract, Transform, Load).
Search data is anything but perfectly structured. Google indexes a whole lot of different document formats: HTML, PDF, word docs, excel files, and a whole lot more. This stuff doesn't exactly map to a neatly defined database schema does it?
MapReduce is a time tested proven and infinitely (well...) scalable method for building a resultset from a very large dataset. The neatest part perhaps is that the data that is mined can be in any number of different format stored in any media available. What does this translate into?
- Cheap grunt servers
- Cheap storage
- Manageability
- Scaling to the moon
Let's talk about that for a second. A low cost Linux server can be put into commission and last an enormous amount of time without every being upgraded. The data will be proliferated through many different systems providing an awesome set of redundant and low cost storage devices. Managing the servers is easy (in relative terms), seriously, they are grunts designed to do one task and do it very well. Upgrades should be minimal since they only have to change if the physical layout of the data changes, and it doesn't have to very often I bet. Scalability, no problem, plug in another thousand servers that are just begging for datasets to reduce and off they go.
Allow me to elaborate on the manageability front here. Say you have a linux server designed to dig through a crap-load of this data that you've been collecting through various means (spidering, general data loads, whatever). We're talking about a finely tuned $600 server (retail) that will kick ass at this job. More likely, we're talking about several thousand of these wicked machines. But, the only time you ever have to alter the innards of these beasts is if they fail or the shape of the data changes. With MapReduce you don't have to change the shape, just add a new generation of finely tuned $600 servers (retail) that that can see and process a new shape, and the new data. The old servers are just the aging rock stars, still jamming to the same tune, eventually they'll retire and the data will evolve. Ultimately the users question gets answered by everyone listening and reduced to the most relevant resultset.
Try that on any modern vendor provided database system and you'll probably find that it just can't be done that way. You have to have all of your data neatly ordered and ready to go. Show me a product that can evolve with your data, without having to migrate or transform that data's physical shape. I'll bet you can't. The reason you can't is because you have to funnel everything through their engine and it can't possibly be capable of knowing how all of the underlying data in a system like Google's. Google can add any data, in any shape it wants to their ginomrous clusters and the very nature of MapReduce screams "I don't care!" because the workers are doing the work.
So what were these guys who wrote this article writing about? Beats me. I think they were just trying to tell IT managers that Google is wrong and they know better. I think they are full of it if that's the case. Or perhaps they really just don't get it. I dunno.
Comments
Jon on 26 Jan 11:48
Followup. (Though I’m sure you can read reddit just as well as I can ;)
http://www.databasecolumn.com/2008/01/mapreduce-continued.html
(btw, Stonebraker is the Berkeley Ingres / Postgres dude. http://en.wikipedia.org/wiki/Michael_Stonebraker )
-j
Scott on 26 Jan 15:02
Thanks Jon,
It reads like some back peddling to me. Using an RDBMS or related technology for unstructured data, or even data that is specifically computational just seems wrong.
One thing these guys did in their previous article was generate a good amount of link bait. Congrats to them for that!
I love PostgreSQL, perhaps more than any other. But this arrogant stance against a bloody primitive like MapReduce is silly. It just seems they missed the whole hunter/gatherer problem that MapReduce solves.
Oh well.
Post a comment