Up until recently, I had worked exclusively with Relational Database Management Systems (RDBMSes) such as MySQL and PostgreSQL in my web development projects. They store data logically and link different subsets of data together in very obvious ways - and make it fairly easy for new comers to learn. I had learn the subtle nuances of both and consider myself an expert in many ways when it comes to creating queries in such environments.
Everything in my web development world was in much of a balance - then, from out of nowhere, Travis Hegner hit me in the face with a proverbial pimp slap of epic proportions. This thunderous blow came from a different type of database system - known as “Document Databases” in the form of HBase, the Hadoop Database. HBase is meant to be used “when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables - billions of rows X millions of columns - atop clusters of commodity hardware.”
Now you may be thinking to yourself “WTF is he talking about!?” I will try to clarify by using an example…
A couple of years ago, we worked on creating a web spider that crawls the web for resumes. The project worked great and we got close to 150,000 resumes into an RDMS (PostgreSQL) in a standardized format using the Sovren Resume Parser which adheres to the HR-XML Resume formatting Standards. We built a search engine that would allow our recruiters access to the resumes so they could (ideally) find matching candidates for open positions (I work for a staffing company, Trillium Staffing). Unfortunately, with that sheer amount of data, the queries were taking upwards of a minute to complete - even with a normalized and optimized system. We dealt with it for a long time, until we came across HBase. Since then, we have implemented the system and minimized query overhead using the Document Database - which basically searches a giant text-based dump of each resume in a field. The searches take at most about 5 seconds now. Pretty big improvement eh? I’d say so, too!
OK, so this worked out great but there isn’t a big community yet for this technology. The reason - in my opinion - is the technology it’s developed on (Java) is a PITA and eats a lot of resources. So, I looked for alternatives, but to no avail.
Another downside I found was that the data being processed was mainly static, which is not great in web development because we (love to) work with dynamic data that can be updated instantly.
I kind of but this out of my mind until randomly, on my favorite CSS Gallery, Best Web Gallery, there was a posting for MongoDB - a sort of RDBMS/Document Database hybrid. SWEET!
After poking around the site for a bit - I was pleasantly surprised at the ease of querying (JavaScript-based queries) and the support. I was even MORE pleased to find they had a sandbox area of sorts to learn and test queries on. Then, to make me even MORE IMPRESSED, there was already an Ubuntu Package for it! Score! Since finding this out, I have been working on a fairly large project and have really been loving it. This project happens to be in PHP, but there are also drivers for Ruby on Rails, Django, etc. I love the fact that I can apply philosophies of both, rapidly search GIANT amounts of data and run automated cron job-type tasks called “Map/Reduce” Jobs right out of the box. It’s just that simple.
It is of my highest opinion that MongoDB not only bridges the gap between key-value stores (which are are fast and highly scalable) and traditional RDBMSes (which provide rich queries and deep functionality) but it also bridges the gap between modern web development and document-based databases successfully and with elite precision.