As the world becomes more information-driven than ever before, a major challenge has become how to deal with the explosion of data. Traditional frameworks of data management now buckle under the gargantuan volume of today’s datasets. Fortunately, a rapidly changing landscape of new technologies is redefining how we work with data at super-massive scale. These technologies demand a new breed of DBAs and infrastructure engineers/developers to manage far more sophisticated systems.
Here is an overview of important technologies to know about for context around big data infrastructure.
- NoSQL Database Systems
- Hadoop, MapReduce, and massively parallel computing
What is NoSQL?
NoSQL (commonly referred to as “Not Only SQL”) represents a completely different framework of databases that allows for high-performance, agile processing of information at massive scale. In other words, it is a database infrastructure that as been very well-adapted to the heavy demands of big data.
The efficiency of NoSQL can be achieved because unlike relational databases that are highly structured, NoSQL databases are unstructured in nature, trading off stringent consistency requirements for speed and agility. NoSQL centers around the concept of distributed databases, where unstructured data may be stored across multiple processing nodes, and often across multiple servers. This distributed architecture allows NoSQL databases to be horizontally scalable; as data continues to explode, just add more hardware to keep up, with no slowdown in performance. The NoSQL distributed database infrastructure has been the solution to handling some of the biggest data warehouses on the planet – i.e. the likes of Google, Amazon, and the CIA.
What is Hadoop?
Hadoop is not a type of database, but rather a software ecosystem that allows for massively parallel computing. It is an enabler of certain types NoSQL distributed databases (such as HBase), which can allow for data to be spread across thousands of servers with little reduction in performance.
A staple of the Hadoop ecosystem is MapReduce, a computational model that basically takes intensive data processes and spreads the computation across a potentially endless number of servers (generally referred to as a Hadoop cluster). It has been a game-changer in supporting the enormous processing needs of big data; a large data procedure which might take 20 hours of processing time on a centralized relational database system, may only take 3 minutes when distributed across a large Hadoop cluster of commodity servers, all processing in parallel.