Continuing our journey on Big Data from my previous post, in this article we will have a closer look at the what are the options one has when it comes to storing data for Big Data storage/processing. Well the preference is obviously for NoSQL databases. However as for all other things, before we discuss NoSQL, it is important we talk a little bit about the history of why we need NoSQL database and what problems they help overcome as compared to relational databases.
History
For about last 20 years, relational databases have been the solution to all sorts of persistence mechanisms. They had a very firm grounding in Set Theory and were very good at what they were meant to do, storing data in relational models in lots and lots of tables. This does offer a very big advantage of slicing and dicing of data, efficient querying(milliseconds), ACID properties etc.. After Edgar F. Codd published the paper on normalization, the relational databases became the de facto standard of storing application data.
So what was the problem?
Well the model scaled very well until recently when there was a data explosion. The rate at which data was getting generated increased multifold with the ubiquitous presence of internet/3G and always on phenomenon.To give you some perspective:-
un-structured ness of data. Now after a certain point in time the relational type of databases fails to handle both scale and varied structure of data, well not at least at a reasonable cost.
Some might disagree but the truth is that relational databases are good at scaling up, but that comes at an extra cost and there is a limit to how much you can scale vertically, instead why not use commodity hardware to scale out.
The reason why relational databases are not good at horizontal scaling is because of the way data stored is distributed in multiple tables and the aggregation might require various joins, so it is never possible to partition data accurately to store in different partitions across different servers.
Solution
Enter NoSQL databases, here data that is supposed to be consumed together is stored together. It is no longer stored in normalized form, data duplicacy is no longer considered as evil, and this approach has its own advantages. The data can be stored on multiple commodity hardware at a much lower cost. This is what the likes of Google Big Table and Facebook - Cassandra have pioneered.
NoSQL Trivia -- The origin of word NoSQL comes from a twitter hash tag by the same name. Some Non relational databases enthusiasts decided to meet in U.S. to discuss the possibilities of non relational storage. This hash tag was contrived by them to plan their meeting. Little had they dreamt that this hash tag will become a movement in its own right.
So far we have focused on Why of NoSQL databases, now lets look at What and How?
What exactly are these NoSQL databases
As mentioned above NoSQL databases are way of persisting data in non-relational way. Here the data is no longer stored in rigid schemas of tables and columns distributed across various tables. Instead related data is stored together in a fluid schema-less fashion. NoSQL databases tend to be schema-less (key-value stores) or have structured contents but without a formal schema (document stores).
Let us look at different types of NoSQL databases viz, key value pair, document oriented, columnar and Graph based. Examples of each of these would be:-
1. Key Value Pair -- Apache Cassandra, Google Big Table, HBase
2. Document Oriented -- Couchbase, Mongo DB
3. Columnar - Vertica, MonetDB, Amazon RedShift
4. Graph Database - Neo4j
In this article we will focus in detail the key-value pair and document oriented database, as these are the most commonly used ones.
Cassandra -- used by NetFlix, eBay, Twitter, Reddit and many others, is one of today’s most popular NoSQL-databases in use. According to the website, the largest known Cassandra setup involves over 300 TB of data on over 400 machines. Cassandra provides a scalable, high-availability datastore with no single point of failure. Interestingly, Cassandra forgoes the widely used Master-Slave setup, in favor of a peer-to-peer cluster. This contributes to Cassandra having no single-point-of-failure, as there is no master-server which, when faced with lots of requests or when breaking, would render all of its slaves useless. Any number of commodity servers can be grouped into a Cassandra cluster.There are only two ways to query, by key or by key-range.
Data Modeling in Cassandra
Data storage in Cassandra is row-oriented, meaning that all contents of a row are serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns [²]. Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key.
History
For about last 20 years, relational databases have been the solution to all sorts of persistence mechanisms. They had a very firm grounding in Set Theory and were very good at what they were meant to do, storing data in relational models in lots and lots of tables. This does offer a very big advantage of slicing and dicing of data, efficient querying(milliseconds), ACID properties etc.. After Edgar F. Codd published the paper on normalization, the relational databases became the de facto standard of storing application data.
So what was the problem?
Well the model scaled very well until recently when there was a data explosion. The rate at which data was getting generated increased multifold with the ubiquitous presence of internet/3G and always on phenomenon.To give you some perspective:-
- Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
- YouTube users upload 48 hours of new video every minute of the day
- In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day
un-structured ness of data. Now after a certain point in time the relational type of databases fails to handle both scale and varied structure of data, well not at least at a reasonable cost.
Some might disagree but the truth is that relational databases are good at scaling up, but that comes at an extra cost and there is a limit to how much you can scale vertically, instead why not use commodity hardware to scale out.
![]() |
| Horizontal scaling |
![]() |
| Vertical Scaling |
The reason why relational databases are not good at horizontal scaling is because of the way data stored is distributed in multiple tables and the aggregation might require various joins, so it is never possible to partition data accurately to store in different partitions across different servers.
Solution
Enter NoSQL databases, here data that is supposed to be consumed together is stored together. It is no longer stored in normalized form, data duplicacy is no longer considered as evil, and this approach has its own advantages. The data can be stored on multiple commodity hardware at a much lower cost. This is what the likes of Google Big Table and Facebook - Cassandra have pioneered.
NoSQL Trivia -- The origin of word NoSQL comes from a twitter hash tag by the same name. Some Non relational databases enthusiasts decided to meet in U.S. to discuss the possibilities of non relational storage. This hash tag was contrived by them to plan their meeting. Little had they dreamt that this hash tag will become a movement in its own right.
So far we have focused on Why of NoSQL databases, now lets look at What and How?
What exactly are these NoSQL databases
As mentioned above NoSQL databases are way of persisting data in non-relational way. Here the data is no longer stored in rigid schemas of tables and columns distributed across various tables. Instead related data is stored together in a fluid schema-less fashion. NoSQL databases tend to be schema-less (key-value stores) or have structured contents but without a formal schema (document stores).
Let us look at different types of NoSQL databases viz, key value pair, document oriented, columnar and Graph based. Examples of each of these would be:-
1. Key Value Pair -- Apache Cassandra, Google Big Table, HBase
2. Document Oriented -- Couchbase, Mongo DB
3. Columnar - Vertica, MonetDB, Amazon RedShift
4. Graph Database - Neo4j
In this article we will focus in detail the key-value pair and document oriented database, as these are the most commonly used ones.
Cassandra -- used by NetFlix, eBay, Twitter, Reddit and many others, is one of today’s most popular NoSQL-databases in use. According to the website, the largest known Cassandra setup involves over 300 TB of data on over 400 machines. Cassandra provides a scalable, high-availability datastore with no single point of failure. Interestingly, Cassandra forgoes the widely used Master-Slave setup, in favor of a peer-to-peer cluster. This contributes to Cassandra having no single-point-of-failure, as there is no master-server which, when faced with lots of requests or when breaking, would render all of its slaves useless. Any number of commodity servers can be grouped into a Cassandra cluster.There are only two ways to query, by key or by key-range.
Data Modeling in Cassandra
Data storage in Cassandra is row-oriented, meaning that all contents of a row are serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns [²]. Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key.
- The following layout represents a row in a Column Family with composite columns. Parts of a composite column are separated by ‘|’. Note that this is just a representation convention; Cassandra’s built-in composite type encodes differently, not using ‘|’. (BTW, this post doesn’t require you to have detailed knowledge of super columns and composite columns.)

Use cases - Now if we quickly discuss the use cases where you would use Key Value kind of database is probably where you would only query based on the key. The database does not care what is stored as value. The indexes are only on the key and you always retrieve and insert values as one big chunk of black box.
MongoDB -
This is a NoSQL database which supports the notion of documents. Documents are JSON structures, to be precise in case of MongoDB it is BSON(Binary equivalent of JSON).
Below is the terminology used in Mongo DB and its analogy with respect to normal RDBS:-
TABLE --> Collection
ROW --> Document
Primary Key --> _id
A sample document looks like below, which is nothing but key value pairs, but unlike key-value database, here you can index and query individual key within the document.
MongoDB -
This is a NoSQL database which supports the notion of documents. Documents are JSON structures, to be precise in case of MongoDB it is BSON(Binary equivalent of JSON).
Below is the terminology used in Mongo DB and its analogy with respect to normal RDBS:-
TABLE --> Collection
ROW --> Document
Primary Key --> _id
A sample document looks like below, which is nothing but key value pairs, but unlike key-value database, here you can index and query individual key within the document.
For document stores, the structure and contents of each "document" are independent of other documents in the same "collection". Adding a field is usually a code change rather than a database change: new documents get an entry for the new field, while older documents are considered to have a null value for the non-existent field. Similarly, "removing" a field could mean that you simply stop referring to it in your code rather than going to the trouble of deleting it from each document (unless space is at a premium, and then you have the option of removing only those with the largest contents). Contrast this to how an entire table must be changed to add or remove a column in a traditional row/column database.
Documents can also hold lists as well as other nested documents. Here's a sample document from MongoDB (a post from a blog or other forum), represented as JSON:
{
_id : ObjectId("4e77bb3b8a3e000000004f7a"),
when : Date("2011-09-19T02:10:11.3Z"),
author : "alex",
title : "No Free Lunch",
text : "This is the text of the post. It could be very long.",
tags : [ "business", "ramblings" ],
votes : 5,
voters : [ "jane", "joe", "spencer", "phyllis", "li" ],
comments : [
{ who : "jane", when : Date("2011-09-19T04:00:10.112Z"),
comment : "I agree." },
{ who : "meghan", when : Date("2011-09-20T14:36:06.958Z"),
comment : "You must be joking. etc etc ..." }
]
}
Note how "comments" is a list of nested documents with their own independent structure. Queries can "reach into" these documents from the outer document, for example to find posts that have comments by Jane, or posts with comments from a certain date range.
Some of the notable advanced features of MongoDB include, automatic master slave replication, auto sharding of data, very rich query language, supports 2nd level of indexes on documents ensuring efficient retrievals, in-built support for Map-Reduce. It also offers very fine grained control over the reliability and durability for someone who does not like the auto pilot mode.
Most common Myth - No support for ACID
One of the most common myth about NoSQL databases is that they do not support Atomicity, Consistency, Integrity, Durability. However by the very nature of how data is stored NoSQL databases should look at ACID in a different light, since there is no need of lots of joins and related data is stored as a single document. It is okay as long as the transaction boundaries are relaxed to a per document level. MongoDB does support transactions at document level, the write and reads are consistent and durable(configurable) at individual document level.
A common (mis)usecase of NoSQL database -- Continuous Delivery/Agile
Most often it happens that you develop a product envisaging a particular use-case but often times people use it in a way that you wouldn't have imagine. Many people have now started advocating NoSQL databases for agile projects just because you do not have to deal with problems of enforcing a common/rigid schema. Each of the distributed team members can work on their own schema and both can go to production because NoSQL allows each row(read document) to have an individual structure different from other documents.
I would say, the reasons for using a NoSQL database should be rooted in reasoning and fitment of the business need rather than just to ease out on development effort.
Future -- Polyglot Persistence
So will the existing RDBMs stop existing in near future. I would say NO and introduce a term I heard from Martin Fowler - Polyglot Persistence -- which means that use different types of persistence according to what makes most business/ domain sense. There is no 1 size fits all.
Some of the notable advanced features of MongoDB include, automatic master slave replication, auto sharding of data, very rich query language, supports 2nd level of indexes on documents ensuring efficient retrievals, in-built support for Map-Reduce. It also offers very fine grained control over the reliability and durability for someone who does not like the auto pilot mode.
Most common Myth - No support for ACID
One of the most common myth about NoSQL databases is that they do not support Atomicity, Consistency, Integrity, Durability. However by the very nature of how data is stored NoSQL databases should look at ACID in a different light, since there is no need of lots of joins and related data is stored as a single document. It is okay as long as the transaction boundaries are relaxed to a per document level. MongoDB does support transactions at document level, the write and reads are consistent and durable(configurable) at individual document level.
A common (mis)usecase of NoSQL database -- Continuous Delivery/Agile
Most often it happens that you develop a product envisaging a particular use-case but often times people use it in a way that you wouldn't have imagine. Many people have now started advocating NoSQL databases for agile projects just because you do not have to deal with problems of enforcing a common/rigid schema. Each of the distributed team members can work on their own schema and both can go to production because NoSQL allows each row(read document) to have an individual structure different from other documents.
I would say, the reasons for using a NoSQL database should be rooted in reasoning and fitment of the business need rather than just to ease out on development effort.
Future -- Polyglot Persistence
So will the existing RDBMs stop existing in near future. I would say NO and introduce a term I heard from Martin Fowler - Polyglot Persistence -- which means that use different types of persistence according to what makes most business/ domain sense. There is no 1 size fits all.




