Me!: August 2013

Continuing our journey on Big Data from my previous post, in this article we will have a closer look at the what are the options one has when it comes to storing data for Big Data storage/processing. Well the preference is obviously for NoSQL databases. However as for all other things, before we discuss NoSQL, it is important we talk a little bit about the history of why we need NoSQL database and what problems they help overcome as compared to relational databases.

History
For about last 20 years, relational databases have been the solution to all sorts of persistence mechanisms. They had a very firm grounding in Set Theory and were very good at what they were meant to do, storing data in relational models in lots and lots of tables. This does offer a very big advantage of slicing and dicing of data, efficient querying(milliseconds), ACID properties etc.. After Edgar F. Codd published the paper on normalization, the relational databases became the de facto standard of storing application data.

So what was the problem?
Well the model scaled very well until recently when there was a data explosion. The rate at which data was getting generated increased multifold with the ubiquitous presence of internet/3G and always on phenomenon.To give you some perspective:-

Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
YouTube users upload 48 hours of new video every minute of the day
In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day

If you look carefully there are at least two common characteristics in the above examples viz. scale and
un-structured ness of data. Now after a certain point in time the relational type of databases fails to handle both scale and varied structure of data, well not at least at a reasonable cost.
Some might disagree but the truth is that relational databases are good at scaling up, but that comes at an extra cost and there is a limit to how much you can scale vertically, instead why not use commodity hardware to scale out.

Horizontal scaling

Vertical Scaling

The reason why relational databases are not good at horizontal scaling is because of the way data stored is distributed in multiple tables and the aggregation might require various joins, so it is never possible to partition data accurately to store in different partitions across different servers.

Solution
Enter NoSQL databases, here data that is supposed to be consumed together is stored together. It is no longer stored in normalized form, data duplicacy is no longer considered as evil, and this approach has its own advantages. The data can be stored on multiple commodity hardware at a much lower cost. This is what the likes of Google Big Table and Facebook - Cassandra have pioneered.

NoSQL Trivia -- The origin of word NoSQL comes from a twitter hash tag by the same name. Some Non relational databases enthusiasts decided to meet in U.S. to discuss the possibilities of non relational storage. This hash tag was contrived by them to plan their meeting. Little had they dreamt that this hash tag will become a movement in its own right.

So far we have focused on Why of NoSQL databases, now lets look at What and How?

What exactly are these NoSQL databases
As mentioned above NoSQL databases are way of persisting data in non-relational way. Here the data is no longer stored in rigid schemas of tables and columns distributed across various tables. Instead related data is stored together in a fluid schema-less fashion. NoSQL databases tend to be schema-less (key-value stores) or have structured contents but without a formal schema (document stores).

Let us look at different types of NoSQL databases viz, key value pair, document oriented, columnar and Graph based. Examples of each of these would be:-
1. Key Value Pair -- Apache Cassandra, Google Big Table, HBase
2. Document Oriented -- Couchbase, Mongo DB
3. Columnar - Vertica, MonetDB, Amazon RedShift
4. Graph Database - Neo4j

In this article we will focus in detail the key-value pair and document oriented database, as these are the most commonly used ones.

Cassandra -- used by NetFlix, eBay, Twitter, Reddit and many others, is one of today’s most popular NoSQL-databases in use. According to the website, the largest known Cassandra setup involves over 300 TB of data on over 400 machines. Cassandra provides a scalable, high-availability datastore with no single point of failure. Interestingly, Cassandra forgoes the widely used Master-Slave setup, in favor of a peer-to-peer cluster. This contributes to Cassandra having no single-point-of-failure, as there is no master-server which, when faced with lots of requests or when breaking, would render all of its slaves useless. Any number of commodity servers can be grouped into a Cassandra cluster.There are only two ways to query, by key or by key-range.
Data Modeling in Cassandra
Data storage in Cassandra is row-oriented, meaning that all contents of a row are serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns [²]. Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key.

The following layout represents a row in a Column Family (CF):

The following layout represents a row in a Super Column Family (SCF):

The following layout represents a row in a Column Family with composite columns. Parts of a composite column are separated by ‘|’. Note that this is just a representation convention; Cassandra’s built-in composite type encodes differently, not using ‘|’. (BTW, this post doesn’t require you to have detailed knowledge of super columns and composite columns.)

Use cases - Now if we quickly discuss the use cases where you would use Key Value kind of database is probably where you would only query based on the key. The database does not care what is stored as value. The indexes are only on the key and you always retrieve and insert values as one big chunk of black box.

MongoDB -

This is a NoSQL database which supports the notion of documents. Documents are JSON structures, to be precise in case of MongoDB it is BSON(Binary equivalent of JSON).

Below is the terminology used in Mongo DB and its analogy with respect to normal RDBS:-

TABLE --> Collection
ROW --> Document
Primary Key --> _id

A sample document looks like below, which is nothing but key value pairs, but unlike key-value database, here you can index and query individual key within the document.

{ "item": "pencil", "qty": 500, "type": "no.2" }

For document stores, the structure and contents of each "document" are independent of other documents in the same "collection". Adding a field is usually a code change rather than a database change: new documents get an entry for the new field, while older documents are considered to have a null value for the non-existent field. Similarly, "removing" a field could mean that you simply stop referring to it in your code rather than going to the trouble of deleting it from each document (unless space is at a premium, and then you have the option of removing only those with the largest contents). Contrast this to how an entire table must be changed to add or remove a column in a traditional row/column database.

Documents can also hold lists as well as other nested documents. Here's a sample document from MongoDB (a post from a blog or other forum), represented as JSON:

{
  _id : ObjectId("4e77bb3b8a3e000000004f7a"),
  when : Date("2011-09-19T02:10:11.3Z"),
  author : "alex",
  title : "No Free Lunch",
  text : "This is the text of the post.  It could be very long.",
  tags : [ "business", "ramblings" ],
  votes : 5,
  voters : [ "jane", "joe", "spencer", "phyllis", "li" ],
  comments : [
    { who : "jane", when : Date("2011-09-19T04:00:10.112Z"),
      comment : "I agree." },
    { who : "meghan", when : Date("2011-09-20T14:36:06.958Z"),
      comment : "You must be joking.  etc etc ..." }
  ]
}

Note how "comments" is a list of nested documents with their own independent structure. Queries can "reach into" these documents from the outer document, for example to find posts that have comments by Jane, or posts with comments from a certain date range.

Some of the notable advanced features of MongoDB include, automatic master slave replication, auto sharding of data, very rich query language, supports 2nd level of indexes on documents ensuring efficient retrievals, in-built support for Map-Reduce. It also offers very fine grained control over the reliability and durability for someone who does not like the auto pilot mode.

Most common Myth - No support for ACID
One of the most common myth about NoSQL databases is that they do not support Atomicity, Consistency, Integrity, Durability. However by the very nature of how data is stored NoSQL databases should look at ACID in a different light, since there is no need of lots of joins and related data is stored as a single document. It is okay as long as the transaction boundaries are relaxed to a per document level. MongoDB does support transactions at document level, the write and reads are consistent and durable(configurable) at individual document level.

A common (mis)usecase of NoSQL database -- Continuous Delivery/Agile
Most often it happens that you develop a product envisaging a particular use-case but often times people use it in a way that you wouldn't have imagine. Many people have now started advocating NoSQL databases for agile projects just because you do not have to deal with problems of enforcing a common/rigid schema. Each of the distributed team members can work on their own schema and both can go to production because NoSQL allows each row(read document) to have an individual structure different from other documents.

I would say, the reasons for using a NoSQL database should be rooted in reasoning and fitment of the business need rather than just to ease out on development effort.

Future -- Polyglot Persistence
So will the existing RDBMs stop existing in near future. I would say NO and introduce a term I heard from Martin Fowler - Polyglot Persistence -- which means that use different types of persistence according to what makes most business/ domain sense. There is no 1 size fits all.

Big Data has taken the software industry by storm. Some of the interesting statistics hint that:-

Big data is a top business priority and drives enormous opportunity for business improvement. Wikibon’s own study projects that big data will be a $50 billion business by 2017.
Market research firm IDC has released a new forecast that shows the big data market is expected to grow from $3.2 billion in 2010 to $16.9 billion in 2015
94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data.

So in this post we try to demystify and see everything that Big Data might mean to you.

Big data is a dynamic that seemed to appear from almost nowhere. But in reality, Big Data is not new – and it is moving into mainstream and getting a lot more attention. The growth of Big Data is being enabled by inexpensive storage, a proliferation of sensor and data capture technology, increasing connections to information via the cloud and virtualised storage infrastructure, as well as innovative software and analysis tools. It is no surprise then that business analytics as a technology area is rising on the radars of CIOs
and line-of-business (LOB) executives.

To me no other technology in the recent past has directly affected the lives of so many people, right from the developer to product owner to marketing dept, to CIO and finally the customer himself.

And rightly so, if something is touching so many lives, is it okay to mean different things to different people?

Most people define Big Data as the 3 V's -- velocity , variety and volume of data.

Much has been written on how the amount of data in the world is exploding in volume. According to a recent study, the amount of information created and replicated will surpass 1.9 zettabytes (1.8 trillion gigabytes) in 2011 – growing by a factor of 9 in just five years.

Big Data is not so much about the content that is created, nor is it even about just consumption. It

is more about the analysis of the data and how that needs to be done. Although the varied variety of content (unstructured, semi-structured) does play a huge role, it is not really a ‘thing’, but

instead a dynamic/activity that crosses many IT borders.

With the focus on Big Data going mainstream, a range of new technologies have hit the market. The table

below gives an overview of these technologies, with associated context (note that the list is not exhaustive).

Technology	Context
Big Table	Proprietary distributed database system built on the Google File System. Inspiration for HBase.
Data Warehouse & BI	Consists of an integrated set of servers, storage, operating system(s),database, business intelligence, data mining and other software specifically pre-installed and pre-optimised for data warehousing.
Hadoop	Multiple computers, communicating through a network, used to solve a common computational problem. The problem is divided into multiple tasks, each of which is solved by one or more computers working in parallel. Improved price:performance ratio, higher reliability and more scalability.
NoSQL / Key value store	A non-relational database is one that does not store data in tables (rows and columns) – in contrast to a relational database. Key Value Stores allow for the management of schema-less (noSQL) entities. E.g. Hbase, Cassandra, Couchbase, MongoDB etc.
Machine Learning	Machine learning is a field that is closely related to data mining and often uses techniques from statistics, probability theory, pattern recognition, and a host of other areas. It's used to build systems like those at Netflix and Amazon that recommend products to users based on past purchases, or systems that find all of the similar news articles on a given day. It can also be used to categorize Web pages automatically according to genre (sports, economy, war, and so on) or to mark e-mail messages as spam.

Below is the architecture diagram of how the whole things comes together. We will cover the diagram and its components in more detail in my upcoming posts.

Conclusion

Apart from the 3 V's mentioned above, Big Data is also equally about the 4th V -- 'Value'.

It is about creating value out of data using about some or all of the above technologies. It entails data analytics by using technologies like NoSQL to store vast amounts of historical data and process it on commodity hardware using technologies like Hadoop. The real power of data comes out of co-relating different sources of related data, hence it is imperative to use it in conjunction with data stored in existing data ware houses.

In future articles we will deal with each of the above viz. No SQL, Hadoop and Machine Learning individually.

Me!

Wednesday, August 21, 2013

Why What and How of NoSQL Databases?

Sunday, August 18, 2013

The Big Data landscape ? What does it mean to you?

Followers

Blog Archive

About Me