Me!

Wednesday, August 21, 2013

Why What and How of NoSQL Databases?

Continuing our journey on Big Data from my previous post, in this article we will have a closer look at the what are the options one has when it comes to storing data for Big Data storage/processing. Well the preference is obviously for NoSQL databases. However as for all other things, before we discuss NoSQL, it is important we talk a little bit about the history of why we need NoSQL database and what problems they help overcome as compared to relational databases.

History
For about last 20 years, relational databases have been the solution to all sorts of persistence mechanisms. They had a very firm grounding in Set Theory and were very good at what they were meant to do, storing data in relational models in lots and lots of tables. This does offer a very big advantage of slicing and dicing of data, efficient querying(milliseconds), ACID properties etc.. After Edgar F. Codd published the paper on normalization, the relational databases became the de facto standard of storing application data.

So what was the problem?
Well the model scaled very well until recently when there was a data explosion. The rate at which data was getting generated increased multifold with the ubiquitous presence of internet/3G and always on phenomenon.To give you some perspective:-

Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
YouTube users upload 48 hours of new video every minute of the day
In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day

If you look carefully there are at least two common characteristics in the above examples viz. scale and
un-structured ness of data. Now after a certain point in time the relational type of databases fails to handle both scale and varied structure of data, well not at least at a reasonable cost.
Some might disagree but the truth is that relational databases are good at scaling up, but that comes at an extra cost and there is a limit to how much you can scale vertically, instead why not use commodity hardware to scale out.

Horizontal scaling

Vertical Scaling

The reason why relational databases are not good at horizontal scaling is because of the way data stored is distributed in multiple tables and the aggregation might require various joins, so it is never possible to partition data accurately to store in different partitions across different servers.

Solution
Enter NoSQL databases, here data that is supposed to be consumed together is stored together. It is no longer stored in normalized form, data duplicacy is no longer considered as evil, and this approach has its own advantages. The data can be stored on multiple commodity hardware at a much lower cost. This is what the likes of Google Big Table and Facebook - Cassandra have pioneered.

NoSQL Trivia -- The origin of word NoSQL comes from a twitter hash tag by the same name. Some Non relational databases enthusiasts decided to meet in U.S. to discuss the possibilities of non relational storage. This hash tag was contrived by them to plan their meeting. Little had they dreamt that this hash tag will become a movement in its own right.

So far we have focused on Why of NoSQL databases, now lets look at What and How?

What exactly are these NoSQL databases
As mentioned above NoSQL databases are way of persisting data in non-relational way. Here the data is no longer stored in rigid schemas of tables and columns distributed across various tables. Instead related data is stored together in a fluid schema-less fashion. NoSQL databases tend to be schema-less (key-value stores) or have structured contents but without a formal schema (document stores).

Let us look at different types of NoSQL databases viz, key value pair, document oriented, columnar and Graph based. Examples of each of these would be:-
1. Key Value Pair -- Apache Cassandra, Google Big Table, HBase
2. Document Oriented -- Couchbase, Mongo DB
3. Columnar - Vertica, MonetDB, Amazon RedShift
4. Graph Database - Neo4j

In this article we will focus in detail the key-value pair and document oriented database, as these are the most commonly used ones.

Cassandra -- used by NetFlix, eBay, Twitter, Reddit and many others, is one of today’s most popular NoSQL-databases in use. According to the website, the largest known Cassandra setup involves over 300 TB of data on over 400 machines. Cassandra provides a scalable, high-availability datastore with no single point of failure. Interestingly, Cassandra forgoes the widely used Master-Slave setup, in favor of a peer-to-peer cluster. This contributes to Cassandra having no single-point-of-failure, as there is no master-server which, when faced with lots of requests or when breaking, would render all of its slaves useless. Any number of commodity servers can be grouped into a Cassandra cluster.There are only two ways to query, by key or by key-range.
Data Modeling in Cassandra
Data storage in Cassandra is row-oriented, meaning that all contents of a row are serialized together on disk. Every row of columns has its unique key. Each row can hold up to 2 billion columns [²]. Furthermore, each row must fit onto a single server, because data is partitioned solely by row-key.

The following layout represents a row in a Column Family (CF):

The following layout represents a row in a Super Column Family (SCF):

The following layout represents a row in a Column Family with composite columns. Parts of a composite column are separated by ‘|’. Note that this is just a representation convention; Cassandra’s built-in composite type encodes differently, not using ‘|’. (BTW, this post doesn’t require you to have detailed knowledge of super columns and composite columns.)

Use cases - Now if we quickly discuss the use cases where you would use Key Value kind of database is probably where you would only query based on the key. The database does not care what is stored as value. The indexes are only on the key and you always retrieve and insert values as one big chunk of black box.

MongoDB -

This is a NoSQL database which supports the notion of documents. Documents are JSON structures, to be precise in case of MongoDB it is BSON(Binary equivalent of JSON).

Below is the terminology used in Mongo DB and its analogy with respect to normal RDBS:-

TABLE --> Collection
ROW --> Document
Primary Key --> _id

A sample document looks like below, which is nothing but key value pairs, but unlike key-value database, here you can index and query individual key within the document.

{ "item": "pencil", "qty": 500, "type": "no.2" }

For document stores, the structure and contents of each "document" are independent of other documents in the same "collection". Adding a field is usually a code change rather than a database change: new documents get an entry for the new field, while older documents are considered to have a null value for the non-existent field. Similarly, "removing" a field could mean that you simply stop referring to it in your code rather than going to the trouble of deleting it from each document (unless space is at a premium, and then you have the option of removing only those with the largest contents). Contrast this to how an entire table must be changed to add or remove a column in a traditional row/column database.

Documents can also hold lists as well as other nested documents. Here's a sample document from MongoDB (a post from a blog or other forum), represented as JSON:

{
  _id : ObjectId("4e77bb3b8a3e000000004f7a"),
  when : Date("2011-09-19T02:10:11.3Z"),
  author : "alex",
  title : "No Free Lunch",
  text : "This is the text of the post.  It could be very long.",
  tags : [ "business", "ramblings" ],
  votes : 5,
  voters : [ "jane", "joe", "spencer", "phyllis", "li" ],
  comments : [
    { who : "jane", when : Date("2011-09-19T04:00:10.112Z"),
      comment : "I agree." },
    { who : "meghan", when : Date("2011-09-20T14:36:06.958Z"),
      comment : "You must be joking.  etc etc ..." }
  ]
}

Note how "comments" is a list of nested documents with their own independent structure. Queries can "reach into" these documents from the outer document, for example to find posts that have comments by Jane, or posts with comments from a certain date range.

Some of the notable advanced features of MongoDB include, automatic master slave replication, auto sharding of data, very rich query language, supports 2nd level of indexes on documents ensuring efficient retrievals, in-built support for Map-Reduce. It also offers very fine grained control over the reliability and durability for someone who does not like the auto pilot mode.

Most common Myth - No support for ACID
One of the most common myth about NoSQL databases is that they do not support Atomicity, Consistency, Integrity, Durability. However by the very nature of how data is stored NoSQL databases should look at ACID in a different light, since there is no need of lots of joins and related data is stored as a single document. It is okay as long as the transaction boundaries are relaxed to a per document level. MongoDB does support transactions at document level, the write and reads are consistent and durable(configurable) at individual document level.

A common (mis)usecase of NoSQL database -- Continuous Delivery/Agile
Most often it happens that you develop a product envisaging a particular use-case but often times people use it in a way that you wouldn't have imagine. Many people have now started advocating NoSQL databases for agile projects just because you do not have to deal with problems of enforcing a common/rigid schema. Each of the distributed team members can work on their own schema and both can go to production because NoSQL allows each row(read document) to have an individual structure different from other documents.

I would say, the reasons for using a NoSQL database should be rooted in reasoning and fitment of the business need rather than just to ease out on development effort.

Future -- Polyglot Persistence
So will the existing RDBMs stop existing in near future. I would say NO and introduce a term I heard from Martin Fowler - Polyglot Persistence -- which means that use different types of persistence according to what makes most business/ domain sense. There is no 1 size fits all.

Sunday, August 18, 2013

The Big Data landscape ? What does it mean to you?

Big Data has taken the software industry by storm. Some of the interesting statistics hint that:-

Big data is a top business priority and drives enormous opportunity for business improvement. Wikibon’s own study projects that big data will be a $50 billion business by 2017.
Market research firm IDC has released a new forecast that shows the big data market is expected to grow from $3.2 billion in 2010 to $16.9 billion in 2015
94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data.

So in this post we try to demystify and see everything that Big Data might mean to you.

Big data is a dynamic that seemed to appear from almost nowhere. But in reality, Big Data is not new – and it is moving into mainstream and getting a lot more attention. The growth of Big Data is being enabled by inexpensive storage, a proliferation of sensor and data capture technology, increasing connections to information via the cloud and virtualised storage infrastructure, as well as innovative software and analysis tools. It is no surprise then that business analytics as a technology area is rising on the radars of CIOs
and line-of-business (LOB) executives.

To me no other technology in the recent past has directly affected the lives of so many people, right from the developer to product owner to marketing dept, to CIO and finally the customer himself.

And rightly so, if something is touching so many lives, is it okay to mean different things to different people?

Most people define Big Data as the 3 V's -- velocity , variety and volume of data.

Much has been written on how the amount of data in the world is exploding in volume. According to a recent study, the amount of information created and replicated will surpass 1.9 zettabytes (1.8 trillion gigabytes) in 2011 – growing by a factor of 9 in just five years.

Big Data is not so much about the content that is created, nor is it even about just consumption. It

is more about the analysis of the data and how that needs to be done. Although the varied variety of content (unstructured, semi-structured) does play a huge role, it is not really a ‘thing’, but

instead a dynamic/activity that crosses many IT borders.

With the focus on Big Data going mainstream, a range of new technologies have hit the market. The table

below gives an overview of these technologies, with associated context (note that the list is not exhaustive).

Technology	Context
Big Table	Proprietary distributed database system built on the Google File System. Inspiration for HBase.
Data Warehouse & BI	Consists of an integrated set of servers, storage, operating system(s),database, business intelligence, data mining and other software specifically pre-installed and pre-optimised for data warehousing.
Hadoop	Multiple computers, communicating through a network, used to solve a common computational problem. The problem is divided into multiple tasks, each of which is solved by one or more computers working in parallel. Improved price:performance ratio, higher reliability and more scalability.
NoSQL / Key value store	A non-relational database is one that does not store data in tables (rows and columns) – in contrast to a relational database. Key Value Stores allow for the management of schema-less (noSQL) entities. E.g. Hbase, Cassandra, Couchbase, MongoDB etc.
Machine Learning	Machine learning is a field that is closely related to data mining and often uses techniques from statistics, probability theory, pattern recognition, and a host of other areas. It's used to build systems like those at Netflix and Amazon that recommend products to users based on past purchases, or systems that find all of the similar news articles on a given day. It can also be used to categorize Web pages automatically according to genre (sports, economy, war, and so on) or to mark e-mail messages as spam.

Below is the architecture diagram of how the whole things comes together. We will cover the diagram and its components in more detail in my upcoming posts.

Conclusion

Apart from the 3 V's mentioned above, Big Data is also equally about the 4th V -- 'Value'.

It is about creating value out of data using about some or all of the above technologies. It entails data analytics by using technologies like NoSQL to store vast amounts of historical data and process it on commodity hardware using technologies like Hadoop. The real power of data comes out of co-relating different sources of related data, hence it is imperative to use it in conjunction with data stored in existing data ware houses.

In future articles we will deal with each of the above viz. No SQL, Hadoop and Machine Learning individually.

Wednesday, April 20, 2011

NoSQL Database

Last week I was involved in some design discussions around implementing/re-designing Dining and Entertainment pages for a star group of hotels.

CASE IN POINT:-

The hotel wanted to revamp their Dining and Entertainment(D&E) pages by adding new content, type of content, layout of the page etc.

The required layout was pretty simple with some normal stuff like title, introduction, detail description adding highlights section etc.

As would be expected we did our due diligence in designing the RELATIONAL database tables and we came up with a NORMALIZED one to many model, the one that we are so used to doing, probably even with our eyes closed .

No points for guessing what the tables might look like. Yeah you guessed it right we have a D&E table and a couple of tables to hold the association like Menu and Highlights.

DOING IT DIFFERENTLY: -

The question to ask is do we really need relational database to structure something like this ? Some characteristics that might lead us to RDBMS are not required here viz:- no cross-references hence no normalization was required, entire page has to loaded always so no lazy loading a subset was required etc.

The design was NOT extensible at all, Today we have tables with each of sub components like Menus, Highlights. Tomorrow if they decide to add another section to the page let's say "comments", we will all be scrambling our way to add another table in the database and link it with the existing D&E tables.

We are probably trying to store un-structured components of a document as relational entities, so why not use a Document Oriented Database? Some characteristics of Document Oriented Database are: -

Documents (objects) map nicely to programming language data types

Embedded documents and arrays reduce need for joins

Dynamically-typed (schema less) for easy schema evolution

No joins and no (multi-object) transactions for high performance and easy scalability

Enter MongoDB(http://www.mongodb.org/display/DOCS/Schema+Design). It is no-sql database and is riding high on the waves of popular sentiment after the success of likes of Cassandra, Hadoop etc.

In MongoDB we could have stored the contents of entire page as a collection of documents structured as json, where each sub-document is free to extend it self and adding another section to a page would be breeze, it would just involve adding sub-document to the already existing tree of document with D&E as its root. No joins required to read the document leading to optimized data access layer.

And to top it all, MongoDB very well supports all of, concurrent reads and writes, replications, failover, sharding etc. etc.

For all those spring lovers there is also a sprint project which makes reading and writing to MongoDB fairly simple. It is called spring data and can be found here (http://www.springsource.org/spring-data)

Cheers!

Tuesday, December 21, 2010

websites for mobiles or websites for PC

A lot of people who are trying to have a mobile version of their website have a basic question of should they have a separate version of their website for mobile or just have separate style sheets for rendering on mobiles.

I guess if we are asked to design something similar, we would say that, make the server return an xml which could be made to render in different formats using XSLT.

However people are exploring other similar approaches, which are more inline with our current way of developing websites for the PC.

Below are few links which talk about two such approaches, one uses Spring Mobile which has close integration with Spring MVC and other uses media queries:-

1.http://www.springsource.org/spring-mobile -- This one intercepts the incoming http request and identifies the originating device between PC and mobile and forwards the request to appropriate tile definition
2.http://www.webmonkey.com/2010/09/make-a-big-splash-on-small-screens-with-media-queries/
3.http://www.smashingmagazine.com/2010/07/19/how-to-use-css3-media-queries-to-create-a-mobile-version-of-your-website/
-- This one helps you create style sheets which are adaptive based on the target device(during rendering, and adjusts accordingly for small screens and like...)

Friday, May 29, 2009

J2me/Mobile logging

One can use MicroLog as logging framework for J2me mobile applications,

more information can be found here -- http://microlog.sourceforge.net/snapshot/

This also has Android support.

J2ME Junit versus Sony Ericsson Mobile Junit

JUNIT for J2ME

The primary reason why we need to have a seperate junit framework for j2me is

in conventional junit for j2se the test methods are either identifed using reflection (the test method should start with 'test')

or using annotations starting junit4.0 unfortunately both of these are not supported in CLDC.

The two flavours in which one can write junits for j2me applications

1. j2me unit from source forge -- http://j2meunit.sourceforge.net/doc.html

It is very similar to junit framework. Test methods are created in classes which extend from j2meunit.framework.TestCase

instead of junit.framework.TestCase. Because reflection is not available developers always have to build the test suites

that shall be run explicitly (as it is possible in JUnit too). This is done by implementing the method suite()

(which is an instance method in J2MEUnit and not static as in JUnit!) and returning a new instance of the class TestSuite

that contains all test methods from the test case that a developer wants to include in the test suite.

The best way to add test methods to a suite is with instances of j2meunit.framework.TestMethod. TestMethod is an interface

with a single method run(TestCase) The argument to run() at test execeution time is the test case the TestMethod instance

is associated with. So all that needs to be done in the implementation of the run method is to call the corresponding

test method in the test case instance after casting it to the correct type. Here is an example from the method suite()

in the class j2meunit.examples.TestOne that can be found in the source code distribution of J2MEUnit:

aSuite.addTest(new TestOne("testOne",

new TestMethod()

{

public void run(TestCase tc)

{

((TestOne) tc).testOne();

}

}));

Tests can be run through a class, called TestRunner as in JUnit. J2MEUnit comes with two implementations of TestRunner.

The first one is j2meunit.textui.TestRunner and is similar to the command line TestRunner from JUnit.

An enhanced TestRunner implemented as a MIDlet that can be run in the WTK emulator or on a real J2ME device

like a mobile phone. Therefore it is recommended to use the implementation j2meunit.midletui.TestRunner to run tests.

Creating an instance of the test runner MIDlet is simple and straightforward,

the only thing that needs to be done is to tell the test runner the tests to execute.

This can be done by subclassing j2meunit.midletui.TestRunner, implementing the startApp() method

(of the J2ME MIDlet base class), and calling the start() method of TestRunner with the test cases

and/or suites as parameters as in the ExampleTestMidlet class that is part of the J2MEUnit source distribution:

protected void startApp()

{

start(new String[] { "j2meunit.examples.TestAll" });

}

2. Sony erisson Mobile Junit.

Mobile JUnit can be downloaded from the Java Docs & Tools section of the Sony Ericsson Developer

World site. Mobile JUnit depends on the Sun Java Wireless Toolkit for CLDC (WTK) and can be used with

any development tool that incorporates or extends the WTK, such as the Sony Ericsson SDK for the Java

ME platform.

Here the test classes extend from com.sonyericsson.junit.framework.TestCase

To run the tests, open a console window and set the current directory to the Mobile JUnit folder. The runmobile-

junit batch file is used to compile and run any test cases it finds in the test folder of the specified

project. Use the following command to run the tests:

Developers guidelines | Unit testing with Sony Ericsson Mobile JUnit

16 September 2006

set projects=c:\SonyEricsson\JavaME_SDK_CLDC\PC_Emulation\WTK2\apps

run-mobile-junit --project-dir:%projects%\mobile-ju-sampleproject

--device:SonyEricsson_W800_Emu --compile-midlet:yes

This generates and compiles a MIDlet-based test runner, starts the SonyEricsson W800 emulator, and

runs the tests.Instead, Mobile JUnit uses generated helper classes to list and invoke individual

tests. A master list of test cases is also generated.

To run the tests, a test MIDlet is generated. The main class of the MIDlet is a Mobile JUnit test runner that

uses the list of test cases to run the tests.The test runner, the test cases, the helper classes, and the Mobile

JUnit framework itself are packaged together into a single JAR file. A JAD file is also generated. The

device emulator loads the generated MIDlet suite and starts the MIDlet to run the tests.

Mobile JUnit defines setUp and tearDown methods in its implementation of the TestCase class. These

methods can be used as expected to initialize and deinitialize test fixtures.

Tests that require a MIDlet main class can use the overloaded version of setUp to access the instance of

javax.microedition.midlet.MIDlet that is running the test case

Thursday, May 21, 2009

WSDL styles interoperability between Dot Net and Java web services.

Coding for web services? Well mostly people try to go with defaults that most frameworks like Axis or Metro provide but have we have wondered about interoperability one of the key selling points of web services?

What if the web service that we have written can not be consumed by the Dot Net client? Doesn't it refute the very notions of interoperability?

Well Dot Net web services only support Document literal wrapped style of WSDL. This style is not documented any where nor is it a standard but this has come from Microsoft and now it has grown into a standard of its own with huge acceptance across the fraternities.

The following link from IBM developer works explains the nitty gritties of which WSDL to use and what are differences between all of the available styles so that you can make an informed decision.

http://www.ibm.com/developerworks/webservices/library/ws-whichwsdl/

Me!