Do a Google on Big Data and you are more likely to find people talking about two things:

  • How Open Source solutions like Hadoop have pioneered this space
  • How some companies have used these solutions to build large scale analytics solutions and business intelligence modules.

Read more and one will find mention of Map Reduce and how many of the NoSQL data stores support this useful “Data Locality” pattern – taking compute to where the data is.

Hadoop users and the creators themselves acknowledge that the technology is good for “streaming reads” and supports high throughput at the cost of latency. This constraint and the fact that Map Reduce tasks are very I/O bound, make it seemingly unsuitable for use cases that involve users waiting for a response such as in OLTP applications.

While all of the above is relevant and mostly true, is it also leading to a certain stereo-typing – that of equating Big Data to Big Analytics?

It might be useful to describe Big Data first. Gartner categorizes data build up in an enterprise as under : Volume, Variety and Velocity. Rapid growth in any of these categories or combinations thereof, results in Big Data. It might be worthwhile to note here that there is no classification under transaction processing or analytics, thereby implying that Big Data is not just Big Analytics.

Big Data solutions need not be limited to Big Analytics and may extend to low latency data access workloads as well. A few random thoughts on patterns and solutions:

  • Data Sharding – useful to scale low latency data stores like RDBMS to store Big Data. Sharding may be built into application code, use an intermediary between the application and data store or inherently supported by the data store using auto-sharding of data.
  • Data Stores by purpose – Big Data invariably means distribution and may result in data duplication; within a single store or multiple. For e.g. data extracts from a DFS like Hadoop may also be stored in a high-speed NoSQL or sharded RDBMS and accessed via secondary indices. This could lead to scenarios outlined by the CAP theorem (http://en.wikipedia.org/wiki/CAP_theorem).
  • Data Stores that effectively leverage CPU, RAM and disk space – Moore’s Law has been proven right the last few years and data stores like the Google Big Table (or HBase) successfully leverage the trend of abundant commodity compute, memory and storage.
  • Optimized Compute Patterns – Efforts like Peregrine(http://peregrine_mapreduce.bitbucket.org/) that support pipe-lined Map Reduce jobs.
  • Data aware Grid topologies – A compute grid where worker participation in a compute task is influenced by data available locally to the worker, usually in-memory. Note that this is different from the data locality pattern implemented in most Map Reduce frameworks.
  • And more…..

It may suffice to say that Big Analytics has been the most visible and commonly deployed use case on Big Data. New age companies, especially the internet based ones, have been using Big Data technologies to deliver content sharing, email, instant messaging and social platform services in near real time. Enterprises are slowly but surely warming up to this trend.

I warmed up to the idea of OSGi after reading an introduction like : “Use OSGi to service enable your JVM”. For background, we at MindTree have created a SOA based J2EE delivery platform built on open source technologies – primarily Spring. In this framework, we expose POJOs as services using Spring remoting and access them via the Mule ESB. The idea of being able to expose POJOs as services without having to remote enable them made OSGi an interesting proposition.

I chose Spring Dynamic Modules(Spring DM) for three reasons:

  • I hoped to leverage all of Spring’s capabilities while writing my service
  • Ability to manage my service POJOs as Spring beans
  • Easy migration path for our framework

My test was to write a POJO, declare it as a Spring bean and an OSGi service (using Spring DM) and eventually access the service from a standalone application. This completes a typical usage scenario. I used the “simple-service” sample code that comes with the Spring DM distribution (version 1.1.0).

I faced issues either because I was not able to find the right documentation or because it was simply not there. I encountered and eventually overcame the following issues:

  • Ability to launch an OSGi container in standalone mode and without the prompts – most samples on the internet showed me how to bring up Equinox with the “osgi>” prompt. This was easily solved though.
  • Ability to load all the required OSGi bundles *before* I could load my bundle
  • Look up a published service from the bundle I deployed and invoke it from my standalone code

A few lessons learned from this exercise are:

  • OSGi is very strict when it comes to class visibility and loading. Parent class loader delegation does not apply – something we are very used to in J2SE and J2EE applications
  • It is easy to share services and packages between bundles. Not all that easy if you want to share between a bundle and a non-OSGi client application. For e.g, you cannot simply type-cast a service you have looked up to an interface that is packaged in the bundle and available on your local classpath.
  • Boot path delegation (creating fragment bundles i.e extensions to the system bundle) can help address the above need but must be used selectively and carefully in order to preserve the benefits of OSGi.
  • All bundles are loaded asynchronously. You need to account for this before you look up any service.

I have replicated below the manifest contents from my fragment bundle (one that exports the common interfaces & classes) and the service bundle (one that imports the common interfaces and implements the service):

Fragment manifest:

——————————————————–

Manifest-Version: 1.0
Archiver-Version: Plexus Archiver
Created-By: Apache Maven
Built-By: m1000350
Build-Jdk: 1.5.0
Bundle-Version: 1.0
Bundle-SymbolicName: org.springframework.osgi.samples.simpleservice.boot
Bundle-Name: Simple-Service-Sample-boot
Bundle-Vendor: Spring Framework
Fragment-Host: system.bundle <— note this extension
Export-Package: org.springframework.osgi.samples.simpleservice <— note this export
Bundle-ManifestVersion: 2

———————————————————

The manifest for my service bundle looks like:

——————————————————–

Manifest-Version: 1.0
Archiver-Version: Plexus Archiver
Created-By: Apache Maven
Built-By: m1000350
Build-Jdk: 1.5.0
Bundle-Version: 1.0
Spring-Context: *;create-asynchronously=false <— Spring DM integration
Bundle-SymbolicName: org.springframework.osgi.samples.simpleservice
Bundle-Name: Simple-Service-Sample
Bundle-Vendor: Spring Framework
Bundle-ManifestVersion: 2
Import-Package: org.springframework.osgi.samples.simpleservice <— note this import

——————————————————–

I have attached the code for my sample application. Unfortunately, the complete Spring DM integration (the use of OsgiBundleXmlApplicationContext) does not work and I have posted a question on the Spring support forum :Spring DM issue.

I was however able to invoke the OSGi service from my non-OSGi client.

Disclaimer: The code is just a sample and is not validated in its approach or manner of using the OSGi platform.

OSGiTest – Rename to .java and open in your favorite editor

Of late I have been quite intrigued by some analysts reports on IT becoming a commodity service. By being a commodity, it no longer appears intellectual or elite.
The driving forces – cost, expected higher productivity, competition, e.t.c.

We view and judge programming languages and platform by the amount of flexibility that they provide. This explains the umpteen number of configuration files that our applications have these days. After all we have been taught to “externalize” as much as possible, out of the application code – the reason : maintainability and flexibility.

But havent we taken it a bit too far? How often do table and column names change for e.g? Can we instead agree on conventions for a few of these instead. Why conventions? Because it opens many exciting possibilities around creating or using frameworks to do a lot of work for you. A great example is the Ruby On Rails(RoR) platform. Iam not endorsing RoR here. I however like its idea of being able to do so much behind the scenes because the application artifacts – tables, classes follow convention. Coming to think of it – we do enforce conventions, dont we?

So why not create the conventions ins such a way that it can benefit us, the developers, and not just some standards watch dog? Eventually everybody benefits – the developer writes less code, the project is done cheaper, developers are seen as being more productive, cost comes down e.t.c

Get what Iam driving at? I seriously feel that the comments in the early part of this post will become a reality. We just need to be able re-define the way we do things – one such is adopting “Convention over Configuration” and building intelligent frameworks on top. I see RoR doing that.

Got this cool idea on ClustrMaps from another blog. Going to update mine to include one as well. Truly amazes me when you see what ideas people come up with world-over.
Makes me wonder about the revenue model for sustenance of these companies……