Do a Google on Big Data and you are more likely to find people talking about two things:

  • How Open Source solutions like Hadoop have pioneered this space
  • How some companies have used these solutions to build large scale analytics solutions and business intelligence modules.

Read more and one will find mention of Map Reduce and how many of the NoSQL data stores support this useful “Data Locality” pattern – taking compute to where the data is.

Hadoop users and the creators themselves acknowledge that the technology is good for “streaming reads” and supports high throughput at the cost of latency. This constraint and the fact that Map Reduce tasks are very I/O bound, make it seemingly unsuitable for use cases that involve users waiting for a response such as in OLTP applications.

While all of the above is relevant and mostly true, is it also leading to a certain stereo-typing – that of equating Big Data to Big Analytics?

It might be useful to describe Big Data first. Gartner categorizes data build up in an enterprise as under : Volume, Variety and Velocity. Rapid growth in any of these categories or combinations thereof, results in Big Data. It might be worthwhile to note here that there is no classification under transaction processing or analytics, thereby implying that Big Data is not just Big Analytics.

Big Data solutions need not be limited to Big Analytics and may extend to low latency data access workloads as well. A few random thoughts on patterns and solutions:

  • Data Sharding – useful to scale low latency data stores like RDBMS to store Big Data. Sharding may be built into application code, use an intermediary between the application and data store or inherently supported by the data store using auto-sharding of data.
  • Data Stores by purpose – Big Data invariably means distribution and may result in data duplication; within a single store or multiple. For e.g. data extracts from a DFS like Hadoop may also be stored in a high-speed NoSQL or sharded RDBMS and accessed via secondary indices. This could lead to scenarios outlined by the CAP theorem (http://en.wikipedia.org/wiki/CAP_theorem).
  • Data Stores that effectively leverage CPU, RAM and disk space – Moore’s Law has been proven right the last few years and data stores like the Google Big Table (or HBase) successfully leverage the trend of abundant commodity compute, memory and storage.
  • Optimized Compute Patterns – Efforts like Peregrine(http://peregrine_mapreduce.bitbucket.org/) that support pipe-lined Map Reduce jobs.
  • Data aware Grid topologies – A compute grid where worker participation in a compute task is influenced by data available locally to the worker, usually in-memory. Note that this is different from the data locality pattern implemented in most Map Reduce frameworks.
  • And more…..

It may suffice to say that Big Analytics has been the most visible and commonly deployed use case on Big Data. New age companies, especially the internet based ones, have been using Big Data technologies to deliver content sharing, email, instant messaging and social platform services in near real time. Enterprises are slowly but surely warming up to this trend.

Mention SOA or Services and most of your audience would immediately relate it to web-services – yes, the often un-intented misuse of XML over Http that gives the technology and anything related to it a bad name in the world of high-performance J2EE applications.

Two of the biggest culprits in loss of Performance are I/O and Transformation overheads. Web services has both these drawbacks – increased data transfer i.e higher I/O associated with markup overhead of well-formed XML and the CPU utilization overhead when converting XML to Java and back aka. Marshalling.

Web-services and its implementation of XML over Http is good when it is genuinely needed. For e.g. exposing services for consumption with partner organizations where consuming technologies are not known or for integration between disparate systems.  However often this need for integration unfortunately leads people to stereotype services as web-services in a SOA. 

The question then is : can we reap the benefits of SOA and not suffer the drawbacks of the overheads inherent in web-services? I believe, we can.

Quite a while back, I read this excellent IBM Redbook : Implementing SOA using ESB where the author recommends deploying a B2B Gateway external to the ESB. I must admit it didnot make much sense to me then. I have come to appreciate it much better these days. A B2B Gateway enables consumption of services by “third-party” . This “third-party” may be a client from a different technology platform or from an altogether different organization.

A separate B2B gateway introduces the possibility of:

  • Making the web-service channel independent of the service implementation and therefore a matter of choice to use (and therefore suffer) the XML over Http interface
  • Introducing the much required security standards(and implementations) for securing services and data managed by the services
  • Using third party implementations that specialize in implementing WS-* policies
  • Using hardware to augment the processing capability provided by software frameworks – e.g. XML appliances

The SOA runtime therefore must enable services to be written independently of XML and the WS-* specifications/constraints.  The Web-service interface  is then an optional channel , via a B2B Gateway, to invoke the services. 

We, at MindTree, have taken this design further in our implementation of Momentum – an SOA based delivery Platform. Interfaces like JMS and web-services are optional channels provided by the Framework to invoke any deployed service. A schematic that explains this approach is shown below:

 

Request flow to a Service from different channels

The web-service interface is therefore an optional means to invoke your service when you separate the service container and the ESB(optional) from the B2B Gateway and deploy the latter as a separate infrastructure.  You can then benefit from the good of web-services without compromising on your service’s QoS.

MCA or SCEA

September 9, 2008

Let me set one thing straight – I have never been a big fan of certifications. It stems from ones on programming languages that require you to remember stuff that is easily found in the API docs. And, of course the aversion to written examinations that all of us with Indian schooling know so well of and consequently loathe 🙂 I have therefore never taken up any professional certifications.

So, it was with some misgiving that I responded to a colleague’s query if I would like to take up the Microsoft Certified Architect evaluation. Having used Java predominantly the last few years, my first and obvious question was – how is the MCA relevant to a Java person? I listened a bit to the process and decided to give it a try for two reasons:

  • The evaluation process was around a solution that one has architected i.e. it was case study based
  • My colleague’s persuasion that it was a worthwhile experience
I’ll leave the process of evaluation out and just say this much that I created a few documents on competency, case study and a presentation that was to be presented to the board. The initial screening round was telephonic and was done by an existing MCA.
The review board was a 4 member team that I had to meet face to face and “defend the case study” as the process put it. I was truly impressed by the end of it all – not because I got the certification but because of the manner in which it was conducted. The review by the board was the highlight of the process and I mentioned thus when asked about the motivation for taking this up.
A few impressive things about the process :
  • Emphasis on one’s own experience in architecting a solution and almost neglible theritical knowledge. One can in no way prepare for this evaluation overnight.
  • Thorough professionalism of the people involved. You work and interface with existing MCAs throughout the process. I interacted with a total of 7. This is significant if you consider that there are a little over 100 MCAs worldwide. Thats the amount of focus and attention every candidate receives.
  • Clear sepraration of roles being certified – Solution architect & Infrastructure architect. There is Enterprise Architect as well. I applied for Solution architect.
  • The feedback one receives at the end of the certification. This feedback is very personalized. For e.g. specific books that one is advised to read based on the board review.
  • The collective experience of the architects on the board. It is OK to admit you dont know answers to questions(you will during the review :)) that come at you in quick succession. I learnt it is difficult to match the collective knowledge of all those gentlemen on the board. I am guessing the average age of a MCA on the board should be well over 40 – seasoned architects surely!
  • Inclusion into the fraternity of MCAs. This is a small group of like minded people. I was invited to be a panelist speaker at the “Microsoft Architecture Days” session in Bangalore. I was allowed to talk about Java when discussing the topic “Software + Services”!  This might be unheard of before.
  • Lastly the nicely mounted  certificate signed by Bill Gates 🙂
Where does the “Sun Certified Enterprise Architect” stand in comparision? Again, I was unaware of its existence until recently when I saw it on a candidate’s resume. I was impressed by the name and the source and therefore did some googling on what it took to become a SCEA. This was before I spoke to the candidate who was a SCEA. I was disappointed by what I heard and what I read about the certification:
  • Firstly, the evaluation process tests one’s skills at application design and at best solution architecture. I feel it is a complete misnomer to have “enterprise architect” in its name. 
  • The process is mostly offline. I wonder how “cheating” is prevented i.e. one person defines the solution on behalf of another.
  • Competencies essential to an architect such as communication, leadership, process, e.t.c are not evaluated here at all. 
  • All the benefits or highlights I experienced in the MCA are missing here. 
It would really do good to make the SCEA more interactive, personal, effective and valued especially since it comes from such a credible source as Sun Microsystems. Until then I would continue to vouch for the MCA programme – true value from Microsoft.

Today it would not raise many eye-brows if I said that an application server need not necessarily be a physical standalone entity or process and can actually be “assembled” from a set of available frameworks. Thanks to frameworks like Spring and standards like OSGi (and containers supporting it), this can be realized and not remain just on paper.

How about an ESB? The picture that most of us have seen (and reproduced in this post : Myth of the ESB ) appear to imply this big central infrastructure that acts as a conduit for all integration and message based interactions within an enterprise.

How easy or difficult is it to realize such a deployment in an organization? I have worked on SOA for a few years and more recently on deploying ESB as an infrastructure component in a SOA. Not once in these cases have we done such a deployment. Why? The answer is simple – most organizations take an incremental approach to SOA adoption and not all existing applications are amenable for migrating their business logic as services and thereafter accessed via an ESB.

Does this mean an ESB is only a promise? No, if you care to see it a little differently. I’ll go back to my earlier example of an application server. In the world of light-weight containers, components and dynamically loaded modules, the application server can be seen as a set of capabilities realized through an integrated set of components,frameworks,libraries and a runtime. The advantage here is that one can use as much of the application server as is needed.

If I were to apply the same analogy to an ESB, the first question that comes to my mind is – what are the capabilities of an ESB (or) what can an ESB provide to an application? IBM puts it nicely (and using a nice term) by defining a “Capability Model” in their Redbook on “Implementing an SOA using an ESB”. This capability model, for me, also serves as a good yard stick to measure an ESB offering – sort of a self compliance check.

A gist of the capability model is listed below:

  • Communication : Routing, Addressing, Response / request, Publish / subscribe, Sync/Async e.t.c
  • Service Interaction: Service interface definition(WSDL), Substitution of service
    implementation, Service directory and discovery
  • Integration : Database, Legacy and application adapters, Service mapping, Protocol transformation
  • Quality of Service : Transactions, assured delivery
  • Security : Authentication, Authorization, Security standards
  • Service Level : Performance, Throughput, Availability
  • Message Processing : Content-based logic, Message and data
    transformations, Validation
  • Management and Autonomic : Administration capability, Logging, Metering, Monitoring
  • Modeling : Object modeling, Common business object models
  • Infrastructure Intelligence : Business rules, Policy driven behavior

Now thats a huge list of capabilities and you may not need all of them – again like in an application server, you may not need JNDI, JTA, JMS and an EJB container all at once. However you can integrate these capabilities when needed. Can you do the same for an ESB? Yes, you can. Its a different post altogether how one can do that. One might ask this question : but why take the trouble?

The answers come from the cost of deployment, the set of features truly available in the fancy ESBs and the flexibility of actually using an ESB’s features but within your own runtime process. There are quite a few light-weight ESBs that help you do this. If you carefully look at the features in these ESBs, you’ll notice that they fall short of the Capability Model in more ways than one. This is true for the commercial ESBs as well. The lack of standards adds to the disconnect. In the meanwhile, considering an ESB as a logical set of capabilities and using it that way puts many things into perspective – including the value of an ESB in a SOA.

Bring up the topic of stateful and stateless applications and you can guarantee to get a house divided, almost equally on both sides. There are die-hard proponents of statelessness (and frameworks to support it;Spring for one) and those that support stateful behavior(and frameworks to support it; JBoss Seam for one) .

This leaves the average developer confused – is holding state good or bad? The answer is different : It is inevitable in circumstances and avoidable in certain others.

A regular web application normally contains state while service calls in the integration or infrastructure layer may not. Often the latter is designed that way i.e being stateless for sake of scalability and fail-over.

An often used approach to maintain state is a user session. The most common choice is the HttpSession. Cluster-aware application servers handle replication of these sessions. In-efficiencies in session replication are often cited as reasons to move to a stateless design or look for alternative means for replication. Lets take a look at common approaches to managing user sessions before we decide on the merit of this move. Session replication choices:

  • No replication. No fail-over. Sticky behavior is the only choice in a redundant server deployment.
  • In memory replication. Default behavior in J2EE application servers.
  • DB based replication. Optional behavior in .Net and Ruby On Rails platforms.

Take any approach and you can find people give you many reasons why not to use them. Some of reasons can be : lopsided load in case of stickiness, inefficient in-memory replication and cluster license cost(3 times more) in case of in-memory replication and increase in DB I/O in case of DB based sessions.

We might do better by addressing the problem before trying to find more efficient solutions. Control over what is considered as valid state and the size of the state object graph matter more. I follow these principles/practices when handling state in my application. Some of them are not new and are in fact best practice recommendations for performance, robustness and overall hygiene of the system:

  • Store only key information in the session i.e only minimal data with which you can re-construct the entire session graph.
  • Store only domain model or equivalent data objects. Avoid objects that hold behavior. An easy way to implement this is to wrap session access with a layer that entertains say only XSD derived objects, which effectively cuts out behavioral class instances.
  • Set a limit to the size of the session i.e avoid large session graphs. The session wrapper can ensure this.
  • Persist session only if dirty. Applies to cases where there is container support and in custom session persistence implementation.

An application that follows all of the above would rarely need to debate on the cost of maintaining state via sessions – in memory or in the DB.

DB based persistence is considered expensive and a mis-fit to store transient data such as user session information. However, interestingly frameworks like .Net and Ruby on Rails(RoR), that matured later than J2EE, provide this as an option. In fact, it is the default in RoR, if Iam not wrong.

Recently I had the choice to architect a SOA based platform to build applications on top. We wanted the core services to be stateless to easily scale out when required. Naturally we preferred the application servers to NOT be clustered and be load balanced instead. The applications built on top had to contain minimal state however. We also decided to mask session management from the consuming applications and implemented session persistence and therefore recovery using the DB. While there were initial apprehensions about DB I/O bottlenecks, adopting the principles described above helped us tide over the issue. The end-applications have been in production for a year now. The logic we used in favor of DB based sessions was this : nature of DB access for say 100 concurrent users would mostly be READ with the odd case of a WRITE(i.e when session gets dirty). 100 reads of small sized records using an index on a table is extremely fast as there are no concurrency or transaction isolation issues as each read is for a specific record independent of the other. Anyway, we have the option to switch back(courtesy the wrapper over session management) to Http sessions and clustering if performance sucked, which hasn’t happened till date.

To sum it up : the debate between Stateful and Stateless applications and consequently that on the most efficient session persistence/replication mechanism is really a matter of choice if session is handled with some discipline in the application.

I have found this common question among attendees from the recently concluded MindTree(http://www.mindtree.com) Osmosis tech fest and last year’s Opengroup’s EA conference where Kamran (MindTree CTO) presented case studies on SOA implementation:

I know the “why” of SOA but not sure of the “how”. Is there a methodology?

The often misleading answer to this question is to tie an SOA adoption to a complete Enterprise Architecture (EA) definition exercise. This leads to the following impressions/myths:

  • SOA is for large organizations or programmes
  • SOA adoption must be a big-bang approach

The truth is : SOA can be adopted just as easily for mid-sized opportunities as it can for large engagements. The difference is in the methodology used.

At MindTree, we have created an approach to SOA adoption. Inputs for this came from a couple of true blue-blooded SOA implementations in the travel industry – one for a large content aggregator that owns a couple of Global Distribution Systems(GDS) for airline industry alike the Sabre and Amadeus of the world, the second is for an organization that is the premier trade organization for the Airline travel industry.

Both these implementations were multi-million dollar initiatives – dont be carried away by the size and lead to one of the myths described above!

In both initiatives, SOA adoption was incremental – this is the beauty of the model adopted.

Lets look at an outline to SOA adoption. It would comprise of:

  • Establish drivers for SOA – defines the case for using SOA
  • Perform portfolio analysis – establishes choice of technology for SOA implementa-tion, influences build, wrap or retire decisions on business processes.
  • Define the architectural goals and the scope of SOA deployment.
  • Define technology architecture and choice of tools & frameworks.
  • Define roadmap for the SOA components
  • Plan for implementation
  • Define implementation and reuse governance

Those who have looked at EA frameworks would catch a sense of resemblance with some of the steps defined above. It is a valid observation and in fact leads us to one of the two SOA adoption models i.e a Combined EA and SOA model.

On the other hand, some of the steps might appear too expensive for a mid-sized organization. This leads to a variant in the model and is termed the Basic SOA model.

An outine to the basic model would look like:

  • Check for existence of suitably articulated drivers for SOA adoption . Doesnot includethe task of identifying the drivers.
  • Define the architectural goals and the scope of SOA deployment – set the expecta-tions that scope is limited to discovery, build and deployment of services only
  • Technology and choice of tools & frameworks
  • Plan for implementation – of the services and applications on top

How does an organization decide to go with one of the two models? This can be partly answered by the scope of SOA adoption. The scope can be quantified in the maturity of the SOA deployment and the services there-in. Attention given to progressive SOA deployment can ensure that an enterprise can start with one model and move on to a higher and better one over a period of time.

The level of SOA deployment is listed below, in increasing order of sophistication:

  • Level 0 – Identify data and behavior to be deployed as services.
  • Level 1 – Design, build and expose services.
  • Level 2 – Support multiple channels and clients in service invocation.
  • Level 3 – Publish, discover and compose services. Also called as Orchestration.
  • Level 4 – Secure services & perform metering to determine usage
  • Level 5 – Operate and manage the services and associated infrastructure
  • Level 6 – Ensure reuse and capture metrics on benefits incurred

The levels are achieved in an iterative manner in most real world deployments as rarely does one identify all the services before attempting to design, build and deploy the first set of services for use by client applications.

More on the methodology and approach along with consulting can be provided on request.