воскресенье, 18 января 2015 г.

Simple architecture harder than complex

I want to write about the architecture of my latest commercial software project I developed (http://iseepro.ru/) and I want to do this in storytelling manner. The reason I chose a narrative style is that it is not the architecture itself that is valuable (there is nothing special about it as you going to see later), but the way we got to it, the decisions and considerations that led us to where we currently are. We were restricted on time and on budget (we had none of both), but nevertheless managed to come up with the best product available within the market (this is very subjective, until proven by others, forgive me). This is a story of simplifying everything to the very extreme.

Before I start, please check out statistics. The infamous Chaos Report published by Standish Group in 1995 revealed horrible numbers - 84% of IT projects considered to be a failure. The experiment involved managers of all levels from several thousand companies throughout all industries involved in IT activities. Well, 1995 is almost Middle Ages, we can only feel sorry for IT dinosaurs. Technologies made a tremendous progress with time both in complexity of tasks solved and in easiness of adoption. We’d expect to have drastically different numbers nowadays, right? IBM have published results for a similar survey conducted in 2008. The numbers were not that promising either - 60% of projects failed in that or another way. Imagine, with .NET 3.5, Java 6, RoR 2.x, with all that goodness we failed almost every second project. Seems like the problem is not the technology alone.  The main reason why projects fail is the lack of focus (McKinsey, link). It is the lack of focus on business problem that leads to over-complicated solutions in a desperate rush for superfluous qualities. This is why we break the schedules and get out of budget.

Having this in our minds we started our Cash Control system development.
Cash Control systems are used to mitigate money loss risk at points of sale. They help fight internal and external fraud in retail.
The appliance is made of a CCTV system integrated with a set of analytical tools operating on devices’ event streams.
The system is relatively simple and functional requirements can be captured in these three simple bullet points. The system should be able to:
  • mix and correlate video stream with device event stream, both in real-time and afterwards;
  • match risky behaviour patterns in device event flow in real-time with customizable templates;
  • provide post-analysis tools and generate reports by cashdesks, cashiers, goods and alerts found.

Non-functional requirements can be captured with a relatively short list as well. System should be:
  • open for integration with devices of different types:
    • cashdesks
    • petrol stations, etc.
  • open for integration with different CCTV systems. Even though our default package includes our partner’s system, our client may have already invested in another CCTV  appliance. In this case we provide them with analytics tool package, integrating it with CCTV system of their choice.
  • cross-platform;
  • scalable, as our clients may be as small as one shop with several cashdesks and as huge as a network of hypermarkets.

We set up our first goal to launch the system in a big hypermarket in Moscow in half a year. Please bear in mind that none of us left their primary occupation.

Can you imagine, how happy my inner architecture astronaut became, as he met this kind of challenge? Immediately, without a single moment of thinking, I got a message from stars with the overall system design. Nice, isn’t it? So here is the ideal world (or the world with lack of oxygen, it depends on how you see it) picture.

This solution is so fucking universal that it could install and administer itself, if only my inner architecture astronaut stayed without oxygen a little longer. It can serve any client, no matter how big it is. Could it be another way if you have SPA, nginx, Thin, Cramp, node.js, NoSql, Mapreduce, amqp, RabbitMQ?!

A couple of years ago I had a joy of attending QCon London. I spent two days in a row at my favourite real-life projects track. Should I mention, that by the end of the second day I had already had enough of MySQL to Postgres and vice versa epic changes? The last talk was performed by ex-googlers Triposo founders. And believe it or not, but I was rewarded for my patience by these guys. Their attitude seemed so different from others’. They didn’t bother building super reliable infrastructure, but used existing available infrastructure instead: they kept their knowledge base backups in Dropbox; had static content stored and served by Amazon S3; put dynamic settings to change applications’ behaviour into a Google docs spreadsheet. Em.. Whoat?! Google docs as a part of a system?! Believe it or not, but ex-googlers didn’t create a clusterized parallelized distributed highly available redundant infrastructure. They chose not to do it at all. Their capital expenses were huge enough to allow themselves two workstations under a kitchen table. That’s where all magic happens - these two stations generate and publish travel guides.
So what’s so special about these guys? They managed to build a successful business, because they had focused on their competitive advantage - extensible knowledge base, automatic guide builder and UX.

Our competitive advantage to be focused on is the most convenient analytics toolset which allows to solve our clients’ real life issues. Our major clients are retail market networks. Our first playground is a hypermarket in Moscow. Facts are:
  1. Real-time monitoring performed by guards is not the main use case by any means. The main scenario is alert post-analysis performed by a security analyst.
  2. Load is not evenly distributed throughout the day and has a certain profile. It grows slowly from morning till evening and reaches its peak at rush hours. It is almost flat around zero during nighttime. We may get a load increased by a factor of two or three at days before holidays and weekends.
  3. Retail infrastructure is mostly built on Windows;
  4. Our software delivered on-premise.

There is a NoSql database drawn by my inner architecture-astronaut. I salute this choice, in general, as long as it at least shows that you have thought about your data and the best way to store that. But this is less and less true nowadays.
But why NoSQL? First of all, it is data volume. We’ve examined a cashdesk communication protocol and estimated our hypermarket to produce around 600К events daily or 200М annually.  With average message size balancing around 1 KB, it totals to ~600 MB daily, or ~200 GB annually for just one hypermarket. However, we should remember that several hypermarkets  united in a single network may generate an order of magnitude more events. NoSQL works well with huge amounts of non-relational data, as long as this data is easily distributable due to its nature. The second reason to opt for NoSQL is write performance. It is obvious, that write operations will prevail read operations in an aggregating system. NoSQL has a good write performance as long as these operations are as easy to distribute as it is easy to partition the data itself. And the last, but probably the most important reason to chose NoSQL, is the lack of schema. Initially we would prefer not to restrict a message schema as long as different protocols have different fields in messages. There are few messages with identical set of fields even within one protocol.
Considering the fact that we were going to need elaborate search instruments our best bet here would have been a document-oriented database, like MongoDB. To accommodate for a huge dataset we would have several instances holding several shards in a cluster, of course.
However, as I’ve already noticed, we were going to deploy our product on-premise and we had to take into account unskilled personnel. They can handle routine tasks like a database backup without problem, yet database cluster support is far beyond their capabilities. That would have been a huge risk for the overall system reliability. Among other things, our client could invest money in a storage solution of their own. Chances are very high that it would be that or another RDBMS. You surely know that switching RDBMS is far easier than changing NoSQL db to relational, don’t you? But I could have dropped all these clever excuses in favor of one and simple reason - neither of us had a production experience with MongoDB. The end.

So, as you might have guessed by now, we chose to store our data in RDBMS - MySQL. As for the data volume - we decided to postpone this problem till we face it. After all, we could have not succeeded with this project at all.
Ok, so we now use relational storage, but what happened  to data? Is it relational now? We introduced canonical events model and started to translate any message arrived within any protocol supported into this model. Canonical events model effectively defines the data schema, we know the meaning of information we store and get rid of what we don’t care about. We threw every event message in one big table in denormalized form. We filled cashiers, cashdesks and goods dictionaries in parallel, duplicating data. Event itself can already include every goods or cashier attribute. Events table had no foreign keys to dictionaries, after all the fact that we used relational database didn’t bind us to normalize ad nauseam.

After we ran our product in trial for several months, we noticed that our queries had gotten slower. Oh gosh, we forgot that our dataset increases indefinitely.
After we analysed real processes, we found out that events older than three months are, basically, rubbish. Nobody ever goes that far to the past. We had to get rid of this garbage, but we had to guarantee constant system performance. I mentioned earlier, that we have a certain load profile with almost absolute zero during the night. We might have used that timespan to run cleanup tasks deleting hundreds of thousands of records. Sounds like a good idea, doesn’t it? There is one problem with this approach. Deletion is the most expensive operation in relational database. You certainly can run the query with hundreds of thousands of rows to be deleted, but it doubtfully would complete in a reasonable timeframe. There is an easier way to clean up a table - to truncate it. But could we truncate parts of tables? Sure we can, by partitioning the table! Where is my cap badge?
We partitioned our events table by months and set up a periodic task to truncate a four months old partition every once in a while. We killed two birds with one stone with this trick. We got almost constant query performance, as queries always operated on 1-3 partitions with the majority of them reading data from the latest one, and we solved the data amount problem.

Short note about reports.
After we got rid of NoSQL we lost a possibility to run Map-reduce tasks on our data easily. However, we were still able to aggregate data for reports of course. There is a standard approach to aggregate data in relational databases areal: OLAP. There are several solutions available out there to start analysing your relational data in many dimensions, free and paid. Yet after we analysed the structure of information we had, we came to conclusion that our database had almost a classical star ROLAP scheme. By adding facts tables that referenced dictionaries, that were dimensions by coincidence, we easily solved that not so easy task. We forked activewarehouse gem and adjusted it for our needs. We started with baked in ETL dsl to extract and transform events to facts and dimensions initially. We ran these scripts on periodic basis, yet it proved to be very error prone. Again, considering the fact that we already had ROLAP scheme at place, it suddenly dawned on me, that we could append facts at the very same moment we inserted events in our database. We introduced another subscriber responsible for just that task, and that was it. As a side benefit we got immediately actual reports. Easy-peasy.

I won’t elaborate on why did we change async web server and async application framework in favor of Rails and Mongrel. I’ll just mention, that Rails is a really convenient and powerful tool for rapid application development, provided the application is as simple as ours.

Comet is a browser to server communication pattern incorporating a permanent connection used to push data changes from server to browser to achieve near real-time behaviour in Web.
I once led a project where we had to communicate with millions of connected devices and propagate changes to web layer in near real-time fashion. We had found a powerful and scalable tool to solve this problem. The system was built with Erlang XMPP server software MongooseIM. This marvelous piece of software allowed us to handle extensive XMPP traffic along with BOSH traffic. It also holds a transient state for web layer. Prodigy.
Should I mention, that when I hear words ‘comet’, ‘device’, ‘real-time’ together in one sentence I first think about Mongoose? This case was no exception. However, considering the fact that no hypermarket possessed a thousandth part of devices this server was capable to serve (either alone or in cluster), I decided to hold this secret weapon and opt for equally cool WebSocket and node.js technology.
WebSocket is a binary protocol, part of html5. Its main advantage is that it is a standard, rather than a proprietary technology built as a browser plugin, it is supported by most of modern web browsers. node.js is a very fast webserver with evented processing model built on top of libev. Unfortunately, we immediately found out that a Websocket is supported by IE starting with version 10, whereas our clients had IE9 installed in the majority of cases. All in all, Websocket was abandoned pretty fast, even before the first prototype was written.
With devices’ communication protocol at hand I managed to calculate the load that a huge hypermarket (huge hypermarkets have up to 50 cashdesks here in Moscow) would have generated. I got an exaggerated number of 30 events per second. It means that our comet server would have to distribute 30*Nusers events per second.
If you remember the facts I mentioned in the very beginning, it was said that there was only one case when monitoring has been involved - security supervision. Even the biggest markets hardly have more than one security desk. That means Nusers for monitoring equals 1. Moreover, if you consider the fact that you hardly ever see all the cashdesks working at the same time (the latests Ruble crisis is not a good example here), and the fact that one can hardly perceive more than 7 objects simultaneously, you’ll get a workload similar to something like this: 30*⅔*⅙ - this would be the real load on our comet server to get monitoring events.
Keeping in mind that the webserver was the least loaded part of the system, it would easily serve these requests, even with the primitive polling.

After a short prototyping iteration we ended up with solution incorporating EventMachine thread inside the web application. EventMachine hosted an amqp subscriber, consuming certain cashdesks’ events from transient queues, pushing them to memory buffers, that got popped out by a regular Rails controller serving Ajax requests issued by a client. There is one small thing to note here, we optimised the traffic a bit by doing long polling. This is how we got rid of WebSockets and node.js.

This is controversial - throughout the story I was getting rid of system components as much as possible, and here I tell you that we brought a whole enterprise framework with us. What for?
Well, first of all, we didn’t carry it with the solution. Every modern Windows Server OS has .Net framework preinstalled as a part of a system. We used .Net to develop our Watchdog service. The fact that we deployed on-premise means, that we had absolutely no chance to remotely administer our solutions. The system had to be rock solid, which was easier to achieve with native tools. This is why I wrote it with C#. All in all, Watchdog was not required to be portable across various platforms as other services were. It was implementation that was target OS dependent, and if we used a custom written code to automatically maintain our processes on Windows it didn’t mean we were supposed to take the same approach with other platforms. Something like crontab would solve the problem on *nix systems.

We were “lucky” to have ActiveX on board as all our clients used Windows workstations, and all video systems we had had integrated so far, provided ActiveX API to access video programmatically. So, we didn’t write a universal facade to video systems providing streaming and event publishing features. Instead, we sacrificed cross-platform features for the speed of development. This decision, together with many other small decisions we made, allowed us to get things done on time and on budget. After all, there were no guarantees that we would succeed, so we wanted to move fast with short sprints.

I bet you won’t guess the hardware we needed to host the solution.
It were one or two average size servers, depending on the availability requirements our clients had. Yet there is not much you can do about availability, provided that you can’t ignore regular power outages on sites.

We managed to deploy our first release in production after 6 month of development with 2 developers and one designer working part time. We supplied 90% of features required to compete on market.

The reason we managed to get things done is that we focused on business features rather on technology avoiding valueless experiments.

The end.