Imagine a large book store with an extremely eclectic and niche collections of books, catering to a very wide taste. Classifying and storing these books so that the customer is able to browse and find the books, easily, is a humungous task. For example, a book on a popular American dance trend which began as a subculture in Japanese martial-arts schools is very difficult to actually classify. A more difficult task for the store manager would be able to find out which categories of these niche products are in demand, since classification itself a complex task. This is why, most brick & mortar stores prefer to stock popular and best-sellers since they are easier to manage. Thankfully the virtual stores are able to store, classify and sell a wider variety of books, because classification and hence search is easier.

But then if you look at it, there are vast amounts of data stored and available today due to extensive computerization in all areas. Until sometime back, making sense of these vast data was cumbersome, because it wasn’t easy to categorize unstructured data e.g. video, audio, etc. Here’s where an open-source framework like Hadoop has been of immense help. The Hadoop platform was designed to solve problems where you have a lot of data- both complex and structured, and it wasn’t easy to index. It’s for situations where you want to run analytics that are deep and computationally extensive.

What is Hadoop? It’s a way of storing and processing big data and datasets across distributed clusters of computers using simple programming modules. Hadoop is designed to run on a large number of machines that don’t share any memory or disks and hence is more robust. Even if one server or system fails, the big data applications will still continue to run. This ensures business continuity in case of unforeseen disasters or pitfalls. What the application does is, breaks the data into small clusters and spreads it across the various servers, and keeps an index of where the data resides in these servers. This is known as the Distributed File System and the default system it uses is the Hadoop Distributed File System (HDFS). Now because there is multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated. HDFS is like the storage room of the Hadoop system: dump the data there until you decide to run an application. This takes us to the next important part of Hadoop clusters – data processing framework, the tool that is used to work with the data itself. In a relational database, a structured query language(SQL) is used to pull out data using queries but when there is unstructured data, using queries isn’t of much help. Instead a series of java application is processed (using MapReduce) that goes and pulls out the information required. This though adds to the complexity of task, also adds flexibility and control for pulling out data.  On a hadoop cluster, each system would generally house both the HDFS and MapReduce, therefore processing of data and its retrieval is faster.

When we talk of Big Data, the common misconception is that data Volume is the primary factor for using Hadoop. On the contrary, big data also refers to the Velocity (speed), Veracity (accuracy), Variety (forms of data) and Value of data. For example, in an hospital other than structured patient admission information- scans, x-rays , CCTV cameras, ECG values can be used. Decisions ranging from medical treatments to finding out how to regulate thermostats in common areas based on footfalls can be arrived at.

Chris Anderson in his book ‘The Longer long tail’ says “the secrets of twenty-first century economics lie in the servers of all the companies all around us” referring to how data and data analysis was  going to drive decision making.

Leave a Reply