Thursday, June 13, 2013

Hadoop: Big Data Processing made easier

A corporate data stores continue to grow almost 50% annually, and this increase in data storage requires a good management and this indirectly needs a change in the current technology. Storage and management technology has evolved to an extent but still today’s enterprises are faced with many evolving needs that can strain storage technologies. However, the big data analytics process demands the capabilities that are beyond the typical storage that should handle terabytes and petabytes of unstructured information which is big challenge. So it is requiring something more, a new way or a platform to deal with large volumes of data.


Hadoop is an open-source software framework that offers a platform to deal with big data. Hadoop was derived from Google’s MapReduce and Google File System papers. Hadoop is written in Java programming language and is an Apache top-level project being built.

Hadoop platform is designed to solve problems caused by large amounts of data that contain complex and unstructured data which cannot be placed into the tables directly. Hadoop solves the most common problem with the big data i.e. efficiently storing and accessing large amounts of data.

The intrinsic design of Hadoop allows it to run as a platform that is able to work across a large number of machines that don’t share any memory or disks. It reduces the management overhead associated with the large data sets. The Hadoop framework provides both reliability and data motion to applications. The data is being loaded into Hadoop platform, the software breaks down the data into fragments which are then spread across different servers. The distributed nature of the data means there is no more traditional data centered server where the data has been stored and has to go to access the data. Furthermore, Hadoop keeps tracks of the data where it resides and it also protects multiple copies of the information.

Unlike the limitations associated with the centralized database system, which may consist of a large disk drive connected to a server that features multiple processors, with Hadoop every sever in the cluster is allowed to participate in the processing of the data through Hadoop’s capability to spread the work and the data across the cluster and each server then operates it’s little piece of data and then all the results is unified.

Hadoop is referred to as MapReduce where the code and processes are mapped to all servers and the results are reduced into a single set. This process makes Hadoop to deal with massive data.

However, there are certain pre-requisites, hardware requirements and configuration that must be met to ensure success. Big data analytics requires that organizations should choose the data to analyze it and then apply aggregate methods before it goes for extract, transform and load process. The data can be structured, unstructured or from multiple sources such as social networks, data logs, websites etc which is accomplish by processes and considerations such as the capability to move computing power closer to the data and perform parallel or batch processing of large data sets.

But Hadoop cannot accomplish everything on it’s own. Organizations will need to consider what additional components are required to build a Hadoop project. For effective management and implementation of Hadoop require some expertise and experiences, and if it is not available then it should take help of the service provider that can offer full support for the Hadoop project.

Although Hadoop has been around for some time, more and more organizations had started using its capabilities and had reduced the challenge of handling the big data processing. It also includes the major companies who are associated with Hadoop such as Google, Yahoo, JP Morgan Chase etc.