Tuesday, July 22, 2014

Introduction to Hadoop

                    Hadoop is framework of Linux based set of tools. It is not any software that you can download on your computer and say "Hey, I have downloaded Hadoop". It is an open source tool and it is freely distributed under Apache license. It means no company is controlling Hadoop, It is maintain by Apache. The concept behind the Hadoop is big data.


Hadoop supports maintaining big data. When we hear word "Big Data" what comes into our mind is"data" which is "big". There is no particular definition for big data but big data is creating large and growing number of files on daily basis which are measured in Terabytes (10^12) and Petabytes (10^15). Yes, we are talking about very large amount of data.

          The attribute of big data is, it is unstructured data, not organized in relational database in nicely created and arranged tables that has column and knew which type of data will go into which column. This big data comes from users like me and you and from applications like facebook, tweeter etc. from systems like ticket booking system and sensors in factories. This big data comes with different challenges and those are: 

  • Velocity 
  • Volume 
  • Variety 

  Velocity means speed in which data is coming in. Volume means size of data, large and growing amount of data is coming in and variety means type of data, data can be audio, video, voice, image, documents, messages, email, photos, text, public record, log files etc. For example 400 million of tweets are posted every day and 1 million transactions are done by Walmart every hour. To handle this type of data powerful technology is used and that is Hadoop.
Hadoop framework is divided into two main components and those are MapReduce and HDFS (Hadoop Distributed file system). Hadoop file system breaks the big data into small pieces of data and stored it on distributed systems. Here distributed systems are not big powerful computers; here systems are numerous low cost computers. Like a data computation is also divided into the small pieces. When there is request for the data, computation is perform individually on each piece of data and after performing the computation result is combined and sends it to the application.
    Hadoop is widely and commonly used in following areas:  

  •  Social media
  •  Retails
  •  Search tools
  •  Government services 
  •  Financial services 
  •  Intelligence 
            If we look at the HDFS it is much similar to the GFS (Google File System) and MapReduce. It is because the technology is firstly invented and used by Google. In early 90's people were using search engines like Exite, Altavista, Lycos, Infoseek and many more. Yes, these are search engines. Then Google came into the picture and suddenly it became most popular and till today it’s the number one search engine. How Google achieved this victory? Google helped by breaking this suspense, In 2003 Google has released paper on GFS (Google file system). They told the world how they store the data but It was part of the story because world did not know yet how they perform computation. In 2004 Google released one more paper and they told the world how they use MapReduce to perform computation on big data. In 2005 Doug Cutting and Michael Cafarella was working in Yahoo. They got very interested in paper. They started creating something based on this papers and result was Hadoop. Hadoop is indeed strange name and it was the name of toy elephant Doug's son used to play with. It was Doug's son who invented this name latter borrowed by Doug. In 2006, Yahoo donated this project to Apache.
      There are two main components MapReduce and HDFS but with that there are few more projects that fall under Apache Hadoop. Those projects are :
Hive:
     Hive is the warehouse structure build on the top of the Hadoop for providing data summarization, query and analysis.
HBase:
      It is open source non-relational database written in java, runs on Hadoop file system.

Mahout:
    Mahout is distributed scalable machine learning algorithm. Mahout's output is recommendation on users search.
Pig:
     Actual computation is done by MapReduce but to make programming more easy for programmers new programming language is created called Pig.
Oozie:
    It is java base application, which is responsible for scheduling jobs in Hadoop system.
Flume:
    Flume is used to collect, move and aggregating large amount of log data.
Sqoop:
     This is tool used to transfer bulk of data between Hadoop and relational model.

   There are some companies that heavily used Hadoop some of them are Yahoo, Facebook, Amazon, eBay, American Airline, Walmart, The New York Times, Federal Reserve Board, and IBM.