Hadoop is an open source framework from Apache and is used to store process and analyze data which are huge in volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.

Hadoop Modules:

  • HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was developed. It states that the files will be broken into blocks and stored in nodes over the distributed architecture.

  • Yarn: Yet another Resource Negotiator is used for job scheduling and manages the cluster.

  • Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using the key-value pair. The Map task takes input data and converts it into a data set which can be computed in the Key-value pair. The output of Map task is consumed by reduce task and then the out of reducer gives the desired result.

  • Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.

Why is Hadoop important?

  • Ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration.

  • Computing power. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.

  • Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.

  • Flexibility. Unlike traditional relational databases, you don't have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.

  • Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.

  • Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.