Home Tutorials Training Consulting Books Company Contact us

Get more...

Apache Hadoop. This article describes how to use Apache Hadoop.

1. Apache Hadoop

1.1. Overview

Apache Hadoop is a software solution for distributed computing of large datasets. Hadoop provides a distributed filesystem (HDFS) and a MapReduce implementation.

A special computer acts as the "name node". This computer saves the information about the available clients and the files. The Hadoop clients (computers) are called nodes. The "name node" is currently a single point of failure. The Hadoop project is working on solutions for this.

1.2. Typical tasks

Apache Hadoop can be used to filter and aggregate data, e.g. a typical use case would be the analysis of web server log files to find the most visited pages. But MapReduce has been used to transverse the graphs and other tasks.

1.3. Writing the map and reduce functions

Hadoop allows that the map and reduce functions are written in Java. Hadoop provides also linker so that map and reduce functions can be written in other languages, e.g. C++, Python, Pe, etc.

2. Hadoop file system

The Hadoop file system (HDSF) is a distributed file system. It uses an existing file system of the operating system but extends this with redundancy and distribution. HSDF hides the complexity of distributed storage and redundancy from the programmer.

In the standard configuration HDFS saves all files three times on different nodes. The "name node" (server) has the information where the files are stored.

Harddisks are very effective in reading large files sequentially but are much slower during random access. HDFS is therefore optimized for large files.

To improve performance Hadoop also tries to move the computation to the nodes which store the data and not vice versa. Especially if you have very large data this helps to improve the performance as you can avoid that the network becomes the bottleneck.