Apache Hadoop

Apache Hadoop is a file of open-source tools, which have a goal to simplify the network usage of many computers to solving problems, which include significant volumes of data and calculations. It offers software Framework for distributed saving and processing of big data with the help of programming model MapReduce.

Originally it was developed for usage in clusters, built into the commodity hardware, which is still one of the usual ways of usage. All modules are built with fundamental premise that there are hardware failures and Framework should be able to address these failures automatically.

Apache Hadoop’s core is composed of storage part, which is known as Hadoop Distributed File System or HDFS and from the processing part, which comes from the MapReduce programming model. Hadoop divides files into large blocks, which then are distributed throughout nodes in the cluster. Packed code then transfers into individual nodes so there would be parallel data processing.

It allows the data set to be processed faster and with higher effectivity than it would correspond to the conventional architecture of supercomputer, which comes from parallel file systems, where are calculations and data distributed with the help of fast network.

The base of Apache Hadoop is composed of the following modules:

Hadoop Common. It contains libraries and tools, which use other Hadoop modules.
Hadoop Distributed File System (HDFS). Distributed file system, which stores data on commodity machines.
Hadoop Yarn. Platform is responsible for the management of computing sources that are arranged into clusters, which are used for user applications.
Hadoop MapReduce. MapReduce programming model implementation for data processing in large volumes.

The term Hadoop is used not only for these modules but also for the ecosystem or file of the additional software packages, which are possible to install above and beyond or together with Hadoop. Among these packages belong e. g. Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie or Apache Storm.

Hadoop Framework itself is written in Java but parts of the code are written in C and tools of the command line are written as shell scripts.

Advantages of using Hadoop

Wide scale of data sources. Hadoop saves time by working with random data format. Thanks to that it can manage data resulting from many format incompatible sources.
Cost effectivity. Data storing in Hadoop is costly effective.
Speed. Hadoop allows data processing on the same server as are the data stored. Terabyte data processing takes only a few minutes.
Duplication. Data are saved in more copies. In case of failure of one element, there is no loss of data.

Disadvantages of using Hadoop

Integration with existing systems. Hadoop isn’t optimized for simplicity of usage. Installation and integration with existing databases can be difficult, especially according to the fact that it doesn’t offer any support.
Administration and simplicity of usage. Hadoop requires deep knowledge of MapReduce, whereas most of the experts, which work with data, use SQL. It means that implementation of Hadoop is bound to extensive training of users.
Security. Hadoop doesn’t have security level, which would correspond to the safe business usage and especially in the area of sensitive data.
Single Point of Failure (SPOF). The original version of Hadoop has only one node, which is responsible for the place, where are the data stored. Therefore, the clusters are useless in case of failure of this node. However, other versions of Hadoop have removed this deficiency.