Apache Hadoop is a file of open-source tools, which have a goal to simplify the network usage of many computers to solving problems, which include significant volumes of data and calculations. It offers software Framework for distributed saving and processing of big data with the help of programming model MapReduce.
Originally it was developed for usage in clusters, built into the commodity hardware, which is still one of the usual ways of usage. All modules are built with fundamental premise that there are hardware failures and Framework should be able to address these failures automatically.
Apache Hadoop’s core is composed of storage part, which is known as Hadoop Distributed File System or HDFS and from the processing part, which comes from the MapReduce programming model. Hadoop divides files into large blocks, which then are distributed throughout nodes in the cluster. Packed code then transfers into individual nodes so there would be parallel data processing.
It allows the data set to be processed faster and with higher effectivity than it would correspond to the conventional architecture of supercomputer, which comes from parallel file systems, where are calculations and data distributed with the help of fast network.
The base of Apache Hadoop is composed of the following modules:
The term Hadoop is used not only for these modules but also for the ecosystem or file of the additional software packages, which are possible to install above and beyond or together with Hadoop. Among these packages belong e. g. Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie or Apache Storm.
Hadoop Framework itself is written in Java but parts of the code are written in C and tools of the command line are written as shell scripts.