MapReduce

MapReduce is a programming model and a process implementation and generating of big-data sets with parallel distributed cluster algorithm linked to this programming model.

MapReduce program consists of mapping methods that perform filtering and sorting and reduce methods performing summary operations (such as the sum of lines, the frequency of different repeating characters). MapReduce system, often called the infrastructure or the architecture, coordinates the processing by controlling distributed servers that run different tasks in parallel, managing all communications and data transmissions between different parts of the system and ensuring consistency in the event of a server failure.

Model is a subset of split-apply-combine strategy for data analysis. It is inspired by the map and reduce functions that are commonly used in functional programming. Their purpose in MapReduce Framework, however, isn’t the same as in their original form. The key benefit of MapReduce Framework aren’t map and reduce functions themselves, but the scalability and error tolerance the Framework can provide for different applications by optimizing the startup.

The single-thread implementation of MapReduce won’t be normally faster than the traditional implementation, but will only be significantly improved in the case of multi-thread implementations using multi-processor hardware. The use of this model is only profitable when the distributed shuffle operation is involved. This operation reduces the cost of network communication and the properties of error resistance. For a good MapReduce algorithm, it must optimize communication costs.

MapReduce libraries are written in different programming languages with different levels of optimization. Apache Hadop is a popular open-source implementation supporting distributed shuffle operations.

The principle of MapReduce

MapReduce works with database or other server cluster. One of the servers will then accept the request from the client. This node sends a Map function or a string of functions to other nodes belonging to the cluster. Other nodes perform the code of this function and return results to the original node. Results can often be duplicate which is a desired property to prevent data loss in a case of failure. The original node can also process the function itself.

When it has enough results from other nodes and his own results or in a case of exceeding the specified time limit for response from other nodes, the original node calls the reduce function over the obtained data set. This phase removes duplications and performs operations that can only be performed if the set is a complete result from all nodes. At the end of this phase it is subsequently possible to return the result to the client.

Possible application of MapReduce

MapReduce can be used, for example, when mapping the frequency of accesses to the website. The mapping function collects a log of requests for pages where the output is a pair of URL and number one. The reduction function then counts values for each key and subsequently, the output is a combination of URL and the total number.

The advantages of using MapReduce

Searching on live data. MapReduce makes it possible to search on live data.
Interconnection efficiency. MapReduce can be very effective in data sets connection based on a common key (the connection can be pushed through to the key storage itself).
Scalability. Using multi-thread implementations, MapReduce can help the most efficient processing.
Error toleration. If one of the server fails, data are not lost and no inconsistency occurs.

The disadvantages of using MapReduce

Speed. MapReduce can be slower than other map-reduce formula applications.
Analytics. Key-based storages aren’t usually very well optimized to the use of analytical methods.