The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves can also correspond to implementations of the technology disclosed.
Category theory is well developed branch of mathematics and has been applied in many domains. In particular, it has been applied in situations where logical structure needs to be understood and modified. It has also been applied as a foundation for a programming paradigm known as functional programming.
At its foundation, category theory is used to formalize mathematical structure and its concepts as a collection of objects (also called nodes) and arrows (also called morphisms).
Category theory can be used to formalize concepts of other high-level abstractions such as set theory, ring theory, and group theory. Several terms used in category theory, including the term “morphism”, differ from their use in mathematics. In category theory, “morphism” obeys a set of conditions specific to category theory itself. Thus, care must be taken to understand the context in which statements are made.
In category theory, a category has two basic properties: the ability to compose the arrows associatively and the existence of an identity arrow for each object. One example of a category is the category of sets, where the objects are sets and the arrows are functions from one set to another. However, the objects of a category need not be sets, and the arrows need not be functions. Accordingly, any way of formalizing a mathematical concept such that it meets the basic conditions on the behavior of objects and arrows is a valid category, and all the results of category theory will apply to it.
The “arrows” of category theory are often said to represent a process connecting two objects, or in many cases a “structure-preserving” transformation connecting two objects. The most important property of the arrows is that they can be “composed”, in other words, arranged in a sequence to form a new arrow.
Categories now appear in most branches of mathematics, some areas of theoretical computer science where they can correspond to types, and mathematical physics where they can be used to describe vector spaces. Categories were first introduced by Samuel Eilenberg and Saunders Mac Lane around 1942-45 in connection with algebraic topology. Categories were specifically designed to bridge what may appear to be two quite different fields: topology and algebra.
MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Conceptually similar approaches have been very well known since 1995 with the Message Passing Interface standard having reduce and scatter operations. These all trace back to functional programming where map and various reduce operators (such as fold) are ubiquitous.
A MapReduce program is composed of a Map( ) procedure that performs filtering, transformation, key generation, a Shuffle( ) procedure that does sorting on the keys (such as sorting students by first name into queues, one queue for each name) and a Reduce( ) procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The MapReduce system orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Hadoop. The name MapReduce originally referred to a proprietary Google technology, but has since been genericized.
Functional programming treats computation as the evaluation of mathematical functions over values and avoids changing-state and mutable data. Every fragment of code evaluates a particular value and always evaluates the same value. Scala is a programming language that integrates features of object-oriented and functional programming languages. In Scala, every value is an object, every object is a value, and every function returns a value. While Scala does not prevent the use of mutable data, it does not require it.
Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Spark is an enhanced variant of Hadoop and serves as an accelerator for the next generation of MapReduce. Scala can be used to take advantage of the parallel processing inherent in a Hadoop framework.