How does mapreduce work in hadoop




















The whole process goes through four phases of execution namely, splitting, mapping, shuffling, and reducing. An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is consumed by a single map.

This is the very first phase in the execution of map-reduce program. In this phase data in each split is passed to a mapping function to produce output values. This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase output.

In our example, the same words are clubed together along with their respective frequency. In this phase, output values from the Shuffling phase are aggregated. This phase combines values from Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset.

In our example, this phase aggregates the values from Shuffling phase i. Hadoop divides the job into tasks. There are two types of tasks:. The complete execution process execution of Map and Reduce tasks, both is controlled by two types of entities called a. For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and there are multiple tasktrackers which reside on Datanode.

Skip to content. What is MapReduce in Hadoop? A brief description of these components can improve our understanding on how MapReduce works. This master will then sub-divide the job into equal sub-parts.

The job-parts will be used for the two main tasks in MapReduce: mapping and reducing. The developer will write logic that satisfies the requirements of the organization or company. The input data will be split and mapped. The intermediate data will then be sorted and merged. The reducer that will generate a final output stored in the HDFS will process the resulting output.

Image Source: Data Flair. Every job consists of two key components: mapping task and reducing task. The map task plays the role of splitting jobs into job-parts and mapping intermediate data. The reduce task plays the role of shuffling and reducing intermediate data into smaller units. The job tracker acts as a master. It ensures that we execute all jobs. The job tracker schedules jobs that have been submitted by clients.

It will assign jobs to task trackers. Each task tracker consists of a map task and reduces the task. Task trackers report the status of each assigned job to the job tracker. The following diagram summarizes how job trackers and task trackers work. Image Source: CNBlogs. The MapReduce program is executed in three main phases: mapping, shuffling, and reducing. There is also an optional phase known as the combiner phase. This is the first phase of the program.

There are two steps in this phase: splitting and mapping. A dataset is split into equal units called chunks input splits in the splitting step.

Hadoop consists of a RecordReader that uses TextInputFormat to transform input splits into key-value pairs. The key-value pairs are then used as inputs in the mapping step.

This is the only data format that a mapper can read or understand. The mapping step contains a coding logic that is applied to these data blocks.

In this step, the mapper processes the key-value pairs and produces an output of the same form key-value pairs. This is the second phase that takes place after the completion of the Mapping phase. It consists of two main steps: sorting and merging. In the sorting step, the key-value pairs are sorted using the keys. Merging ensures that key-value pairs are combined.

The shuffling phase facilitates the removal of duplicate values and the grouping of values. Different values with similar keys are grouped. The output of this phase will be keys and values, just like in the Mapping phase.

In the reducer phase, the output of the shuffling phase is used as the input. The reducer processes this input further to reduce the intermediate values into smaller values.

The following options affect the frequency of these merges to disk prior to the reduce and the memory allocated to map output during the reduce. It can define multiple local directories spanning multiple disks and then each filename is assigned to a semi-random local directory. When the job starts, task tracker creates a localized job directory relative to the local directory specified in the configuration. Thus the task tracker directory structure looks as following:.

Jobs can enable task JVMs to be reused by specifying the job configuration mapred. If the value is 1 the default , then JVMs are not reused i. If it is -1, there is no limit to the number of tasks a JVM can run of the same job. One can also specify some value greater than 1 using the api JobConf. The following properties are localized in the job configuration for each task's execution:.

Note: During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots. For example, mapred. The child-jvm always has its current working directory added to the java. And hence the cached libraries can be loaded via System. JobClient is the primary interface by which user-job interacts with the JobTracker. JobClient provides facilities to submit jobs, track their progress, access component-tasks' reports and logs, get the MapReduce cluster's status information and so on.

Job history files are also logged to user specified directory hadoop. Hence, by default they will be in mapred. User can stop logging by giving the value none for hadoop. User can use OutputLogFilter to filter log files from the output directory listing. Normally the user creates the application, describes various facets of the job via JobConf , and then uses the JobClient to submit the job and monitor its progress.

Job level authorization and queue level authorization are enabled on the cluster, if the configuration mapred. When enabled, access control checks are done by a the JobTracker before allowing users to submit jobs to queues and administering these jobs and b by the JobTracker and the TaskTracker before allowing users to view job details or to modify a job using MapReduce APIs, CLI or web user interfaces. A job submitter can specify access control lists for viewing or modifying a job via the configuration properties mapreduce.

By default, nobody is given access in these properties. However, irrespective of the job ACLs configured, a job's owner, the superuser and cluster administrators mapreduce.

A job view ACL authorizes users against the configured mapreduce. Other information about a job, like its status and its profile, is accessible to all users, without requiring authorization. A job modification ACL authorizes users against the configured mapreduce.

These operations are also permitted by the queue level ACL, "mapred. Users may need to chain MapReduce jobs to accomplish complex tasks which cannot be done via a single MapReduce job. This is fairly easy since the output of the job typically goes to distributed file-system, and the output, in turn, can be used as the input for the next job. In such cases, the various job-control options are:. In a secure cluster, the user is authenticated via Kerberos' kinit command.

Because of scalability concerns, we don't push the client's Kerberos' tickets in MapReduce jobs. Instead, we acquire delegation tokens from each HDFS NameNode that the job will use and store them in the job as part of job submission. Other applications require to set the configuration "mapreduce. These tokens are passed to the JobTracker as part of the job submission as Credentials. The MapReduce tokens are provided so that tasks can spawn jobs if they wish to.

The tasks authenticate to the JobTracker via the MapReduce delegation tokens. The obtained token must then be pushed onto the credentials that is there in the JobConf used for job submission.

The API Credentials. The credentials are sent to the JobTracker as part of the job submission process.

The TaskTracker localizes the file as part job localization. In order to launch jobs from tasks or for doing any HDFS operation, tasks must set the configuration "mapreduce.

This is the default behavior unless mapreduce. For jobs whose tasks in turn spawns jobs, this should be set to false. Applications sharing JobConf objects between multiple jobs on the JobClient side should look at setting mapreduce. This is because the Credentials object within the JobConf will then be shared. All jobs will end up sharing the same tokens, and hence the tokens should not be canceled when the jobs in the sequence finish.

Apart from the HDFS delegation tokens, arbitrary secrets can also be passed during the job submission for tasks to access other third party services. A reference to the JobConf passed in the JobConfigurable. The api JobConf. Tasks can access the secrets using the APIs in Credentials. InputFormat describes the input-specification for a MapReduce job. The MapReduce framework relies on the InputFormat of the job to:.

The default behavior of file-based InputFormat implementations, typically sub-classes of FileInputFormat , is to split the input into logical InputSplit instances based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits.

A lower bound on the split size can be set via mapred. Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected.

In such cases, the application should implement a RecordReader , who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task. TextInputFormat is the default InputFormat. However, it must be noted that compressed files with the above extensions cannot be split and each compressed file is processed in its entirety by a single mapper.

InputSplit represents the data to be processed by an individual Mapper. Typically InputSplit presents a byte-oriented view of the input, and it is the responsibility of RecordReader to process and present a record-oriented view. FileSplit is the default InputSplit. It sets map. Typically the RecordReader converts the byte-oriented view of the input, provided by the InputSplit , and presents a record-oriented to the Mapper implementations for processing. RecordReader thus assumes the responsibility of processing record boundaries and presents the tasks with keys and values.

OutputFormat describes the output-specification for a MapReduce job. The MapReduce framework relies on the OutputFormat of the job to:. TextOutputFormat is the default OutputFormat. OutputCommitter describes the commit of task output for a MapReduce job. The MapReduce framework relies on the OutputCommitter of the job to:.

FileOutputCommitter is the default OutputCommitter. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This process is completely transparent to the application.

So, just create any side-files in the path returned by FileOutputFormat. RecordWriter implementations write the job outputs to the FileSystem. Users submit jobs to Queues. Queues, as collection of jobs, allow the system to provide specific functionality. For example, queues use ACLs to control which users who can submit jobs to them. Queues are expected to be primarily used by Hadoop Schedulers.

Hadoop comes configured with a single mandatory queue, called 'default'. Queue names are defined in the mapred. Some job schedulers, such as the Capacity Scheduler , support multiple queues. A job defines the queue it needs to be submitted to through the mapred. Setting the queue name is optional. If a job is submitted without an associated queue name, it is submitted to the 'default' queue. Counters represent global counters, defined either by the MapReduce framework or applications.

Each Counter can be of any Enum type. Counters of a particular Enum are bunched into groups of type Counters. Applications can define arbitrary Counters of type Enum and update them via Reporter. These counters are then globally aggregated by the framework. DistributedCache distributes application-specific, large, read-only files efficiently.

DistributedCache is a facility provided by the MapReduce framework to cache files text, archives, jars and so on needed by applications. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves. DistributedCache tracks the modification timestamps of the cached files. Clearly the cache files should not be modified by the application or externally while the job is executing.

Archives zip, tar, tgz and tar. Files have execution permissions set. Optionally users can also direct the DistributedCache to symlink the cached file s into the current working directory of the task via the DistributedCache.

Or by setting the configuration property mapred. It can be used to distribute both jars and native libraries. The DistributedCache. The same can be done by setting the configuration properties mapred.

Similarly the cached files that are symlinked into the working directory of the task can be used to distribute native libraries and load them. DistributedCache files can be private or public, that determines how they can be shared on the slave nodes. The Tool interface supports the handling of generic Hadoop command-line options. Tool is the standard for any MapReduce tool or application. The application should delegate the handling of standard command-line options to GenericOptionsParser via ToolRunner.

IsolationRunner is a utility to help debug MapReduce programs. To use the IsolationRunner , first set keep. IsolationRunner will run the failed task in a single jvm, which can be in the debugger, over precisely the same input.

Profiling is a utility to get a representative 2 or 3 sample of built-in java profiler for a sample of maps and reduces. User can specify whether the system should collect profiler information for some of the tasks in the job by setting the configuration property mapred. The value can be set using the api JobConf. If the value is set true , the task profiling is enabled. The profiler information is stored in the user log directory.

By default, profiling is not enabled for the job. By default, the specified range is User can also specify the profiler configuration arguments by setting the configuration property mapred. The value can be specified using the api JobConf. These parameters are passed to the task child JVM on the command line. The MapReduce framework provides a facility to run user-provided scripts for debugging.

When a MapReduce task fails, a user can run a debug script, to process task logs for example. The script is given access to the task's stdout and stderr outputs, syslog and jobconf. The output from the debug script's stdout and stderr is displayed on the console diagnostics and also as part of the job UI. In the following sections we discuss how to submit a debug script with a job. The script file needs to be distributed and submitted to the framework.

The user needs to use DistributedCache to distribute and symlink the script file. A quick way to submit the debug script is to set values for the properties mapred. In streaming mode, a debug script can be submitted with the command-line options -mapdebug and -reducedebug , for debugging map and reduce tasks respectively.

The arguments to the script are the task's stdout, stderr, syslog and jobconf files. For pipes, a default script is run to process core dumps under gdb, prints stack trace and gives info about running threads. JobControl is a utility which encapsulates a set of MapReduce jobs and their dependencies. Hadoop MapReduce provides facilities for the application-writer to specify compression for both intermediate map-outputs and the job-outputs i.

It also comes bundled with CompressionCodec implementation for the zlib compression algorithm. The gzip file format is also supported.

Hadoop also provides native implementations of the above compression codecs for reasons of both performance zlib and non-availability of Java libraries. More details on their usage and availability are available here. Applications can control compression of intermediate map-outputs via the JobConf. Applications can control compression of job-outputs via the FileOutputFormat. CompressionType i. CompressionType api.



0コメント

  • 1000 / 1000