Python Map Reduce Assignment Help | Why mrjob?
- realcode4you
- 6 hours ago
- 4 min read
1.1.1 Overview
mrjob is the easiest route to writing Python programs that run on Hadoop. If you use mrjob, you’ll be able to test your code locally without installing Hadoop or run it on a cluster of your choice.
Additionally, mrjob has extensive integration with Amazon Elastic MapReduce. Once you’re set up, it’s as easy to run your job in the cloud as it is to run it on your laptop.
Here are a number of features of mrjob that make writing MapReduce jobs easier:
Keep all MapReduce code for one job in a single class
Easily upload and install code and data dependencies at runtime
Switch input and output formats with a single line of code
Automatically download and parse error logs for Python tracebacks
Put command line filters before or after your Python code
If you don’t want to be a Hadoop expert but need the computing power of MapReduce, mrjob might be just the thing for you.
1.1.2 Why use mrjob instead of X?
Where X is any other library that helps Hadoop and Python interface with each other.
1. mrjob has more documentation than any other framework or library we are aware of. If you’re reading this, it’s probably your first contact with the library, which means you are in a great position to provide valuable feedback about our documentation. Let us know if anything is unclear or hard to understand.
2. mrjob lets you run your code without Hadoop at all. Other frameworks require a Hadoop instance to function at all. If you use mrjob, you’ll be able to write proper tests for your MapReduce code.
3. mrjob provides a consistent interface across every environment it supports. No matter whether you’re running locally, in the cloud, or on your own cluster, your Python code doesn’t change at all.
4. mrjob handles much of the machinery of getting code and data to and from the cluster your job runs on. You don’t need a series of scripts to install dependencies or upload files.
5. mrjob makes debugging much easier. Locally, it can run a simple MapReduce implementation in-process, so you get a traceback in your console instead of in an obscure log file. On a cluster or on Elastic MapReduce, it parses error logs for Python tracebacks and other likely causes of failure.
6. mrjob automatically serializes and deserializes data going into and coming out of each task so you don’t need to constantly json.loads() and json.dumps().
1.1.3 Why use X instead of mrjob?
The flip side to mrjob’s ease of use is that it doesn’t give you the same level of access to Hadoop APIs that Dumbo and Pydoop do. It’s simplified a great deal. But that hasn’t stopped several companies, including Yelp, from using it for day-to-day heavy lifting. For common (and many uncommon) cases, the abstractions help rather than hinder.
Other libraries can be faster if you use typedbytes. There have been several attempts at integrating it with mrjob, and it may land eventually, but it doesn’t exist yet.
1.2 Fundamentals
1.2.1 Installation
Install with pip:
pip install mrjob
or from a git clone of the source code:
1.2.2 Writing your first job
Open a file called mr_word_count.py and type this into it:
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
Nowgoback to the command line, find your favorite body of text (such mrjob’s README.rst, or even your new file mr_word_count.py), and try this:
$ python mr_word_count.py my_file.txt
You should see something like this:
"chars" 3654
"lines" 123
"words" 417
Congratulations! You’ve just written and run your first program with mrjob.
What’s happening
A job is defined by a class that inherits from MRJob. This class contains methods that define the steps of your job.
A “step” consists of a mapper, a combiner, and a reducer. All of those are optional, though you must have at least one. So you could have a step that’s just a mapper, or just a combiner and a reducer.
When you only have one step, all you have to do is write methods called mapper(), combiner(), and reducer().
The mapper() method takes a key and a value as args (in this case, the key is ignored and a single line of text input is the value) and yields as many key-value pairs as it likes. The reduce() method takes a key and an iterator of values and also yields as many key-value pairs as it likes. (In this case, it sums the values for each key, which represent the numbers of characters, words, and lines in the input.)
Warning: Forgetting the following information will result in confusion.
The final required component of a job file is these two lines at the end of the file, every time:
if name == '__main__':
MRWordCounter.run() # where MRWordCounter is your job class
These lines pass control over the command line arguments and execution to mrjob. Without them, your job will not work. For more information, see Hadoop Streaming and mrjob and Why can’t I put the job class and run code in the same file?.
1.2.3 Running your job different ways
The most basic way to run your job is on the command line:
$ python my_job.py input.txt
By default, output will be written to stdout.
You can pass input via stdin, but be aware that mrjob will just dump it to a file first:
$ python my_job.py < input.txt
You can pass multiple input files, mixed with stdin (using the- character):
$ python my_job.py input1.txt input2.txt- < input3.txt
By default, mrjob will run your job in a single Python process. This provides the friendliest debugging experience, but it’s not exactly distributed computing!
You change the way the job is run with the-r/--runner option. You can use-r inline (the default),-r local,-r hadoop, or-r emr.
To run your job in multiple subprocesses with a few Hadoop features simulated, use-r local. To run it on your Hadoop cluster, use-r hadoop.
1.2. Fundamentals
If you have Dataproc configured(seeDataprocQuickstart), you can run it there with -r dataproc. Your input files can come from HDFS if you’re using Hadoop, or GCS if you’re using Dataproc:
$pythonmy_job.py-rdataprocgcs://my-inputs/input.txt
$pythonmy_job.py-rhadoophdfs://my_home/input.txt
If you have Elastic MapReduce configured (seeElasticMapReduceQuickstart), you can run it there with -r emr. Your input files can come from HDFS if you’re using Hadoop, or S3 if you’re using EMR:
$pythonmy_job.py-remrs3://my-inputs/input.txt
$pythonmy_job.py-rhadoophdfs://my_home/input.txt
If your code spans multiple files, see Uploading your source tree.


Comments