Monday, September 1, 2014

Lesson 5 - MapReduce

Big Data and MapReduce
Catalog and index all books in the world.

Scenarios for MapReduce
In which of the situations below do you think mapreduce may have been used?
1. discover new oil resources
2. power an e-commerce website.
3. identify the malware and cyber attack  patterns for online security.
4. helps doctors answer questions about patients' health.

Basics of MapReduce

Counting words serially

import logging
import sys
import string

from util import logfile

logging.basicConfig(filename=logfile, format='%(message)s',
                   level=logging.INFO, filemode='w')


def word_count():
    # For this exercise, write a program that serially counts the number of occurrences
    # of each word in the book Alice in Wonderland.
    #
    # The text of Alice in Wonderland will be fed into your program line-by-line.
    # Your program needs to take each line and do the following:
    # 1) Tokenize the line into string tokens by whitespace
    #    Example: "Hello, World!" should be converted into "Hello," and "World!"
    #    (This part has been done for you.)
    #
    # 2) Remove all punctuation
    #    Example: "Hello," and "World!" should be converted into "Hello" and "World"
    #
    # 3) Make all letters lowercase
    #    Example: "Hello" and "World" should be converted to "hello" and "world"
    #
    # Store the the number of times that a word appears in Alice in Wonderland
    # in the word_counts dictionary, and then *print* (don't return) that dictionary
    #
    # In this exercise, print statements will be considered your final output. Because
    # of this, printing a debug statement will cause the grader to break. Instead,
    # you can use the logging module which we've configured for you.
    #
    # For example:
    # logging.info("My debugging message")
    #
    # The logging module can be used to give you more control over your
    # debugging or other messages than you can get by printing them. Messages
    # logged via the logger we configured will be saved to a
    # file. If you click "Test Run", then you will see the contents of that file
    # once your program has finished running.
    #
    # The logging module also has other capabilities; see
    # https://docs.python.org/2/library/logging.html
    # for more information.

    word_counts = {}

    for line in sys.stdin:
        data = line.strip().split(" ")
     
        # Your code here
    for i in data:
     key = i.translate(string.maketrans("","",string.punctuation).lower()
    if key in word_counts.key():
        word_counts[key] +1=1
    else
        word_counts[key] =1
    print word_counts

word_count()

Counting words in MapReduce

Mapper stage

Reducer stage

Using MapReduce with Aadhar Data

import sys
import string
import logging

from util import mapper_logfile
logging.basicConfig(filename=mapper_logfile, format='%(message)s',
                    level=logging.INFO, filemode='w')

def mapper():

    #Also make sure to fill out the reducer code before clicking "Test Run" or "Submit".

    #Each line will be a comma-separated list of values. The
    #header row WILL be included. Tokenize each row using the
    #commas, and emit (i.e. print) a key-value pair containing the
    #district (not state) and Aadhaar generated, separated by a tab.
    #Skip rows without the correct number of tokens and also skip
    #the header row.

    #You can see a copy of the the input Aadhaar data
    #in the link below:
    #https://www.dropbox.com/s/vn8t4uulbsfmalo/aadhaar_data.csv

    #Since you are printing the output of your program, printing a debug
    #statement will interfere with the operation of the grader. Instead,
    #use the logging module, which we've configured to log to a file printed
    #when you click "Test Run". For example:
    #logging.info("My debugging message")

    for line in sys.stdin:
    data = line.strip().split(" ")
        #your code here
    if len(data) !=12 or data[0]=='Registrar':
           continue
    print {0}\t{1}.format(data[3],data[8])

mapper()



import sys
import logging

from util import reducer_logfile
logging.basicConfig(filename=reducer_logfile, format='%(message)s',
                    level=logging.INFO, filemode='w')

def reducer():
    
    #Also make sure to fill out the mapper code before clicking "Test Run" or "Submit".

    #Each line will be a key-value pair separated by a tab character.
    #Print out each key once, along with the total number of Aadhaar 
    #generated, separated by a tab. Make sure each key-value pair is 
    #formatted correctly! Here's a sample final key-value pair: 'Gujarat\t5.0'

    #Since you are printing the output of your program, printing a debug 
    #statement will interfere with the operation of the grader. Instead, 
    #use the logging module, which we've configured to log to a file printed 
    #when you click "Test Run". For example:
    #logging.info("My debugging message")
        
    aadhaar_generated=0
    old_key=none

    for line in sys.stdin:
    data = line.strip().split("\t ")
        # your code here
    if len(data) !=2
           continue
    this_key.count=data
    if old_key and 
    old_key != this_key:
    print {0}\t{1}.format(old_key, aadhaar_generated)
    aadhaar_generated=0
  old_key=this_key
    aadhaar_generated += float(count)
    
   if old_key !=  None:
   print {0}\t{1}.format(old_key, aadhaar_generated)


reducer()


More Complex MapReduce
How do we do more complex thing with MapReduce?
- counting word
- aggregate aadhaar generated

MapReduce EcoSystem
MapReduce Programming Model
Haodoop:
hive - facebook
pig - yahoo


Using MapReduce with Subway Data








No comments:

Post a Comment