遅ればせながら - yamanetoshi's diary

Hadoop 試してみます。参考にしたのは以下なエントリ。

A Pythonic way to do MapReduce using hadoop

む

今気づいたんですが、python で書いちゃうんですね、これ。で、以下な項目に沿って順にアレしていきます。

Preparing the environment
Installing hadoop 0.20.2 on Ubuntu 10.10
Collecting all CSV files which contains students’ marks
Writing a mapper in Python for MapReduce
Writing a reducer in Python for MapReduce
Test the output using shell and pipe
Run MapReduce with hadoop
Fetch the hadoop output
by the way no war/jar business!

Preparing the environment

ええと、都合が良い事に Ubuntu 10.10 前提って事で記述されてます。で、

java6 が入ってること
JAVA_HOME な環境変数が設定されてること
ssh と rsync なパケジが導入されてること

という記述があります。わしんちの Ubuntu は sun-java6-jre なので

JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.22; export JAVA_HOME

を ~/.bashrc に追加って事で良いのかな。あとパケジも導入済み。

$ dpkg --get-selections|grep ssh
libganymed-ssh2-java                            install
openssh-client                                  install
openssh-server                                  install
ssh-askpass-gnome                               install
$ dpkg --get-selections|grep rsync
rsync                                           install
$

Installing hadoop 0.20.2 on Ubuntu 10.10

0.20.2 なナニを download します。どこに展開しよ。

$ pwd
/home/rms/Documents/Hadoop/hadoop-0.20.2

とゆー事で以下。

$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output ‘dfs[a-z.]+’
$ cat output/*
1       dfsadmin
$

何が起きたのか今ヒトツ理解できてないorz
おそらく grep して結果を output に吐きだしたのだろうと思うのですが。。

Collecting all CSV files which contains students’ marks

ええと、ココから mapreduce.tar.gz を download して展開。

$ find
.
./mapper.py
./data
./data/math.txt
./data/chemistry.txt
./data/generator.py
./data/surnames
./data/physics.txt
./.idea
./.idea/mapreduce.iml
./.idea/misc.xml
./.idea/modules.xml
./.idea/vcs.xml
./.idea/encodings.xml
./.idea/.name
./.idea/workspace.xml
./reducer.py
$

Writing a mapper in Python for MapReduce

mapper.py が以下。

$ cat mapper.py 
#!/usr/bin/env python

__author__ = 'Nurul Ferdous <nurul@ferdo.us>'
__version__ = '0.2.20'

import sys

for line in sys.stdin:
    ln = line.strip()
    data = ln.split(' ')
    print '%s\t%s' % (data[0], data[1])

データは名前と点数が空白区切りで一行づつという形になっている模様。標準入力から読みこんで標準出力に出してあげれば良いらしい。

Writing a reducer in Python for MapReduce

そーゆー意味ではこっちの方がアレなのでしょうか。

$ cat reducer.py 
#!/usr/bin/env python
##reducer.py

__author__ = 'Nurul Ferdous <nurul@ferdo.us>'
__version__ = '0.2.20'

from operator import itemgetter
import sys

agregatedmarks = {}

for line in sys.stdin:
    name, marks = line.rstrip().split('\t')
    try:
        if agregatedmarks.get(name):
            agregatedmarks.get(name).append(marks)
        else:
            agregatedmarks[name] = [marks]
    except ValueError:
        pass

sortedmarksheet = sorted(agregatedmarks.items(), key=itemgetter(0))
for name, marks in sortedmarksheet:
    print '%s\t%s'% (name, marks)

$

う、こっちはちょっと直感的ではないかも。

Test the output using shell and pipe

一旦シェルで実行せい、とある。

$ time cat data/math.txt | ./mapper.py | sort | ./reducer.py

real    0m0.120s
user    0m0.080s
sys     0m0.020s
$

出力は略。

Run MapReduce with hadoop

以下で、との事。どんどんなぞる。

$ cd ../hadoop-0.20.2
$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \
 -file /media/raid/sandbox/mapreduce/mapper.py -mapper \
 /media/raid/sandbox/mapreduce/mapper.py -file \
 /media/raid/sandbox/mapreduce/reducer.py -reducer \
 /media/raid/sandbox/mapreduce/reducer.py -input \
 /media/raid/sandbox/mapreduce/data/*.txt -output marksheet
$ cat marksheet/*

なんか失敗したぞ。marksheet ディレクトリを掘ってリトライ。
したらこんどは already exists って叱られたorz
で、以下にしたら成功しました。何故だろう。

$ bin/hadoop jar contrib/streaming/hadoop-0.20.2-streaming.jar \
 -file /media/raid/sandbox/mapreduce/mapper.py -mapper \
 /media/raid/sandbox/mapreduce/mapper.py -file \
 /media/raid/sandbox/mapreduce/reducer.py -reducer \
 /media/raid/sandbox/mapreduce/reducer.py -input \
 /media/raid/sandbox/mapreduce/data/*.txt -output ../mapreduce/marksheet

で、

$ cat ../mapreduce/marksheet/part-0000
(略)
zamora  ['78', '59', '82']
zapata  ['56', '35', '87']
zavala  ['87', '30', '61']
zhang   ['60', '32', '58']
ziegler ['91', '20', '77']
zimmerman       ['55', '42', '69']
zuniga  ['99', '71', '99']
$

ええと

Fetch the hadoop output
by the way no war/jar business!

な項目が無いな。とりあえず上記を踏まえてもう少し検証してみます。