今日の朝練 - yamanetoshi's diary

今朝はここから。なんとなく資料を流して wordcount.py を開く。
後で気がついたのですが、これ系って現実トウヒベースで対応した方が生産性が良いですね。朝練の時は gdgd だったんですが、仕事の合間に実装書いたらどんどん進んだので笑いました。

問題

以下なコードが先んじて準備されてます。

# This basic command line argument parsing code is provided and
# calls the print_words() and print_top() functions which you must define.
def main():
  if len(sys.argv) != 3:
    print 'usage: ./wordcount.py {--count | --topcount} file'
    sys.exit(1)

  option = sys.argv[1]
  filename = sys.argv[2]
  if option == '--count':
    print_words(filename)
  elif option == '--topcount':
    print_top(filename)
  else:
    print 'unknown option: ' + option
    sys.exit(1)

if __name__ == '__main__':
  main()

print_words と print_top という手続きを書け、と。

まずは print_words

コメントによれば仕様が以下。

1. For the --count flag, implement a print_words(filename) function that counts
how often each word appears in the text and prints:
word1 count1
word2 count2
...

Print the above list in order sorted by word (python will sort punctuation to
come before letters -- that's fine). Store all the words as lowercase,
so 'The' and 'the' count as the same word.

ファイルからの入力に慣れてなくて gdgd でした。chomp が分からなかったりとか。で、でっち上がったのが以下。

def print_words(filename):
  words = {}
  f = open(filename, 'rU')
  for line in f: 
    line = line.rstrip()
    l = line.split()
    for word in l:
      word = word.lower()
      if word in words:
        words[word] += 1
      else:
        words[word] = 1

  for w in sorted(words.keys()):
    print w + ' ' + str(words[w])

  f.close()

for の基な sorted(words.keys()) はテキストに書いてありました。あるいは

rstrip() が chomp だったり
lower() が lowercase なソレだったり

上記はテキストファイルから読み込んで辞書に出力なナニとしては標準的な書き方なのかどうか。

print_top

コメントによれば仕様は以下。

2. For the --topcount flag, implement a print_top(filename) which is similar
to print_words() but which prints just the top 20 most common words sorted
so the most common word is first, then the next most common, and so on.

あと、似た手続きはまるっと纏めれ、なコメントもあったので以下。

def read_words(filename):
  words = {}
  f = open(filename, 'rU')
  for line in f: 
    line = line.rstrip()
    l = line.split()
    for word in l:
      word = word.lower()
      if word in words:
        words[word] += 1
      else:
        words[word] = 1

  f.close()

  return words

def print_top(filename):
  words = read_words(filename)

  ctr = 1
  for k, v in sorted(words.items(), key=lambda x:x[1], cmp=lambda x, y:cmp(int(y), int(x))):
    print k, v
    ctr += 1
    if (ctr > 20):
      break

def print_words(filename):
  words = read_words(filename)

  for w in sorted(words.keys()):
    print w + ' ' + str(words[w])

print_top な sorted はテキストにはフォロー無し。cmp なソレは降順指定な記述になるようです。key の指定も面白いですね。あと、words.items() って記述も興味深いです。

この問題

UT 式でなかったのでちょっとフラストレーション溜り気味。