ThinkStats (18) - yamanetoshi's diary

3 月末から放置している。ぐずぐず状態だなぁ、と思いつつ不調な朝に手を付けてみるなど。
とりあえず、http://thinkstats.com/hypothesis.py を取得して実行してみました。

$ ls *.py
Cdf.py  
cumulative.py  
descriptive.py  
first.py  
hypothesis.py  
myplot.py  
Pmf.py  
survey.py  
thinkstats.py
$ python hypothesis.py 
(Mean, Var) of pooled data (38.56055968517709, 7.301863788195572)
(Mean, Var) of resampled deltas (-0.00169117178270686, 0.0030938100255098588)
Tails (left, right, total): 0.084 0.079 0.163
n: 4413
m: 4735
mu1 38.6009517335
mu2 38.5229144667
delta 0.0780372667775
p-value 0.163
Writing length_deltas_cdf.pdf
Writing length_deltas_cdf.png
(Mean, Var) of resampled deltas (0.08077549196467845, 0.003143562170784472)
Tails (left, right, total): 0.001 0.515 0.516
n: 4413
m: 4735
mu1 38.6009517335
mu2 38.5229144667
delta 0.0780372667775
p-value 0.516
Posterior 0.759941089838

ソースを掘削しつつリハビリな方向。

先頭から確認

hypothesis.py な main の先頭あたりが以下。

def main():
    random.seed(1)

    # get the data
    pool, firsts, others = cumulative.MakeTables()
    mean_var = thinkstats.MeanVar(pool.lengths)
    print '(Mean, Var) of pooled data', mean_var

ええと、cumulative.MakeTables が以下。

def MakeTables(data_dir='.'):
    """Reads survey data and returns a tuple of Tables"""
    table, firsts, others = first.MakeTables(data_dir)
    pool = descriptive.PoolRecords(firsts, others)

    Process(pool, 'live births')
    Process(firsts, 'first babies')
    Process(others, 'others')
        
    return pool, firsts, others

第一子とそれ以外なデータに分けて descriptive.PoolRecords で何してるのか。

def PoolRecords(*tables):
    """Construct a table with records from all tables.
    
    Args:
        constructor: init method used to make the new table
    
        tables: any number of tables

    Returns:
        new table object
    """
    pool = survey.Pregnancies()
    for table in tables:
        pool.ExtendRecords(table.records)
    return pool

これ、第一子とそれ以外の値のリストを結合している模様。ExtendRecords の定義が以下。

    def ExtendRecords(self, records):
        """Adds records to this table.

        Args:
            records: a sequence of record object
        """
        self.records.extend(records)

何故に table をそのまんま使わないのか、というあたりが微妙。
ここはスルーで pool の平均と分散を、って pool.length というソレが微妙。多分妊娠期間のはずなんだけどソースが追えてない。

あった

cumulative.MakeTables から cumulative.Process を呼び出してて、length を設定してました。

def Process(table, name):
    """Runs various analyses on this table.

    Creates instance variables:
        weights: sequence of int total weights in ounces
        weight_pmf: Pmf object
        weight_cdf: Cdf object
        oz_pmf: Pmf of just the ounce field
    """
    descriptive.Process(table, name)

    table.weights = [p.totalwgt_oz for p in table.records
                     if p.totalwgt_oz != 'NA']
    table.weight_pmf = Pmf.MakePmfFromList(table.weights, table.name)
    table.weight_cdf = Cdf.MakeCdfFromList(table.weights, table.name)

ここから呼び出している descriptive.Process からも属性が設定されてますね。なんつーか色々な場所で属性追加してる、というのも凄いな。
descriptive.Process の定義が以下。

def Process(table, name):
    """Runs various analyses on this table."""
    first.Process(table)
    table.name = name

    table.var = thinkstats.Var(table.lengths, table.mu)
    table.trim = thinkstats.TrimmedMean(table.lengths)

    table.hist = Pmf.MakeHistFromList(table.lengths, name=name)
    table.pmf = Pmf.MakePmfFromHist(table.hist)

あと、first.Process が以下でここで length が追加されてます。

def Process(table):
    """Runs analysis on the given table.
    
    Args:
        table: table object
    """
    table.lengths = [p.prglength for p in table.records]
    table.n = len(table.lengths)
    table.mu = Mean(table.lengths)

これよくよく見てみるに属性に平均と分散持ってますね。で、main に戻って RunTest を呼び出しています。

    # run the test
    RunTest('length', 
            pool.lengths,
            firsts.lengths, 
            others.lengths, 
            iters=1000,
            trim=False,
            partition=False)

pool (全部)、firsts (第一子)、others (第一子以外) の平均妊娠期間なリストを渡してる、という事か。

RunTest

端折って以下。

    # P(E|H0)
    peh0 = Test(root + '_deltas_cdf',
                actual1, actual2, pool, pool,
                iters, plot=True)

actual1 が first で actual2 が others です。Test は p 値を戻す手続き。核心部分が以下。

    n = len(actual1)
    m = len(actual2)
    
    mu1, mu2, delta = DifferenceInMean(actual1, actual2)
    delta = abs(delta)

    cdf, pvalue = PValue(model1, model2, n, m, delta, iters)

ちょっと時間切れ。続きは別途で。