仮説検定の判断をp値でする危険性 - プログラミングを完全に理解したエンジニアのメモ

仮説検定とは

「母集団に差がない」とする帰無仮説とそれの逆である「母集団に差がある」という対立仮説をもとに検定を行うことである。

帰無仮説(H0)…比較対象に差がない(A=B)
対立仮説(H1)…比較対象に差がある(A≠B)

p値とは

検定の結果が偶然か偶然じゃないかを決定する値 => 「まぐれ」な確率「棄却域の確率」とか「有意水準」という。

検定

検定統計量(求めたp値) < 有意水準 => 帰無仮説(H0)を棄却 = 対立仮説(H1)を採用 => 有意差があるよ！

有意水準を0.05とした時に、「p < 0.05」だった場合は、「5%以下の確率で偶然って判定されるよ(まぐれは5%以下だよ)」= 「95%の確率で偶然じゃなよ」ということである。

有意水準は一般的に0.01 , 0.05 , 0.1が使われる。

データ数が増えるとp値が0に近づき、どんなデータも有意になってしまうのでは？

そこで疑問に思った。データ数が増えるとp値が0に近づき、どんなデータも有意になってしまうのでは？ 母集団のサンプル数が増え、p値が0に近づくということは p < 0.05 が成り立ち、どんな検定でも帰無仮説が棄却されて有意差があると判断できてしまうのではないかと思った。そこで身近な例をもとに検証する。

(例)広告のABテストの母平均の差の検定

帰無仮説...AとBに差がない
対立仮説...AとBに差がある

有意水準5%で検定する。(5%というのも様々な議論があるが一旦放置)

	インプレション数	クリック数	クリック率
広告A	99	49	49%
広告B	102	50	49%

> prop.test(c( 49 , 50 ),c( 99 , 102 ))

  2-sample test for equality of proportions with continuity correction

data:  c(49, 50) out of c(99, 102)
X-squared = 1.1651e-30, df = 1, p-value = 1
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.1382439  0.1477508
sample estimates:
   prop 1    prop 2
0.4949495 0.4901961

p > 0.05なので差がないといえる。

データ数を1000倍に増やしてみる。

	インプレション数	クリック数	クリック率
広告A	99000	49000	49%
広告B	102000	50000	49%

> n = 1000; prop.test(c( n*49 , n*50 ),c( n*99 , n*102 ))

  2-sample test for equality of proportions with continuity correction

data:  c(n * 49, n * 50) out of c(n * 99, n * 102)
X-squared = 4.5226, df = 1, p-value = 0.03345
alternative hypothesis: two.sided
95 percent confidence interval:
 0.0003718071 0.0091350259
sample estimates:
   prop 1    prop 2
0.4949495 0.4901961

p < 0.05なので帰無仮説が棄却されて差があると判断される。

サンプル数が多くなるとp値が限りなく0に近くなり有意(p < 0.05)になってしまい、採用する仮説が異なって( = 結果が逆になる)しまう。

まとめ

データ数でいくらでも操作できるのでp値だけで判断できない
サンプル数が多いときに仮説検定をすること自体間違っている

アメリカ統計学会も勧告を出している AMERICAN STATISTICAL ASSOCIATION RELEASES STATEMENT ON STATISTICAL SIGNIFICANCE AND P-VALUES

P-values can indicate how incompatible the data are with a specified statistical model.

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

Proper inference requires full reporting and transparency.

A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

面白かったサイト

tjo.hatenablog.com

abrahamcow.hatenablog.com

www.gixo.jp

techlife.cookpad.com