Pearson product-moment correlation coefficient (r) and t-test on it

The index of relation between x and y is correlation coefficient or Pearson product moment correlation coefficient as formula below. Range of correlation coefficient is between -1 and 1.

\displaystyle r = \frac{\sum_{i=1}^n(x_i-\bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i - \bar x)^2}\sqrt{\sum_{i=1}^n(y_i - \bar y)^2}}

\bar x and \bar y are average of x and y, respectively. i is number of sample (incremental variable). n is number of sample.

Correlation coefficient (r) of 2 variables randomly extracted from population follows t-distribution. T-statistics of r is calculated formula as below and follows t-distribution with degree of freedom n-2, n is number of sample. When correlation coefficient of population is \rho, null hypothesis is described that “\rho = 0″. If t-statistics calculated from number of sample (n) and correlation coefficient (r) is greater than that of significance level (\alpha), null hypothesis is rejected.

\displaystyle t = r\sqrt{\frac{n - 2}{1 - r^2}}

The test of significance for this important null hypothesis H (ρ = 0) is equivalent to that for the null hypothesis H (β1 = 0) or H (β2 = 0). It now follows that if x and y have a joint bivariate normal distribution, then the test for the null hypothesis H (ρ = 0) is obtained by using the fact that if the null hypothesis under test is true, then

\displaystyle F = \frac{(n-2)Z^2}{XY-Z^2} = \frac{(n-2)r^2}{1-r^2}\vspace{0.1in}\\ X = \sum(x - \bar{x})^2\vspace{0.1in}\\ Y = \sum(y - \bar{y})^2\vspace{0.1in}\\ Z = \sum(x - \bar{x})(y - \bar{y})\vspace{0.1in}\\ r^2 = \frac{Z^2}{XY}

has the F distribution with 1, n – 2 d.f. An equivalent test of significance for the null hypothesis is obtained by using the fact that if the null hypothesis is true, then

\displaystyle t = \frac{r\sqrt{n - 2}}{\sqrt{1 - r^2}}

has “Student’s” distribution. with n – 2 d.f.

For any non-zero null hypothesis about ρ there is no parallelism between the correlation coefficient ρ and the regression coefficients β1 and β2. In fact, no exact test of significance is available for testing readily non-zero null hypothesis about ρ. Fisher has given an approximate method for such null hypothesis, but we do not consider this here.

ピアソンの積率相関係数 r とそれに対する t 検定

 xy との間の相関の程度を表す指標としてピアソンの積率相関係数 r があり,下記の式で表現します.相関係数 r は -1 から 1 の範囲にあります.

\displaystyle r = \frac{\sum_{i=1}^n(x_i-\bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n(x_i - \bar x)^2}\sqrt{\sum_{i=1}^n(y_i - \bar y)^2}}

\bar x and \bar y are average of x and y, respectively. i is number of sample (incremental variable). n is number of sample.

 母集団から無作為に抽出したサンプルの2種類の変数の間の相関係数 r は t 分布に従いますので有意差検定を行うことができます.

 相関係数 r に対する t 統計値は下記の式で求まり,サンプル数 n とすると自由度 n-2 の t 分布に従います.母集団の相関係数を \rho として帰無仮説を『相関係数 \rho = 0 である』とします.サンプル数 n, 相関係数 r から計算した t 統計値が,有意水準 \alpha に対応する t 統計値を越えれば帰無仮説を棄却します.

\displaystyle t = r\sqrt{\frac{n - 2}{1 - r^2}}

 ピアソンの積率相関係数 r の有意差検定について以前質問をいただきましたが,参考文献を入手しましたので追記します.実は証明など期待していたのですが,期待はずれでした.\rho についての帰無仮説を簡単に検定できる正確な方法は存在しない,と書いてありました.そのような帰無仮説に対して Fisher が近似的な方法を提供していると書かれていましたが,それ以上の記述はありませんでした.以下拙訳です.

 帰無仮説 H (ρ = 0) は次の帰無仮説と同等である.H (β1 = 0) または H (β2 = 0).そこで仮に x, y が共同二変量正規分布を取る場合,仮に検定対象の帰無仮説が真であるとすると,帰無仮説 H (ρ = 0) のための検定が得られる.

\displaystyle F = \frac{(n-2)Z^2}{XY-Z^2} = \frac{(n-2)r^2}{1-r^2}\vspace{0.1in}\\ X = \sum(x - \bar{x})^2\vspace{0.1in}\\ Y = \sum(y - \bar{y})^2\vspace{0.1in}\\ Z = \sum(x - \bar{x})(y - \bar{y})\vspace{0.1in}\\ r^2 = \frac{Z^2}{XY}

 上記の F は自由度 n-2 の F 分布に従う.同等の有意差検定として,仮に帰無仮説が真だとすると以下の式は自由度 n-2 の Student’s-t 分布に従う.

\displaystyle t = \frac{r\sqrt{n - 2}}{\sqrt{1 - r^2}}

 ρ についてのいかなる帰無仮説でも相関係数 ρ と回帰係数 β1 および β2 との間に平行性は存在しない.事実 ρ についての帰無仮説を簡単に検定できる正確な方法は存在しない.そのような帰無仮説に対して Fisher が近似的な方法を提供しているが,ここでは取り扱わない.

How to execute multiple comparison

Student’s t-test would be executed to compare average between two groups. Then if you would like to compare average between 3 or more groups, what do you do? The test needs 2 steps process.

  1. Analysis of variance (ANOVA)
  2. Compare between each 2 groups

1. Analysis of variance

On analysis of variance, null hypothesis is that all groups belong to one population. Therefore, if null hypothesis has been rejected, all groups would not belong to one population.

If all groups belong to one population, the average of all groups, called as grand mean (MG), would be close to average of population. Furthermore, if all groups belong to one population, each average of each groups, for example, M1, M2, M3, would be close to grand mean. However, if any group doesn’t belong to one population, the average of other population would be far from MG. Then we need the indicator that represents how far each average of each groups from grand mean, corrected with number of sample, n. It is called as mean square among groups (MSA).

\displaystyle MSA = \frac{\sum_{i=1}^k n_i (M_i - MG)^2}{k-1}

MSA; mean square among groups. n; number of sample in each groups. i; number of group (incremental variable). k; number of groups.

Then calculate variances of each samples, correct with number of each sample and you would take index of variances in samples. It is mean square of error (MSE), average of variance in group.

\displaystyle MSE = \frac{\sum_{i=1}^k (n_i - 1)V_i}{\sum_{i=1}^{k}(n_i - 1)}

MSE; mean square of error. n; number of sample in each groups. i; number of group (incremental variable). k; number of groups. V; variance.

\displaystyle V = \frac{\sum(x - \bar x)^2}{n-1}

x; each value of samples in each groups. n; number of sample.

F statistics, calculated as ratio MSA to MSE, follows F distribution. When F statistics would be over a value, null hypothesis would be rejected and you could compare average between each groups.

\displaystyle F=\frac{MSA}{MSE}

2. Compare between each 2 groups

If null hypothesis would be rejected with ANOVA, you could compare between each groups with following method.

  • Bonferroni method
  • Tukey’s HSD
  • Dunnet’s procedure
  • Hsu’s MCB tests
  • Scheffe’s procedure

Bonferroni method may be easy to understand and use. Divided significance level \alpha by k, number of pairs, would be Bonferroni corrected significance level. See following chart.

Bonferroni Corrected Significance Level

多重比較するにはまず分散分析を行い,次いで各群間の比較を行う

 2 群間の平均値に差があるかを検定するには Student’s t 検定を行いました.今回は 3 群間の平均値に差があるかを検定する方法を述べます.検定は 2 段階に分けて行います.

  1. 分散分析 (ANOVA)
  2. 各群間の比較

1. 分散分析 (analysis of variance)

 分散分析では帰無仮説を『全ての群が同一の母集団に属する』とします.これを否定出来れば全ての群が同一母集団には属しないことが言えます.以下その方法を述べます.

 すべての群が同一母集団に属しているなら 3 群全部のサンプル平均値 (grand mean; MG) は母集団の平均値に近くなるはずです.さらにすべての群が同一母集団に属するなら,それぞれの群の平均値 (M1, M2, M3) は MG に近くなるはずです.逆に 3 群が異なる母集団に属するなら M1, M2, M3 は MG から離れた値になります.そこで各群の平均値が総平均値からどれだけ離れているか,それぞれの群のサンプル数 n で補正した指標を下記の式で表現します.これは平均値の群間差の平方和です.

\displaystyle MSA = \frac{\sum_{i=1}^k n_i (M_i - MG)^2}{k-1}

MSA; mean square among groups. n; number of sample in each groups. i; number of group (incremental variable). k; number of groups.

 次に各サンプルの分散を求め,各群のサンプル数で補正して 1 サンプルあたりのばらつきの指標とします.これは群内の分散の平均値となります.

\displaystyle MSE = \frac{\sum_{i=1}^k (n_i - 1)V_i}{\sum_{i=1}^{k}(n_i - 1)}

MSE; mean square of error. n; number of sample in each groups. i; number of group (incremental variable). k; number of groups. V; variance.

\displaystyle V = \frac{\sum(x - \bar x)^2}{n-1}

x; each value of samples in each groups. n; number of sample.

 下記の式のように MSA と MSE の比を取ると, MSA/MSE は F 分布に従います.F の値が一定以上となると帰無仮説は棄却され,全ての群が同一母集団には属しないことが言え,各群間の比較が可能となります.

\displaystyle F=\frac{MSA}{MSE}

2. 各群間の比較

 ANOVA の結果,全ての群が同一母集団には属しないことが証明された後に各群間を比較する方法にはいくつかあります.

  • Bonferroni 法
  • Tukey 法
  • Dunnet 法
  • Hsu’s MCB method
  • Scheffe’s procedure

 Bonferroni 法が分かりやすいので述べます.有意水準 \alpha を群数 k で除算した \alpha/k を有意水準とする方法です.下図は Bonferroni 法による有意水準を補正した場合としない場合とで有意水準がどう変化するか示したグラフです.

Bonferroni Corrected Significance Level

How to estimate 95 % confidence interval of population average from sample average and sample standard deviation?

If you had known average and standard deviation of sample which size is N, you could estimate the range of population average with 95 % probability. In standard normal distribution, sum of area under curve greater than 1.96 and less than – 1.96 is 0.05. There is population average between exception multiplying 1.96 by standard error from average and sum of the multiplied and average, called as 95 % confidence interval (95 % CI).

\displaystyle 95 \% C.I.= \mu \pm 1.96 SE = \mu \pm 1.96 \frac{SD}{\sqrt N}

\mu; Average, SE; standard error, SD; standard deviation

標本平均値と標本標準偏差から母集団の平均値の95%信頼区間を求める

 連続変数においてはサンプルサイズ N の標本の平均値の分布から,95 % の確率で母集団の平均値が含まれる範囲が分かります.標準正規分布においては値が 1.96 以上の曲線下面積と – 1.96 以下の曲線下面積の和は 0.05 となります.つまり標準誤差 SE に 1.96 をかけ,平均値から引いた値から平均値に足した値までの間に真の平均値が含まれます.この範囲を 95 % 信頼区間といいます.

\displaystyle 95 \% C.I.= \mu \pm 1.96 SE = \mu \pm 1.96 \frac{SD}{\sqrt N}

\mu; Average, SE; standard error, SD; standard deviation