When should you execute Fisher exact test, not chi-square test?

You should not execute chi-square test but Fisher exact probability test when gland total of cross tabulation was smaller than 20 or one or greater than one cells had smaller than 5 expected value. In this article, I would like to describe how to solve expected value. Expected value is calculated with marginal total.

We have cross tabulation below;

  TRUE FALSE Marginal total
POSITIVE a b a + b
NEGATIVE c d c + d
Marginal total a + c b + d N 

Expected value of each cells is below;

  TRUE FALSE Marginal total
POSITIVE (a + b)*(a + c)/N (a + b)*(b + d)/N a + b
NEGATIVE (c + d)*(a + c)/N (c + d)*(b + d)/N c + d
Marginal total a + c b + d N 

How to calculate Fisher’s exact test with logarithm?

χ二乗検定ではなく Fisher の直接確率検定を行うべき状況とは

 クロス表(四分表)を見て,各セルの期待値が 5 未満の割合が 20 % 以上存在する場合や,総数 N が 20 未満の場合にはχ二乗検定ではなく Fisher の直接確率検定を行うべきであるとされています.この辺り,本来なら Fisher の直接確率検定を行いたいが計算コストが高すぎてχ二乗検定で代用せざるを得なかった経緯があるのではないかと考えます.

 ここでは各セルの期待値を求める方法を述べます.一見して明らかですが,期待値は周辺度数 (marginal total) のみから算出されます.

 下表のようなクロス表があるとします.

  TRUE FALSE Marginal total
POSITIVE a b a + b
NEGATIVE c d c + d
Marginal total a + c b + d N 

 各セルの期待値は下記の通りです.

  TRUE FALSE Marginal total
POSITIVE (a + b)*(a + c)/N (a + b)*(b + d)/N a + b
NEGATIVE (c + d)*(a + c)/N (c + d)*(b + d)/N c + d
Marginal total a + c b + d N 

参照記事
対数を用いてFisherの直接確率検定を計算するには

Execute Fisher’s exact test with T-SQL

Fisher’s probability is function depend on cut-off value. In this article, you could draw scatter plot ROC curve and get Fisher’s probability with query below. In ROC curve, you would plot false positive rate on horizontal axis and sensitivity on vertical axis. You could plot cut-off value on horizontal axis and Fisher’s probability on vertical axis. You could get best cut-off value with minimum probability in line graph if minimum probability would be smaller than 0.05.

CREATE TABLE [dbo].[T_DATA]
    (    ID       nchar(8)         NOT NULL, PRIMARY KEY
    ,    Test     decimal(4, 2)    NOT NULL
    ,    Outcome  nchar(1)         NOT NULL
    );
GO
ALTER TABLE T_DATA ADD CONSTRAINT CK_Outcome CHECK (Outcome = '0' OR Outcome = '1');
GO

Run the code below to create function, converting from factorial to sum of logarithm.

CREATE FUNCTION [dbo].[LOG_FACT](
    @SrcNumber FLOAT
)
RETURNS FLOAT
BEGIN
    DECLARE @DesNumber   FLOAT
    SET @DesNumber = LOG(1)
    WHILE @SrcNumber > 0
    BEGIN
        SET @DesNumber = @DesNumber + LOG(@SrcNumber)
        SET @SrcNumber = @SrcNumber - 1
    END
    RETURN @DesNumber
END
GO

Run the code below to create stored procedure that creates cross table from T_DATA.

CREATE PROCEDURE [dbo].[sp_Cut_by_Test]
    @CutOff  decimal(4, 2)
AS
BEGIN
WITH Cross_Table AS
(
  SELECT COUNT(*) AS 'N'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff AND T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'a'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff AND T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'b'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff AND T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'c'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff AND T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'd'
    ,    SUM(CASE WHEN                            T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'a+c'
    ,    SUM(CASE WHEN                            T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'b+d'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff                          THEN 1 ELSE 0 END) AS 'a+b'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff                          THEN 1 ELSE 0 END) AS 'c+d'
    FROM T_DATA
)
  SELECT @CutOff
    ,    Cross_Table.[N]
    ,    Cross_Table.[a]
    ,    Cross_Table.[b]
    ,    Cross_Table.
    ,    Cross_Table.[d]
    ,    Cross_Table.[a+c]
    ,    Cross_Table.[b+d]
    ,    Cross_Table.[a+b]
    ,    Cross_Table.
    ,    Cross_Table.[a]/Cross_Table.[a+c] AS 'Sensitivity'
    ,    Cross_Table.[d]/Cross_Table.[b+d] AS 'Specificity'
    ,    1 - Cross_Table.[d]/Cross_Table.[b+d] AS 'FalsePositive'
    FROM Cross_Table;
END
GO

Run the code below to create stored procedure that calculates Fisher’s probability. @Start means minimum value of test, @End means maximum value of test and @Step means step value of test from @Start to @End. For example, @Start 2.0, @End 4.0 and @Step 0.1, respectively.

CREATE PROCEDURE [dbo].[FisherExactTest]
    (    @Start  decimal(4, 2)
    ,    @End    decimal(4, 2)
    ,    @Step   decimal(4, 2)
    )
AS
BEGIN
    CREATE TABLE #Result
    (    [CutOff]    decimal(4, 2)    NOT NULL
    ,    N    int    NOT NULL
    ,    a    int    NOT NULL
    ,    b    int    NOT NULL
    ,    c    int    NOT NULL
    ,    d    int    NOT NULL
    ,    [a+c]    int    NOT NULL
    ,    [b+d]    int    NOT NULL
    ,    [a+b]    int    NOT NULL
    ,        int    NOT NULL
    ,    Sensitivity    FLOAT    NOT NULL
    ,    Specificity    FLOAT    NOT NULL
    ,    FalsePositive    FLOAT    NOT NULL
    )
    DECLARE @CutOff  decimal(4, 2)
    SET @CutOff = @Start
    WHILE @CutOff <= @End
    BEGIN
       INSERT INTO #Result EXEC sp_Cut_by_Test @CutOff
       SET @CutOff = @CutOff + @Step
    END
  SELECT CutOff
    ,    N
    ,    a
    ,    b
    ,    c
    ,    d
    ,    [a+c]
    ,    [b+d]
    ,    [a+b]
    ,    [c+d]
    ,    Sensitivity
    ,    Specificity
    ,    FalsePositive
    ,    EXP(dbo.LOG_FACT([a+b])
         +   dbo.LOG_FACT([c+d])
         +   dbo.LOG_FACT([a+c])
         +   dbo.LOG_FACT([b+d])
         -   dbo.LOG_FACT(N)
         -   dbo.LOG_FACT(CASE WHEN a = 0 THEN 1 ELSE a END)
         -   dbo.LOG_FACT(CASE WHEN b = 0 THEN 1 ELSE b END)
         -   dbo.LOG_FACT(CASE WHEN c = 0 THEN 1 ELSE c END)
         -   dbo.LOG_FACT(CASE WHEN d = 0 THEN 1 ELSE d END)) AS [FisherExact_Test]
    FROM #Result
   ORDER BY CutOff
END

References;
How to calculate Fisher’s exact test with logarithm?
How to calculate four numbers from marginal total in cross tabulation?
Stored procedure to calculate factorial with natural logarithm

T-SQLでFisherの直接確率検定を行う

 Fisher の確率はカットオフ値の関数であると以前書きました.今回は下記のクエリで ROC 曲線を描き,Fisherの直接確率検定を行います.ROC 曲線では横軸に偽陽性率を取り,縦軸に感度を取ります.カットオフ値を横軸に取り,Fisher の確率を縦軸にグラフを描くと,グラフの最小値に該当するカットオフ値が求めるものとなります.ただし,p < 0.05 であることが条件です.

CREATE TABLE [dbo].[T_DATA]
    (    ID       nchar(8)         NOT NULL, PRIMARY KEY
    ,    Test     decimal(4, 2)    NOT NULL
    ,    Outcome  nchar(1)         NOT NULL
    );
GO
ALTER TABLE T_DATA ADD CONSTRAINT CK_Outcome CHECK (Outcome = '0' OR Outcome = '1');
GO

 下記の関数は階乗を対数の和に変換します.

CREATE FUNCTION [dbo].[LOG_FACT](
    @SrcNumber FLOAT
)
RETURNS FLOAT
BEGIN
    DECLARE @DesNumber   FLOAT
    SET @DesNumber = LOG(1)
    WHILE @SrcNumber > 0
    BEGIN
        SET @DesNumber = @DesNumber + LOG(@SrcNumber)
        SET @SrcNumber = @SrcNumber - 1
    END
    RETURN @DesNumber
END
GO

 下記のストアドプロシージャはカットオフ値を変数化して T_DATA からクロス表を作成します.

CREATE PROCEDURE [dbo].[sp_Cut_by_Test]
    @CutOff  decimal(4, 2)
AS
BEGIN
WITH Cross_Table AS
(
  SELECT COUNT(*) AS 'N'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff AND T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'a'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff AND T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'b'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff AND T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'c'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff AND T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'd'
    ,    SUM(CASE WHEN                            T_DATA.Outcome = '1' THEN 1 ELSE 0 END) AS 'a+c'
    ,    SUM(CASE WHEN                            T_DATA.Outcome = '0' THEN 1 ELSE 0 END) AS 'b+d'
    ,    SUM(CASE WHEN T_DATA.Test <= @CutOff                          THEN 1 ELSE 0 END) AS 'a+b'
    ,    SUM(CASE WHEN T_DATA.Test >  @CutOff                          THEN 1 ELSE 0 END) AS 'c+d'
    FROM T_DATA
)
  SELECT @CutOff
    ,    Cross_Table.[N]
    ,    Cross_Table.[a]
    ,    Cross_Table.[b]
    ,    Cross_Table.
    ,    Cross_Table.[d]
    ,    Cross_Table.[a+c]
    ,    Cross_Table.[b+d]
    ,    Cross_Table.[a+b]
    ,    Cross_Table.
    ,    Cross_Table.[a]/Cross_Table.[a+c] AS 'Sensitivity'
    ,    Cross_Table.[d]/Cross_Table.[b+d] AS 'Specificity'
    ,    1 - Cross_Table.[d]/Cross_Table.[b+d] AS 'FalsePositive'
    FROM Cross_Table;
END
GO

 下記のストアドプロシージャはクロス表から Fisher の確率を求めます.@Start とは test の最小値であり, @End は test の最大値, @Step は @Start から @End までの増分のことです.例えばそれぞれ @Start を 2.0 とし,@End を 4.0 とし,@Step を 0.1 とするなどです.

CREATE PROCEDURE [dbo].[FisherExactTest]
    (    @Start  decimal(4, 2)
    ,    @End    decimal(4, 2)
    ,    @Step   decimal(4, 2)
    )
AS
BEGIN
    CREATE TABLE #Result
    (    [CutOff]    decimal(4, 2)    NOT NULL
    ,    N    int    NOT NULL
    ,    a    int    NOT NULL
    ,    b    int    NOT NULL
    ,    c    int    NOT NULL
    ,    d    int    NOT NULL
    ,    [a+c]    int    NOT NULL
    ,    [b+d]    int    NOT NULL
    ,    [a+b]    int    NOT NULL
    ,        int    NOT NULL
    ,    Sensitivity    FLOAT    NOT NULL
    ,    Specificity    FLOAT    NOT NULL
    ,    FalsePositive    FLOAT    NOT NULL
    )
    DECLARE @CutOff  decimal(4, 2)
    SET @CutOff = @Start
    WHILE @CutOff <= @End
    BEGIN
       INSERT INTO #Result EXEC sp_Cut_by_Test @CutOff
       SET @CutOff = @CutOff + @Step
    END
  SELECT CutOff
    ,    N
    ,    a
    ,    b
    ,    c
    ,    d
    ,    [a+c]
    ,    [b+d]
    ,    [a+b]
    ,    [c+d]
    ,    Sensitivity
    ,    Specificity
    ,    FalsePositive
    ,    EXP(dbo.LOG_FACT([a+b])
         +   dbo.LOG_FACT([c+d])
         +   dbo.LOG_FACT([a+c])
         +   dbo.LOG_FACT([b+d])
         -   dbo.LOG_FACT(N)
         -   dbo.LOG_FACT(CASE WHEN a = 0 THEN 1 ELSE a END)
         -   dbo.LOG_FACT(CASE WHEN b = 0 THEN 1 ELSE b END)
         -   dbo.LOG_FACT(CASE WHEN c = 0 THEN 1 ELSE c END)
         -   dbo.LOG_FACT(CASE WHEN d = 0 THEN 1 ELSE d END)) AS [FisherExact_Test]
    FROM #Result
   ORDER BY CutOff
END

参照記事
対数を用いてFisherの直接確率検定を計算するには
周辺度数からクロス表を作成するには
対数により階乗を計算するストアドプロシージャを作成する

Stored procedure to calculate factorial with natural logarithm

This site has shown stored procedure and function to calculate factorial with SQL. Although the logic is so simple as extracting one from previous argument and multiplying to it, it requires high cost of computing power. Therefore, the maximum of argument is restricted to so relatively small integer. In this article, you can calculate factorial with converting argument to natural logarithm, adding them and re-converting it to power of e, the base of natural logarithm.

CREATE PROCEDURE [dbo].[LOG_FACT]
    @SrcNumber    FLOAT
AS
BEGIN
    DECLARE	@DesNumber    FLOAT
    SET @DesNumber = LOG(1)
    WHILE @SrcNumber > 0
    BEGIN
        SET @DesNumber = @DesNumber + LOG(@SrcNumber)
        SET @SrcNumber = @SrcNumber - 1
    END
    SELECT	EXP(@DesNumber)
END

Maximum of argument is 170.

EXEC LOG_FACT 170;
7.25741561530971E+306

対数により階乗を計算するストアドプロシージャを作成する

 SQL で階乗を計算するストアドプロシージャにはSQL-Serverの関数・プロシージャのVB.NETでの実行などのサイトがありますが,入力できる引数の上限が比較的小さくあまり実用的ではありません.引数から1ずつ減算して前の値に掛けていくという計算のロジックは比較的シンプルですが,計算コストは逆に高くなります.ここでは引数を一旦自然対数に変換して対数の和を求め,最後に自然対数の底のべき乗を求めることで階乗を求めます.

CREATE PROCEDURE [dbo].[LOG_FACT]
    @SrcNumber    FLOAT
AS
BEGIN
    DECLARE	@DesNumber    FLOAT
    SET @DesNumber = LOG(1)
    WHILE @SrcNumber > 0
    BEGIN
        SET @DesNumber = @DesNumber + LOG(@SrcNumber)
        SET @SrcNumber = @SrcNumber - 1
    END
    SELECT	EXP(@DesNumber)
END

 引数に取りうる最大値は170です.

EXEC LOG_FACT 170;
(列名なし)
7.25741561530971E+306

How to calculate four numbers from marginal total in cross tabulation?

When you execute Fisher’s exact test with cross tabulation, the marginal total is constant. Therefore, you could calculate the remaining three numbers if you could get ‘a’, the number of true positive. Because all four numbers are 0 or greater than 0, the range of ‘a’ is between 0 and the smaller one of either the number of ‘TRUE’ or the number of ‘positive’.

Fisher’s probability is function dependent on ‘a’. Fisher’s probability follows super geometric distribution. If ‘positive’ and ‘negative’ are separated by cut-off value in continuous variable, changing cut-off value makes a change in numbers of positive, negative and true-positive. The numbers of true and false never change regardless of change in cut-off value. Therefore, Fisher’s probability is function dependent on cut-off value.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a a + N – P – T N – P
Marginal total T N – T N 

At first you know only about numbers of ‘N’ meaning of grand total, ‘T’ meaning of true and ‘P’ meaning of positive.

  TRUE FALSE Marginal total
POSITIVE     P
NEGATIVE      
Marginal total T   N 

Next, you can calculate numbers ‘N – P’ meaning of negative and ‘N – T’ meaning of false. Then you have got marginal total.

  TRUE FALSE Marginal total
POSITIVE     P
NEGATIVE     N – P
Marginal total T N – T N 

If you could get ‘a’, you would get false negative ‘T – a’ and false positive ‘P – a’.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a   N – P
Marginal total T N – T N 

At last, you could get true negative ‘a + N – P – T’.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a a + N – P – T N – P
Marginal total T N – T N 

周辺度数からクロス表を作成するには

 クロス表から Fisher の直接確率検定を行う場合,周辺度数 (marginal total) が一定ですので,真陽性の度数 a さえ決まれば残りは自動的に決まります.下記の四分表はそれを示しています.すべての度数は 0 以上の整数ですから,a の取りうる範囲は 0 から T または P のいずれか小さい方までです.

 ここで重要なことは,Fisher の直接確率検定による確率 p は a の関数になっていることです.これを超幾何分布と言います.仮に POSITIVE と NEGATIVE とが何らかの連続変数の閾値によって分けられている場合,閾値を変化させることで周辺度数である陽性の度数 P,陰性の度数 N – P  および真陽性の度数 a も変化します.真の度数 T および偽の度数 N – T は閾値によって変化することはありません.つまり Fisher の直接確率検定による確率 p は閾値の関数になっています.

 通常ですと論文には総数の N, 真の度数 T, 陽性の度数 P, 更に感度と特異度が記載されており,ここから四分表を再現できるようになっています.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a a + N – P – T N – P
Marginal total T N – T N 

 最初に分かっているのは下記のように総数 N,陽性の度数 P,真の度数 T のみです.

  TRUE FALSE Marginal total
POSITIVE     P
NEGATIVE      
Marginal total T   N 

 次に陰性の度数 N-P,偽の度数 N-T を計算で求めます.これで周辺度数 (marginal total) が得られます.

  TRUE FALSE Marginal total
POSITIVE     P
NEGATIVE     N – P
Marginal total T N – T N 

  a を与えると,偽陰性と偽陽性が求められます.a は TRUE に感度をかけて求めます.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a   N – P
Marginal total T N – T N 

  最後に真陰性が得られます.

  TRUE FALSE Marginal total
POSITIVE a P – a P
NEGATIVE T – a a + N – P – T N – P
Marginal total T N – T N