How to calculate Harrell’s c-index to evaluate multivariate model with EXCEL VBA?

You may use Akaike information criterion (AIC) to evaluate fitting of multivariate model. You could use c-index that Harrell have proposed. Although it seems to evaluate fitting of present data set, it seems not to consider about future data set, it might result in overfitting to present data set.

  1. Make pair from data set with proportional hazard analysis
  2. Calculate risk score
  3. Compare risk score and survival time between pairs
  4. Calculate c-index

1. Make pair from data set with proportional hazard analysis

It’s assumed that sample size is N, the number of pairs could be calculated following formula.

\displaystyle _{N}C_{2} = \frac{N!}{(N-2)!2!}

It’s assumed that worksheet’s structure follows the list below.

  • The 1st line is title.
  • The 1st column is survival time, the 2nd is outcome, the 3rd is a risk score of model 1, the 4th is a risk score of model 2 and the 5th is a risk score of model 3, respectively.
  • All data type is numerical.
  • In outcome, 0 is death and 1 is censored, respectively.
Option Explicit

Sub C_Statistics()
    Dim i   As Long
    Dim j   As Long
    Dim k   As Long
    Dim Rng As Range
    Dim Ar  As Variant
    k = 0
    Set Rng = ActiveSheet.UsedRange
    Set Rng = Rng.Resize(Rng.Rows.Count - 1).Offset(1)
    Ar = Rng
    For i = LBound(Ar) To UBound(Ar) - 1
        For j = i + 1 To UBound(Ar)
            k = k + 1
        Next j
    Next i
    Debug.Print "k= " & k
End Sub

2. Calculate risk score

Risk score (R) is calculated as following formula.

\displaystyle R = \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n

\displaystyle S(t, X) = S_0(t)^{\exp(R-R_0)}

A point estimated of effect size in COX proportional hazard analysis is hazard ratio (Exp(Β)) and regression coefficient of covariate is logarithm of hazard ratio (Β). It’s assumed that risk score has been calculated.

3. Compare risk score and survival time between both of pair

It’s important that “If both of pair was censored or one of pair was censored and survival time of censored is short, they were classified as unknown”. In other words,

  • Accept the pair both of it is death
  • If one of pair is death and the survival time of death is shorter than the censored, accept it.

It’s as following VBA code. It’s assumed that it doesn’t includes equal sign if both survival time of pair were equal.

            Select Case Ar(i, 2) + Ar(j, 2)
            Case 0
                k = k + 1
            Case 1
                If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 2) - Ar(j, 2)) > 0 Then
                    k = k + 1
                End If
            End Select

Furthermore, you would compare risk score and survival time between both of pair and evaluate the sign of product of the differentiation of risk score and the differentiation of survival time, respectively. It means that whether the magnitude of risk score and the length of survival time are consistent or not. It’s assumed that lower risk score means longer survival time.

Option Explicit

Sub C_Statistics()
    Dim i   As Long
    Dim j   As Long
    Dim k   As Long
    Dim n1  As Long
    Dim n2  As Long
    Dim n3  As Long
    Dim Rng As Range
    Dim Ar  As Variant
    
    k = 0
    n1 = 0
    n2 = 0
    n3 = 0
    Set Rng = ActiveSheet.UsedRange
    Set Rng = Rng.Resize(Rng.Rows.Count - 1).Offset(1)
    Ar = Rng
    For i = LBound(Ar) To UBound(Ar) - 1
        For j = i + 1 To UBound(Ar)
            Select Case Ar(i, 2) + Ar(j, 2)
            Case 0
                If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 3) - Ar(j, 3)) < 0 Then
                    n1 = n1 + 1
                End If
                If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 4) - Ar(j, 4)) < 0 Then
                    n2 = n2 + 1
                End If
                If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 5) - Ar(j, 5)) < 0 Then
                    n3 = n3 + 1
                End If
                k = k + 1
            Case 1
                If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 2) - Ar(j, 2)) > 0 Then
                    If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 3) - Ar(j, 3)) < 0 Then
                        n1 = n1 + 1
                    End If
                    If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 4) - Ar(j, 4)) < 0 Then
                        n2 = n2 + 1
                    End If
                    If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 5) - Ar(j, 5)) < 0 Then
                        n3 = n3 + 1
                    End If
                    k = k + 1
                End If
            End Select
        Next j
    Next i
    Debug.Print "n1= " & n1, "n2= " & n2, "n3= " & n3, "k= " & k
    Debug.Print "C1= " & n1 / k, "C2= " & n2 / k, "C3= " & n3 / k
End Sub

The sign of 35th line is larger than 0, it's assumed that censor is 1 and death is 0, would be reversed if censor was 0 and death was 1. The signs of 24th, 27th, 30th, 36th, 39th and 42nd would be reversed if it was assumed that higher risk score means longer survival time.

4. Calculate c-index

n1/k, n2/k and n3/k are c-index of model 1, model 2 and model 3, respectively. c-index ranges between 0 and 1. If c-index is 0.5, it means that the model doesn't fit at all. If it's closer to 0 or 1, it means that the model fits better.

Draw a pair of patients and determine which patient lived longer from his baseline evaluation. Survival times can be validly compared either when both patients have died, or when one has died and the other's followup time has exceeded the survival time of the first. If both patients are still alive, which will live longer is not known, and that pair of patients is not used in the analysis. Otherwise, it can be determined whether the patient with the higher prognostic score (ie, the weighted combination of baseline and test variables used to predict survival) also had the longer survival time. The process is repeated until all possible pairs of patients have been examined. Of the pairs of patients for which the ordering of survival time s could be inferred, the fraction of pairs such that the patient with the higher score had the longer survival time will be denoted by c.

The index c estimates the probability that, of two randomly chosen patients, the patient with the higher prognostic score will outlive the patient with the lower prognostic score. Values of c near .5 indicate that the prognostic score is no better than a coin-flip in determining which patient will live longer. Values of c near 0 or 1 indicate the baseline data virtually always determine which patient has a better prognosis. The c index measures a probability; many clinicians are more used to dealing with a correlation index that ranges from -1 to +1. A Kendall or Goodman-Kruskal type of correlation index can easily be constructed by calculating γ = 2(c - .5), where γ is the estimated probability that the prognostic score correctly orders prognosis for a pair of patients minus the probability that it incorrectly orders prognosis. When the prognostic score is unrelated to survival time, gamma is zero. When gamma = .5, the relationship between the prognostic score and survival time is halfway between a random relationship and a perfect relationship, and the corresponding c value is .75.

References:
Frank E. Harrell Jr, et al: Evaluating the Yield of Medical Tests. JAMA. 1982; 247 (18): 2543 - 2546
Morizane Toshio: Multivariate model, International Medical Information Center 2008; 29 (3): 8 - 12

多変量モデル評価法のc-indexをEXCEL VBAで計算する

 多変量モデルの適合度の評価方法には通常赤池情報量基準 (AIC) を用いますが,Harrell らの提唱する c-index という指標もあります.c 統計値とも言い,リスクスコアの小さい(又は大きい)症例の方が生存期間が長いことが実際のデータでどれくらいの確率で正しいかを示す値です.方法は後述しますが,AIC と比較すると現在のデータに対する適合度のみを評価しており,未来のデータの予測精度への考慮がないように思えます.その意味で overfitting の可能性がある評価法と言えなくもありません.

  1. 比例ハザード解析対象となった症例から,全てのペアを作る
  2. それらのリスクスコアを調べる
  3. リスクスコアの大小および生存期間の長短を比較する
  4. c-index を計算する

1. 比例ハザード解析対象となった症例から,全てのペア(2症例ずつの組み合わせ)を作る

 サンプルサイズを N とすると,全てのペア数は下式で表現されます.

\displaystyle _{N}C_{2} = \frac{N!}{(N-2)!2!}

 ワークシート上にデータがあるとして,1行が1症例とすると,全ての行から任意の2行を取り出すコードは下記のようになります.ワークシートの構造が以下のようであると仮定します.

  • 1 行目はタイトル行である.
  • A 列は生存期間, B 列は転帰,C 列はモデル 1 のリスクスコア,D 列はモデル 2 のリスクスコア,E 列はモデル 3 のリスクスコアをそれぞれ表現する.
  • 全てのデータは数値型である.
  • B 列で死亡は 0, 打切は 1 と表現する.
Option Explicit

Sub C_Statistics()
    Dim i   As Long
    Dim j   As Long
    Dim k   As Long
    Dim Rng As Range
    Dim Ar  As Variant
    k = 0
    Set Rng = ActiveSheet.UsedRange
    Set Rng = Rng.Resize(Rng.Rows.Count - 1).Offset(1)
    Ar = Rng
    For i = LBound(Ar) To UBound(Ar) - 1
        For j = i + 1 To UBound(Ar)
            k = k + 1
        Next j
    Next i
    Debug.Print "k= " & k
End Sub

2. それらのリスクスコアを調べる

 リスクスコア (R) は下式で表現されます.予後を規定するという意味で予後スコア prognostic score とも言います.β は回帰係数,X は共変量です.R0 は全症例のリスクスコアの平均値です.S0(t) はベースラインの生存率であり,全ての説明変数が基準値である場合の各時点 t での生存率です.

\displaystyle R = \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n

\displaystyle S(t, X) = S_0(t)^{\exp(R-R_0)}

 COX 比例ハザード分析では効果量の点推定値はハザード比 (Exp(Β)) として表現され,共変量の回帰係数はハザード比の対数 (LN(Exp(Β)) = Β) として表現します.それぞれの共変量にそれぞれの回帰係数をかけた積の和がリスクスコアです.ここでは既にリスクスコアの計算は終わっているものとします.

3. リスクスコアの大小および生存期間の長短を比較する

 ここで重要な点は「2 症例とも打切例,あるいは片方が打切で打切までの期間がより短い場合は不明に分類される」との記述を条件式に表現する方法です.この条件は次のように言い換えることができます.

  • 両者とも死亡のペアを受け入れる
  • 一方が死亡の場合,死亡例の生存期間が打切例の生存期間より短いなら受け入れる

 これを VBA で表現すると以下のようになります.2 行目と 4 行目の Case 式はそれぞれ上述した条件式に該当します.5 行目は上述の条件の後者を表現したものであり,生存期間の差と転帰の差との積を取り,符号が負の場合は拒否します.参照書籍の記述によると『打切例の打切までの生存期間が同じ値かあるいは短い場合にはどちらの生存が長いかは判断することができない』とのことですので,等号は外すこととします.

            Select Case Ar(i, 2) + Ar(j, 2)
            Case 0
                k = k + 1
            Case 1
                If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 2) - Ar(j, 2)) > 0 Then
                    k = k + 1
                End If
            End Select

 さらにリスクスコアと生存期間とを比較します.同様にリスクスコアの差と生存期間の差との積の符号を評価します.リスクスコアの大小と生存期間の長短とが一致しているか否かを,差の積の符号に置き換えている訳です.最初に『リスクスコアの小さい(又は大きい)症例の方が生存期間が長いこと』と述べましたが,説明変数の設定によって各変数の係数の正負を逆転させ,リスクスコアの大小を逆転させることも可能です.ここではリスクスコアが小さいほど生存期間が長いという前提で話を進めます.

Option Explicit

Sub C_Statistics()
    Dim i   As Long
    Dim j   As Long
    Dim k   As Long
    Dim n1  As Long
    Dim n2  As Long
    Dim n3  As Long
    Dim Rng As Range
    Dim Ar  As Variant
    
    k = 0
    n1 = 0
    n2 = 0
    n3 = 0
    Set Rng = ActiveSheet.UsedRange
    Set Rng = Rng.Resize(Rng.Rows.Count - 1).Offset(1)
    Ar = Rng
    For i = LBound(Ar) To UBound(Ar) - 1
        For j = i + 1 To UBound(Ar)
            Select Case Ar(i, 2) + Ar(j, 2)
            Case 0
                If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 3) - Ar(j, 3)) < 0 Then
                    n1 = n1 + 1
                End If
                If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 4) - Ar(j, 4)) < 0 Then
                    n2 = n2 + 1
                End If
                If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 5) - Ar(j, 5)) < 0 Then
                    n3 = n3 + 1
                End If
                k = k + 1
            Case 1
                If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 2) - Ar(j, 2)) > 0 Then
                    If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 3) - Ar(j, 3)) < 0 Then
                        n1 = n1 + 1
                    End If
                    If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 4) - Ar(j, 4)) < 0 Then
                        n2 = n2 + 1
                    End If
                    If (Ar(i, 1) - Ar(j, 1)) * (Ar(i, 5) - Ar(j, 5)) < 0 Then
                        n3 = n3 + 1
                    End If
                    k = k + 1
                End If
            End Select
        Next j
    Next i
    Debug.Print "n1= " & n1, "n2= " & n2, "n3= " & n3, "k= " & k
    Debug.Print "C1= " & n1 / k, "C2= " & n2 / k, "C3= " & n3 / k
End Sub

 35 行目の条件式の符号は打切が 1, 死亡が 0 の際のものです.打切が 0, 死亡が 1 なら符号は逆転します.同様にリスクスコアが大きいほど生存期間が長いなら 24, 27, 30, 36, 39, 42 行目の符号は逆転します.

4. c-index を計算する

 リスクスコアと生存の関係が (1) 一致しているか,(2) 一致していないか,(3) 不明かで結果を場合分けしそれぞれの個数をカウントします.(1)/((1)+(2)) の比率が c-index です.上記では n1/k, n2/k, n3/k がそれぞれのモデルの c-index となります.c-index は 0 から 1 までの値を取りますが,0.5 の場合は全く適合していないと評価します.0 または 1 に近いほど適合が良いと評価します.

Draw a pair of patients and determine which patient lived longer from his baseline evaluation. Survival times can be validly compared either when both patients have died, or when one has died and the other's followup time has exceeded the survival time of the first. If both patients are still alive, which will live longer is not known, and that pair of patients is not used in the analysis. Otherwise, it can be determined whether the patient with the higher prognostic score (ie, the weighted combination of baseline and test variables used to predict survival) also had the longer survival time. The process is repeated until all possible pairs of patients have been examined. Of the pairs of patients for which the ordering of survival time s could be inferred, the fraction of pairs such that the patient with the higher score had the longer survival time will be denoted by c.

The index c estimates the probability that, of two randomly chosen patients, the patient with the higher prognostic score will outlive the patient with the lower prognostic score. Values of c near .5 indicate that the prognostic score is no better than a coin-flip in determining which patient will live longer. Values of c near 0 or 1 indicate the baseline data virtually always determine which patient has a better prognosis. The c index measures a probability; many clinicians are more used to dealing with a correlation index that ranges from -1 to +1. A Kendall or Goodman-Kruskal type of correlation index can easily be constructed by calculating γ = 2(c - .5), where γ is the estimated probability that the prognostic score correctly orders prognosis for a pair of patients minus the probability that it incorrectly orders prognosis. When the prognostic score is unrelated to survival time, gamma is zero. When gamma = .5, the relationship between the prognostic score and survival time is halfway between a random relationship and a perfect relationship, and the corresponding c value is .75.

参照:
Frank E. Harrell Jr, et al: Evaluating the Yield of Medical Tests. JAMA. 1982; 247 (18): 2543 - 2546
森實敏夫:多変量モデル,あいみっく,2008; 29 (3): 8 - 12(国際医学情報センター)