BOBOBK

The c-index and Its Application in Survival Analysis

TECHNOLOGY

The concordance index (c-index) is a parameter used to evaluate how well a predictive model performs. By definition, it is the proportion of concordant pairs among all comparable pairs at different time points. This metric is particularly significant in biological contexts such as cancer prognosis, where it helps assess the accuracy of survival time predictions.

In Python, you can compute it using the concordance_index function from the lifelines package.

Let’s look at a concrete example to understand its meaning. Suppose we have six patients with actual survival times of 1 month, 6 months, 12 months, 2 years, 3 years, and 5 years. If the predictions exactly match the actual values, the c-index is 1.0, indicating perfect prediction.

# Import necessary packages
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

# Install lifelines if not already installed
!pip install lifelines

from lifelines.utils import concordance_index

# Define the data
df = pd.DataFrame({
    "name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Er", "Ma Zi", "someone"],
    "survive": [1, 6, 12, 24, 36, 60],
    "predicted": [1, 6, 12, 24, 36, 60]
})
c_index = concordance_index(df.survive, df.predicted)

print(df)
print(c_index)

# Output:
#       name  survive  predicted
# 0  Zhang San        1          1
# 1     Li Si        6          6
# 2    Wang Wu       12         12
# 3    Zhao Er       24         24
# 4     Ma Zi       36         36
# 5   someone       60         60
# 1.0

In fact, the c-index does not depend on the actual values but rather on the ordering, making it similar to Spearman’s correlation — a non-parametric method. If we change the predicted values while preserving the order, the c-index remains 1.

Example 1:

df = pd.DataFrame({
    "name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Er", "Ma Zi", "someone"],
    "survive": [1, 6, 12, 24, 36, 60],
    "predicted": [1, 1.1, 1.2, 2.4, 3.6, 6]
})
c_index = concordance_index(df.survive, df.predicted)

print(df)
print(c_index)

# 1.0

Example 2:

df = pd.DataFrame({
    "name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Er", "Ma Zi", "someone"],
    "survive": [1, 6, 12, 24, 36, 60],
    "predicted": [1, 60, 120, 240, 360, 600]
})
c_index = concordance_index(df.survive, df.predicted)

print(df)
print(c_index)

# 1.0

However, if the order is incorrect, the c-index drops significantly.

Example 3:

df = pd.DataFrame({
    "name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Er", "Ma Zi", "someone"],
    "survive": [1, 6, 12, 24, 36, 60],
    "predicted": [1, 12, 6, 36, 24, 60]
})
c_index = concordance_index(df.survive, df.predicted)

print(df)
print(c_index)

# Output: 0.8666666666666667

Example 4: Reverse order:

df = pd.DataFrame({
    "name": ["Zhang San", "Li Si", "Wang Wu", "Zhao Er", "Ma Zi", "someone"],
    "survive": [1, 6, 12, 24, 36, 60],
    "predicted": [60, 36, 24, 12, 6, 1]
})
c_index = concordance_index(df.survive, df.predicted)

print(df)
print(c_index)

# Output: 0.0

Summary

The concordance index (c-index) is a useful metric in survival analysis for evaluating the performance of predictive models. It is sensitive to the ranking order of predictions, but insensitive to the specific numerical values. This makes it especially suitable for assessing models where rank accuracy is more important than exact value prediction.

Related

Python Native Lists vs. NumPy Arrays

TECHNOLOGY
Python Native Lists vs. NumPy Arrays

In Python, you can choose from various native data types to store collection data, including list, array, tuple, and dictionary. Among these, the list is highly flexible, can store any content, and is mutable, making it widely applicable. However, for scientific computing and storing purely numerical data, NumPy is widely used and has practically replaced lists. So, what are the differences between them, how significant are these differences, and how should they be applied in practice?

Calculating the Gini Coefficient and Plotting the Lorenz Curve with matplotlib

TECHNOLOGY
Calculating the Gini Coefficient and Plotting the Lorenz Curve with matplotlib

The Gini coefficient and Lorenz curve are widely used to represent data inequality, especially wealth inequality. However, currently in Python, there isn't a very good function to directly plot the Lorenz curve. Since the current project requires it, this article records how to use numpy, pandas, matplotlib, and other packages to calculate the Gini coefficient and plot the Lorenz curve for practical use.