Binary Classification
Stu Field, SomaLogic Operating Co., Inc.
Source:vignettes/articles/stat-binary-classification.Rmd
stat-binary-classification.Rmd
Classification via Logistic Regression
Although targeted statistical analyses are beyond the scope of the
SomaDataIO
package, below is an example analysis that
typical users/customers would perform on ‘SomaScan’ data.
It is not intended to be a definitive guide in statistical analysis
and existing packages do exist in the R
ecosystem that
perform parts or extensions of these techniques. Many variations of the
workflow below exist, however the framework highlights how one could
perform standard preliminary analyses on ‘SomaScan’ data.
Data Preparation
# the `example_data` package data
dim(example_data)
#> [1] 192 5318
table(example_data$SampleType)
#>
#> Buffer Calibrator QC Sample
#> 6 10 6 170
# center/scale
cs <- function(.x) { # .x = numeric vector
out <- .x - mean(.x) # center
out / sd(out) # scale
}
# prepare data set for analysis
cleanData <- example_data |>
filter(SampleType == "Sample") |> # rm control samples
drop_na(Sex) |> # rm NAs if present
log10() |> # log10-transform (Math Generic)
mutate(Group = as.numeric(factor(Sex)) - 1) |> # map Sex -> 0/1
modify_at(getAnalytes(example_data), cs) # center/scale analytes
table(cleanData$Sex)
#>
#> F M
#> 85 85
table(cleanData$Group) # F = 0; M = 1
#>
#> 0 1
#> 85 85
Set up Train/Test Data
# idx = hold-out
# seed resulting in 50/50 class balance
idx <- withr::with_seed(3, sample(1:nrow(cleanData), size = nrow(cleanData) - 50))
train <- cleanData[idx, ]
test <- cleanData[-idx, ]
# assert no overlap
isTRUE(
all.equal(intersect(rownames(train), rownames(test)), character(0))
)
#> [1] TRUE
Logistic Regression
We use the cleanData
, train
, and
test
data objects from above.
Predict Sex
LR_tbl <- getAnalyteInfo(train) |>
select(AptName, SeqId, Target = TargetFullName, EntrezGeneSymbol, UniProt) |>
mutate(
formula = map(AptName, ~ as.formula(paste("Group ~", .x))), # create formula
model = map(formula, ~ stats::glm(.x, data = train, family = "binomial", model = FALSE)), # fit glm()
beta_hat = map(model, coef) |> map_dbl(2L), # pull out coef Beta
p.value = map2_dbl(model, AptName, ~ {
summary(.x)$coefficients[.y, "Pr(>|z|)"] }), # pull out p-values
fdr = p.adjust(p.value, method = "BH") # FDR correction multiple testing
) |>
arrange(p.value) |> # re-order by `p-value`
mutate(rank = row_number()) # add numeric ranks
LR_tbl
#> # A tibble: 5,284 × 11
#> AptName SeqId Target EntrezGeneSymbol UniProt formula model beta_hat
#> <chr> <chr> <chr> <chr> <chr> <list> <lis> <dbl>
#> 1 seq.658… 6580… Pregn… PZP P20742 <formula> <glm> -3.07
#> 2 seq.576… 5763… Beta-… DEFB104A Q8WTQ1 <formula> <glm> 3.13
#> 3 seq.303… 3032… Folli… CGA FSHB P01215… <formula> <glm> -1.64
#> 4 seq.792… 7926… Kunit… SPINT3 P49223 <formula> <glm> 2.90
#> 5 seq.295… 2953… Lutei… CGA LHB P01215… <formula> <glm> -1.58
#> 6 seq.168… 1689… Ecton… ENPP2 Q13822 <formula> <glm> -1.89
#> 7 seq.491… 4914… Human… CGA CGB P01215… <formula> <glm> -1.56
#> 8 seq.928… 9282… Cyste… CRISP2 P16562 <formula> <glm> 1.91
#> 9 seq.247… 2474… Serum… APCS P02743 <formula> <glm> 1.79
#> 10 seq.713… 7139… SLIT … SLITRK4 Q8IW52 <formula> <glm> 1.21
#> # ℹ 5,274 more rows
#> # ℹ 3 more variables: p.value <dbl>, fdr <dbl>, rank <int>
Fit Model | Calculate Performance
Next, select features for the model fit. We have a good idea of
reasonable Sex
markers from prior knowledge
(CGA*
), and fortunately many of these are highly ranked in
LR_tbl
. Below we fit a 4-marker logistic regression model
from cherry-picked gender-related features:
# AptName is index key between `LR_tbl` and `train`
feats <- LR_tbl$AptName[c(1L, 3L, 5L, 7L)]
form <- as.formula(paste("Group ~", paste(feats, collapse = "+")))
fit <- glm(form, data = train, family = "binomial", model = FALSE)
pred <- tibble(
true_class = test$Sex, # orig class label
pred = predict(fit, newdata = test, type = "response"), # prob. 'Male'
pred_class = ifelse(pred < 0.5, "F", "M"), # class label
)
conf <- table(pred$true_class, pred$pred_class, dnn = list("Actual", "Predicted"))
tp <- conf[2L, 2L]
tn <- conf[1L, 1L]
fp <- conf[1L, 2L]
fn <- conf[2L, 1L]
# Confusion matrix
conf
#> Predicted
#> Actual F M
#> F 24 1
#> M 5 20
# Classification metrics
tibble(Sensitivity = tp / (tp + fn),
Specificity = tn / (tn + fp),
Accuracy = (tp + tn) / sum(conf),
PPV = tp / (tp + fp),
NPV = tn / (tn + fn)
)
#> # A tibble: 1 × 5
#> Sensitivity Specificity Accuracy PPV NPV
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.8 0.96 0.88 0.952 0.828