Skip to content

Classification via Logistic Regression

Although targeted statistical analyses are beyond the scope of the SomaDataIO package, below is an example analysis that typical users/customers would perform on ‘SomaScan’ data.

It is not intended to be a definitive guide in statistical analysis and existing packages do exist in the R ecosystem that perform parts or extensions of these techniques. Many variations of the workflow below exist, however the framework highlights how one could perform standard preliminary analyses on ‘SomaScan’ data.

Data Preparation

# the `example_data` package data
dim(example_data)
#> [1]  192 5318

table(example_data$SampleType)
#> 
#>     Buffer Calibrator         QC     Sample 
#>          6         10          6        170

# center/scale
cs <- function(.x) {    # .x = numeric vector
  out <- .x - mean(.x)  # center
  out / sd(out)         # scale
}

# prepare data set for analysis
cleanData <- example_data |>
  filter(SampleType == "Sample") |>               # rm control samples
  drop_na(Sex) |>                                 # rm NAs if present
  log10() |>                                      # log10-transform (Math Generic)
  mutate(Group = as.numeric(factor(Sex)) - 1) |>  # map Sex -> 0/1
  modify_at(getAnalytes(example_data), cs)        # center/scale analytes

table(cleanData$Sex)
#> 
#>  F  M 
#> 85 85

table(cleanData$Group)    # F = 0; M = 1
#> 
#>  0  1 
#> 85 85

Set up Train/Test Data

# idx = hold-out 
# seed resulting in 50/50 class balance
idx   <- withr::with_seed(3, sample(1:nrow(cleanData), size = nrow(cleanData) - 50))
train <- cleanData[idx, ]
test  <- cleanData[-idx, ]

# assert no overlap
isTRUE(
  all.equal(intersect(rownames(train), rownames(test)), character(0))
)
#> [1] TRUE

Logistic Regression

We use the cleanData, train, and test data objects from above.

Predict Sex

LR_tbl <- getAnalyteInfo(train) |>
  select(AptName, SeqId, Target = TargetFullName, EntrezGeneSymbol, UniProt) |>
  mutate(
    formula  = map(AptName, ~ as.formula(paste("Group ~", .x))),  # create formula
    model    = map(formula, ~ stats::glm(.x, data = train, family = "binomial", model = FALSE)),  # fit glm()
    beta_hat = map(model, coef) |> map_dbl(2L),     # pull out coef Beta
    p.value  = map2_dbl(model, AptName, ~ {
      summary(.x)$coefficients[.y, "Pr(>|z|)"] }),  # pull out p-values
    fdr      = p.adjust(p.value, method = "BH")     # FDR correction multiple testing
  ) |>
  arrange(p.value) |>            # re-order by `p-value`
  mutate(rank = row_number())    # add numeric ranks

LR_tbl
#> # A tibble: 5,284 × 11
#>    AptName  SeqId Target EntrezGeneSymbol UniProt formula   model beta_hat
#>    <chr>    <chr> <chr>  <chr>            <chr>   <list>    <lis>    <dbl>
#>  1 seq.658… 6580… Pregn… PZP              P20742  <formula> <glm>    -3.07
#>  2 seq.576… 5763… Beta-… DEFB104A         Q8WTQ1  <formula> <glm>     3.13
#>  3 seq.303… 3032… Folli… CGA FSHB         P01215… <formula> <glm>    -1.64
#>  4 seq.792… 7926… Kunit… SPINT3           P49223  <formula> <glm>     2.90
#>  5 seq.295… 2953… Lutei… CGA LHB          P01215… <formula> <glm>    -1.58
#>  6 seq.168… 1689… Ecton… ENPP2            Q13822  <formula> <glm>    -1.89
#>  7 seq.491… 4914… Human… CGA CGB          P01215… <formula> <glm>    -1.56
#>  8 seq.928… 9282… Cyste… CRISP2           P16562  <formula> <glm>     1.91
#>  9 seq.247… 2474… Serum… APCS             P02743  <formula> <glm>     1.79
#> 10 seq.713… 7139… SLIT … SLITRK4          Q8IW52  <formula> <glm>     1.21
#> # ℹ 5,274 more rows
#> # ℹ 3 more variables: p.value <dbl>, fdr <dbl>, rank <int>

Fit Model | Calculate Performance

Next, select features for the model fit. We have a good idea of reasonable Sex markers from prior knowledge (CGA*), and fortunately many of these are highly ranked in LR_tbl. Below we fit a 4-marker logistic regression model from cherry-picked gender-related features:

# AptName is index key between `LR_tbl` and `train`
feats <- LR_tbl$AptName[c(1L, 3L, 5L, 7L)]
form  <- as.formula(paste("Group ~", paste(feats, collapse = "+")))
fit   <- glm(form, data = train, family = "binomial", model = FALSE)
pred  <- tibble(
  true_class = test$Sex,                                         # orig class label
  pred       = predict(fit, newdata = test, type = "response"),  # prob. 'Male'
  pred_class = ifelse(pred < 0.5, "F", "M"),                     # class label
)
conf <- table(pred$true_class, pred$pred_class, dnn = list("Actual", "Predicted"))
tp   <- conf[2L, 2L]
tn   <- conf[1L, 1L]
fp   <- conf[1L, 2L]
fn   <- conf[2L, 1L]

# Confusion matrix
conf
#>       Predicted
#> Actual  F  M
#>      F 24  1
#>      M  5 20

# Classification metrics
tibble(Sensitivity = tp / (tp + fn),
       Specificity = tn / (tn + fp),
       Accuracy    = (tp + tn) / sum(conf),
       PPV         = tp / (tp + fp),
       NPV         = tn / (tn + fn)
)
#> # A tibble: 1 × 5
#>   Sensitivity Specificity Accuracy   PPV   NPV
#>         <dbl>       <dbl>    <dbl> <dbl> <dbl>
#> 1         0.8        0.96     0.88 0.952 0.828