Pre-Processing SomaScan

Overview

SomaDataIO contains functionality to prepare a SomaScan .adat data file containing microarray-based relative fluorescent units (RFUs) for analysis. The typical data analysis path includes several steps of pre-processing, including:

Filtering features
Filtering samples
Data QC
Transformations

This article will walk through why these pre-processing steps are recommended prior to executing an analysis on SomaScan data, along with how these steps can be performed using the SomaDataIO package.

Filtering Features

The goal of this pre-processing step is to remove features (SeqIds) typically not useful for analysis from a SomaScan dataset, while also retaining relevant features that will enable broad discovery during downstream analysis.

The filtering logic typically used for protein features (i.e. SOMAmer Reagents) is:

Type == "Protein" & Organism == "Human"

What information does this logic convey? It indicates that only human protein features from the raw SomaScan data set should be retained. This can be accomplished by filtering on the ADAT Type attribute, which represents the SOMAmer target type. This will be used in conjunction with the Organism attribute, which represents the organism from which the protein originated.

These two attributes (Type and Organism) can be accessed via SomaDataIO::getAnalyteInfo(). This function retrieves the analyte annotation data (i.e. the column metadata, or COL_DATA) that appears above the protein measurements in the ADAT file, and returns it in a data.frame format.

Note: The example_data object used in this vignette is a V4.0 plasma ADAT. Other matrices or versions of ADATs may contain expanded feature sets. For more information about the format and content of an ADAT, please reference the SomaLogic-Data GitHub repository.

To retrieve the column metadata annotations for the data set:

annots <- getAnalyteInfo(example_data)

We can now check the contents of the Type and Organism attributes, which can be accessed as columns of a data frame once extracted with getAnalyteInfo():

annots |> count(Type) |> arrange(desc(n))
#> # A tibble: 7 × 2
#>   Type                              n
#>   <chr>                         <int>
#> 1 Protein                        5207
#> 2 Non-Human                        24
#> 3 Spuriomer                        20
#> 4 Hybridization Control Elution    12
#> 5 Non-Biotin                       10
#> 6 Deprecated                        7
#> 7 Non-cleavable                     4

annots |> count(Organism) |> arrange(desc(n))
#> # A tibble: 14 × 2
#>    Organism                                                           n
#>    <chr>                                                          <int>
#>  1 Human                                                           5032
#>  2 Mouse                                                            228
#>  3 Aequorea victoria (Jellyfish)                                      3
#>  4 African clawed frog                                                3
#>  5 Heloderma suspectum (Gila monster)                                 3
#>  6 Hornet                                                             3
#>  7 Thermus thermophilus                                               3
#>  8 Photinus pyralis (North American firefly)                          2
#>  9 Sambucus nigra (European elder)                                    2
#> 10 Cyanidium caldarium                                                1
#> 11 Geobacillus stearothermophilus (Bacillus stearothermophilus)       1
#> 12 Rhizobium meliloti (Ensifer meliloti) (Sinorhizobium meliloti)     1
#> 13 isolate BEN                                                        1
#> 14 isolate LW123                                                      1

As described in the filtering strategy above, we only want to retain the features for which Type == "Protein", indicating that the feature is related to a human protein. In the same filtering step, we can also specify that we only want to retain proteins found in humans (Organism == "Human"). In some versions of the assay there are SeqIds labeled “Internal Use Only” which are not intended for downstream analysis and can also be removed. This can be performed using dplyr::filter():

human_prots <- getAnalyteInfo(example_data) |>
  filter(Type == "Protein" & Organism == "Human") |>
  filter(!grepl("^Internal Use Only", TargetFullName))
length(human_prots$SeqId)
#> [1] 4979

The original dataset (example_adat) contained 5284 analytes; after filtering for human proteins and removing 305 features as described above, it contains 4979 analytes.

Remember that the human_prots object is not an ADAT, but rather a data.frame representing the column metadata that corresponds to the example_data ADAT object. We must now filter the example_data object to match the filtering performed on the human_prots object.

# Identify SeqIds that differ between the data sets
discard <- setdiff(grep("^seq\\.", colnames(example_data), value = TRUE), 
                   human_prots$AptName)
# Discard non-human, non-protein features
filt_data <- example_data |>
  dplyr::select(-all_of(discard))

This data set now contains only human proteins.

Flagged Features

Our human protein analyte annotations object human_prots contains a single column, called ColCheck, that can be used for filtering flagged features from ADAT datasets. This column contains the QC acceptance criteria for all plates/sets. The standard QC acceptance criteria used to determine if a feature is flagged is if the QC ratio by plate (aggregate across all plates) is within the acceptance range of 0.8 to 1.2:

Again, please see SomaLogic-Data for more details.

human_prots |> count(ColCheck)
#> # A tibble: 2 × 2
#>   ColCheck     n
#>   <chr>    <int>
#> 1 FLAG       214
#> 2 PASS      4765

214 human protein features in filt_data were flagged in this column check, and are marked with a FLAG value. The presence of a FLAG value indicates the SeqId for one of the QC control samples was outside the range shown above on at least one plate. The total number of SeqIds outside this range can be used to assess potential issues with runs or studies, however a FLAG value does not necessarily indicate a problem with the signal for a specific analyte in the study samples. Therefore, these features should not be filtered out of the dataset as part of standard pre-processing.

Filtering Samples

The next pre-processing step to consider is filtering samples. A typical data analysis is focused on only the study samples within the ADAT file. The SampleType column in the filt_data object can be used to filter down to the study samples only.

filt_data |> count(SampleType)
#> # A tibble: 4 × 2
#>   SampleType     n
#>   <chr>      <int>
#> 1 Buffer         6
#> 2 Calibrator    10
#> 3 QC             6
#> 4 Sample       170

An ADAT file will also include buffer, calibrator and QC samples. It is recommended to remove these samples prior to an analysis.

filt_data <- filt_data |> 
  filter(SampleType == "Sample")

The filt_data object now has 170 study samples, after removing the 10 calibrator samples (replicate controls for combining data across runs), 6 QC samples (replicate controls used to assess run quality), and 6 buffer samples (no protein controls).

Flagged Samples

An ADAT file contains a single column, called RowCheck, that can be used for filtering out flagged samples that do not pass a pre-defined normalization acceptance criteria.

The standard normalization acceptance criteria used to determine if a sample is flagged by the RowCheck column is if any of the rowwise normalization scale factors are outside of the acceptance range of 0.4 to 2.5:

filt_data |>
  count(RowCheck) |> 
  mutate(percent = n / sum(n) * 100)
#> # A tibble: 2 × 3
#>   RowCheck     n percent
#>   <chr>    <int>   <dbl>
#> 1 FLAG         2    1.18
#> 2 PASS       168   98.8

Note 2 samples in filt_data are flagged with this row check, and are marked with a FLAG value.

# pull normalization scale factor variable names from ADAT
norm_vars <- grep("^[Nn]orm[Ss]cale|^Med\\.Scale\\.", 
                  names(filt_data), value = TRUE)

filt_data |>
  filter(!RowCheck == "PASS") |> 
  dplyr::select(SampleId, all_of(norm_vars)) |> 
  as.data.frame()
#>                SampleId NormScale_20 NormScale_0_005 NormScale_0_5
#> 258495800110_3      126    0.3700948        1.064435     0.9836240
#> 258495800109_1      147    0.2891420        1.111629     0.9634776

We can see that these samples each have a normalization scale factor from at least one dilution bin that fall below the standard accepted range. Filtering samples based on the default RowCheck column will drop these samples from filt_data.

filt_data <- filt_data |> 
  filter(RowCheck == "PASS")

Note that normalization to a reference may result in scaling bias in certain scenarios. It is recommended to check the flag rate using the default acceptance criteria from RowCheck, and if > 10% of samples are flagged, further evaluation may be needed.

Depending on your data and experiment, it may be reasonable to keep samples with FLAG values, particularly those samples close to the acceptance criteria boundary. The FLAG value indicates non-conformance to a reference (either study-specific or external such as ANML), and further evaluation is encouraged to assess if those samples should be considered true outliers for a given study and removed.

Sample Level Outliers by RFU

It is also important to evaluate if there are any sample-level outliers by RFU measurements. For standard plasma and serum studies Standard BioTools’ recommended outlier definition is any sample where >= 5% of RFU measurements exceed 6 median absolute deviations (MADs) and 5x fold-change from median signal.

This filter is typically appropriate for studies on plasma, serum, and other biological matrices generally exhibiting homeostatic characteristics. For studies on matrices such as tissue homogenate, cell culture, or study designs containing client-provided background lysis buffer controls (or similar), this filter will likely not be appropriate.

# filt_data does not have any outliers by default
# create a "fake" outlier sample as an example
apts <- getAnalytes(filt_data)
filt_data[12, apts[1:600]] <- filt_data[12, apts[1:600]] * 100  # 600 apts ~ 12%

We can identify plasma and serum outlier samples using this definition with the calcOutlierMap() function.

om <- calcOutlierMap(filt_data)
plot(om)

The one sample in filt_data that we modified in the previous code chunk was identified as an outlier by RFU measurements with the *.

The getOutlierIds() function is useful for extracting the sample IDs that were flagged as outliers in the plot. These sample IDs should be investigated further and possibly considered for removal prior to analysis.

rfu_outliers <- getOutlierIds(om)
rfu_outliers
#>   idx
#> 1  12

# drop one outlier sample
filt_data <- filt_data |> 
  filter(!dplyr::row_number() %in% rfu_outliers$idx)

Data QC

The goal of this step is to assess if the normalization applied to the input ADAT is appropriate for the study. To accomplish this, we can check the association of normalization scale factors with any endpoint of interest.

As an example, we will create a status_class two-group endpoint variable and a status_cont continuous endpoint variable in the filt_data ADAT.

# create "status" endpoints as examples
filt_data <- withr::with_seed(101,
  filt_data |> 
    mutate(status_cont = runif(nrow(filt_data))) |> 
    mutate(status_class = ifelse(status_cont > 0.5, "treatment", "control"))
)

Plot the normalization scale factors by the status_class endpoint.

filt_data |> 
  select(SampleId, status_class, all_of(norm_vars)) |> 
  tidyr::pivot_longer(!c(SampleId, status_class),
                      names_to = "Normalization Scale Factor",
                      values_to = "Value") |> 
  ggplot(aes(x = `Normalization Scale Factor`, y = Value, fill = status_class)) +
    geom_boxplot() +
    ylim(0, 2.6) +
    geom_hline(yintercept  = 0.4, linetype = "dashed", color = "red") +
    geom_hline(yintercept  = 2.5, linetype = "dashed", color = "red") +
    theme_bw() +
    scale_fill_manual(values = c("#4067E2", "#59CFDB"))

There does not appear to be any visual normalization bias by status_class. This conclusion is confirmed by the results from t-tests.

norm_vars |>
  as_tibble() |> 
  mutate(
    formula = map(norm_vars, ~ as.formula(paste(.x, "~ status_class"))), # create formula
    t_test  = map(formula, ~ stats::t.test(.x, data = filt_data)),  # fit t-tests
    t_stat  = map_dbl(t_test, "statistic"),            # pull out t-statistic
    p.value = map_dbl(t_test, "p.value"),              # pull out p-values
    fdr     = p.adjust(p.value, method = "BH")         # FDR for multiple testing
  ) 
#> # A tibble: 3 × 6
#>   value           formula   t_test  t_stat p.value   fdr
#>   <chr>           <list>    <list>   <dbl>   <dbl> <dbl>
#> 1 NormScale_20    <formula> <htest>  0.604   0.547 0.585
#> 2 NormScale_0_005 <formula> <htest> -1.58    0.117 0.351
#> 3 NormScale_0_5   <formula> <htest> -0.548   0.585 0.585

Now, check for normalization bias by the continuous endpoint status_cont.

filt_data |> 
  select(SampleId, status_cont, all_of(norm_vars)) |> 
  tidyr::pivot_longer(!c(SampleId, status_cont),
                      names_to = "Normalization Scale Factor",
                      values_to = "Value") |> 
  ggplot(aes(x = status_cont, y = Value, color = `Normalization Scale Factor`)) +
    geom_point() +
    ylim(0, 2.6) +
    geom_hline(yintercept  = 0.4, linetype = "dashed", color = "red") +
    geom_hline(yintercept  = 2.5, linetype = "dashed", color = "red") +
    theme_bw() +
    scale_color_manual(values = c("#4067E2", "#59CFDB", "#DB40EF")) +
    facet_wrap(~`Normalization Scale Factor`) +
    theme(legend.position = "none")

There does not appear to be any visual normalization bias by the status_cont continuous endpoint. This conclusion is confirmed by the results from correlation tests.

norm_vars |>
  as_tibble() |> 
  mutate(
    formula = map(norm_vars, ~ as.formula(paste("~ status_cont +", .x))), # create formula
    cor     = map(formula, ~ stats::cor.test(.x, data = filt_data)),  # calculate correlations
    pearson = map_dbl(cor, "statistic"),            # pull out test statistic
    p.value = map_dbl(cor, "p.value"),              # pull out p-values
    fdr     = p.adjust(p.value, method = "BH")      # FDR for multiple testing
  ) 
#> # A tibble: 3 × 6
#>   value           formula   cor     pearson p.value   fdr
#>   <chr>           <list>    <list>    <dbl>   <dbl> <dbl>
#> 1 NormScale_20    <formula> <htest>   0.132   0.895 0.895
#> 2 NormScale_0_005 <formula> <htest>   0.674   0.501 0.853
#> 3 NormScale_0_5   <formula> <htest>   0.571   0.569 0.853

If there is a considerable normalization bias observed in your ADAT, an alternate normalization should be used for univariate analyses, and any planned analyses may need to be modified to account for this bias.

Transformations

SomaScan data is typically delivered to end users as normalized, untransformed RFUs. Many analytic operations are best performed using these untransformed values, including fold-change calculations, bridging RFUs across assay versions, calculating coefficients of variation (CV) for replicate samples, and comparing study-specific RFUs to population reference values provided in the SomaScan Menu Tool.

While many operations are conducted on untransformed RFU values, it is important to note that for each SeqId, the distribution of biological sample RFUs generally follows a right-tailed, log-normal distribution. Let’s take a look at the distribution of one example analyte feature, seq.10023.32, before any transformations are applied.

filt_data |> 
  ggplot(aes(x = seq.10023.32)) +
  geom_density(fill = "#59CFDB", color = "#59CFDB", alpha = 0.8) +
  theme_bw()

There is a log-normal distribution, with a long right-tail in the distribution of raw RFU values, and this is the assumed distribution of untransformed analytes provided in an ADAT file.

Log-10

Performing a log10 transformation is recommended prior to using methods that assume normally distributed measurements. Please note that real-world data rarely conforms exactly to ideal distributional assumptions and analysts should feel free to explore alternative transformations that may better suit their specific data and statistical test assumptions. However, we have found that the log10 transformation is broadly appropriate for standard parametric analyses across the SomaScan menu.

The SomaDataIO package provides a log10() math generic function for the soma_adat class, which will log-10 transform all SOMAmer analyte features within an ADAT file.

filt_data <- filt_data |> 
  log10()

Centering and Scaling (Z-Score)

A further transformation to consider applying to the analyte features is a Z-score transformation if your experiment includes any multivariate analysis. To compare relative differential signal across analyte features, centering and scaling all of the RFU values should be applied through a Z-score transformation:

$Z = \frac{x - \mu}{\sigma}$

where $x$ is the observed value, $\mu$ , is the mean of the sample, and $\sigma$ is the standard deviation of the sample. The z-score transformation is typically performed on log10-transformed RFUs.

This can be done by creating a simple function, center_scale(), and applying it to all analytes in filt_data.

# center/scale
center_scale <- function(.x) {    # .x = numeric vector
  out <- .x - mean(.x)  # center
  out / sd(out)         # scale
}

filt_data <- filt_data |> 
  mutate(across(getAnalytes(filt_data), center_scale))

Now, re-examine the distribution of the same analyte feature seq.10023.32:

filt_data |> 
  ggplot(aes(x = seq.10023.32)) +
  geom_density(fill = "#59CFDB", color = "#59CFDB", alpha = 0.8) +
  theme_bw()

After applying the transformations, the distribution of all of the pre-processed analyte features should look similar to the above. If true modal behavior of any analytes are observed, it may require secondary assessment and further evaluation.

While centering and scaling standardizes the RFU distributions across all SeqIds for multivariate analysis, it is important to understand that this does not enable meaningful comparison of expression values between different SeqIds. Take, for instance, hypothetical SeqId A with a z-score of 2 and hypothetical SeqId B with a z-score of -1. One cannot infer that the protein target of SeqId A was present at a higher concentration than the target of SeqId B in the original sample prep. All comparisons should be made between sample groups within a given SeqId. The RFU value, as well as log10 RFU and z-score, is dependent on variety of analyte-specific factors such as the SOMAmer Reagent-target protein binding kinetics, the dilution bin of the SOMAmer reagent, NHS-biotin conjugation efficiency of the SOMAmer, and other factors intrinsic to the SOMAmer Reagent within the SomaScan assay.

`preProcessAdat()` function

The preProcessAdat() function is available to perform the steps outlined in this vignette. By default, it will filter features and samples using the standard QC and normalization acceptance criteria described earlier, but will not drop sample-level RFU outliers. It also has the option to perform log-10 and center & scale transformations to the untransformed RFU values. If data QC plots by endpoints or clinical variables are desired,the names of the variables should be explicitly passed to the data.qc argument. Please see the preProcessAdat() function documentation for more details.

# first recreate outlier and endpoints, and add to original example_data object
apts <- getAnalytes(example_data)
example_data[12, apts[1:600]] <- example_data[12, apts[1:600]] * 100  # 600 apts ~ 12%

example_data <- withr::with_seed(101,
  example_data |> 
    mutate(status_cont  = runif(nrow(example_data))) |> 
    mutate(status_class = ifelse(status_cont > 0.5, "treatment", "control"))
)

processed_data <- preProcessAdat(adat            = example_data,
                                 filter.features = TRUE,
                                 filter.controls = TRUE,
                                 filter.qc       = TRUE,
                                 filter.outliers = FALSE,
                                 data.qc         = c("status_class",
                                                     "status_cont"),
                                 log.10          = FALSE,
                                 center.scale    = FALSE)
#> ✔ 305 non-human protein features were removed.
#> → 214 human proteins did not pass standard QC
#> acceptance criteria and were flagged in `ColCheck`.  These features
#> were not removed, as they still may yield useful information in an
#> analysis, but further evaluation may be needed.
#> ✔ 6 buffer samples were removed.
#> ✔ 10 calibrator samples were removed.
#> ✔ 6 QC samples were removed.
#> ✔ 2 samples flagged in `RowCheck` did not
#> pass standard normalization acceptance criteria (0.4 <= x <= 2.5)
#> and were removed.
#> → Data QC plots were generated:
#> $status_class

#> 
#> $status_cont


processed_data
#> ══ SomaScan Data ══════════════════════════════════════════════════════
#>      SomaScan version     V4 (5k)
#>      Signal Space         5k
#>      Attributes intact    ✓
#>      Rows                 168
#>      Columns              5015
#>      Clinical Data        36
#>      Features             4979
#> ── Column Meta ────────────────────────────────────────────────────────
#> ℹ SeqId, SeqIdVersion, SomaId, TargetFullName, Target,
#> ℹ UniProt, EntrezGeneID, EntrezGeneSymbol, Organism, Units,
#> ℹ Type, Dilution, PlateScale_Reference, CalReference,
#> ℹ Cal_Example_Adat_Set001, ColCheck,
#> ℹ CalQcRatio_Example_Adat_Set001_170255, QcReference_170255,
#> ℹ Cal_Example_Adat_Set002,
#> ℹ CalQcRatio_Example_Adat_Set002_170255, Dilution2
#> ── Tibble ─────────────────────────────────────────────────────────────
#> # A tibble: 168 × 5,016
#>    row_names      PlateId  PlateRunDate ScannerID PlatePosition SlideId
#>    <chr>          <chr>    <chr>        <chr>     <chr>           <dbl>
#>  1 258495800012_3 Example… 2020-06-18   SG152144… H9            2.58e11
#>  2 258495800004_7 Example… 2020-06-18   SG152144… H8            2.58e11
#>  3 258495800010_8 Example… 2020-06-18   SG152144… H7            2.58e11
#>  4 258495800003_4 Example… 2020-06-18   SG152144… H6            2.58e11
#>  5 258495800009_4 Example… 2020-06-18   SG152144… H5            2.58e11
#>  6 258495800012_8 Example… 2020-06-18   SG152144… H4            2.58e11
#>  7 258495800001_3 Example… 2020-06-18   SG152144… H3            2.58e11
#>  8 258495800004_8 Example… 2020-06-18   SG152144… H2            2.58e11
#>  9 258495800001_8 Example… 2020-06-18   SG152144… H12           2.58e11
#> 10 258495800009_8 Example… 2020-06-18   SG152144… H10           2.58e11
#> # ℹ 158 more rows
#> # ℹ 5,010 more variables: Subarray <dbl>, SampleId <chr>,
#> #   SampleType <chr>, PercentDilution <int>, SampleMatrix <chr>,
#> #   Barcode <lgl>, Barcode2d <chr>, SampleName <lgl>,
#> #   SampleNotes <lgl>, AliquotingNotes <lgl>, …
#> ═══════════════════════════════════════════════════════════════════════

Questions

As always, if you have any pre-processing questions, we are here to help. Please reach out to us via:

GitHub SUPPORT
Global Scientific Engagement Team: techsupport@somalogic.com
General SomaScan inquiries: support@somalogic.com

SomaScan Bioinformatics Team, Standard BioTools, Inc.