The SeqId
is the cornerstone used to uniquely identify
SomaLogic analytes.
SeqIds
follow the format <Pool>-<Clone>_<Version>
, for example
"1234-56_7"
can be represented as:
Pool | Clone | Version |
1234 | 56 | 7 |
See Details below for the definition of each sub-unit.
The <Pool>-<Clone>
combination is sufficient to uniquely identify a
specific analyte and therefore versions are no longer provided (though
they may be present in legacy ADATs).
The tools below enable users to extract, test, identify, compare,
and manipulate SeqIds
across assay runs and/or versions.
Usage
getSeqId(x, trim.version = FALSE)
regexSeqId()
locateSeqId(x, trailing = TRUE)
seqid2apt(x)
apt2seqid(x)
is.apt(x)
is.SeqId(x)
matchSeqIds(x, y, order.by.x = TRUE)
getSeqIdMatches(x, y, show = FALSE)
Arguments
- x
Character. A vector of strings, usually analyte/feature column names,
AptNames
, orSeqIds
. Forseqid2apt()
, a vector ofSeqIds
. Forapt2seqid()
, a character vector containingSeqIds
. FormatchSeqIds()
, a vector of pattern matches containingSeqIds
. Can beAptNames
withGeneIDs
, theseq.XXXX
format, or even "naked"SeqIds
.- trim.version
Logical. Whether to remove the version number, i.e. "1234-56_7" -> "1234-56". Primarily for legacy ADATs.
- trailing
Logical. Should the regular expression explicitly specify trailing
SeqId
pattern match, i.e."regex$"
? This is the most common case and the default.- y
Character. A second vector of
AptNames
containingSeqIds
to match against those in contained inx
. FormatchSeqIds()
these values are returned if there are matching elements.- order.by.x
Logical. Order the returned character string by the
x
(first) argument?- show
Logical. Return the data frame visibly?
Value
getSeqId()
: a character vector of SeqIds
captured from a string.
regexSeqId()
: a regular expression (regex
) string
pre-defined to match SomaLogic the SeqId
pattern.
locateSeqId()
: a data frame containing the start
and stop
integer positions for SeqId
matches at each value of x
.
seqid2apt()
: a character vector with the seq.*
prefix, i.e.
the inverse of getSeqId()
.
apt2seqid()
: a character vector of SeqIds
. is.SeqId()
will
return TRUE
for all elements.
is.apt()
, is.SeqId()
: Logical. TRUE
or FALSE
.
matchSeqIds()
: a character string corresponding to values
in y
of the intersect of x
and y
. If no matches are
found, character(0)
.
getSeqIdMatches()
: a \(n x 2\) data frame, where n
is the
length of the intersect of the matching SeqIds
.
The data frame is named by the passed arguments, x
and y
.
Details
Pool: | ties back to the original well during SELEX |
Clone: | ties to the specific sequence within a pool |
Version: | refers to custom modifications (optional/defunct) |
AptName
a
SeqId
combined with a string, usually aGeneId
- orseq.
-prefix, for convenient, human-readable manipulation from withinR
.
Functions
getSeqId()
: extracts/captures the theSeqId
match from an analyte column identifier, i.e. column name of an ADAT loaded withread_adat()
. Assumes theSeqId
pattern occurs at the end of the string, which for the vast majority of cases will be true. For edge cases, see thetrailing
argument tolocateSeqId()
.regexSeqId()
: generates a pre-formatted regular expression for matching ofSeqIds
. Note the trailing match, which is most commonly required, butlocateSeqId()
offers an alternative to mach anywhere in a string. Used internally in many utility functionslocateSeqId()
: generates a data frame of the positionalSeqId
matches. Specifically designed to facilitateSeqId
extraction viasubstr()
. Similar tostringr::str_locate()
.seqid2apt()
: converts aSeqId
into anonymous-AptName format, i.e.1234-56
->seq.1234.56
. Version numbers (1234-56_ver
) are always trimmed when present.apt2seqid()
: converts an anonymous-AptName intoSeqId
format, i.e.seq.1234.56
->1234-56
. Version numbers (seq.1234.56.ver
) are always trimmed when present.is.apt()
: regular expression match to determine if a string contains aSeqId
, and thus is probably anAptName
format string. Both legacyEntrezGeneSymbol-SeqId
combinations or newer so-called"anonymous-AptNames"
formats (seq.1234.45
) are matched.is.SeqId()
: tests forSeqId
format, i.e. values returned fromgetSeqId()
will always returnTRUE
.matchSeqIds()
: matches two character vectors on the basis of their intersectingSeqIds
. Note that elements iny
not containing aSeqId
regular expression are silently dropped.getSeqIdMatches()
: matches two character vectors on the basis of their intersecting SeqIds only (irrespective of theGeneID
-prefix). This produces a two-column data frame which then can be used as to map between the two sets.The final order of the matches/rows is by the input corresponding to the first argument (
x
).By default the data frame is invisibly returned to avoid dumping excess output to the console (see the
show =
argument.)
Examples
x <- c("ABDC.3948.48.2", "3948.88",
"3948.48.2", "3948-48_2", "3948.48.2",
"3948-48_2", "3948-88",
"My.Favorite.Apt.3948.88.9")
tibble::tibble(orig = x,
SeqId = getSeqId(x),
SeqId_trim = getSeqId(x, TRUE),
AptName = seqid2apt(SeqId))
#> # A tibble: 8 × 4
#> orig SeqId SeqId_trim AptName
#> <chr> <chr> <chr> <chr>
#> 1 ABDC.3948.48.2 3948-48_2 3948-48 seq.3948.48
#> 2 3948.88 3948-88 3948-88 seq.3948.88
#> 3 3948.48.2 3948-48_2 3948-48 seq.3948.48
#> 4 3948-48_2 3948-48_2 3948-48 seq.3948.48
#> 5 3948.48.2 3948-48_2 3948-48 seq.3948.48
#> 6 3948-48_2 3948-48_2 3948-48 seq.3948.48
#> 7 3948-88 3948-88 3948-88 seq.3948.88
#> 8 My.Favorite.Apt.3948.88.9 3948-88_9 3948-88 seq.3948.88
# Logical Matching
is.apt("AGR2.4959.2") # TRUE
#> [1] TRUE
is.apt("seq.4959.2") # TRUE
#> [1] TRUE
is.apt("4959-2") # TRUE
#> [1] TRUE
is.apt("AGR2") # FALSE
#> [1] FALSE
# SeqId Matching
x <- c("seq.4554.56", "seq.3714.49", "PlateId")
y <- c("Group", "3714-49", "Assay", "4554-56")
matchSeqIds(x, y)
#> [1] "4554-56" "3714-49"
matchSeqIds(x, y, order.by.x = FALSE)
#> [1] "3714-49" "4554-56"
# vector of features
feats <- getAnalytes(example_data)
match_df <- getSeqIdMatches(feats[1:100], feats[90:500]) # 11 overlapping
match_df
#> feats[1:100] feats[90:500]
#> 1 seq.10461.57 seq.10461.57
#> 2 seq.10462.14 seq.10462.14
#> 3 seq.10464.6 seq.10464.6
#> 4 seq.10467.58 seq.10467.58
#> 5 seq.10471.25 seq.10471.25
#> 6 seq.10476.23 seq.10476.23
#> 7 seq.10477.162 seq.10477.162
#> 8 seq.10485.56 seq.10485.56
#> 9 seq.10489.19 seq.10489.19
#> 10 seq.10490.3 seq.10490.3
#> 11 seq.10491.21 seq.10491.21
a <- utils::head(feats, 15)
b <- withr::with_seed(99, sample(getSeqId(a))) # => SeqId & shuffle
(getSeqIdMatches(a, b)) # sorted by first vector "a"
#> a b
#> 1 seq.10000.28 10000-28
#> 2 seq.10001.7 10001-7
#> 3 seq.10003.15 10003-15
#> 4 seq.10006.25 10006-25
#> 5 seq.10008.43 10008-43
#> 6 seq.10011.65 10011-65
#> 7 seq.10012.5 10012-5
#> 8 seq.10013.34 10013-34
#> 9 seq.10014.31 10014-31
#> 10 seq.10015.119 10015-119
#> 11 seq.10021.1 10021-1
#> 12 seq.10022.207 10022-207
#> 13 seq.10023.32 10023-32
#> 14 seq.10024.44 10024-44
#> 15 seq.10030.8 10030-8