Input data and preprocessing
Input Data & Preprocessing
📥 Input Format
adabmDCA 2.0
takes as input a multiple sequence alignment (MSA) in FASTA format, typically of aligned protein or RNA/DNA sequences (see Fig. 1).
The tool supports three built-in alphabets and also allows custom alphabets, as long as they match the MSA content.
Type | Alphabet Symbols |
---|---|
protein | -, A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y |
RNA | -, A, C, G, U |
DNA | -, A, C, G, T |
💡 Line breaks within sequences are supported.
Figure 1: Example of a fasta file containg the MSA.
🔧 Preprocessing Steps
The following steps are applied to every input MSA:
- Remove sequences with invalid symbols.
- Remove duplicate sequences.
- Reweight sequences to correct for phylogenetic and sampling bias (optional).
- Compute empirical statistics using a pseudocount.
⚖️ Sequence Reweighting
To downweight overrepresented or phylogenetically related sequences, adabmDCA
uses a clustering threshold (default: 80% identity). The weight for sequence \(\mathbf{a}^{(m)}\) is:
where \(N^{(m)}\) is the number of sequences that have sequence identity with \(\mathbf{a}^{(m)}\) above the clustering threshold.
- Set the threshold with
--clustering_seqid <value>
- Disable with
--no_reweighting
🧮 Pseudocount Regularization
A small pseudocount \(\alpha\) is added to frequency estimates to prevent issues with rare or unobserved symbols:
-
One-site frequency:
\(f_i(a) = (1 - \alpha) f^{\mathrm{data}}_i(a) + \frac{\alpha}{q}\) -
Two-site frequency:
\(f_{ij}(a, b) = (1 - \alpha) f^{\mathrm{data}}_{ij}(a, b) + \frac{\alpha}{q^2}\)
If not set via --pseudocount
, the default is:
$$
\alpha = \frac{1}{M_{\text{eff}}}, \quad \text{with} \quad M_{\text{eff}} = \sum_{m=1}^M w^{(m)}
$$
being the effective number of sequences.