Training DCA models
All versions of adabmDCA — Python, Julia, and C++ — expose the same command-line interface through the adabmDCA command.
To see the complete list of training options:
$ adabmDCA train -h
The standard command to start training a DCA model is:
$ adabmDCA train -m <model> -d <fasta_file> -o <output_folder> -l <label>
Arguments
<model>∈{bmDCA, eaDCA, edDCA, edgeDCA}
Selects the training routine.
By default, the fully connectedbmDCAalgorithm is used.eaDCAandedgeDCAbuild sparse models by activating couplings during training, whileedDCAsparsifies a model by decimation.edDCAcan either decimate a pre-trainedbmDCAmodel, or first train abmDCAmodel and then decimate it.<fasta_file>– Path to the FASTA file containing the training MSA.<output_folder>– Folder where results will be stored (created if missing).<label>– Optional tag for output files.
Training Behavior
Training stops when the Pearson correlation between model and empirical connected correlations reaches the target value (default: 0.95).
- Early training is fast (e.g., Pearson ≈ 0.9 after ~100 iterations).
- Approaching higher values takes significantly longer (power‑law decay).
For a quick coarse model, set:
--target 0.9
Output Files
During training, adabmDCA maintains three output files:
<label>_params.dat– Non‑zero model parameters- Lines starting with
J→ couplings -
Lines with
h→ biases -
<label>_chains.fasta– State of the Markov chains -
<label>_adabmDCA.log– Log file updated throughout training
Update intervals:
- bmDCA: every 50 updates
- eaDCA, edDCA, edgeDCA: every 10 updates
Restoring Interrupted Training
Resume training using:
$ adabmDCA train [...] -p <file_params> -c <file_chains>
Importance Weights
Provide custom weights with:
--weights <path>
Otherwise, weights are computed automatically and stored as:
<label>_weights.dat
Options:
--clustering_seqid <value>– default: 0.8--no_reweighting– use uniform weights
Choosing the Alphabet
Default alphabet: protein.
Specify alternatives:
- RNA →
--alphabet rna - DNA →
--alphabet dna - Custom →
--alphabet ABCD-
eaDCA
Enable with:
--model eaDCA
Key hyperparameters:
--factivate– fraction of inactive couplings activated (default: 0.001)--gsteps– parameter updates per graph update (default: 10)
Recommended: reduce sweeps to 5.
edDCA (Decimated DCA)
Run decimation:
$ adabmDCA train -m edDCA -d <fasta_file> -p <params> -c <chains>
Two workflows:
- Use pre‑trained bmDCA (
params+chains) - Train bmDCA automatically, then decimate
Key hyperparameters:
--gsteps– default: 10--drate– pruning fraction (default: 0.01)--density– target graph density (default: 0.02)--target– Pearson threshold (default: 0.95)
edgeDCA (Edge Activation DCA)
Enable with:
--model edgeDCA
edgeDCA trains a sparse model by starting from an empty coupling graph and progressively activating whole residue-residue edges. At each graph update, the algorithm compares empirical and model two-site statistics, selects the inactive edge with the largest KL discrepancy, activates all couplings for that pair of sites, and initializes them from the empirical/model frequency ratio. The Markov chains are then resampled and the process repeats until the target Pearson correlation is reached or the maximum number of epochs is exceeded.
This differs from eaDCA, which activates individual coupling entries. edgeDCA activates complete site pairs, so it is useful when you want a sparse interaction graph at the residue-pair level.
Important defaults:
--pseudocount: if not set, defaults to0.1foredgeDCA--lr: ignored byedgeDCA--target: Pearson threshold on two-site statistics, default0.95--nsweeps: Monte Carlo sweeps between edge activations, default10
For edgeDCA, the pseudocount acts as an effective learning rate for edge activation: the closer the pseudocount is to 1, the smaller the effective learning rate is. Some datasets can require much stronger regularization, with --pseudocount values up to 0.95. The explicit learning-rate parameter --lr is not used by edgeDCA.
Example:
$ adabmDCA train -m edgeDCA -d <fasta_file> -o <output_folder>
Choosing Hyperparameters
Defaults work well for clean and moderately diverse MSAs. For more difficult datasets, consider tuning:
Learning Rate
- Default: 0.01
- If chains mix poorly, try:
--lr 0.005 - For
edgeDCA,--lris ignored; tune--pseudocountinstead.
Number of Markov Chains
- Default: 10,000
- Using fewer chains reduces the memory required to train the model, but it may also lead to a longer algorithm convergence time.
- Change with:
--nchains <value>
Number of Monte Carlo Steps
- Controlled by
--nsweeps - Default: 10
- Recommended range: 10–50. Higher values drastically increase the training time and, empirically, do not help much the model convergence.
Regularization (Pseudocount)
Controlled by --pseudocount.
Default for bmDCA, eaDCA, and edDCA:
α = 1 / M_eff
Default for edgeDCA:
α = 0.1
For bmDCA, eaDCA, and edDCA, increasing α (e.g. α = 0.001 or 0.01) may help when the training struggle converging or the mixing time of the model is very high, but it also makes the model less expressive. For edgeDCA, α controls the effective learning rate of edge activation: values closer to 1 make updates smaller, and values up to 0.95 can be useful on some datasets. edgeDCA still points towards the original statistics, without pseudocount, so the pseudocount does not interfere with the expressivity of the model.