Training DCA models π
All versions of adabmDCA β Python, Julia, and C++ β expose the same command-line interface through the adabmDCA command.
To see the complete list of training options:
$ adabmDCA train -h
The standard command to start training a DCA model is:
$ adabmDCA train -m <model> -d <fasta_file> -o <output_folder> -l <label>
Arguments π§©
<model>β{bmDCA, eaDCA, edDCA}
Selects the training routine.
By default, the fully connectedbmDCAalgorithm is used.edDCAcan follow two different routines: either it decimates a pre-trainedbmDCAmodel, or it first trains abmDCAmodel and then decimates it.<fasta_file>β Path to the FASTA file containing the training MSA.<output_folder>β Folder where results will be stored (created if missing).<label>β Optional tag for output files.
Training Behavior βοΈ
Training stops when the Pearson correlation between model and empirical connected correlations reaches the target value (default: 0.95).
- Early training is fast (e.g., Pearson β 0.9 after ~100 iterations).
- Approaching higher values takes significantly longer (powerβlaw decay).
For a quick coarse model, set:
--target 0.9
Output Files π
During training, adabmDCA maintains three output files:
<label>_params.datβ Nonβzero model parameters- Lines starting with
Jβ couplings -
Lines with
hβ biases -
<label>_chains.fastaβ State of the Markov chains -
<label>_adabmDCA.logβ Log file updated throughout training
Update intervals:
- bmDCA: every 50 updates
- eaDCA, edDCA: every 10 updates
Restoring Interrupted Training π
Resume training using:
$ adabmDCA train [...] -p <file_params> -c <file_chains>
Importance Weights ποΈββοΈ
Provide custom weights with:
--weights <path>
Otherwise, weights are computed automatically and stored as:
<label>_weights.dat
Options:
--clustering_seqid <value>β default: 0.8--no_reweightingβ use uniform weights
Choosing the Alphabet π
Default alphabet: protein.
Specify alternatives:
- RNA β
--alphabet rna - DNA β
--alphabet dna - Custom β
--alphabet ABCD-
eaDCA π±
Enable with:
--model eaDCA
Key hyperparameters:
--factivateβ fraction of inactive couplings activated (default: 0.001)--gstepsβ parameter updates per graph update (default: 10)
Recommended: reduce sweeps to 5.
edDCA βοΈ (Decimated DCA)
Run decimation:
$ adabmDCA train -m edDCA -d <fasta_file> -p <params> -c <chains>
Two workflows:
- Use preβtrained bmDCA (
params+chains) - Train bmDCA automatically, then decimate
Key hyperparameters:
--gstepsβ default: 10--drateβ pruning fraction (default: 0.01)--densityβ target graph density (default: 0.02)--targetβ Pearson threshold (default: 0.95)
Choosing Hyperparameters ποΈ
Defaults work well for clean and moderately diverse MSAs. For more difficult datasets, consider tuning:
Learning Rate
- Default: 0.01
- If chains mix poorly, try:
--lr 0.005
Number of Markov Chains
- Default: 10,000
- Using fewer chains reduces the memory required to train the model, but it may also lead to a longer algorithm convergence time.
- Change with:
--nchains <value>
Number of Monte Carlo Steps
- Controlled by
--nsweeps - Default: 10
- Recommended range: 10β50. Higher values drastically increase the training time and, empirically, do not help much the model convergence.
Regularization (Pseudocount)
Controlled by --pseudocount.
Default:
Ξ± = 1 / M_eff
Increasing Ξ± (e.g. Ξ± = 0.001 or 0.01) may help when the training struggle converging or the mixing time of the model is very high, but it also makes the model less expressive.