Skip to content

Training

⚙️ Training

All adabmDCA implementations (Python, Julia, C++) provide a unified command-line interface. Training is performed using the command:

adabmDCA train -m <model> -d <fasta_file> -o <output_folder> -l <label>

Where:

  • -m <model>: Type of model to train. Choose from:
    • bmDCA: Fully-connected Boltzmann Machine (default)
    • eaDCA: Sparse model with progressively activated couplings
    • edDCA: Sparse model obtained by decimating an existing bmDCA model
  • -d <fasta_file>: Path to the input MSA in FASTA format
  • -o <output_folder>: Output directory for storing results
  • -l <label>: Optional label prefix for output files

Info

All options are listed in Script arguments. The same information can be shown from the command line using:

adabmDCA train -h

Training proceeds until the Pearson correlation coefficient between the model and empirical two-point statistics reaches a target threshold (default: --target 0.95).


📦 Output Files

By default, the following files are generated during training (overwritten every 50 epochs):

  • <label>_params.dat: Model parameters.

    • Lines starting with J encode pairwise couplings: position1, position2, symbol1, symbol2
    • Lines starting with h encode single-site biases: position, symbol
    • Inactive couplings (i.e. value 0) are not saved
  • <label>_chains.fasta: Final state of Markov chains used during sampling

  • <label>_adabmDCA.log: Log file recording training progress


♻️ Restart from a Checkpoint

To resume an interrupted training, provide saved parameters and chains:

adabmDCA train [...] -p <file_params> -c <file_chains>

This initializes the model and chains from previously saved states.


⚖️ Sequence Weights

You can provide precomputed importance weights via:

-w <weights_file>

Otherwise, weights are computed automatically using sequence identity clustering and saved to:

<output_folder>/<label>_weights.dat

Use --clustering_seqid to control the sequence identity threshold, or --no_reweighting to disable reweighting.


🔠 Alphabet Specification

By default, the alphabet is set to protein. To specify others:

--alphabet rna
--alphabet dna

You can also define a custom alphabet:

--alphabet ABCD-

Just ensure the symbols match exactly with those in your MSA. Custom alphabets allow for specialized tokens or reordering.


🧠 Training Algorithms

🧩 bmDCA: Fully Connected Model

This is the default training mode, where all possible pairwise couplings are learned.


🌱 eaDCA: Progressive Coupling Activation

This routine learns a sparse DCA model by gradually activating couplings.

Use:

adabmDCA train -m eaDCA [...]

Key hyperparameters:

  • --factivate <float>: Fraction of inactive couplings activated per iteration (default: 0.001)
  • --gsteps <int>: Number of parameter updates per fixed coupling graph (default: 10)
  • --nsweeps <int>: MC sweeps per parameter update. Since only part of the model changes at each step, a smaller value (e.g. 5) is usually sufficient.

✂️ edDCA: Decimation for Sparsity

This routine builds a sparse model by pruning the least informative couplings from a bmDCA model.

Two modes are available:

1. From Pretrained Model

adabmDCA train -m edDCA -d <fasta_file> -p <file_params> -c <file_chains>
  • Requires parameters and chains from a previously trained bmDCA model
  • Applies decimation until target sparsity is reached (up to 10,000 iterations)

2. Train + Decimate (No Input Model)

If -p and -c are omitted, the model is initialized from scratch. A bmDCA is trained first, then decimated automatically.

Key hyperparameters:

  • --gsteps <int>: Parameter updates per decimation step (default: 10)
  • --drate <float>: Fraction of couplings to remove per step (default: 0.01)
  • --density <float>: Target graph density (default: 0.02)
  • --target <float>: Target Pearson correlation (default: 0.95)

🔧 Hyperparameter Tuning

Default values offer a good trade-off between performance and training time for clean MSAs. In other cases, consider adjusting the following:

🚀 Learning Rate (--lr)

  • Default: 0.05
  • Reduce to 0.01 or 0.005 if the training is unstable, does not converge, or yields poor generative models

🔗 Number of Markov Chains (--nchains)

  • Default: 10000
  • Lowering this reduces memory use, but may slow convergence (see Fig. 2)

Figure 2: Evolution of the Pearson correlation coefficient for trainings with a different number of Markov chains. The target pearson is set to 0.95 (black dashed line). Example FASTA

🔄 Monte Carlo Sweeps (--nsweeps)

  • Default: 10
  • Controls the number of full sequence updates between parameter updates
  • Higher values (up to 50) improve chain decorrelation but increase runtime

🧮 Pseudocount Regularization (--pseudocount)

  • Default: \(\alpha = 1 / M_{\mathrm{eff}}\) (effective number of sequences)
  • Adjust to control the smoothness of empirical frequency estimates
  • \(\alpha = 0\): No regularization
  • \(\alpha = 1\): Flat prior (uniform frequencies)
  • Set manually via --pseudocount <value>

Tip

Choosing optimal hyperparameters depends on your dataset. MSAs with many gaps, high diversity, or strong subfamily structure may require more conservative settings to ensure convergence and good sampling.