Entropy from Thermodynamic Integration
The entropy command estimates the entropy of a trained DCA model without enumerating all possible sequences. It uses a thermodynamic integration path that progressively biases the model toward one target sequence.
adabmDCA entropy \
-p <params.dat> \
-d <training_msa.fasta> \
-t <target_sequence.fasta> \
-o <output_folder>
Biased Model
Starting from the trained model with energy \(E_0(x)\), the script defines a one-parameter family of models
where \(x^\star\) is the target sequence, typically chosen from the training set, and \(I(x, x^\star)\) is the number of matching residues between \(x\) and \(x^\star\), from \(0\) to \(L\). In the implementation this is done by adding theta * targetseq to the local bias parameters.
Increasing \(\theta\) makes the target sequence more likely. The script starts from the user-provided --theta_max, then increases it until the target sequence is sampled with measurable probability, approximately 10% of the chains in the implementation.
Free Energy Estimate
The script works with
At large \(\theta_{\max}\), the target sequence has measurable probability \(p_\star\). Since
the script estimates
Then it integrates back to \(\theta = 0\). Because
we have
The integral is approximated with the trapezoidal rule over --nsteps values of \(\theta\). At each value, chains are sampled for --nsweeps sweeps and the mean identity to the target sequence is measured.
Entropy
After estimating \(F(0) = -\log Z_0\), the entropy is computed as
The average energy \(\langle E_0 \rangle_0\) is estimated by thermalizing chains at \(\theta = 0\) for --nsweeps_zero sweeps.
Equivalently, the final estimator is
Output
The command writes a log file named <label>.log in the output folder. By default, label is entropy, so the file is:
<output_folder>/entropy.log
Each row reports:
Epoch: integration step indexTheta: current field strengthFree Energy: running estimate of \(F(\theta_{\max}) + \int_0^\theta \langle I(x, x^\star) \rangle_{\theta'} d\theta'\); the final row gives \(F(0)\)Entropy: current entropy estimateTime: elapsed time in seconds
Main Parameters
--theta_max: initial maximum field strength. It is automatically increased until the target sequence appears in at least 10% of chains.--nsteps: number of integration points.--nsweeps: MCMC sweeps per integration point.--nsweeps_zero: sweeps used to estimate \(\langle E_0 \rangle_0\).--nsweeps_theta: sweeps used to equilibrate at the initial \(\theta_{\max}\).--nchains: number of chains used in the estimates.
More chains, more integration steps, and more sweeps reduce Monte Carlo noise, at the cost of longer runtime.