Applications
Generate sequences
Once we have a trained model, we can use it to generate new sequences. This can be done using the command:
$ ./adabmDCA.sh sample -p <path_params> -d <fasta_file> -o <output_folder> --ngen <num_gen>
where output_folder
is the directory where to save the data and num_gen
is the number of sequences to be generated. The routine will first compute the mixing time (\(t^{\mathrm{mix}}\)) of the model by running a simulation starting from the sequences of the input MSA. After that, it will randomly initialize num_gen
Markov chains and run for nmix
\(\cdot t^{\mathrm{mix}}\) sweeps to ensure that the model equilibrates. It will save in the output directory a FASTA file containing the sequences sampled with the model and a text file containing the records used to determine the convergence of the algorithm. Figure mixing time-Right shows the comparison between the entries of the covariance matrices obtained from the data and from the generated sequences. The Pearson correlation coefficient is the same used as a target for the training and the slope is close to 1, meaning that the model is able to correctly recover the two-sites statistics of the data MSA.
Convergence criterium
To determine the convergence of the Monte Carlo simulation, the following strategy is used. We extract \texttt{nmeasure}\(=N\) sequences from the data MSA according to their statistical weight and we make a copy of them. The first set represents the chains simulated up to time \(t\), and we denote them as \(\pmb{A}(t) = \{\pmb{a}_1(t), \dots, \pmb{a}_{N}(t)\}\), while the sequences of the second set are the chains simulated until time \(t/2\), and we call them \(\pmb{A}(t/2) = \{\pmb{a}_1(t/2), \dots, \pmb{a}_{N}(t/2)\}\). With some abuse of notation, we define the intrachain correlation and autocorrelation as
where \(\sigma(i)\) is a random permutation of the index \(i\) and
is the normalized sequence identity (or overlap) between the sequences \(\pmb{a}\) and \(\pmb{b}\). By construction, at the initialization we have \(\mathrm{SeqID}(t, t/2) = 1\) and \(\mathrm{SeqID}(t)\) somewhat close to the average sequence identity of the MSA. The convergence is obtained when chains are mixed, meaning that the system has completely forgotten the initial configuration. This requirement is satisfied when the statistics of a set of independent chains is the same as the one between the initialization and the evolved chains, meaning \(\mathrm{SeqID}(t) \cong \mathrm{SeqID}(t, t/2)\). The point at which the two curves merge is called mixing time, and we denote it as \(t^{\mathrm{mix}}\). After reaching the mixing time of the model, the algorithm will initialize ngen
chains at random. It will run a sampling for other nmix
\(\cdot t^{\mathrm{mix}}\) steps to guarantee complete thermalization, with nmix=2
by default. Together with the generated sequences, the script will output a text file containing the records of \(\mathrm{SeqID}(t)\) and \(\mathrm{SeqID}(t, t/2)\) and their standard deviations (figure mixing time-Left).
Contact prediction
One of the principal applications of the DCA models has been that of predicting a tertiary structure of a protein or RNA domain. In particular, with each pair of sites \(i\) and \(j\) in the MSA, adabmDCA 2.0
computes a contact score that quantifies how likely the two associated positions in the chains are in contact in the three-dimensional structure.
Formally, it corresponds to the average-product corrected (APC) Frobenius norms of the coupling matrices [Ekeberg et al., 2013], i.e.
Note that the coupling parameters are usually transformed in a zero-sum gauge before computing the scores, and the gap symbol should be neglected while computing the sum in Eq. (1) [Feinauer et al., 2014].
The scores for all site pairs are provided in the output folder in a separate file called <label>_frobenius.txt
The first two columns indicate the site indices and the third one contains the associated APC Frobenius norm. The command for computing the matrix of Frobenius norms is
$ ./adabmDCA.sh contacts -p <file_params> -o <output_folder>
Scoring a sequence set
At convergence, users can score a set of input sequences according to a trained DCA model by using the command line
$ ./adabmDCA.sh energies -d <fasta_file> -p <file_params> -o <output_folder>
adabmDCA 2.0
will produce a new FASTA file in the output folder <output_folder>
where the input sequences in <fasta_file>
have an additional field in the name that account for the DCA energy function computed according to the model in <file_params>
. Note that low energies correspond to good sequences.
Single mutant library
Another possible application exploits the sequence-fitness score computable according to the energy function \(E\) in Eq. (1). This routine provides a single-mutant library for a given wild type to possibly guide Deep Mutational Scanning (DMS) experiments. In particular, adabmDCA 2.0
allows one to predict the fitness reduction (increase) in terms of \(\Delta E = E\left(\rm mutant \right) - E\left(\rm wildtype \right)\) for positive (negative) value of \(\Delta E\), respectively [Hopf et al., 2017].
To produce a FASTA file containing all weighted single-mutants one has to run
$ ./adabmDCA.sh DMS -d <WT> -p <file_params> -o <output_folder>
where <WT>
is the name of the FASTA file containing the wild type sequence, <file_params>
is a model file in a format compatible with adabmDCA 2.0
output, and <output_folder>
corresponds to the folder that will contain the library file. The sequences in the output FASTA file are named after the introduced mutation and the corresponding \(\Delta E\); for instance, >G27A | DCAscore: -0.6
denotes that position 27
has been changed from G
to A
and \(\Delta E = -0.6\).