adabmDCApy APIs
This section describe all the functions available in the Python implementation of adabmDCA
.
Submodules
adabmDCA.dataset module
- class adabmDCA.dataset.DatasetDCA(path_data: str | Path, path_weights: str | Path | None = None, alphabet: str = 'protein', clustering_th: float = 0.8, no_reweighting: bool = False, device: device = device(type='cpu'), dtype: dtype = torch.float32, message: bool = True)
Bases:
Dataset
Dataset class for handling multi-sequence alignments data.
- get_effective_size() int
Returns the effective size (Meff) of the dataset.
- Returns:
Effective size of the dataset.
- Return type:
int
- get_num_residues() int
Returns the number of residues (L) in the multi-sequence alignment.
- Returns:
Length of the MSA.
- Return type:
int
- get_num_states() int
Returns the number of states (q) in the alphabet.
- Returns:
Number of states.
- Return type:
int
- shuffle() None
Shuffles the dataset.
adabmDCA.fasta module
- adabmDCA.fasta.compute_weights(data: ndarray | Tensor, th: float = 0.8, device: device = device(type='cpu'), dtype: dtype = torch.float32) Tensor
Computes the weight to be assigned to each sequence ‘s’ in ‘data’ as 1 / n_clust, where ‘n_clust’ is the number of sequences that have a sequence identity with ‘s’ >= th.
- Parameters:
data (np.ndarray | torch.Tensor) – Encoded input dataset.
th (float, optional) – Sequence identity threshold for the clustering. Defaults to 0.8.
device (toch.device, optional) – Device. Defaults to “cpu”.
dtype (torch.dtype, optional) – Data type. Defaults to torch.float32.
- Returns:
Array with the weights of the sequences.
- Return type:
torch.Tensor
- adabmDCA.fasta.decode_sequence(sequence: ndarray, tokens: str) str | ndarray
Takes a numeric sequence or list of seqences in input an returns the corresponding string encoding.
- Parameters:
sequence (np.ndarray) – Input sequences. Can be either a 1D or a 2D array.
tokens (str) – Alphabet to be used for the encoding.
- Returns:
Decoded input.
- Return type:
str | np.ndarray
- adabmDCA.fasta.encode_sequence(sequence: str | ndarray, tokens: str) ndarray
Encodes a sequence or a list of sequences into a numeric format.
- Parameters:
sequence (str | np.ndarray) – Input sequence.
tokens (str) – Alphabet to be used for the encoding.
- Returns:
Encoded sequence or sequences.
- Return type:
np.ndarray
- adabmDCA.fasta.get_tokens(alphabet: str) str
Converts the alphabet into the corresponding tokens.
- Parameters:
alphabet (str) – Alphabet to be used for the encoding. It can be either “protein”, “rna”, “dna” or a custom string of tokens.
- Returns:
Tokens of the alphabet.
- Return type:
str
- adabmDCA.fasta.import_from_fasta(fasta_name: str | Path, tokens: str | None = None, filter_sequences: bool = False) Tuple[ndarray, ndarray]
Import data from a fasta file.
- Parameters:
fasta_name (str | Path) – Path to the fasta file.
tokens (str | None, optional) – Alphabet to be used for the encoding. If provided, encodes the sequences in numeric format.
filter_sequences (bool, optional) – If True, removes the sequences whose tokens are not present in the alphabet. Defaults to False.
- Raises:
RuntimeError – The file is not in fasta format.
- Returns:
headers, sequences.
- Return type:
Tuple[np.ndarray, np.ndarray]
- adabmDCA.fasta.validate_alphabet(sequences: ndarray, tokens: str)
Check if the chosen alphabet is compatible with the input sequences.
- Parameters:
sequences (np.ndarray) – Input sequences.
tokens (str) – Alphabet to be used for the encoding.
- Raises:
KeyError – The chosen alphabet is incompatible with the Multi-Sequence Alignment.
- adabmDCA.fasta.write_fasta(fname: str, headers: ndarray, sequences: ndarray, numeric_input: bool = False, remove_gaps: bool = False, tokens: str = 'protein')
Generate a fasta file with the input sequences.
- Parameters:
fname (str) – Name of the output fasta file.
headers (np.ndarray) – Array of sequences’ headers.
sequences (np.ndarray) – Array of sequences.
numeric_input (bool, optional) – Whether the sequences are in numeric (encoded) format or not. Defaults to False.
remove_gaps (bool, optional) – If True, removes the gap from the alignment. Defaults to False.
tokens (str) – Alphabet to be used for the encoding. Defaults to protein.
adabmDCA.functional module
- adabmDCA.functional.one_hot(x: Tensor, num_classes: int = -1, dtype: dtype = torch.float32)
A fast one-hot encoding function faster than the PyTorch one working with torch.int32 and returning a float Tensor. Works only for 2D tensors.
- Parameters:
x (torch.Tensor) – Input tensor to be one-hot encoded.
num_classes (int, optional) – Number of classes. If -1, the number of classes is inferred from the input tensor. Defaults to -1.
dtype (torch.dtype, optional) – Data type of the output tensor. Defaults to torch.float32.
- Returns:
One-hot encoded tensor.
- Return type:
torch.Tensor
adabmDCA.io module
- adabmDCA.io.load_chains(fname: str, tokens: str, load_weights: bool = False) ndarray | Tuple[ndarray, ndarray]
Loads the sequences from a fasta file and returns the numeric-encoded version. If the sequences are weighted, the log-weights are also returned. If the sequences are not weighted, the log-weights are set to 0.
- Parameters:
fname (str) – Path to the file containing the sequences.
tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.
load_weights (bool, optional) – If True, the log-weights are loaded and returned. Defaults to False.
- Returns:
Numeric-encoded sequences and log-weights if load_weights is True.
- Return type:
np.ndarray | Tuple[np.ndarray, np.ndarray]
- adabmDCA.io.load_params(fname: str, tokens: str, device: device, dtype: dtype = torch.float32) Dict[str, Tensor]
Import the parameters of the model from a file.
- Parameters:
fname (str) – Path of the file that stores the parameters.
tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.
device (torch.device) – Device where to store the parameters.
dtype (torch.dtype) – Data type of the parameters. Defaults to torch.float32.
- Returns:
Parameters of the model.
- Return type:
Dict[str, torch.Tensor]
- adabmDCA.io.load_params_oldformat(fname: str, device: device, dtype: dtype = torch.float32) Dict[str, Tensor]
Import the parameters of the model from a file. Assumes the old DCA format.
- Parameters:
fname (str) – Path of the file that stores the parameters.
device (torch.device) – Device where to store the parameters.
dtype (torch.dtype) – Data type of the parameters. Defaults to torch.float32.
- Returns:
Parameters of the model.
- Return type:
Dict[str, torch.Tensor]
- adabmDCA.io.save_chains(fname: str, chains: Tensor, tokens: str, log_weights: Tensor = None) None
Saves the chains in a fasta file.
- Parameters:
fname (str) – Path to the file where to save the chains.
chains (torch.Tensor) – Chains.
tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.
log_weights (torch.Tensor, optional) – Log-weights of the chains. Defaults to None.
- adabmDCA.io.save_params(fname: str, params: Dict[str, Tensor], mask: Tensor, tokens: str) None
Saves the parameters of the model in a file.
- Parameters:
fname (str) – Path to the file where to save the parameters.
params (Dict[str, torch.Tensor]) – Parameters of the model.
mask (torch.Tensor) – Mask of the coupling matrix that determines which are the non-zero entries.
tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.
- adabmDCA.io.save_params_oldformat(fname: str, params: Dict[str, Tensor], mask: Tensor) None
Saves the parameters of the model in a file. Assumes the old DCA format.
- Parameters:
fname (str) – Path to the file where to save the parameters.
params (Dict[str, torch.Tensor]) – Parameters of the model.
mask (torch.Tensor) – Mask of the coupling matrix that determines which are the non-zero entries.
adabmDCA.plot module
- adabmDCA.plot.plot_PCA(fig: figure, data1: ndarray, pc1: int = 0, pc2: int = 1, data2: ndarray | None = None, labels: List[str] | str = 'Data', colors: List[str] | str = 'black', title: str | None = None) figure
Makes the scatter plot of the components (pc1, pc2) of the input data and shows the histograms of the components.
- Parameters:
fig (plt.figure) – Figure to plot the data.
data1 (np.ndarray) – Data to plot.
pc1 (int, optional) – First principal direction. Defaults to 0.
pc2 (int, optional) – Second principal direction. Defaults to 1.
data2 (np.ndarray | None, optional) – Data to be superimposed to data1. Defaults to None.
labels (List[str] | str, optional) – Labels to put in the legend. Defaults to “Data”.
colors (List[str] | str, optional) – Colors to be used. Defaults to “black”.
title (str | None, optional) – Title of the plot. Defaults to None.
- Returns:
Updated figure.
- Return type:
plt.figure
- adabmDCA.plot.plot_autocorrelation(ax: Axes, checkpoints: ndarray, autocorr: ndarray, gen_seqid: ndarray, data_seqid: ndarray) Axes
Plots the time-autocorrelation curve of the sequence identity and the generated and data sequence identities.
- Parameters:
ax (plt.Axes) – Axes to plot the data.
checkpoints (np.ndarray) – Checkpoints of the sampling.
autocorr (np.ndarray) – Time-autocorrelation of the sequence identity.
gen_seqid (np.ndarray) – Sequence identity of the generated data.
data_seqid (np.ndarray) – Sequence identity of the data.
- Returns:
Updated axes.
- Return type:
plt.Axes
- adabmDCA.plot.plot_pearson_sampling(ax: Axes, checkpoints: ndarray, pearsons: ndarray, pearson_training: ndarray | None = None)
- adabmDCA.plot.plot_scatter_correlations(ax: Axes, Cij_data: ndarray, Cij_gen: ndarray, Cijk_data: ndarray, Cijk_gen: ndarray, pearson_Cij: float, pearson_Cijk: float) Axes
Plots the scatter plot of the data and generated Cij and Cijk values.
- Parameters:
ax (plt.Axes) – Axes to plot the data. Must have 2 subplots.
Cij_data (np.ndarray) – Data Cij values.
Cij_gen (np.ndarray) – Generated Cij values.
Cijk_data (np.ndarray) – Data Cijk values.
Cijk_gen (np.ndarray) – Generated Cijk values.
pearson_Cij (float) – Pearson correlation coefficient of Cij.
pearson_Cijk (float) – Pearson correlation coefficient of Cijk.
- Returns:
Updated axes.
- Return type:
plt.Axes
adabmDCA.sampling module
- adabmDCA.sampling.get_sampler(sampling_method: str) Callable
Returns the sampling function corresponding to the chosen method.
- Parameters:
sampling_method (str) – String indicating the sampling method. Choose between ‘metropolis’ and ‘gibbs’.
- Raises:
KeyError – Unknown sampling method.
- Returns:
Sampling function.
- Return type:
Callable
- adabmDCA.sampling.gibbs_sampling(chains: Tensor, params: Dict[str, Tensor], nsweeps: int, beta: float = 1.0) Tensor
Gibbs sampling.
- Parameters:
chains (torch.Tensor) – Initial chains.
params (Dict[str, torch.Tensor]) – Parameters of the model.
nsweeps (int) – Number of sweeps.
beta (float, optional) – Inverse temperature. Defaults to 1.0.
- Returns:
Updated chains.
- Return type:
torch.Tensor
- adabmDCA.sampling.metropolis(chains: Tensor, params: Dict[str, Tensor], nsweeps: int, beta: float = 1.0) Tensor
Metropolis sampling.
- Parameters:
chains (torch.Tensor) – One-hot encoded sequences.
params (Dict[str, torch.Tensor]) – Parameters of the model.
nsweeps (int) – Number of sweeps to be performed.
beta (float, optional) – Inverse temperature. Defaults to 1.0.
- Returns:
Updated chains.
- Return type:
torch.Tensor
adabmDCA.statmech module
- adabmDCA.statmech.compute_energy(X: Tensor, params: Dict[str, Tensor]) Tensor
Compute the DCA energy of the sequences in X.
- Parameters:
X (torch.Tensor) – Sequences in one-hot encoding format.
params (Dict[str, torch.Tensor]) – Parameters of the model.
- Returns:
DCA Energy of the sequences.
- Return type:
torch.Tensor
- adabmDCA.statmech.compute_entropy(chains: Tensor, params: Dict[str, Tensor], logZ: float) float
Compute the entropy of the DCA model.
- Parameters:
chains (torch.Tensor) – Chains that are supposed to be an equilibrium realization of the model.
params (Dict[str, torch.Tensor]) – Parameters of the model.
logZ (float) – Log-partition function of the model.
- Returns:
Entropy of the model.
- Return type:
float
- adabmDCA.statmech.compute_logZ_exact(all_states: Tensor, params: Dict[str, Tensor]) float
Compute the log-partition function of the model.
- Parameters:
all_states (torch.Tensor) – All possible states of the system.
params (Dict[str, torch.Tensor]) – Parameters of the model.
- Returns:
Log-partition function of the model.
- Return type:
float
- adabmDCA.statmech.compute_log_likelihood(fi: Tensor, fij: Tensor, params: Dict[str, Tensor], logZ: float) float
Compute the log-likelihood of the model.
- Parameters:
fi (torch.Tensor) – Single-site frequencies of the data.
fij (torch.Tensor) – Two-site frequencies of the data.
params (Dict[str, torch.Tensor]) – Parameters of the model.
logZ (float) – Log-partition function of the model.
- Returns:
Log-likelihood of the model.
- Return type:
float
- adabmDCA.statmech.enumerate_states(L: int, q: int, device: device = device(type='cpu')) Tensor
Enumerate all possible states of a system of L sites and q states.
- Parameters:
L (int) – Number of sites.
q (int) – Number of states.
device (torch.device, optional) – Device to store the states. Defaults to “cpu”.
- Returns:
All possible states.
- Return type:
torch.Tensor
- adabmDCA.statmech.iterate_tap(mag: Tensor, params: Dict[str, Tensor], max_iter: int = 500, epsilon: float = 0.0001)
Iterates the TAP equations until convergence.
- Parameters:
mag (torch.Tensor) – Initial magnetizations.
params (Dict[str, torch.Tensor]) – Parameters of the model.
max_iter (int, optional) – Maximum number of iterations. Defaults to 2000.
epsilon (float, optional) – Convergence threshold. Defaults to 1e-6.
- Returns:
Fixed point magnetizations of the TAP equations.
- Return type:
torch.Tensor
adabmDCA.stats module
- adabmDCA.stats.extract_Cij_from_freq(fij: Tensor, pij: Tensor, fi: Tensor, pi: Tensor, mask: Tensor | None = None) Tuple[float, float]
Extracts the lower triangular part of the covariance matrices of the data and chains starting from the frequencies.
- Parameters:
fij (torch.Tensor) – Two-point frequencies of the data.
pij (torch.Tensor) – Two-point frequencies of the chains.
fi (torch.Tensor) – Single-point frequencies of the data.
pi (torch.Tensor) – Single-point frequencies of the chains.
mask (torch.Tensor | None, optional) – Mask for comparing just a subset of the couplings. Defaults to None.
- Returns:
Extracted two-point frequencies of the data and chains.
- Return type:
Tuple[float, float]
- adabmDCA.stats.extract_Cij_from_seqs(data: Tensor, chains: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0, mask: Tensor | None = None) Tuple[float, float]
Extracts the lower triangular part of the covariance matrices of the data and chains starting from the sequences.
- Parameters:
data (torch.Tensor) – Data sequences.
chains (torch.Tensor) – Chain sequences.
weights (torch.Tensor | None, optional) – Weights of the sequences. Defaults to None.
pseudo_count (float, optional) – Pseudo count for the single and two points statistics. Acts as a regularization. Defaults to 0.0.
mask (torch.Tensor | None, optional) – Mask for comparing just a subset of the couplings. Defaults to None.
- Returns:
Two-point frequencies of the data and chains.
- Return type:
Tuple[float, float]
- adabmDCA.stats.generate_unique_triplets(L: int, ntriplets: int, device: device = device(type='cpu')) Tensor
Generates a set of unique triplets of positions. Used to compute the 3-points statistics.
- Parameters:
L (int) – Length of the sequences.
ntriplets (int) – Number of triplets to be generated.
device (torch.device, optional) – Device to perform computations on. Defaults to “cpu”.
- Returns:
Tensor of shape (ntriplets, 3) containing the indices of the triplets.
- Return type:
torch.Tensor
- adabmDCA.stats.get_correlation_two_points(fij: Tensor, pij: Tensor, fi: Tensor, pi: Tensor, mask: Tensor | None = None) Tuple[float, float]
Computes the Pearson coefficient and the slope between the two-point frequencies of data and chains.
- Parameters:
fij (torch.Tensor) – Two-point frequencies of the data.
pij (torch.Tensor) – Two-point frequencies of the chains.
fi (torch.Tensor) – Single-point frequencies of the data.
pi (torch.Tensor) – Single-point frequencies of the chains.
mask (torch.Tensor | None, optional) – Mask to select the couplings to use for the correlation coefficient. Defaults to None.
- Returns:
Pearson correlation coefficient of the two-sites statistics and slope of the interpolating line.
- Return type:
Tuple[float, float]
- adabmDCA.stats.get_covariance_matrix(data: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0) Tensor
Computes the weighted covariance matrix of the input multi sequence alignment.
- Parameters:
data (torch.Tensor) – Input MSA in one-hot variables.
weights (torch.Tensor | None, optional) – Importance weights of the sequences.
pseudo_count (float, optional) – Pseudo count. Defaults to 0.0.
- Returns:
Covariance matrix.
- Return type:
torch.Tensor
- adabmDCA.stats.get_freq_single_point(data: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0) Tensor
Computes the single point frequencies of the input MSA. :param data: One-hot encoded data array. :type data: torch.Tensor :param weights: Weights of the sequences. :type weights: torch.Tensor | None, optional :param pseudo_count: Pseudo count to be added to the frequencies. Defaults to 0.0. :type pseudo_count: float, optional
- Raises:
ValueError – If the input data is not a 3D tensor.
- Returns:
Single point frequencies.
- Return type:
torch.Tensor
- adabmDCA.stats.get_freq_three_points(data: Tensor, ntriplets: int, weights: Tensor | None = None, device: device = device(type='cpu')) Tensor
Computes the 3-body statistics of the input MSA.
- Parameters:
data (torch.Tensor) – Input MSA in one-hot encoding.
ntriplets (int) – Number of triplets to test.
weights (torch.Tensor | None, optional) – Importance weights for the sequences. Defaults to None.
device (torch.device, optional) – Device to perform computations on. Defaults to “cpu”.
- Returns:
3-points connected correlation for ntriplets randomly extracted triplets.
- Return type:
torch.Tensor
- adabmDCA.stats.get_freq_two_points(data: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0) Tensor
Computes the 2-points statistics of the input MSA.
- Parameters:
data (torch.Tensor) – One-hot encoded data array.
weights (torch.Tensor | None, optional) – Array of weights to assign to the sequences of shape.
pseudo_count (float, optional) – Pseudo count for the single and two points statistics. Acts as a regularization. Defaults to 0.0.
- Raises:
ValueError – If the input data is not a 3D tensor.
- Returns:
Matrix of two-point frequencies of shape (L, q, L, q).
- Return type:
torch.Tensor
adabmDCA.utils module
- adabmDCA.utils.get_device(device: str, message: bool = True) device
Returns the device where to store the tensors.
- Parameters:
device (str) – Device to be used.
message (bool, optional) – Print the device. Defaults to True.
- Returns:
Device.
- Return type:
torch.device
- adabmDCA.utils.get_dtype(dtype: str) dtype
Returns the data type of the tensors.
- Parameters:
dtype (str) – Data type.
- Returns:
Data type.
- Return type:
torch.dtype
- adabmDCA.utils.get_mask_save(L: int, q: int, device: device) Tensor
Returns the mask to save the upper-triangular part of the coupling matrix.
- Parameters:
L (int) – Length of the MSA.
q (int) – Number of values that each residue can assume.
device (torch.device) – Device where to store the mask.
- Returns:
Mask.
- Return type:
torch.Tensor
- adabmDCA.utils.init_chains(num_chains: int, L: int, q: int, device: device, dtype: dtype = torch.float32, fi: Tensor | None = None) Tensor
Initialize the chains of the DCA model. If ‘fi’ is provided, the chains are sampled from the profile model, otherwise they are sampled uniformly at random.
- Parameters:
num_chains (int) – Number of parallel chains.
L (int) – Length of the MSA.
q (int) – Number of values that each residue can assume.
device (torch.device) – Device where to store the chains.
dtype (torch.dtype, optional) – Data type of the chains. Defaults to torch.float32.
fi (torch.Tensor | None, optional) – Single-point frequencies. Defaults to None.
- Returns:
Initialized parallel chains in one-hot encoding format.
- Return type:
torch.Tensor
- adabmDCA.utils.init_parameters(fi: Tensor) Dict[str, Tensor]
Initialize the parameters of the DCA model.
- Parameters:
fi (torch.Tensor) – Single-point frequencies of the data.
- Returns:
Parameters of the model.
- Return type:
Dict[str, torch.Tensor]
- adabmDCA.utils.resample_sequences(data: Tensor, weights: Tensor, nextract: int) Tensor
Extracts nextract sequences from data with replacement according to the weights.
- Parameters:
data (torch.Tensor) – Data array.
weights (torch.Tensor) – Weights of the sequences.
nextract (int) – Number of sequences to be extracted.
- Returns:
Extracted sequences.
- Return type:
torch.Tensor
- adabmDCA.utils.set_zerosum_gauge(params: Dict[str, Tensor]) Dict[str, Tensor]
Sets the zero-sum gauge on the coupling matrix.
- Parameters:
params (Dict[str, torch.Tensor]) – Parameters of the model.
- Returns:
Parameters with fixed gauge.
- Return type:
Dict[str, torch.Tensor]