adabmDCApy APIs

This section describes all the functions available in the Python implementation of adabmDCA.

Submodules

adabmDCA.dataset module

class adabmDCA.dataset.DatasetDCA(path_data: str | Path, path_weights: str | Path | None = None, alphabet: str = 'protein', clustering_th: float = 0.8, no_reweighting: bool = False, device: device = device(type='cpu'), dtype: dtype = torch.float32, message: bool = True)

Bases: Dataset

Dataset class for handling multi-sequence alignments data.

get_effective_size() → int

Returns the effective size (Meff) of the dataset.

Returns:: Effective size of the dataset.
Return type:: int

get_num_residues() → int

Returns the number of residues (L) in the multi-sequence alignment.

Returns:: Length of the MSA.
Return type:: int

get_num_states() → int

Returns the number of states (q) in the alphabet.

Returns:: Number of states.
Return type:: int

shuffle() → None: Shuffles the dataset.

adabmDCA.fasta module

adabmDCA.fasta.compute_weights(data: ndarray | Tensor, th: float = 0.8, device: device = device(type='cpu'), dtype: dtype = torch.float32) → Tensor

Computes the weight to be assigned to each sequence ‘s’ in ‘data’ as 1 / n_clust, where ‘n_clust’ is the number of sequences that have a sequence identity with ‘s’ >= th.

Parameters:

data (np.ndarray | torch.Tensor) – Encoded input dataset.
th (float, optional) – Sequence identity threshold for the clustering. Defaults to 0.8.
device (toch.device, optional) – Device. Defaults to “cpu”.
dtype (torch.dtype, optional) – Data type. Defaults to torch.float32.

Returns:

Array with the weights of the sequences.

Return type:

torch.Tensor

adabmDCA.fasta.decode_sequence(sequence: list | ndarray | Tensor, tokens: str) → str | ndarray

Takes a numeric sequence or list of seqences in input an returns the corresponding string encoding.

Parameters:

sequence (np.ndarray) – Input sequences. Can be either a 1D or a 2D iterable.
tokens (str) – Alphabet to be used for the encoding.

Returns:

string or array of strings with the decoded input.

Return type:

str | np.ndarray

adabmDCA.fasta.encode_sequence(sequence: str | ndarray | list, tokens: str) → ndarray

Encodes a sequence or a list of sequences into a numeric format.

Parameters:

sequence (str | np.ndarray | list) – Input sequence.
tokens (str) – Alphabet to be used for the encoding.

Returns:

Encoded sequence or sequences.

Return type:

np.ndarray

adabmDCA.fasta.get_tokens(alphabet: str) → str

Converts the alphabet into the corresponding tokens.

Parameters:: alphabet (str) – Alphabet to be used for the encoding. It can be either “protein”, “rna”, “dna” or a custom string of tokens.
Returns:: Tokens of the alphabet.
Return type:: str

adabmDCA.fasta.import_from_fasta(fasta_name: str | Path, tokens: str | None = None, filter_sequences: bool = False, remove_duplicates: bool = True) → Tuple[ndarray, ndarray]

Import sequences from a fasta file. The following operations are performed: - If ‘tokens’ is provided, encodes the sequences in numeric format. - If ‘filter_sequences’ is True, removes the sequences whose tokens are not present in the alphabet. - If ‘remove_duplicates’ is True, removes the duplicated sequences.

Parameters:

fasta_name (str | Path) – Path to the fasta file.
tokens (str | None, optional) – Alphabet to be used for the encoding. If provided, encodes the sequences in numeric format.
filter_sequences (bool, optional) – If True, removes the sequences whose tokens are not present in the alphabet. Defaults to False.
remove_duplicates (bool, optional) – If True, removes the duplicated sequences. Defaults to True.

Raises:

RuntimeError – The file is not in fasta format.

Returns:

headers, sequences.

Return type:

Tuple[np.ndarray, np.ndarray]

adabmDCA.fasta.validate_alphabet(sequences: ndarray, tokens: str)

Check if the chosen alphabet is compatible with the input sequences.

Parameters:

sequences (np.ndarray) – Input sequences.
tokens (str) – Alphabet to be used for the encoding.

Raises:

KeyError – The chosen alphabet is incompatible with the Multi-Sequence Alignment.

adabmDCA.fasta.write_fasta(fname: str, headers: ndarray, sequences: ndarray, numeric_input: bool = False, remove_gaps: bool = False, tokens: str = 'protein')

Generate a fasta file with the input sequences.

Parameters:

fname (str) – Name of the output fasta file.
headers (np.ndarray) – Array of sequences’ headers.
sequences (np.ndarray) – Array of sequences.
numeric_input (bool, optional) – Whether the sequences are in numeric (encoded) format or not. Defaults to False.
remove_gaps (bool, optional) – If True, removes the gap from the alignment. Defaults to False.
tokens (str) – Alphabet to be used for the encoding. Defaults to protein.

adabmDCA.functional module

adabmDCA.functional.one_hot(x: Tensor, num_classes: int = -1, dtype: dtype = torch.float32)

A fast one-hot encoding function faster than the PyTorch one working with torch.int32 and returning a float Tensor. Works only for 2D tensors.

Parameters:

x (torch.Tensor) – Input tensor to be one-hot encoded.
num_classes (int, optional) – Number of classes. If -1, the number of classes is inferred from the input tensor. Defaults to -1.
dtype (torch.dtype, optional) – Data type of the output tensor. Defaults to torch.float32.

Returns:

One-hot encoded tensor.

Return type:

torch.Tensor

adabmDCA.io module

adabmDCA.io.load_chains(fname: str, tokens: str, load_weights: bool = False) → ndarray | Tuple[ndarray, ndarray]

Loads the sequences from a fasta file and returns the numeric-encoded version. If the sequences are weighted, the log-weights are also returned. If the sequences are not weighted, the log-weights are set to 0.

Parameters:

fname (str) – Path to the file containing the sequences.
tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.
load_weights (bool, optional) – If True, the log-weights are loaded and returned. Defaults to False.

Returns:

Numeric-encoded sequences and log-weights if load_weights is True.

Return type:

np.ndarray | Tuple[np.ndarray, np.ndarray]

adabmDCA.io.load_params(fname: str, tokens: str, device: device, dtype: dtype = torch.float32) → Dict[str, Tensor]

Import the parameters of the model from a file.

Parameters:

fname (str) – Path of the file that stores the parameters.
tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.
device (torch.device) – Device where to store the parameters.
dtype (torch.dtype) – Data type of the parameters. Defaults to torch.float32.

Returns:

Parameters of the model.

Return type:

Dict[str, torch.Tensor]

adabmDCA.io.load_params_oldformat(fname: str, device: device, dtype: dtype = torch.float32) → Dict[str, Tensor]

Import the parameters of the model from a file. Assumes the old DCA format.

Parameters:

fname (str) – Path of the file that stores the parameters.
device (torch.device) – Device where to store the parameters.
dtype (torch.dtype) – Data type of the parameters. Defaults to torch.float32.

Returns:

Parameters of the model.

Return type:

Dict[str, torch.Tensor]

adabmDCA.io.save_chains(fname: str, chains: Tensor, tokens: str, log_weights: Tensor = None) → None

Saves the chains in a fasta file.

Parameters:

fname (str) – Path to the file where to save the chains.
chains (torch.Tensor) – Chains.
tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.
log_weights (torch.Tensor, optional) – Log-weights of the chains. Defaults to None.

adabmDCA.io.save_params(fname: str, params: Dict[str, Tensor], tokens: str, mask: Tensor | None = None) → None

Saves the parameters of the model in a file.

Parameters:

fname (str) – Path to the file where to save the parameters.
params (Dict[str, torch.Tensor]) – Parameters of the model.
tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.
mask (torch.Tensor | None) – Mask of the coupling matrix that determines which are the non-zero entries. If None, the lower-triangular part of the coupling matrix is masked. Defaults to None.

adabmDCA.io.save_params_oldformat(fname: str, params: Dict[str, Tensor], mask: Tensor | None = None) → None

Saves the parameters of the model in a file. Assumes the old DCA format.

Parameters:

fname (str) – Path to the file where to save the parameters.
params (Dict[str, torch.Tensor]) – Parameters of the model.
mask (torch.Tensor) – Mask of the coupling matrix that determines which are the non-zero entries. If None, the lower-triangular part of the coupling matrix is masked. Defaults to None.

adabmDCA.pca module

class adabmDCA.pca.Pca

Bases: object

fit(M: Tensor, num_directions: int = 2) → None

Fit the PCA model to the data.

Parameters:

M (torch.Tensor) – Data matrix (num_samples, num_variables).
num_directions (int) – Number of principal components to compute.

fit_transform(M: Tensor, num_directions: int = 2) → Tensor

Fit the PCA model to the data and project the data onto the principal components.

Parameters:

M (torch.Tensor) – Data matrix (num_samples, num_variables).
num_directions (int) – Number of principal components to compute.

Returns:

Projected data matrix (num_samples, num_directions).

Return type:

torch.Tensor

transform(M: Tensor) → Tensor

Projects the data onto the principal components.

Parameters:: M (torch.Tensor) – Data matrix (num_samples, num_variables).
Returns:: Projected data matrix (num_samples, num_directions).
Return type:: torch.Tensor

adabmDCA.plot module

adabmDCA.plot.plot_PCA(fig: figure, data1: ndarray, pc1: int = 0, pc2: int = 1, data2: ndarray | None = None, labels: List[str] | str = 'Data', colors: List[str] | str = 'black', title: str | None = None) → figure

Makes the scatter plot of the components (pc1, pc2) of the input data and shows the histograms of the components.

Parameters:

fig (plt.figure) – Figure to plot the data.
data1 (np.ndarray) – Data to plot.
pc1 (int, optional) – First principal direction. Defaults to 0.
pc2 (int, optional) – Second principal direction. Defaults to 1.
data2 (np.ndarray | None, optional) – Data to be superimposed to data1. Defaults to None.
labels (List[str] | str, optional) – Labels to put in the legend. Defaults to “Data”.
colors (List[str] | str, optional) – Colors to be used. Defaults to “black”.
title (str | None, optional) – Title of the plot. Defaults to None.

Returns:

Updated figure.

Return type:

plt.figure

adabmDCA.plot.plot_autocorrelation(ax: Axes, checkpoints: ndarray, autocorr: ndarray, gen_seqid: ndarray, data_seqid: ndarray) → Axes

Plots the time-autocorrelation curve of the sequence identity and the generated and data sequence identities.

Parameters:

ax (plt.Axes) – Axes to plot the data.
checkpoints (np.ndarray) – Checkpoints of the sampling.
autocorr (np.ndarray) – Time-autocorrelation of the sequence identity.
gen_seqid (np.ndarray) – Sequence identity of the generated data.
data_seqid (np.ndarray) – Sequence identity of the data.

Returns:

Updated axes.

Return type:

plt.Axes

adabmDCA.plot.plot_pearson_sampling(ax: Axes, checkpoints: ndarray, pearsons: ndarray, pearson_training: ndarray | None = None)

adabmDCA.plot.plot_scatter_correlations(ax: Axes, Cij_data: ndarray, Cij_gen: ndarray, Cijk_data: ndarray, Cijk_gen: ndarray, pearson_Cij: float, pearson_Cijk: float) → Axes

Plots the scatter plot of the data and generated Cij and Cijk values.

Parameters:

ax (plt.Axes) – Axes to plot the data. Must have 2 subplots.
Cij_data (np.ndarray) – Data Cij values.
Cij_gen (np.ndarray) – Generated Cij values.
Cijk_data (np.ndarray) – Data Cijk values.
Cijk_gen (np.ndarray) – Generated Cijk values.
pearson_Cij (float) – Pearson correlation coefficient of Cij.
pearson_Cijk (float) – Pearson correlation coefficient of Cijk.

Returns:

Updated axes.

Return type:

plt.Axes

adabmDCA.sampling module

adabmDCA.sampling.get_sampler(sampling_method: str) → Callable

Returns the sampling function corresponding to the chosen method.

Parameters:: sampling_method (str) – String indicating the sampling method. Choose between ‘metropolis’ and ‘gibbs’.
Raises:: KeyError – Unknown sampling method.
Returns:: Sampling function.
Return type:: Callable

adabmDCA.sampling.gibbs_sampling(chains: Tensor, params: Dict[str, Tensor], nsweeps: int, beta: float = 1.0) → Tensor

Gibbs sampling.

Parameters:

chains (torch.Tensor) – Initial chains.
params (Dict[str, torch.Tensor]) – Parameters of the model.
nsweeps (int) – Number of sweeps.
beta (float, optional) – Inverse temperature. Defaults to 1.0.

Returns:

Updated chains.

Return type:

torch.Tensor

adabmDCA.sampling.metropolis(chains: Tensor, params: Dict[str, Tensor], nsweeps: int, beta: float = 1.0) → Tensor

Metropolis sampling.

Parameters:

chains (torch.Tensor) – One-hot encoded sequences.
params (Dict[str, torch.Tensor]) – Parameters of the model.
nsweeps (int) – Number of sweeps to be performed.
beta (float, optional) – Inverse temperature. Defaults to 1.0.

Returns:

Updated chains.

Return type:

torch.Tensor

adabmDCA.statmech module

adabmDCA.statmech.compute_energy(X: Tensor, params: Dict[str, Tensor]) → Tensor

Compute the DCA energy of the sequences in X.

Parameters:

X (torch.Tensor) – Sequences in one-hot encoding format.
params (Dict[str, torch.Tensor]) – Parameters of the model.

Returns:

DCA Energy of the sequences.

Return type:

torch.Tensor

adabmDCA.statmech.compute_entropy(chains: Tensor, params: Dict[str, Tensor], logZ: float) → float

Compute the entropy of the DCA model.

Parameters:

chains (torch.Tensor) – Chains that are supposed to be an equilibrium realization of the model.
params (Dict[str, torch.Tensor]) – Parameters of the model.
logZ (float) – Log-partition function of the model.

Returns:

Entropy of the model.

Return type:

float

adabmDCA.statmech.compute_logZ_exact(all_states: Tensor, params: Dict[str, Tensor]) → float

Compute the log-partition function of the model.

Parameters:

all_states (torch.Tensor) – All possible states of the system.
params (Dict[str, torch.Tensor]) – Parameters of the model.

Returns:

Log-partition function of the model.

Return type:

float

adabmDCA.statmech.compute_log_likelihood(fi: Tensor, fij: Tensor, params: Dict[str, Tensor], logZ: float) → float

Compute the log-likelihood of the model.

Parameters:

fi (torch.Tensor) – Single-site frequencies of the data.
fij (torch.Tensor) – Two-site frequencies of the data.
params (Dict[str, torch.Tensor]) – Parameters of the model.
logZ (float) – Log-partition function of the model.

Returns:

Log-likelihood of the model.

Return type:

float

adabmDCA.statmech.enumerate_states(L: int, q: int, device: device = device(type='cpu')) → Tensor

Enumerate all possible states of a system of L sites and q states.

Parameters:

L (int) – Number of sites.
q (int) – Number of states.
device (torch.device, optional) – Device to store the states. Defaults to “cpu”.

Returns:

All possible states.

Return type:

torch.Tensor

adabmDCA.statmech.iterate_tap(mag: Tensor, params: Dict[str, Tensor], max_iter: int = 500, epsilon: float = 0.0001)

Iterates the TAP equations until convergence.

Parameters:

mag (torch.Tensor) – Initial magnetizations.
params (Dict[str, torch.Tensor]) – Parameters of the model.
max_iter (int, optional) – Maximum number of iterations. Defaults to 2000.
epsilon (float, optional) – Convergence threshold. Defaults to 1e-6.

Returns:

Fixed point magnetizations of the TAP equations.

Return type:

torch.Tensor

adabmDCA.stats module

adabmDCA.stats.extract_Cij_from_freq(fij: Tensor, pij: Tensor, fi: Tensor, pi: Tensor, mask: Tensor | None = None) → Tuple[float, float]

Extracts the lower triangular part of the covariance matrices of the data and chains starting from the frequencies.

Parameters:

fij (torch.Tensor) – Two-point frequencies of the data.
pij (torch.Tensor) – Two-point frequencies of the chains.
fi (torch.Tensor) – Single-point frequencies of the data.
pi (torch.Tensor) – Single-point frequencies of the chains.
mask (torch.Tensor | None, optional) – Mask for comparing just a subset of the couplings. Defaults to None.

Returns:

Extracted two-point frequencies of the data and chains.

Return type:

Tuple[float, float]

adabmDCA.stats.extract_Cij_from_seqs(data: Tensor, chains: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0, mask: Tensor | None = None) → Tuple[float, float]

Extracts the lower triangular part of the covariance matrices of the data and chains starting from the sequences.

Parameters:

data (torch.Tensor) – Data sequences.
chains (torch.Tensor) – Chain sequences.
weights (torch.Tensor | None, optional) – Weights of the sequences. Defaults to None.
pseudo_count (float, optional) – Pseudo count for the single and two points statistics. Acts as a regularization. Defaults to 0.0.
mask (torch.Tensor | None, optional) – Mask for comparing just a subset of the couplings. Defaults to None.

Returns:

Two-point frequencies of the data and chains.

Return type:

Tuple[float, float]

adabmDCA.stats.generate_unique_triplets(L: int, ntriplets: int, device: device = device(type='cpu')) → Tensor

Generates a set of unique triplets of positions. Used to compute the 3-points statistics.

Parameters:

L (int) – Length of the sequences.
ntriplets (int) – Number of triplets to be generated.
device (torch.device, optional) – Device to perform computations on. Defaults to “cpu”.

Returns:

Tensor of shape (ntriplets, 3) containing the indices of the triplets.

Return type:

torch.Tensor

adabmDCA.stats.get_correlation_two_points(fij: Tensor, pij: Tensor, fi: Tensor, pi: Tensor, mask: Tensor | None = None) → Tuple[float, float]

Computes the Pearson coefficient and the slope between the two-point frequencies of data and chains.

Parameters:

fij (torch.Tensor) – Two-point frequencies of the data.
pij (torch.Tensor) – Two-point frequencies of the chains.
fi (torch.Tensor) – Single-point frequencies of the data.
pi (torch.Tensor) – Single-point frequencies of the chains.
mask (torch.Tensor | None, optional) – Mask to select the couplings to use for the correlation coefficient. Defaults to None.

Returns:

Pearson correlation coefficient of the two-sites statistics and slope of the interpolating line.

Return type:

Tuple[float, float]

adabmDCA.stats.get_covariance_matrix(data: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0) → Tensor

Computes the weighted covariance matrix of the input multi sequence alignment.

Parameters:

data (torch.Tensor) – Input MSA in one-hot variables.
weights (torch.Tensor | None, optional) – Importance weights of the sequences.
pseudo_count (float, optional) – Pseudo count. Defaults to 0.0.

Returns:

Covariance matrix.

Return type:

torch.Tensor

adabmDCA.stats.get_freq_single_point(data: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0) → Tensor

Computes the single point frequencies of the input MSA. :param data: One-hot encoded data array. :type data: torch.Tensor :param weights: Weights of the sequences. :type weights: torch.Tensor | None, optional :param pseudo_count: Pseudo count to be added to the frequencies. Defaults to 0.0. :type pseudo_count: float, optional

Raises:: ValueError – If the input data is not a 3D tensor.
Returns:: Single point frequencies.
Return type:: torch.Tensor

adabmDCA.stats.get_freq_three_points(data: Tensor, ntriplets: int, weights: Tensor | None = None, device: device = device(type='cpu')) → Tensor

Computes the 3-body statistics of the input MSA.

Parameters:

data (torch.Tensor) – Input MSA in one-hot encoding.
ntriplets (int) – Number of triplets to test.
weights (torch.Tensor | None, optional) – Importance weights for the sequences. Defaults to None.
device (torch.device, optional) – Device to perform computations on. Defaults to “cpu”.

Returns:

3-points connected correlation for ntriplets randomly extracted triplets.

Return type:

torch.Tensor

adabmDCA.stats.get_freq_two_points(data: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0) → Tensor

Computes the 2-points statistics of the input MSA.

Parameters:

data (torch.Tensor) – One-hot encoded data array.
weights (torch.Tensor | None, optional) – Array of weights to assign to the sequences of shape.
pseudo_count (float, optional) – Pseudo count for the single and two points statistics. Acts as a regularization. Defaults to 0.0.

Raises:

ValueError – If the input data is not a 3D tensor.

Returns:

Matrix of two-point frequencies of shape (L, q, L, q).

Return type:

torch.Tensor

adabmDCA.utils module

adabmDCA.utils.get_device(device: str, message: bool = True) → device

Returns the device where to store the tensors.

Parameters:

device (str) – Device to be used.
message (bool, optional) – Print the device. Defaults to True.

Returns:

Device.

Return type:

torch.device

adabmDCA.utils.get_dtype(dtype: str) → dtype

Returns the data type of the tensors.

Parameters:: dtype (str) – Data type.
Returns:: Data type.
Return type:: torch.dtype

adabmDCA.utils.get_mask_save(L: int, q: int, device: device) → Tensor

Returns the mask to save the upper-triangular part of the coupling matrix.

Parameters:

L (int) – Length of the MSA.
q (int) – Number of values that each residue can assume.
device (torch.device) – Device where to store the mask.

Returns:

Mask.

Return type:

torch.Tensor

adabmDCA.utils.init_chains(num_chains: int, L: int, q: int, device: device, dtype: dtype = torch.float32, fi: Tensor | None = None) → Tensor

Initialize the chains of the DCA model. If ‘fi’ is provided, the chains are sampled from the profile model, otherwise they are sampled uniformly at random.

Parameters:

num_chains (int) – Number of parallel chains.
L (int) – Length of the MSA.
q (int) – Number of values that each residue can assume.
device (torch.device) – Device where to store the chains.
dtype (torch.dtype, optional) – Data type of the chains. Defaults to torch.float32.
fi (torch.Tensor | None, optional) – Single-point frequencies. Defaults to None.

Returns:

Initialized parallel chains in one-hot encoding format.

Return type:

torch.Tensor

adabmDCA.utils.init_parameters(fi: Tensor) → Dict[str, Tensor]

Initialize the parameters of the DCA model.

Parameters:: fi (torch.Tensor) – Single-point frequencies of the data.
Returns:: Parameters of the model.
Return type:: Dict[str, torch.Tensor]

adabmDCA.utils.resample_sequences(data: Tensor, weights: Tensor, nextract: int) → Tensor

Extracts nextract sequences from data with replacement according to the weights.

Parameters:

data (torch.Tensor) – Data array.
weights (torch.Tensor) – Weights of the sequences.
nextract (int) – Number of sequences to be extracted.

Returns:

Extracted sequences.

Return type:

torch.Tensor

adabmDCA.utils.set_zerosum_gauge(params: Dict[str, Tensor]) → Dict[str, Tensor]

Sets the zero-sum gauge on the coupling matrix.

Parameters:: params (Dict[str, torch.Tensor]) – Parameters of the model.
Returns:: Parameters with fixed gauge.
Return type:: Dict[str, torch.Tensor]

adabmDCApy APIs

Submodules

adabmDCA.dataset module

adabmDCA.fasta module

adabmDCA.functional module

adabmDCA.io module

adabmDCA.pca module

adabmDCA.plot module

adabmDCA.sampling module

adabmDCA.statmech module

adabmDCA.stats module

adabmDCA.utils module

Module contents