adabmDCApy APIs

This section describe all the functions available in the Python implementation of adabmDCA.

Submodules

adabmDCA.dataset module

class adabmDCA.dataset.DatasetDCA(path_data: str | Path, path_weights: str | Path | None = None, alphabet: str = 'protein', clustering_th: float = 0.8, no_reweighting: bool = False, device: device = device(type='cpu'), dtype: dtype = torch.float32, message: bool = True)

Bases: Dataset

Dataset class for handling multi-sequence alignments data.

get_effective_size() int

Returns the effective size (Meff) of the dataset.

Returns:

Effective size of the dataset.

Return type:

int

get_num_residues() int

Returns the number of residues (L) in the multi-sequence alignment.

Returns:

Length of the MSA.

Return type:

int

get_num_states() int

Returns the number of states (q) in the alphabet.

Returns:

Number of states.

Return type:

int

shuffle() None

Shuffles the dataset.

adabmDCA.fasta module

adabmDCA.fasta.compute_weights(data: ndarray | Tensor, th: float = 0.8, device: device = device(type='cpu'), dtype: dtype = torch.float32) Tensor

Computes the weight to be assigned to each sequence ‘s’ in ‘data’ as 1 / n_clust, where ‘n_clust’ is the number of sequences that have a sequence identity with ‘s’ >= th.

Parameters:
  • data (np.ndarray | torch.Tensor) – Encoded input dataset.

  • th (float, optional) – Sequence identity threshold for the clustering. Defaults to 0.8.

  • device (toch.device, optional) – Device. Defaults to “cpu”.

  • dtype (torch.dtype, optional) – Data type. Defaults to torch.float32.

Returns:

Array with the weights of the sequences.

Return type:

torch.Tensor

adabmDCA.fasta.decode_sequence(sequence: ndarray, tokens: str) str | ndarray

Takes a numeric sequence or list of seqences in input an returns the corresponding string encoding.

Parameters:
  • sequence (np.ndarray) – Input sequences. Can be either a 1D or a 2D array.

  • tokens (str) – Alphabet to be used for the encoding.

Returns:

Decoded input.

Return type:

str | np.ndarray

adabmDCA.fasta.encode_sequence(sequence: str | ndarray, tokens: str) ndarray

Encodes a sequence or a list of sequences into a numeric format.

Parameters:
  • sequence (str | np.ndarray) – Input sequence.

  • tokens (str) – Alphabet to be used for the encoding.

Returns:

Encoded sequence or sequences.

Return type:

np.ndarray

adabmDCA.fasta.get_tokens(alphabet: str) str

Converts the alphabet into the corresponding tokens.

Parameters:

alphabet (str) – Alphabet to be used for the encoding. It can be either “protein”, “rna”, “dna” or a custom string of tokens.

Returns:

Tokens of the alphabet.

Return type:

str

adabmDCA.fasta.import_from_fasta(fasta_name: str | Path, tokens: str | None = None, filter_sequences: bool = False) Tuple[ndarray, ndarray]

Import data from a fasta file.

Parameters:
  • fasta_name (str | Path) – Path to the fasta file.

  • tokens (str | None, optional) – Alphabet to be used for the encoding. If provided, encodes the sequences in numeric format.

  • filter_sequences (bool, optional) – If True, removes the sequences whose tokens are not present in the alphabet. Defaults to False.

Raises:

RuntimeError – The file is not in fasta format.

Returns:

headers, sequences.

Return type:

Tuple[np.ndarray, np.ndarray]

adabmDCA.fasta.validate_alphabet(sequences: ndarray, tokens: str)

Check if the chosen alphabet is compatible with the input sequences.

Parameters:
  • sequences (np.ndarray) – Input sequences.

  • tokens (str) – Alphabet to be used for the encoding.

Raises:

KeyError – The chosen alphabet is incompatible with the Multi-Sequence Alignment.

adabmDCA.fasta.write_fasta(fname: str, headers: ndarray, sequences: ndarray, numeric_input: bool = False, remove_gaps: bool = False, tokens: str = 'protein')

Generate a fasta file with the input sequences.

Parameters:
  • fname (str) – Name of the output fasta file.

  • headers (np.ndarray) – Array of sequences’ headers.

  • sequences (np.ndarray) – Array of sequences.

  • numeric_input (bool, optional) – Whether the sequences are in numeric (encoded) format or not. Defaults to False.

  • remove_gaps (bool, optional) – If True, removes the gap from the alignment. Defaults to False.

  • tokens (str) – Alphabet to be used for the encoding. Defaults to protein.

adabmDCA.functional module

adabmDCA.functional.one_hot(x: Tensor, num_classes: int = -1, dtype: dtype = torch.float32)

A fast one-hot encoding function faster than the PyTorch one working with torch.int32 and returning a float Tensor. Works only for 2D tensors.

Parameters:
  • x (torch.Tensor) – Input tensor to be one-hot encoded.

  • num_classes (int, optional) – Number of classes. If -1, the number of classes is inferred from the input tensor. Defaults to -1.

  • dtype (torch.dtype, optional) – Data type of the output tensor. Defaults to torch.float32.

Returns:

One-hot encoded tensor.

Return type:

torch.Tensor

adabmDCA.io module

adabmDCA.io.load_chains(fname: str, tokens: str, load_weights: bool = False) ndarray | Tuple[ndarray, ndarray]

Loads the sequences from a fasta file and returns the numeric-encoded version. If the sequences are weighted, the log-weights are also returned. If the sequences are not weighted, the log-weights are set to 0.

Parameters:
  • fname (str) – Path to the file containing the sequences.

  • tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.

  • load_weights (bool, optional) – If True, the log-weights are loaded and returned. Defaults to False.

Returns:

Numeric-encoded sequences and log-weights if load_weights is True.

Return type:

np.ndarray | Tuple[np.ndarray, np.ndarray]

adabmDCA.io.load_params(fname: str, tokens: str, device: device, dtype: dtype = torch.float32) Dict[str, Tensor]

Import the parameters of the model from a file.

Parameters:
  • fname (str) – Path of the file that stores the parameters.

  • tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.

  • device (torch.device) – Device where to store the parameters.

  • dtype (torch.dtype) – Data type of the parameters. Defaults to torch.float32.

Returns:

Parameters of the model.

Return type:

Dict[str, torch.Tensor]

adabmDCA.io.load_params_oldformat(fname: str, device: device, dtype: dtype = torch.float32) Dict[str, Tensor]

Import the parameters of the model from a file. Assumes the old DCA format.

Parameters:
  • fname (str) – Path of the file that stores the parameters.

  • device (torch.device) – Device where to store the parameters.

  • dtype (torch.dtype) – Data type of the parameters. Defaults to torch.float32.

Returns:

Parameters of the model.

Return type:

Dict[str, torch.Tensor]

adabmDCA.io.save_chains(fname: str, chains: Tensor, tokens: str, log_weights: Tensor = None) None

Saves the chains in a fasta file.

Parameters:
  • fname (str) – Path to the file where to save the chains.

  • chains (torch.Tensor) – Chains.

  • tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.

  • log_weights (torch.Tensor, optional) – Log-weights of the chains. Defaults to None.

adabmDCA.io.save_params(fname: str, params: Dict[str, Tensor], mask: Tensor, tokens: str) None

Saves the parameters of the model in a file.

Parameters:
  • fname (str) – Path to the file where to save the parameters.

  • params (Dict[str, torch.Tensor]) – Parameters of the model.

  • mask (torch.Tensor) – Mask of the coupling matrix that determines which are the non-zero entries.

  • tokens (str) – “protein”, “dna”, “rna” or another string with the alphabet to be used.

adabmDCA.io.save_params_oldformat(fname: str, params: Dict[str, Tensor], mask: Tensor) None

Saves the parameters of the model in a file. Assumes the old DCA format.

Parameters:
  • fname (str) – Path to the file where to save the parameters.

  • params (Dict[str, torch.Tensor]) – Parameters of the model.

  • mask (torch.Tensor) – Mask of the coupling matrix that determines which are the non-zero entries.

adabmDCA.plot module

adabmDCA.plot.plot_PCA(fig: figure, data1: ndarray, pc1: int = 0, pc2: int = 1, data2: ndarray | None = None, labels: List[str] | str = 'Data', colors: List[str] | str = 'black', title: str | None = None) figure

Makes the scatter plot of the components (pc1, pc2) of the input data and shows the histograms of the components.

Parameters:
  • fig (plt.figure) – Figure to plot the data.

  • data1 (np.ndarray) – Data to plot.

  • pc1 (int, optional) – First principal direction. Defaults to 0.

  • pc2 (int, optional) – Second principal direction. Defaults to 1.

  • data2 (np.ndarray | None, optional) – Data to be superimposed to data1. Defaults to None.

  • labels (List[str] | str, optional) – Labels to put in the legend. Defaults to “Data”.

  • colors (List[str] | str, optional) – Colors to be used. Defaults to “black”.

  • title (str | None, optional) – Title of the plot. Defaults to None.

Returns:

Updated figure.

Return type:

plt.figure

adabmDCA.plot.plot_autocorrelation(ax: Axes, checkpoints: ndarray, autocorr: ndarray, gen_seqid: ndarray, data_seqid: ndarray) Axes

Plots the time-autocorrelation curve of the sequence identity and the generated and data sequence identities.

Parameters:
  • ax (plt.Axes) – Axes to plot the data.

  • checkpoints (np.ndarray) – Checkpoints of the sampling.

  • autocorr (np.ndarray) – Time-autocorrelation of the sequence identity.

  • gen_seqid (np.ndarray) – Sequence identity of the generated data.

  • data_seqid (np.ndarray) – Sequence identity of the data.

Returns:

Updated axes.

Return type:

plt.Axes

adabmDCA.plot.plot_pearson_sampling(ax: Axes, checkpoints: ndarray, pearsons: ndarray, pearson_training: ndarray | None = None)
adabmDCA.plot.plot_scatter_correlations(ax: Axes, Cij_data: ndarray, Cij_gen: ndarray, Cijk_data: ndarray, Cijk_gen: ndarray, pearson_Cij: float, pearson_Cijk: float) Axes

Plots the scatter plot of the data and generated Cij and Cijk values.

Parameters:
  • ax (plt.Axes) – Axes to plot the data. Must have 2 subplots.

  • Cij_data (np.ndarray) – Data Cij values.

  • Cij_gen (np.ndarray) – Generated Cij values.

  • Cijk_data (np.ndarray) – Data Cijk values.

  • Cijk_gen (np.ndarray) – Generated Cijk values.

  • pearson_Cij (float) – Pearson correlation coefficient of Cij.

  • pearson_Cijk (float) – Pearson correlation coefficient of Cijk.

Returns:

Updated axes.

Return type:

plt.Axes

adabmDCA.sampling module

adabmDCA.sampling.get_sampler(sampling_method: str) Callable

Returns the sampling function corresponding to the chosen method.

Parameters:

sampling_method (str) – String indicating the sampling method. Choose between ‘metropolis’ and ‘gibbs’.

Raises:

KeyError – Unknown sampling method.

Returns:

Sampling function.

Return type:

Callable

adabmDCA.sampling.gibbs_sampling(chains: Tensor, params: Dict[str, Tensor], nsweeps: int, beta: float = 1.0) Tensor

Gibbs sampling.

Parameters:
  • chains (torch.Tensor) – Initial chains.

  • params (Dict[str, torch.Tensor]) – Parameters of the model.

  • nsweeps (int) – Number of sweeps.

  • beta (float, optional) – Inverse temperature. Defaults to 1.0.

Returns:

Updated chains.

Return type:

torch.Tensor

adabmDCA.sampling.metropolis(chains: Tensor, params: Dict[str, Tensor], nsweeps: int, beta: float = 1.0) Tensor

Metropolis sampling.

Parameters:
  • chains (torch.Tensor) – One-hot encoded sequences.

  • params (Dict[str, torch.Tensor]) – Parameters of the model.

  • nsweeps (int) – Number of sweeps to be performed.

  • beta (float, optional) – Inverse temperature. Defaults to 1.0.

Returns:

Updated chains.

Return type:

torch.Tensor

adabmDCA.statmech module

adabmDCA.statmech.compute_energy(X: Tensor, params: Dict[str, Tensor]) Tensor

Compute the DCA energy of the sequences in X.

Parameters:
  • X (torch.Tensor) – Sequences in one-hot encoding format.

  • params (Dict[str, torch.Tensor]) – Parameters of the model.

Returns:

DCA Energy of the sequences.

Return type:

torch.Tensor

adabmDCA.statmech.compute_entropy(chains: Tensor, params: Dict[str, Tensor], logZ: float) float

Compute the entropy of the DCA model.

Parameters:
  • chains (torch.Tensor) – Chains that are supposed to be an equilibrium realization of the model.

  • params (Dict[str, torch.Tensor]) – Parameters of the model.

  • logZ (float) – Log-partition function of the model.

Returns:

Entropy of the model.

Return type:

float

adabmDCA.statmech.compute_logZ_exact(all_states: Tensor, params: Dict[str, Tensor]) float

Compute the log-partition function of the model.

Parameters:
  • all_states (torch.Tensor) – All possible states of the system.

  • params (Dict[str, torch.Tensor]) – Parameters of the model.

Returns:

Log-partition function of the model.

Return type:

float

adabmDCA.statmech.compute_log_likelihood(fi: Tensor, fij: Tensor, params: Dict[str, Tensor], logZ: float) float

Compute the log-likelihood of the model.

Parameters:
  • fi (torch.Tensor) – Single-site frequencies of the data.

  • fij (torch.Tensor) – Two-site frequencies of the data.

  • params (Dict[str, torch.Tensor]) – Parameters of the model.

  • logZ (float) – Log-partition function of the model.

Returns:

Log-likelihood of the model.

Return type:

float

adabmDCA.statmech.enumerate_states(L: int, q: int, device: device = device(type='cpu')) Tensor

Enumerate all possible states of a system of L sites and q states.

Parameters:
  • L (int) – Number of sites.

  • q (int) – Number of states.

  • device (torch.device, optional) – Device to store the states. Defaults to “cpu”.

Returns:

All possible states.

Return type:

torch.Tensor

adabmDCA.statmech.iterate_tap(mag: Tensor, params: Dict[str, Tensor], max_iter: int = 500, epsilon: float = 0.0001)

Iterates the TAP equations until convergence.

Parameters:
  • mag (torch.Tensor) – Initial magnetizations.

  • params (Dict[str, torch.Tensor]) – Parameters of the model.

  • max_iter (int, optional) – Maximum number of iterations. Defaults to 2000.

  • epsilon (float, optional) – Convergence threshold. Defaults to 1e-6.

Returns:

Fixed point magnetizations of the TAP equations.

Return type:

torch.Tensor

adabmDCA.stats module

adabmDCA.stats.extract_Cij_from_freq(fij: Tensor, pij: Tensor, fi: Tensor, pi: Tensor, mask: Tensor | None = None) Tuple[float, float]

Extracts the lower triangular part of the covariance matrices of the data and chains starting from the frequencies.

Parameters:
  • fij (torch.Tensor) – Two-point frequencies of the data.

  • pij (torch.Tensor) – Two-point frequencies of the chains.

  • fi (torch.Tensor) – Single-point frequencies of the data.

  • pi (torch.Tensor) – Single-point frequencies of the chains.

  • mask (torch.Tensor | None, optional) – Mask for comparing just a subset of the couplings. Defaults to None.

Returns:

Extracted two-point frequencies of the data and chains.

Return type:

Tuple[float, float]

adabmDCA.stats.extract_Cij_from_seqs(data: Tensor, chains: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0, mask: Tensor | None = None) Tuple[float, float]

Extracts the lower triangular part of the covariance matrices of the data and chains starting from the sequences.

Parameters:
  • data (torch.Tensor) – Data sequences.

  • chains (torch.Tensor) – Chain sequences.

  • weights (torch.Tensor | None, optional) – Weights of the sequences. Defaults to None.

  • pseudo_count (float, optional) – Pseudo count for the single and two points statistics. Acts as a regularization. Defaults to 0.0.

  • mask (torch.Tensor | None, optional) – Mask for comparing just a subset of the couplings. Defaults to None.

Returns:

Two-point frequencies of the data and chains.

Return type:

Tuple[float, float]

adabmDCA.stats.generate_unique_triplets(L: int, ntriplets: int, device: device = device(type='cpu')) Tensor

Generates a set of unique triplets of positions. Used to compute the 3-points statistics.

Parameters:
  • L (int) – Length of the sequences.

  • ntriplets (int) – Number of triplets to be generated.

  • device (torch.device, optional) – Device to perform computations on. Defaults to “cpu”.

Returns:

Tensor of shape (ntriplets, 3) containing the indices of the triplets.

Return type:

torch.Tensor

adabmDCA.stats.get_correlation_two_points(fij: Tensor, pij: Tensor, fi: Tensor, pi: Tensor, mask: Tensor | None = None) Tuple[float, float]

Computes the Pearson coefficient and the slope between the two-point frequencies of data and chains.

Parameters:
  • fij (torch.Tensor) – Two-point frequencies of the data.

  • pij (torch.Tensor) – Two-point frequencies of the chains.

  • fi (torch.Tensor) – Single-point frequencies of the data.

  • pi (torch.Tensor) – Single-point frequencies of the chains.

  • mask (torch.Tensor | None, optional) – Mask to select the couplings to use for the correlation coefficient. Defaults to None.

Returns:

Pearson correlation coefficient of the two-sites statistics and slope of the interpolating line.

Return type:

Tuple[float, float]

adabmDCA.stats.get_covariance_matrix(data: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0) Tensor

Computes the weighted covariance matrix of the input multi sequence alignment.

Parameters:
  • data (torch.Tensor) – Input MSA in one-hot variables.

  • weights (torch.Tensor | None, optional) – Importance weights of the sequences.

  • pseudo_count (float, optional) – Pseudo count. Defaults to 0.0.

Returns:

Covariance matrix.

Return type:

torch.Tensor

adabmDCA.stats.get_freq_single_point(data: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0) Tensor

Computes the single point frequencies of the input MSA. :param data: One-hot encoded data array. :type data: torch.Tensor :param weights: Weights of the sequences. :type weights: torch.Tensor | None, optional :param pseudo_count: Pseudo count to be added to the frequencies. Defaults to 0.0. :type pseudo_count: float, optional

Raises:

ValueError – If the input data is not a 3D tensor.

Returns:

Single point frequencies.

Return type:

torch.Tensor

adabmDCA.stats.get_freq_three_points(data: Tensor, ntriplets: int, weights: Tensor | None = None, device: device = device(type='cpu')) Tensor

Computes the 3-body statistics of the input MSA.

Parameters:
  • data (torch.Tensor) – Input MSA in one-hot encoding.

  • ntriplets (int) – Number of triplets to test.

  • weights (torch.Tensor | None, optional) – Importance weights for the sequences. Defaults to None.

  • device (torch.device, optional) – Device to perform computations on. Defaults to “cpu”.

Returns:

3-points connected correlation for ntriplets randomly extracted triplets.

Return type:

torch.Tensor

adabmDCA.stats.get_freq_two_points(data: Tensor, weights: Tensor | None = None, pseudo_count: float = 0.0) Tensor

Computes the 2-points statistics of the input MSA.

Parameters:
  • data (torch.Tensor) – One-hot encoded data array.

  • weights (torch.Tensor | None, optional) – Array of weights to assign to the sequences of shape.

  • pseudo_count (float, optional) – Pseudo count for the single and two points statistics. Acts as a regularization. Defaults to 0.0.

Raises:

ValueError – If the input data is not a 3D tensor.

Returns:

Matrix of two-point frequencies of shape (L, q, L, q).

Return type:

torch.Tensor

adabmDCA.utils module

adabmDCA.utils.get_device(device: str, message: bool = True) device

Returns the device where to store the tensors.

Parameters:
  • device (str) – Device to be used.

  • message (bool, optional) – Print the device. Defaults to True.

Returns:

Device.

Return type:

torch.device

adabmDCA.utils.get_dtype(dtype: str) dtype

Returns the data type of the tensors.

Parameters:

dtype (str) – Data type.

Returns:

Data type.

Return type:

torch.dtype

adabmDCA.utils.get_mask_save(L: int, q: int, device: device) Tensor

Returns the mask to save the upper-triangular part of the coupling matrix.

Parameters:
  • L (int) – Length of the MSA.

  • q (int) – Number of values that each residue can assume.

  • device (torch.device) – Device where to store the mask.

Returns:

Mask.

Return type:

torch.Tensor

adabmDCA.utils.init_chains(num_chains: int, L: int, q: int, device: device, dtype: dtype = torch.float32, fi: Tensor | None = None) Tensor

Initialize the chains of the DCA model. If ‘fi’ is provided, the chains are sampled from the profile model, otherwise they are sampled uniformly at random.

Parameters:
  • num_chains (int) – Number of parallel chains.

  • L (int) – Length of the MSA.

  • q (int) – Number of values that each residue can assume.

  • device (torch.device) – Device where to store the chains.

  • dtype (torch.dtype, optional) – Data type of the chains. Defaults to torch.float32.

  • fi (torch.Tensor | None, optional) – Single-point frequencies. Defaults to None.

Returns:

Initialized parallel chains in one-hot encoding format.

Return type:

torch.Tensor

adabmDCA.utils.init_parameters(fi: Tensor) Dict[str, Tensor]

Initialize the parameters of the DCA model.

Parameters:

fi (torch.Tensor) – Single-point frequencies of the data.

Returns:

Parameters of the model.

Return type:

Dict[str, torch.Tensor]

adabmDCA.utils.resample_sequences(data: Tensor, weights: Tensor, nextract: int) Tensor

Extracts nextract sequences from data with replacement according to the weights.

Parameters:
  • data (torch.Tensor) – Data array.

  • weights (torch.Tensor) – Weights of the sequences.

  • nextract (int) – Number of sequences to be extracted.

Returns:

Extracted sequences.

Return type:

torch.Tensor

adabmDCA.utils.set_zerosum_gauge(params: Dict[str, Tensor]) Dict[str, Tensor]

Sets the zero-sum gauge on the coupling matrix.

Parameters:

params (Dict[str, torch.Tensor]) – Parameters of the model.

Returns:

Parameters with fixed gauge.

Return type:

Dict[str, torch.Tensor]

Module contents