Skip to content

module adabmDCA.dataset


class DatasetDCA

Dataset class for handling multi-sequence alignments data.

Args:

  • path_data (str): Path to multi sequence alignment in fasta format.
  • path_weights (Optional[str], optional): Path to the file containing the importance weights of the sequences. If None, the weights are computed automatically.
  • alphabet (str, optional): Selects the type of encoding of the sequences. Default choices are ("protein", "rna", "dna"). Defaults to "protein".
  • clustering_th (float, optional): Sequence identity threshold for clustering. Defaults to 0.8.
  • no_reweighting (bool, optional): If True, the weights are not computed. Defaults to False.
  • remove_duplicates (bool, optional): If True, removes duplicate sequences from the dataset. Defaults to False.
  • filter_sequences (bool, optional): If True, removes sequences containing tokens not in the alphabet. Defaults to False.
  • message (bool, optional): Print the import message. Defaults to True.
  • device (torch.device, optional): Device to be used. Defaults to "cpu".
  • dtype (torch.dtype, optional): Data type of the dataset. Defaults to torch.float32.

method __init__

__init__(
    path_data: str,
    path_weights: Optional[str] = None,
    alphabet: str = 'protein',
    clustering_th: float = 0.8,
    no_reweighting: bool = False,
    remove_duplicates: bool = False,
    filter_sequences: bool = False,
    message: bool = True,
    device: device = device(type='cpu'),
    dtype: dtype = torch.float32
)

method get_effective_size

get_effective_size() → int

Returns the effective size (Meff) of the dataset.

Returns:

  • int: Effective size of the dataset.

method get_frequencies

get_frequencies(
    pseudocount: float = 0.0,
    batch_size: int = 10000
) → Tuple[Tensor, Tensor]

Computes the single-site and two-site frequencies of the dataset. When there are too many sequences, computing the frequencies directly from the one-hot encoding can be memory-intensive. Therefore, we compute the frequencies using batched operations.

Args:

  • pseudocount (float, optional): Pseudocount to be added to the frequencies. Defaults to 0.0.
  • batch_size (int, optional): Batch size to use when computing the frequencies. Defaults to 10000.

Returns:

  • Tuple[torch.Tensor, torch.Tensor]: Single-site frequencies fi of shape (L, q) and two-site frequencies fij of shape (L, q, L, q).

method get_num_residues

get_num_residues() → int

Returns the number of residues (L) in the multi-sequence alignment.

Returns:

  • int: Length of the MSA.

method get_num_states

get_num_states() → int

Returns the number of states (q) in the alphabet.

Returns:

  • int: Number of states.

method shuffle

shuffle() → None

Shuffles the dataset.


method to_one_hot

to_one_hot() → Tensor

Converts the dataset to one-hot encoding.

Returns:

  • torch.Tensor: One-hot encoded dataset of shape (M, L, q).

This file was automatically generated via lazydocs.