Skip to content

module dataset


class DatasetDCA

Dataset class for handling multi-sequence alignments data.

method __init__

__init__(
    path_data: str | Path,
    path_weights: str | Path | None = None,
    alphabet: str = 'protein',
    clustering_th: float = 0.8,
    no_reweighting: bool = False,
    device: device = device(type='cpu'),
    dtype: dtype = torch.float32,
    message: bool = True
)

Initialize the dataset.

Args:

  • path_data (str | Path): Path to multi sequence alignment in fasta format.
  • path_weights (str | Path | None, optional): Path to the file containing the importance weights of the sequences. If None, the weights are computed automatically.
  • alphabet (str, optional): Selects the type of encoding of the sequences. Default choices are ("protein", "rna", "dna"). Defaults to "protein".
  • clustering_th (float, optional): Sequence identity threshold for clustering. Defaults to 0.8.
  • no_reweighting (bool, optional): If True, the weights are not computed. Defaults to False.
  • device (torch.device, optional): Device to be used. Defaults to "cpu".
  • dtype (torch.dtype, optional): Data type of the dataset. Defaults to torch.float32.
  • message (bool, optional): Print the import message. Defaults to True.

method get_effective_size

get_effective_size() → int

Returns the effective size (Meff) of the dataset.

Returns:

  • int: Effective size of the dataset.

method get_num_residues

get_num_residues() → int

Returns the number of residues (L) in the multi-sequence alignment.

Returns:

  • int: Length of the MSA.

method get_num_states

get_num_states() → int

Returns the number of states (q) in the alphabet.

Returns:

  • int: Number of states.

method shuffle

shuffle() → None

Shuffles the dataset.


This file was automatically generated via lazydocs.