module adabmDCA.dataset
class DatasetDCA
Dataset class for handling multi-sequence alignments data.
method __init__
__init__(
path_data: str,
path_weights: Optional[str] = None,
alphabet: str = 'protein',
clustering_th: float = 0.8,
no_reweighting: bool = False,
remove_duplicates: bool = False,
filter_sequences: bool = False,
message: bool = True,
device: device = device(type='cpu'),
dtype: dtype = torch.float32
)
Initialize the dataset.
Args:
path_data(str): Path to multi sequence alignment in fasta format.path_weights(Optional[str], optional): Path to the file containing the importance weights of the sequences. If None, the weights are computed automatically.alphabet(str, optional): Selects the type of encoding of the sequences. Default choices are ("protein", "rna", "dna"). Defaults to "protein".clustering_th(float, optional): Sequence identity threshold for clustering. Defaults to 0.8.no_reweighting(bool, optional): If True, the weights are not computed. Defaults to False.remove_duplicates(bool, optional): If True, removes duplicate sequences from the dataset. Defaults to False.filter_sequences(bool, optional): If True, removes sequences containing tokens not in the alphabet. Defaults to False.message(bool, optional): Print the import message. Defaults to True.device(torch.device, optional): Device to be used. Defaults to "cpu".dtype(torch.dtype, optional): Data type of the dataset. Defaults to torch.float32.
method get_effective_size
get_effective_size() → int
Returns the effective size (Meff) of the dataset.
Returns:
int: Effective size of the dataset.
method get_num_residues
get_num_residues() → int
Returns the number of residues (L) in the multi-sequence alignment.
Returns:
int: Length of the MSA.
method get_num_states
get_num_states() → int
Returns the number of states (q) in the alphabet.
Returns:
int: Number of states.
method shuffle
shuffle() → None
Shuffles the dataset.
This file was automatically generated via lazydocs.