Skip to content

module adabmDCA.fasta

Global Variables

  • TOKENS_PROTEIN
  • TOKENS_RNA
  • TOKENS_DNA

function get_tokens

get_tokens(alphabet: str) → str

Converts a known alphabet into the corresponding tokens, otherwise returns the custom alphabet.

Args:

  • alphabet (str): Alphabet to be used for the encoding. It can be either "protein", "rna", "dna" or a custom string of tokens.

Returns:

  • str: Tokens of the alphabet.

function encode_sequence

encode_sequence(sequence: Union[str, Iterable[str]], tokens: str) → ndarray

Encodes a sequence or a list of sequences into a numeric format.

Args:

  • sequence (Union[str, Iterable[str]]): Input sequence or iterable of sequences of size (batch_size,).
  • tokens (str): Alphabet to be used for the encoding.

Returns:

  • np.ndarray: Array of shape (L,) or (batch_size, L) with the encoded sequence or sequences.

function decode_sequence

decode_sequence(
    sequence: Union[ndarray, Tensor, list],
    tokens: str
) → Union[str, ndarray]

Takes a numeric sequence or list of seqences in input an returns the corresponding string encoding.

Args:

  • sequence (Union[np.ndarray, torch.Tensor, list]): Input sequences. Can be of shape - (L,): single sequence in encoded format - (batch_size, L): multiple sequences in encoded format - (batch_size, L, q) multiple one-hot encoded sequences
  • tokens (str): Alphabet to be used for the encoding.

Returns:

  • Union[str, np.ndarray]: string or array of strings with the decoded input.

function import_from_fasta

import_from_fasta(
    fasta_name: str,
    tokens: Optional[str] = None,
    filter_sequences: bool = False,
    remove_duplicates: bool = False,
    return_mask: bool = False
)

Import sequences from a fasta or compressed fasta (.fas.gz) file. The following operations are performed: - If 'tokens' is provided, encodes the sequences in numeric format. - If 'filter_sequences' is True, removes the sequences whose tokens are not present in the alphabet. - If 'remove_duplicates' is True, removes the duplicated sequences. - If 'return_mask' is True, returns also the mask selecting the retained sequences from the original ones.

Args:

  • fasta_name (str | Path): Path to the fasta or compressed fasta (.fas.gz) file.
  • tokens (str | None, optional): Alphabet to be used for the encoding. If provided, encodes the sequences in numeric format.
  • filter_sequences (bool, optional): If True, removes the sequences whose tokens are not present in the alphabet. Defaults to False.
  • remove_duplicates (bool, optional): If True, removes the duplicated sequences. Defaults to False.
  • return_mask (bool, optional): If True, returns also the mask selecting the retained sequences from the original ones. Defaults to False.

Raises:

  • RuntimeError: The file is not in fasta format.

Returns: Union[Tuple[np.ndarray, np.ndarray], Tuple[np.ndarray, np.ndarray, np.ndarray]]: - If 'return_mask' is False: Tuple of (headers, sequences) - If 'return_mask' is True: Tuple of (headers, sequences, mask)


function write_fasta

write_fasta(
    fname: str,
    headers: Union[Iterable[str], ndarray, Tensor],
    sequences: Union[Iterable[str], ndarray, Tensor],
    remove_gaps: bool = False,
    tokens: str = 'protein'
) → None

Generate a fasta file with the input sequences.

Args:

  • fname (str): Name of the output fasta file.
  • headers (Union[Iterable[str], np.ndarray, torch.Tensor]): Iterable with sequences' headers.
  • sequences (Union[Iterable[str], np.ndarray, torch.Tensor]): Iterable with sequences in string, categorical or one-hot encoded format.
  • remove_gaps (bool, optional): If True, removes the gap from the alignment. Defaults to False.
  • tokens (str): Alphabet to be used for the encoding. Defaults to 'protein'.

function compute_weights

compute_weights(
    data: Union[ndarray, Tensor],
    th: float = 0.8,
    device: device = device(type='cpu'),
    dtype: dtype = torch.float32
) → Tensor

Computes the weight to be assigned to each sequence 's' in 'data' as 1 / n_clust, where 'n_clust' is the number of sequences that have a sequence identity with 's' >= th.

Args:

  • data (Union[np.ndarray, torch.Tensor]): Input dataset. Must be either a (batch_size, L) or a (batch_size, L, q) (one-hot encoded) array.
  • th (float, optional): Sequence identity threshold for the clustering. Defaults to 0.8.
  • device (torch.device, optional): Device. Defaults to "cpu".
  • dtype (torch.dtype, optional): Data type. Defaults to torch.float32.

Returns:

  • torch.Tensor: Array with the weights of the sequences.

function validate_alphabet

validate_alphabet(sequences: ndarray, tokens: str)

Check if the chosen alphabet is compatible with the input sequences.

Args:

  • sequences (np.ndarray): Input sequences.
  • tokens (str): Alphabet to be used for the encoding.

Raises:

  • ValueError: The chosen alphabet is incompatible with the Multi-Sequence Alignment.

This file was automatically generated via lazydocs.