module fasta
Global Variables
- TOKENS_PROTEIN
- TOKENS_RNA
- TOKENS_DNA
function get_tokens
get_tokens(alphabet: str) → str
Converts the alphabet into the corresponding tokens.
Args:
alphabet
(str): Alphabet to be used for the encoding. It can be either "protein", "rna", "dna" or a custom string of tokens.
Returns:
str
: Tokens of the alphabet.
function encode_sequence
encode_sequence(sequence: str | ndarray | list, tokens: str) → ndarray
Encodes a sequence or a list of sequences into a numeric format.
Args:
sequence
(str | np.ndarray | list): Input sequence.tokens
(str): Alphabet to be used for the encoding.
Returns:
np.ndarray
: Encoded sequence or sequences.
function decode_sequence
decode_sequence(sequence: list | ndarray | Tensor, tokens: str) → str | ndarray
Takes a numeric sequence or list of seqences in input an returns the corresponding string encoding.
Args:
sequence
(np.ndarray): Input sequences. Can be either a 1D or a 2D iterable.tokens
(str): Alphabet to be used for the encoding.
Returns:
str | np.ndarray
: string or array of strings with the decoded input.
function import_from_fasta
import_from_fasta(
fasta_name: str | Path,
tokens: str | None = None,
filter_sequences: bool = False,
remove_duplicates: bool = True
) → Tuple[ndarray, ndarray]
Import sequences from a fasta file. The following operations are performed: - If 'tokens' is provided, encodes the sequences in numeric format. - If 'filter_sequences' is True, removes the sequences whose tokens are not present in the alphabet. - If 'remove_duplicates' is True, removes the duplicated sequences.
Args:
fasta_name
(str | Path): Path to the fasta file.tokens
(str | None, optional): Alphabet to be used for the encoding. If provided, encodes the sequences in numeric format.filter_sequences
(bool, optional): If True, removes the sequences whose tokens are not present in the alphabet. Defaults to False.remove_duplicates
(bool, optional): If True, removes the duplicated sequences. Defaults to True.
Raises:
RuntimeError
: The file is not in fasta format.
Returns:
Tuple[np.ndarray, np.ndarray]
: headers, sequences.
function write_fasta
write_fasta(
fname: str,
headers: ndarray,
sequences: ndarray,
numeric_input: bool = False,
remove_gaps: bool = False,
tokens: str = 'protein'
)
Generate a fasta file with the input sequences.
Args:
fname
(str): Name of the output fasta file.headers
(np.ndarray): Array of sequences' headers.sequences
(np.ndarray): Array of sequences.numeric_input
(bool, optional): Whether the sequences are in numeric (encoded) format or not. Defaults to False.remove_gaps
(bool, optional): If True, removes the gap from the alignment. Defaults to False.tokens
(str): Alphabet to be used for the encoding. Defaults to protein.
function compute_weights
compute_weights(
data: ndarray | Tensor,
th: float = 0.8,
device: device = device(type='cpu'),
dtype: dtype = torch.float32
) → Tensor
Computes the weight to be assigned to each sequence 's' in 'data' as 1 / n_clust, where 'n_clust' is the number of sequences that have a sequence identity with 's' >= th.
Args:
data
(np.ndarray | torch.Tensor): Encoded input dataset.th
(float, optional): Sequence identity threshold for the clustering. Defaults to 0.8.device
(toch.device, optional): Device. Defaults to "cpu".dtype
(torch.dtype, optional): Data type. Defaults to torch.float32.
Returns:
torch.Tensor
: Array with the weights of the sequences.
function validate_alphabet
validate_alphabet(sequences: ndarray, tokens: str)
Check if the chosen alphabet is compatible with the input sequences.
Args:
sequences
(np.ndarray): Input sequences.tokens
(str): Alphabet to be used for the encoding.
Raises:
KeyError
: The chosen alphabet is incompatible with the Multi-Sequence Alignment.
This file was automatically generated via lazydocs.