schrodinger.application.msv.seqio module

class schrodinger.application.msv.seqio.FetchIDs(pdb, entrez, uniprot)

Bases: tuple

__contains__(key, /)

Return key in self.

__len__()

Return len(self).

count(value, /)

Return number of occurrences of value.

entrez

Alias for field number 1

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

pdb

Alias for field number 0

uniprot

Alias for field number 2

exception schrodinger.application.msv.seqio.SequenceWarning[source]

Bases: UserWarning

Custom warning for problems loading sequences

__init__(*args, **kwargs)
args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class schrodinger.application.msv.seqio.catch_sequence_warnings(*args, **kwargs)[source]

Bases: contextlib.ExitStack

Filter SequenceWarnings and store them on the instance

__init__(*args, **kwargs)[source]
callback(callback, /, *args, **kwds)

Registers an arbitrary callback and arguments.

Cannot suppress exceptions.

close()

Immediately unwind the context stack.

enter_context(cm)

Enters the supplied context manager.

If successful, also pushes its __exit__ method as a callback and returns the result of the __enter__ method.

pop_all()

Preserve the context stack by transferring it to a new instance.

push(exit)

Registers a callback with the standard __exit__ method signature.

Can suppress exceptions the same way __exit__ method can. Also accepts any object with an __exit__ method (registering a call to the method instead of the object itself).

exception schrodinger.application.msv.seqio.GetSequencesException[source]

Bases: OSError

Custom Exception for problems retrieving sequences.

__init__(*args, **kwargs)
args
characters_written
errno

POSIX exception code

filename

exception filename

filename2

second exception filename

strerror

exception strerror

with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class schrodinger.application.msv.seqio.PdbParts(pdbcode, pdbchain)

Bases: tuple

__contains__(key, /)

Return key in self.

__len__()

Return len(self).

count(value, /)

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

pdbchain

Alias for field number 1

pdbcode

Alias for field number 0

class schrodinger.application.msv.seqio.FastaParts(name, long_name, chain, anno_type)

Bases: tuple

__contains__(key, /)

Return key in self.

__len__()

Return len(self).

anno_type

Alias for field number 3

chain

Alias for field number 2

count(value, /)

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

long_name

Alias for field number 1

name

Alias for field number 0

schrodinger.application.msv.seqio.make_maestro_pdb_id(pdb_id)[source]

Convert a PDB ID to “:”-separated PDB code and PDB chain (e.g. 4hhb if chain is blank or 4hhb:A)

Parameters

pdb_id (str) – PDB ID with optional chain, e.g. 4hhb, 4hhbA, 4hhb:A, 4hhb_A

Returns

PDB ID with “:” between PDB code and PDB chain

Return type

str

schrodinger.application.msv.seqio.parse_pdb_id(pdb_id, permissive=False)[source]

Parse a PDB ID into a (pdb code, pdb chain) Named tuple.

Parameters
  • pdb_id (str) – PDB ID with optional chain, e.g. 4hhb, 4hhbA, 4hhb:A, 4hhb_A

  • permissive (bool) – Whether to use permissive parsing. In strict mode, PDB ID must be 4 characters starting with a digit and single-letter chain is optional. In permissive mode, PDB ID can contain any non-whitespace characters but chain separator and single-letter chain are required.

Returns

Named tuple of (pdbcode, pdbchain)

Type

PdbParts

Raises

GetSequencesException – if pdb_id can’t be parsed

schrodinger.application.msv.seqio.get_valid_pdb_id_map_for_seqs(seqs, structureless_only=True)[source]

For a list of sequences return a map of valid PDB IDs to sequences.

Parameters
  • seqs (list(sequence.Sequence)) – List of sequences to get the map for

  • structureless_only (bool) – Whether to only return structureless seqs

Returns

Map of valid PDB IDs to their source sequence

Return type

dict(str: sequence.Sequence)

schrodinger.application.msv.seqio.valid_pdb_id(pdb_id: str) bool[source]
Returns

Whether the ID appears to be a valid PDB ID

schrodinger.application.msv.seqio.valid_entrez_id(entrez_id: str) bool[source]

Entrez ID may be:

1) NCBI Accession number: 9 or 12 characters starting with any letter, followed by "P_", ending with 6 or 9 numbers and an optional number following a period (ex. NP_123456, XP_123456789.1)

  1. NCBI GenInfo identifier: A single 9-digit number (ex. 123456789).

Returns

Whether the ID appears to be a valid Entrez ID

schrodinger.application.msv.seqio.valid_uniprot_id(uniprot_id: str) bool[source]

UniProt ID must be 6 characters or 10 characters starting with a letter

Returns

Whether the ID appears to be a valid UniProt ID

schrodinger.application.msv.seqio.valid_swiss_prot_name(swiss_prot_name: str) bool[source]

Swiss-Prot entry name must be of the form X_Y, where X and Y are at most 5 alphanumeric characters and the underscore serves as a separator.

We also require Y to be a minimum of 2 characters to avoid confusion with a PDB ID.

Returns

Whether the name appears to be a valid Swiss-Prot entry name

schrodinger.application.msv.seqio.process_fetch_ids(ids, *, dialog_parent, allow_pdb=True)[source]

Convenience method to parse a list or comma-separated strings into valid sequence and/or structure identifiers. If any IDs can’t be identified, prompt the user to continue.

Parameters
  • ids (str or list) – Database ID or IDs (comma-separated str or list)

  • dialog_parent (QtWidgets.QWidget) – Parent to show dialog box

  • allow_pdb (bool) – Whether to allow structure identifiers. If False, they will be treated as unidentified.

Returns

Namedtuple of IDs identified as PDB, entrez, uniprot; or None if there are unidentified IDs and the user cancels.

Return type

FetchIDs or NoneType

schrodinger.application.msv.seqio.maestro_get_pdb(maestro_pdb_id, pdb_dir=None, remote_ok=False)[source]

Download a PDB file. If specified, the chain will be split out into a separate file.

Parameters
  • maestro_pdb_id (str) – 4-letter PDB code or code:chain (e.g. 4hhb or 4hhb:A)

  • pdb_dir (str) – directory to check for existing files and destination to download new files

  • remote_ok (bool) – whether it’s okay to make a remote query.

Returns

downloaded PDB path

Return type

str

Raises

GetSequencesException – if pdb file can’t be downloaded

class schrodinger.application.msv.seqio.SeqDownloader[source]

Bases: object

ENTREZ_FORMAT_STR = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&id={ID}'
UNIPROT_FORMAT_STR = 'https://www.uniprot.org/uniprot/{ID}.{EXT}'
classmethod downloadPDB(pdb_id, pdb_dir=None, remote_ok=False)[source]

Parse PDB ID string and download PDB file.

Parameters
  • pdb_id (str) – PDB ID with optional chain (e.g. 4hhb, 4hhbA, 4hhb:A)

  • pdb_dir (str) – directory to check for existing files and destination to download new files

  • remote_ok (bool) – whether it’s okay to make a remote query.

Returns

Full path to downloaded PDB path

Type

str

Raises

GetSequencesException – if pdb file can’t be downloaded

classmethod downloadEntrezSeq(sequence_id, remote_ok)[source]

Download a sequence from Entrez database.

Parameters
  • sequence_id (str) – Sequence ID in Entrez format.

  • remote_ok (bool) – whether it’s okay to make a remote query.

Returns

Full path to downloaded fasta file

Return type

str

classmethod downloadUniprotSeq(sequence_id, remote_ok, *, use_xml=False)[source]

Download a sequence from Uniprot database.

Parameters
  • sequence_id (str) – Sequence ID in Uniprot format.

  • remote_ok (bool) – whether it’s okay to make a remote query.

  • use_xml (bool) – whether to get the xml file with the full UniProt annotation information (e.g. domains). Setting this to True with download the xml file instead of the FASTA file.

Returns

Full path to downloaded fasta or xml file

Return type

str

schrodinger.application.msv.seqio.read_sequences(filename)[source]

Read sequences from the filename. Format is detected from the file extension

Note that this function is only used for non-structure filetypes. For structure filetypes, see the StructureConverter class.

Parameters

filename (str) – Path to sequence file

Return type

list

Returns

A list of sequences in the file

schrodinger.application.msv.seqio.from_biopython(biopy_seq)[source]

Convert a Biopython sequence to a ProteinSequence

Parameters

seq (Bio.SeqRecord.SeqRecord) – A Biopython sequence to convert to a ProteinSequence

Returns

The converted sequence

Return type

schrodinger.protein.sequence.ProteinSequence

class schrodinger.application.msv.seqio.StructureConverter(ct, eid=None)[source]

Bases: object

Reads a structure and converts it to a list of sequences.

Note that this class produces sequences that are ordered based on residue number and insertion code, not connectivity. If that ever changes, structure_model.MaestroStructureModel._extractChains must also be updated.

__init__(ct, eid=None)[source]
Parameters
  • ct (schrodinger.structure.Structure) – A structure to convert to sequences.

  • eid (str) – The entry id to assign to the created sequences. If not given, the entry id from the structure will be used.

classmethod convert(ct, eid=None)[source]

Convert the provided structure into a list of sequences.

Parameters
  • ct (schrodinger.structure.Structure) – A structure to convert to sequences.

  • eid (str) – The entry id to assign to the created sequences. If not given, the entry id from the structure will be used.

Returns

A list of sequences, one per chain.

Return type

list[sequence.Sequence]

makeSequences()[source]

Note that disulfide bonds might be between chains, so need to be calculated at the ct level

Returns

A list of sequences, one per chain.

Return type

list[sequence.Sequence]

classmethod convertStructResidue(struct_res, make_res)[source]

Convert a structure._Residue into a residue.Residue.

Parameters
  • struct_res (structure._Residue or residue.Residue) – A structure residue to convert. If this is a residue.Residue object, it will be returned unchanged.

  • make_res (callable) – A method to convert a string into a residue.Residue

Returns

A newly created residue

Return type

residue.Residue

class schrodinger.application.msv.seqio.MMSequenceConverter[source]

Bases: object

Converts sequence between mmseq and MSV sequence formats.

Note

This is supposed to be used with ‘with’ context manager.

classmethod readSequences(file_name, file_format=0)[source]

Reads all sequences from file specified by file_name.

Parameters
  • file_name (str) – Name of input file.

  • file_format (int) – Format of the input file. By default, the format is MMSEQIO_ANY meaning file type is automatically recognized.

Return type

List of schrodinger.protein.sequence.Sequence.

Returns

List of sequences read from the file.

Raises

GetSequencesException – If the file could not be read.

classmethod writeSequences(sequences, file_name, file_format=1)[source]

Writes sequences to a file specified by file_name.

Raises

mmcheck.MmException – If the file could not be open for writing.

Parameters
  • seqences – List of sequences to be written to file.

  • file_name (str) – Name of input file.

  • file_format (int) – Format of the input file. By default, the format is MMSEQIO_NATIVE.

class schrodinger.application.msv.seqio.BaseProteinAlignmentReader[source]

Bases: object

Base class for reading protein sequence alignments from files.

classmethod read(file_name, AlnCls=<class 'schrodinger.protein.alignment.ProteinAlignment'>)[source]

Returns alignment read from file

Note

The alignment can be empty if no sequence was present in the input file.

Parameters
  • file_name (str) – Source file name

  • AlnCls (type) – The type of the Alignment to return

Returns

An alignment of the specified type

Raises

IOError – If file cannot be read

class schrodinger.application.msv.seqio.ClustalAlignmentReader[source]

Bases: schrodinger.application.msv.seqio.BaseProteinAlignmentReader

Class for reading Clustal *.aln files.

classmethod read(file_name, AlnCls=<class 'schrodinger.protein.alignment.ProteinAlignment'>)[source]
Parameters
  • file_name (str) – Source file name

  • AlnCls (type) – The type of the Alignment to return

Returns

An alignment of the specified type

class schrodinger.application.msv.seqio.SeqDReader[source]

Bases: object

classmethod read(file_name)[source]
class schrodinger.application.msv.seqio.FastaAlignmentReader[source]

Bases: object

classmethod parseSSA(seq)[source]

Parse a SSA sequence into a list of SSA values that can be assigned to residues’ secondary_structure property

Parameters

seq (str) – the “sequence” from the FASTA file which encodes the SSA values

Returns

a list of the SSA values. The SSA values come from schrodinger.structure. Returns None if any of the elements was invalid

Type

list(int) or NoneType

classmethod read(file_name, AlnClass=<class 'schrodinger.protein.alignment.ProteinAlignment'>)[source]

Loads a sequence file in FASTA format, creates sequences and appends them to alignment. Splits sequence name from the FASTA header.

Parameters
  • file_name (str) – name of input FASTA file

  • AlnClass (type) – The class of the alignment object to return

Returns

Read alignment.

Return type

AlnClass

classmethod readFromText(lines, AlnClass=<class 'schrodinger.protein.alignment.ProteinAlignment'>)[source]

Read sequences from FASTA-formatted text, creates sequences and appends them to alignment. Splits sequence name from the FASTA header.

Parameters
  • lines (list of str) – list of strings representing FASTA file

  • AlnClass (type) – The class of the alignment object to return

Returns

The alignment

Return type

AlnClass

classmethod readFromStringList(strings, AlnClass=<class 'schrodinger.protein.alignment.ProteinAlignment'>)[source]

Return an alignment object created from an iterable of sequence strings

Parameters
  • strings (Iterable of strings) – Sequences as iterable of strings (1D codes)

  • AlnClass (type) – The class of the alignment object to return

Returns

The alignment

Return type

AlnClass

schrodinger.application.msv.seqio.to_biopython(seq)[source]

Converts a sequence to a Biopython sequence

Parameters

seq (schrodinger.protein.sequence.ProteinSequence) – A sequence to convert to a Biopython sequence

Returns

The sequence converted to a Biopython SeqRecord

Return type

Bio.SeqRecord.SeqRecord

class schrodinger.application.msv.seqio.BaseProteinAlignmentWriter[source]

Bases: object

Class for writing protein alignments to files.

classmethod write(aln, file_name, **kwargs)[source]

Writes aln to a file.

Parameters
  • aln (BaseAlignment) – Alignment to be written to a file.

  • file_name (str) – Destination file name.

Note

Subclasses may take additional **kwargs as write options

class schrodinger.application.msv.seqio.FastaAlignmentWriter[source]

Bases: schrodinger.application.msv.seqio.BaseProteinAlignmentWriter

Class for writing FASTA .fasta files.

Format is described here: U{Fasta format wikipedia<https://en.wikipedia.org/wiki/FASTA_format>}

HEADER_START = '>'
HEADER_END = ''
classmethod toString(aln, use_unique_names=True, maxl=50)[source]
classmethod toStringAndNames(aln, use_unique_names=True, maxl=50, export_annotations=False, sim_ref_seq=None)[source]

Converts aln to FASTA string

Parameters
  • aln (ProteinAlignment) – Structured sequences

  • use_unique_names (bool) – If True, write unique name for each sequence.

  • maxl (int) – Maximum length of a line

  • export_annotations (bool) – Whether annotations should be exported along with sequence information. If True, annotations listed in EXPORT_ANNOTATIONS will be exported.

  • sim_ref_seq (sequence.Sequence or None) – Reference sequence to calculate similarities for the sequences to be exported. If None, similarity will not be exported.

Returns

FASTA string

Return type

string

classmethod toStringList(aln)[source]

Convert ProteinAlignment object to list of sequence strings

Parameters

aln (ProteinAlignment) – Alignment data

Return type

list of str

Returns

A list of sequence strings representing the alignment

classmethod write(aln, file_name, use_unique_names=True, maxl=50, export_annotations=False, sim_ref_seq=None, **kwargs)[source]

Write aln to FASTA file

Raises

IOError – If output file cannot be written.

Parameters
  • aln (ProteinAlignment) – Structured sequences

  • use_unique_names (bool) – If True, write unique name for each sequence.

  • maxl (int) – Maximum length of a line

  • file_name (str) – Destination file name.

  • export_annotations (bool) – Whether annotations should be exported along with sequence information. If True, annotations listed in EXPORT_ANNOTATIONS will be exported.

  • sim_ref_seq (sequence.Sequence or None) – Reference sequence to calculate similarities for the sequences to be exported. If None, similarity will not be exported.

Returns

output names of each sequence

Return type

list of str

class schrodinger.application.msv.seqio.ClustalAlignmentWriter[source]

Bases: schrodinger.application.msv.seqio.BaseProteinAlignmentWriter

Class for writing Clustal *.aln files.

The format is described here: http://meme-suite.org/doc/clustalw-format.html

classmethod write(aln, file_name, use_unique_names=True, **kwargs)[source]

Writes aln to a Clustal alignment file.

Note: **kwargs are ignored, to preserve signature of BaseProteinAlignmentWriter

Raises

IOError – If output file cannot be written.

Parameters
  • aln (BaseAlignment) – Alignment to be written to a file.

  • file_name (str) – Destination file name.

  • use_unique_names (bool) – If True, write unique name for each sequence.

Return type

dict

Returns

A mapping of names written to the clustal file and sequences

class schrodinger.application.msv.seqio.CSVAlignmentWriter[source]

Bases: schrodinger.application.msv.seqio.BaseProteinAlignmentWriter

classmethod write(aln, file_name, export_descriptors=False, **kwargs)[source]

Writes aln to a file.

Parameters
  • aln (BaseAlignment) – Alignment to be written to a file.

  • file_name (str) – Destination file name.

Note

Subclasses may take additional **kwargs as write options

class schrodinger.application.msv.seqio.SeqDAlignmentWriter[source]

Bases: schrodinger.application.msv.seqio.BaseProteinAlignmentWriter

Class to write sequence and descriptors to seqd file. Each sequence is exported to a seqd file with name “<seq_name>_<chain_name>.seqd”

classmethod write(aln, file_name, export_descriptors=True, export_annotations=False, **kwargs)[source]

Writes aln to a file.

Parameters
  • aln (BaseAlignment) – Alignment to be written to a file.

  • file_name (str) – Destination file name.

Note

Subclasses may take additional **kwargs as write options

schrodinger.application.msv.seqio.is_inhouse_header(fasta_header)[source]

Test that the given fasta header is of the in house format In house format is given by ">NAME:<long_name>|CHAIN:<chain>" with an optional "|<anno_type>" flag on the end.

Example::

>NAME:ABC|CHAIN:X|SSA >NAME:A|B|C|CHAIN:X x

Parameters

fasta_header (str) – The fasta header to check

Returns

Whether it is or isnt the in-house format

Return type

bool

schrodinger.application.msv.seqio.parse_in_house_header(fasta_header)[source]

Test that the given fasta header is of the in house format In house format is given by ">NAME:<long_name>|CHAIN:<chain>" with an optional "|<anno_type>" flag on the end.:

Example::

>NAME:ABC LONG|CHAIN:X|SSA –> ABC LONG, X, secondary_structure >NAME:A|B|C|CHAIN:X x –> A|B|C, X, None

Parameters

fasta_header (str) – The fasta header to parse

Returns

the long_name, chain and annotation type corresponding to the header

Return type

tuple(str, str, PSAnno.ANNOTATION_TYPES) or NoneType)

schrodinger.application.msv.seqio.parse_fasta_header(header, permissive=True)[source]

Parse a FASTA header into a (name, long_name, chain, anno_type) Named tuple.

Parameters
  • header (str) – The header for a single entry in a FASTA file (including leading comment character)

  • permissive (bool) – Whether to use permissive parsing. See parse_pdb_id for documentation.

Returns

Named tuple of (name, long_name, chain, anno_type)

Type

FastaParts

schrodinger.application.msv.seqio.parse_long_name(long_name, permissive=True)[source]

Attempt to parse a long_name into a short name and a chain.

Example: 1FSK:A –> 1FSK, A

2BJM.H VH CDR_LENGTH: 5 17 11 –> 2BJM, H sp|accession|entry name –> accession, “”

Parameters
  • long_name (str) – The long name to attempt to parse

  • permissive (bool) – Whether to use permissive parsing. See parse_pdb_id for documentation.

Returns

A short name and a chain id

Return type

PdbParts

schrodinger.application.msv.seqio.reorder_fasta_alignment(aln, orig_names)[source]

Reorder a FASTA alignment to match the order of names written to FASTA.

Intended for use after alignment methods that reorder the output.

Example usage:

orig_names = seqio.FastaAlignmentWriter.write(orig_aln, input_filename)

# run alignment method

aln = seqio.FastaAlignmentReader.read(out_filename)

reorder_fasta_alignment(aln, orig_names)
Parameters
  • aln (alignment.BaseAlignment) – Alignment to reorder. Will be modified in place.

  • orig_names (list[str]) – Original order of sequence names as written to FASTA.

Raises

ValueError – If the alignments have different lengths or mismatched names