schrodinger.livedesign.molhash module

Generate a unique hash code for a molecule based on chemistry. If two molecules are chemically “the same”, they should have the same hash.

Used by Schrödinger’s LiveDesign to determine if two molecules are the same. LiveDesign makes changes to the molecule before molhash, somewhat equivalent the steps available in rdkit.Chem.MolStandardize.

Using molhash adds value beyond using SMILES because it:

  • Ignores SMILES features that are not chemically meaningful (e.g. atom map numbers)

  • Canonicalizes enhanced stereochemistry groups. For example C[C@H](O)CC |&1:1| and C[C@@H](O)CC |&1:1| have the same molhash

  • Canonicalizes S group data (for example, polymer data)

There are two hash schemes, the default, and one in which tautomers are considered equivalent.

Copyright (C) 2022 Schrödinger, LLC

class schrodinger.livedesign.molhash.HashLayer(value)[source]

Bases: enum.Enum

Variables
  • CANONICAL_SMILES – RDKit canonical SMILES (excluding enhanced stereo)

  • ESCAPE – arbitrary other information to be incorporated

  • FORMULA – a simple molecular formula for the molecule

  • NO_STEREO_SMILES – RDKit canonical SMILES with all stereo removed

  • NO_STEREO_TAUTOMER_HASH – the above tautomer hash lacking all stereo

  • SGROUP_DATA – canonicalization of all SGroups data present

  • TAUTOMER_HASH – SMILES-like representation for a generic tautomer form

See SS-30145 for more documentation and example jupyter notebook

CANONICAL_SMILES = 1
ESCAPE = 2
FORMULA = 3
NO_STEREO_SMILES = 4
NO_STEREO_TAUTOMER_HASH = 5
SGROUP_DATA = 6
TAUTOMER_HASH = 7
class schrodinger.livedesign.molhash.HashScheme(value)[source]

Bases: enum.Enum

Which hash layers to use to when deduplicating molecules

Typically the “ALL_LAYERS” scheme is used, but some users may want the “TAUTOMER_INSENSITIVE_LAYERS” scheme.

Variables
  • ALL_LAYERS – most strict hash scheme utilizing all layers

  • STEREO_INSENSITIVE_LAYERS – excludes stereo sensitive layers

  • TAUTOMER_INSENSITIVE_LAYERS – excludes tautomer sensitive layers

ALL_LAYERS = (<HashLayer.CANONICAL_SMILES: 1>, <HashLayer.ESCAPE: 2>, <HashLayer.FORMULA: 3>, <HashLayer.NO_STEREO_SMILES: 4>, <HashLayer.NO_STEREO_TAUTOMER_HASH: 5>, <HashLayer.SGROUP_DATA: 6>, <HashLayer.TAUTOMER_HASH: 7>)
STEREO_INSENSITIVE_LAYERS = (<HashLayer.ESCAPE: 2>, <HashLayer.FORMULA: 3>, <HashLayer.NO_STEREO_SMILES: 4>, <HashLayer.NO_STEREO_TAUTOMER_HASH: 5>, <HashLayer.SGROUP_DATA: 6>)
TAUTOMER_INSENSITIVE_LAYERS = (<HashLayer.ESCAPE: 2>, <HashLayer.FORMULA: 3>, <HashLayer.NO_STEREO_TAUTOMER_HASH: 5>, <HashLayer.SGROUP_DATA: 6>, <HashLayer.TAUTOMER_HASH: 7>)
schrodinger.livedesign.molhash.get_molhash(all_layers, hash_scheme: schrodinger.livedesign.molhash.HashScheme = HashScheme.ALL_LAYERS) str[source]

Generate a molecular hash using a specified set of layers.

Parameters
  • mol – the molecule to generate the hash for

  • hash_scheme – enum encoding information layers for the hash

Returns

hash for the given scheme constructed from the input layers

schrodinger.livedesign.molhash.get_mol_layers()[source]

Generate layers of data about that could be used to identify a molecule

Parameters
  • original_molecule – molecule to obtain canonicalization layers from

  • data_field_names – optional sequence of names of SGroup DAT fields which will be included in the hash.

  • escape – optional field which can contain arbitrary information

Returns

dictionary of HashLayer enum to calculated hash

schrodinger.livedesign.molhash.strip_atom_map_labels(mol)[source]
schrodinger.livedesign.molhash.get_stereo_tautomer_hash(molecule)[source]
schrodinger.livedesign.molhash.get_canonical_smiles(cxsmiles)[source]
schrodinger.livedesign.molhash.get_no_stereo_layers(mol)[source]
schrodinger.livedesign.molhash.get_canonical_atom_ranks_and_bonds(mol, useSmilesOrdering=True)[source]

returns a 2-tuple with:

  1. the canonical ranks of a molecule’s atoms

  2. the bonds expressed as (canonical_atom_rank_1,canonical_atom_rank_2) where canonical_atom_rank_1 < canonical_atom_rank_2

If useSmilesOrdering is True then the atom indices here correspond to the order of the atoms in the canonical SMILES, otherwise just the canonical atom order is used. useSmilesOrdering=True is a bit slower, but it allows the output to be linked to the canonical SMILES, which can be useful.

schrodinger.livedesign.molhash.canonicalize_data_sgroup(sg, atRanks, bndOrder, fieldNames=None, sortAtomOrder=True)[source]

NOTES: if sortAtomOrder is true then the atom list will be sorted. This assumes that the order of the atoms in that list is not important

schrodinger.livedesign.molhash.getCanonicalBondRep(bond, atomRanks)[source]
schrodinger.livedesign.molhash.canonicalize_sru_sgroup(mol, sg, atRanks, bndOrder, sortAtomAndBondOrder)[source]

NOTES: if sortAtomAndBondOrder is true then the atom and bond lists will be sorted. This assumes that the ordering of those lists is not important

schrodinger.livedesign.molhash.canonicalize_cop_sgroup(sg, atRanks, sortAtomAndBondOrder)[source]

NOTES: if sortAtomAndBondOrder is true then the atom and bond lists will be sorted. This assumes that the ordering of those lists is not important

schrodinger.livedesign.molhash.canonicalize_sgroups(mol, dataFieldNames=None, sortAtomAndBondOrder=True)[source]

NOTES: if sortAtomAndBondOrder is true then the atom and bond lists will be sorted. This assumes that the ordering of those lists is not important

class schrodinger.livedesign.molhash.EnhancedStereoUpdateMode(value)[source]

Bases: enum.Enum

An enumeration.

ADD_WEIGHTS = 1
REMOVE_WEIGHTS = 2
schrodinger.livedesign.molhash.update_enhanced_stereo_group_weights(mol, mode)[source]
schrodinger.livedesign.molhash.canonicalize_stereo_groups(mol)[source]

Returns canonical CXSmiles and the corresponding molecule with the stereo groups canonicalized.

The RDKit canonicalization code does not currently take stereo groups into account. We work around that by using EnumerateStereoisomers() to generate all possible instances of the molecule’s stereogroups and then lexically compare the CXSMILES of those.