Source code for schrodinger.application.combinatorial_diversity.diversity_splitter

"""
Overview and Motivation
-----------------------
This module contains the DiversitySplitter class, which splits a potentially
large data set into roughly equal-sized chunks that occupy distinct regions
of chemical space. The procedure may be used to make the process of choosing
diverse structures via DiversitySelector more scalable. For example, use of
DiversitySelector to choose 10,000 diverse structures from a pool of 100,000
may take several hours due to the quadratic nature of the algorithm. However,
if the pool of 100,000 were divided into 8 chunks of 12,500, DiversitySelector
could be used to select 1,250 structures from each chunk of 12,500, with each
calculation taking only about 1/64 the time of selecting 10,000 in one shot.
While the combined diverse subsets of 1,250 will not be as diverse as if the
algorithm were applied to select 10,000, the diversity will be higher than
random selection. Furthermore, selections made from different chunks can be
distributed over different processors for additional speedup. And finally,
the time required to select larger numbers of diverse structures from larger
pools scales linearly in the number of diverse structures, so long as the
fraction being selected, such as 10%, remains fixed.

Splitting Chemical Sapce
------------------------
Chemical space is divided by first choosing a small number of diverse probe
structures from the pool using DiversitySelector. These N probes are used to
define an N-dimensional similarity space by computing the similarity of each
structure in the pool to each probe. For a pool of 100,000 structures, this
yields a data matrix with 100,000 rows and N columns. Principal components
analysis can be performed on this data matrix to obtain N orthogonal factors
that span directions of maximum variance. If, say, the first 2 factors are
retained, this divides the data space into 4 quadrants identified as '++',
'-+', '--' and '+-' to indicate the algebraic sign of the score on each of
the 2 factors. Below is an illustration of the division of 100,000 actual
structures over 4 quadrants derived from PCA analysis using 10 diverse probes.
PC1 is the factor with the largest eigenvalue and PC2 is the factor with the
2nd largest eigenvalue::

                   PC2
                    |
            20,036  |  32,154
                    |
                 -+ | ++
           ---------|---------PC1
                 -- | +-
                    |
            25,998  |  21,812
                    |

The quadrants do not have equal populations, and as additional factors are
retained and we move to octants and then orthants, the disparities in the
populations become even greater. For example, in the case of 4 factors and
16 orthants, the populations range from 2,322 (--++) to 9,979 (----). To
achieve nearly equal populations, orthants are sorted by population and the
most populous orthant is combined with the least populous orthant, the 2nd
most populous orthant is combined with the 2nd least populous orthant, etc.
In the above case of quadrants, this yields the following orthant pairs
and combined populations::

    (++, -+) --> 52,190
    (+-, --) --> 47,810

If the goal is to obtain chunks of roughly 12,500 compounds, then we need
to go beyond 2 factors so that the combined populations are smaller. Using
4 factors, the following combined populations are obtained::

    (--++, ----) --> 12,301
    (+++-, -+-+) --> 12,100
    (++++, +-++) --> 13,528
    (++-+, ++--) --> 12,990
    (-+++, -++-) --> 12,356
    (-+--, --+-) --> 12,353
    (+-+-, ---+) --> 12,523
    (+--+, +---) --> 11,849

Increasing to 5 factors results in combined populations that are about half
as large as above, so we would use 4 factors to achieve the desired splitting.

Copyright Schrodinger LLC, All Rights Reserved.
"""

import os

from schrodinger.application.combinatorial_diversity import diversity_selector
from schrodinger.application.combinatorial_diversity import splitter_utils
from schrodinger.infra import canvas

MIN_POP = 10000
NUM_PROBES = 10
RAND_SEED = diversity_selector.RAND_SEED


[docs]class DiversitySplitter:
    """
    Given a file of 32-bit fingerprints, this class selects a diverse set
    of probe structures and uses the similarities to the probes to divide
    the structures in the fingerprint file into distinct regions of chemical
    space of roughly equal populations.
    """

[docs]    def __init__(self,
                 fp_file,
                 min_pop=MIN_POP,
                 num_probes=NUM_PROBES,
                 rand_seed=RAND_SEED):
        """
        Constructor taking the name of a 32-bit fingerprint file, the
        minimum desired population of each region of space, the number
        of diverse probes from which to create the space, and a random
        seed for the diversity algorithm used to select the probes.

        :param fp_file: Input file of 32-bit fingerprints and SMILES.
        :type fp_file: str

        :param min_pop: Minimum number of structures in each region.
        :type min_pop: int

        :param num_probes: Number of diverse probe structures.
        :type num_probes: int

        :param rand_seed: Random seed for underlying diversity algorithm.
        :type rand_seed: int
        """

        if not os.path.isfile(fp_file):
            raise FileNotFoundError(f'Fingerprint file "{fp_file}" not found')

        if num_probes < 2:
            raise ValueError('Number of diverse probes must be 2 or greater')

        self._fp_file = fp_file
        probes = splitter_utils.select_probes(fp_file, num_probes, rand_seed)
        sim_cols = splitter_utils.compute_sim_to_probes(fp_file, probes)
        cormat = splitter_utils.create_sim_cormat(sim_cols)
        evalues, evectors = splitter_utils.diagonalize_symmat(cormat)
        scores = splitter_utils.compute_factor_scores(sim_cols, evectors)
        self._orthant_rows = splitter_utils.partition_scores(scores, min_pop)

[docs]    def getOrthantPairs(self):
        """
        Returns a list of strings that represent the orthant pairs over
        which the structures are divided. In the case of 4 factors, the
        returned list might look something like::

            ['++++|+++-', '++-+|+---', . . . '---+|----']

        :return: Orthant pair strings.
        :rtype: list(str)
        """

        return sorted(self._orthant_rows)

[docs]    def getOrthantPopulations(self):
        """
        Returns the number of structures associated with each orthant pair.
        These follow the same order as getOrthantPairs.

        :return: Orthant pair populations.
        :rtype: list(int)
        """

        return [len(self._orthant_rows[s]) for s in self.getOrthantPairs()]

[docs]    def getOrthantRows(self):
        """
        Returns the 0-based fingerprint row numbers associated with each
        orthant pair. These follow the same order as getOrthantPairs.

        :return: Orthant pair rows.
        :rtype: list(list(int))
        """

        return [self._orthant_rows[s] for s in self.getOrthantPairs()]

[docs]    def splitFingerprints(self, file_base, create_files=True):
        """
        Splits the input fingerprint file into a set of fingerprint files
        containing the rows associated with each orthant pair. File names
        will be <file_base>_1.fp, <file_base>_2.fp, etc., and the rows in
        those files will correspond to the rows returned by getOrthantRows.
        This function returns the fingerprint file names.

        Note that use of this function is not recommended for fingerprint
        files containing more than 5 million rows due to a buildup of memory
        within the ChmCustomOut32 objects that create the output fingerprint
        files. The situation is uniquely problematic because those objects
        all remain in scope, building up memory, throughout the entire course
        of this I/O operation.

        :param file_base: Base name of the fingerprint files to create.
        :type file_base: str

        :param create_files: If False, function will return fingerprint
                file names without actually creating the files.
        :type create_files: bool

        :return: Fingerprint file names.
        :rtype: list(str)
        """

        num_orthants = len(self._orthant_rows)
        fp_files = [f'{file_base}_{i + 1}.fp' for i in range(num_orthants)]
        if not create_files:
            return fp_files

        orthant_to_file = dict(zip(self.getOrthantPairs(), fp_files))
        fpin = canvas.ChmFPIn32(self._fp_file)
        typeinfo = fpin.getTypeInfo()
        row_to_fpout = {}
        for orthant, rows in self._orthant_rows.items():
            fpout = canvas.ChmCustomOut32(typeinfo, True)
            fpout.open(orthant_to_file[orthant])
            for row in rows:
                row_to_fpout[row] = fpout

        column_names = fpin.getExtraColumnNames()
        row = 0
        while fpin.hasNext():
            fp, title, props = fpin.nextExtra()
            extra_data = dict(zip(column_names, props))
            row_to_fpout[row].write(fp, title, extra_data)
            row += 1

        return fp_files