schrodinger.application.phase.packages.oned_task_utils module¶

Performs task-based work for the 1D similarity driver.

class schrodinger.application.phase.packages.oned_task_utils.BasicStatsAccumulator(property_name)¶

Bases: object

Accumulates basic statistics for a set of property values.

__init__(property_name)¶: Constructor taking the name of the property.

addValue(value)¶: Adds a numeric value to the series, updating the statistics.

property std_dev¶: Returns the standard deviation.

schrodinger.application.phase.packages.oned_task_utils.base64_decode_fd(s)¶

Decodes feature definitions from a Base64 string. Feature definitions will be empty if string is empty.

Parameters: s (str) – Base64-encoded feature definition string
Returns: Feature definitions
Return type: list(phase.PhpFeatureDefinition)

schrodinger.application.phase.packages.oned_task_utils.base64_encode_fd(fd)¶

Encodes feature definitions to a Base64 string. String will be empty if fd is empty or None.

Parameters: fd (list(phase.PhpFeatureDefinition)) – Feature definitions
Returns: Base64-encoded feature definitions string
Return type: str

schrodinger.application.phase.packages.oned_task_utils.combine_oned_hits(hits_files_in, hits_file_out, query_row=None, sort=True, max_hits=1000, max_rows=1000000)¶

Combines a set of 1D hits files with or without sorting and capping.

Parameters

hits_files_in (list(str)) – List of compressed CSV hits files to combine
hits_file_out (str) – Output compressed CSV hits file
query_row (list(str)) – If supplied, this row is written before any hits
sort (bool) – Whether to write a sorted hits file
max_hits (int) – Cap on the number of sorted hits to output
max_rows (int) – Maximum number of sorted rows to hold in memory

Returns

Number of hits written

Return type

int

schrodinger.application.phase.packages.oned_task_utils.create_oned_data_file(structure_file, oned_data_file, treatment=0, fd=None, props=None, logger=None, progress_interval=10000)¶

Creates a 1D data file from the structures in a SMILES, SMILES-CSV, Maestro or SD file.

Parameters

structure_file (str) – Input file of structures
oned_data_file (str) – Destination 1D data file (.1dbin)
treatment (phase.OneDTreatment) – Structure treatment for 1D representations
fd (list(phase.PhpFeatureDefinition) or NoneType) – Overrides default feature definitions. Relevant only when treatment is phase.ONED_TREATMENT_PHARM.
props (list(str) or NoneType) – m2io-style properties to include in the 1D data file, other than SMILES and title. Not used when a SMILES file is supplied.
logger (logging.Logger or NoneType) – Logger for info level progress messages
progress_interval (int) – Interval between progress messages

Returns

Number of rows written to the 1D data file

Return type

int

schrodinger.application.phase.packages.oned_task_utils.describe_oned_data_file(oned_data_file, stats=False)¶

Returns a string containing a description of the supplied 1D data file.

Parameters

oned_data_file (str) – Name of the 1D data file (.1dbin)
stats (bool) – Whether to report basic statistics for any numeric properties in the 1D data file

Returns

String containing the description

Return type

str

schrodinger.application.phase.packages.oned_task_utils.export_oned_data_file(oned_data_file, output_file, subset=None)¶

Exports rows from a 1D data file to a compressed CSV file. A subset of rows may be specified as a string of comma-separated row ranges, (e.g., ‘1:100,200:300’) or via a text file with a property name on the first line (e.g., ‘s_m_title’ or ‘s_sd_Vendor_ID’) and the values of that property on subsequent lines.

Parameters

oned_data_file (str) – Name of the 1D data file (.1dbin)
output_file (str) – Output compressed CSV file
subset (str) – A comma-separated list of row ranges or text file name

Returns

Number of rows exported

Return type

int

schrodinger.application.phase.packages.oned_task_utils.get_hits_file_names(jobname, nqueries)¶

Returns the names of the hits files the job will produce based on the number of query structures.

Parameters

jobname (str) – Job name
nqueries (int) – Number of query structures

Returns

Hits file names

Return type

list(str)

schrodinger.application.phase.packages.oned_task_utils.get_jobname(args)¶

Assigns the job name from SCHRODINGER_JOBNAME or from the base name of the appropriate input file.

Parameters: args (argparser.Namespace) – argparser.Namespace with command line arguments
Returns: Job name
Return type: str

schrodinger.application.phase.packages.oned_task_utils.get_master_oned_properties(filenames)¶

Returns the union of all properties in the provided 1D data files or compressed CSV hits files. The first three properties are always SMILES NAME and ONED_REP_PROPERTY. If CSV hits files are supplied, the last property will always be ONED_SIM_PROPERTY. Any additional properties will appear after the first three properties. If multiple files are supplied, those additional properties will be sorted alphabetically.

Parameters: filenames (list(str)) – List of files to consider
Returns: Union of additional properties
Return type: list(str)

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_attributes(oned_data_file)¶

Returns the version, structure treatment and feature definitions of the supplied 1D data file.

Parameters: oned_data_file (str) – Name of the 1D data file (.1dbin)
Returns: tuple of version, structure treatment and feature definitions
Return type: str, phase.OneDTreatment, list(phase.PhpFeatureDefinition)

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_names(source, prefer_cwd=False)¶

Returns the name of the 1D data file(s) specified in source, which may be the name of a 1D data file or a list file containing the names of one or more 1D data files.

Parameters

source (str) – 1D data file source specification
prefer_cwd (bool) – If source is a list file, setting this to True forces the use of 1D data files in the CWD if they exist, even if they also exist at the locations specified in the list file. This addresses the situation where the list file contains absolute paths that exist on the job host, but the corresponding files have been copied to the job directory. In that case, we want to be accessing only the files in the job directory.

Returns

1D data file names

Return type

list(str)

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_properties(oned_data_file)¶

Returns a list of the names of any additional properties stored in the supplied 1D data file. These are properties other than SMILES, NAME and the ONED_REP_PROPERTY. The list will be empty if no additonal properties are stored.

Parameters: oned_data_file (str) – Name of the 1D data file (.1dbin)
Returns: list of additional property names
Return type: list(str)

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_distribution(oned_data_files, nsub)¶

Given a list of 1D data files to screen and the number of subjobs over which the screen is to be distributed, this function determines how to divide the 1D data files over the subjobs. A list with nsub elements is returned, where a given element holds one or more 1D data file names and the (start, stop) row limits to screen in that file. For example, if 1D data files file1.1dbin and file2.1dbin are supplied, with 1200 and 1800 rows, respectively, and nsub is 3, this function would return the following:

[[[‘file1.1dbin’, (0, 1000)]], # subjob 1: [[‘file1.1dbin’, (1000, 1200)], [‘file2.1dbin’, (0, 800)]], # subjob 2 [[‘file2.1dbin’, (800, 1800)]]] # subjob 3

Parameters

oned_data_files (list(str)) – List of 1D data files
nsub (int) – Number of subjobs

Returns

List of lists of file name, (start, stop) limits

Return type

list(list(str, (int, int)))

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_splits(oned_data_file, prefix_out, nfiles)¶

Returns a list of output file names and (start, stop) limits for physically splitting a 1D data file into a number of smaller, equal-sized files. A given element of the returned list will be of the form:

<prefix_out>_<n>.1dbin, (start, stop)

where <prefix_out>_<n>.1dbin is the nth output file to create and start, stop are the corresponding row limits in oned_data_file, with stop being non-inclusive.

Parameters

oned_data_file (str) – The name of the 1D data file to be split
prefix_out (str) – Prefix for all output 1D data files
nfiles (int) – The number of output 1D data files to create

Returns

List of file names and (start, stop) limits

Return type

list(str, (int, int))

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_row_count(oned_data_file)¶

Returns the number of rows in the supplied 1D data file.

Parameters: oned_data_file (str) – Name of the 1D data file (.1dbin)
Returns: Number of rows in the file
Return type: int

schrodinger.application.phase.packages.oned_task_utils.get_oned_data_file_rows(oned_data_file, start=0, stop=None)¶

Generator that yields rows from the supplied 1D data file. Each row is a list of strings, where the first 3 elements are SMILES, name and 1D encoding, and any subsequent elements hold the values of additional properties stored in the 1D data file.

Parameters

oned_data_file (str) – Name of the 1D data file (.1dbin)
start (int) – 0-based starting row position
stop (int) – Upper limit on the rows to read. For example, if start=5 and stop=10, rows 5, 6, 7, 8 and 9 will be read. Reading proceeds until the end of the file by default.

Yield

The next row in the file

Ytype

list(str)

schrodinger.application.phase.packages.oned_task_utils.get_oned_query(st_query, oned_data_file)¶

Returns a 1D representation for the provided query structure that’s created according to the attributes in the supplied 1D data file.

Parameters

st_query (structure.Structure) – The query structure
oned_data_file (str) – Name of the input 1D data file (.1dbin)

Returns

1D representation of the query structure

Return type

phase.OneDRep

schrodinger.application.phase.packages.oned_task_utils.get_oned_query_row(st_query, oned_query_base64, master_properties)¶

Returns a row for the query structure that can be written to the top of a hits file.

Parameters

st_query (structure.Structure) – The query structure
oned_query_base64 (str) – Base64-encoded 1D representation of the query structure
master_properties (list(str)) – The full list of properties being written to the hits file

Returns

Query structure row

Return type

list(str)

schrodinger.application.phase.packages.oned_task_utils.get_oned_query_rows(st_queries, oned_data_files)¶

Given a list of query structures and the 1D data files that were screened, this function returns a row for each query that can be supplied to combine_oned_hits to ensure that the query appears at the top of its associated hits file.

Parameters

st_queries (list(structure.Structure)) – Query structures
oned_data_files (list(str)) – Names of 1D data files that were screened

Returns

A row for each query

Return type

list(list(str))

schrodinger.application.phase.packages.oned_task_utils.get_oned_query_structures(query_file)¶

Reads query structures from the provided SMILES, SMILESCSV, Maestro or SD file. Coordinates are not set in the case of SMILES or SMILESCSV.

Parameters: query_file (str) – Structure file containing queries
Returns: Query structures
Return type: list(structure.Structure)

schrodinger.application.phase.packages.oned_task_utils.get_property_positions(master_properties, file_properties)¶

Returns the postion of each master property in a potentially smaller list of properties from a particular file. If a master property is not found in file_properties, the position of that property will be len(file_properties). Thus if file_pos contains the positions returned by this function, and file_row contains the property values for some row in that file, the following code can be used to construct a master row of property values that contains ‘’ for each missing value:

file_row.append(‘’) master_row = [file_row[pos] for pos in file_pos]

Parameters

master_properties (list(str)) – Master list of properties
file_properties – The list of properties from the file
file_properties – list(str)

Returns

Positions of master_properties within file_properties

Return type

list of int

schrodinger.application.phase.packages.oned_task_utils.get_rows_to_export(row_ranges)¶

Constructs a canvas.ChmBitset from a comma-separated list of row ranges (e.g., ‘1:100,200:300’). Input row numbers are 1-based and upper limits are inclusive. The returned bitset will have a logical size equal to the highest row number supplied, and the on positions will be 0-based. Note that the maximum logical size for a ChmBitset is 4,294,967,263, so this function assumes that users will not attempt to create individual 1D data files that would exceed the ChmBitset limit.

Parameters

oned_data_file (str) – The name of the 1D data file (.1dbin)
row_ranges (str) – Comma-separated list of 1-based row ranges

Returns

Bitset with 0-based rows as the on positions

Return type

canvas.ChmBitset

Raise

ValueError if an illegal string of row ranges is supplied

schrodinger.application.phase.packages.oned_task_utils.get_split_file_names(prefix, n)¶

Returns the names of the 1D data files that will be created in the ‘split’ task.

Parameters

prefix (str) – Prefix of output 1D data files
n (int) – Number of files to create

Returns

1D data file names

Return type

list(str)

schrodinger.application.phase.packages.oned_task_utils.get_structure_file_reader(structure_file)¶

Returns the appropriate reader for the supplied structure file, which is expected to be SMILES, SMILESCSV, MAESTRO or SD.

Parameters: structure_file (str) – Input file of structures
Returns: Structure file reader
Return type: structure.SmilesReader, structure.SmilesCsvReader or structure.StructureReader
Raise: ValueError if the file format is illegal

schrodinger.application.phase.packages.oned_task_utils.get_values_to_match(oned_data_file, filename)¶

Given a 1D data file and a text file containing a property name followed by property values to match, this function returns the 0-based position of the specified property in the 1D data file and a set of the values to match.

Parameters

oned_data_file (str) – The name of the 1D data file (.1dbin)
filename (str) – The name of the text file with the property name and the values to match

Returns

0-based property position, followed by values to match

Return type

int, set(str)

Raise

ValueError if the property is not found in oned_data_file

schrodinger.application.phase.packages.oned_task_utils.is_oned_data_file(filename)¶

Returns True if the supplied file name corresponds to a 1D data file.

Parameters: filename (str) – The name of the file
Returns: Whether the name corresponds to a 1D data file
Return type: bool

schrodinger.application.phase.packages.oned_task_utils.merge_oned_data_files(oned_data_files_in, oned_data_file_out, remove=False)¶

Merges a list of 1D data files, creating an output file with a master set of properties.

Parameters

oned_data_files_in (list(str)) – List of 1D data files to merge
oned_data_file_out (str) – Output 1D data file
remove (bool) – If True input 1D files will be removed after merge

Returns

Total number of rows merged

Return type

int

schrodinger.application.phase.packages.oned_task_utils.merge_oned_hits_files(hits_files_in, hits_file_out, query_row=None)¶

Merges a list of compressed CSV hits files, creating an output file with a master set of properties.

Parameters

hits_files_in (list(str)) – List of hits files to merge
hits_file_out (str) – Output hits file
query_row (list(str)) – If supplied, this row is written before any hits. It should be obtained by calling get_oned_query_rows.

Returns

Number of merged hits written

Return type

int

schrodinger.application.phase.packages.oned_task_utils.run_oned_screen(st_query, oned_data_file, hits_file, start=0, stop=None, write_query_row=False, sort=True, max_hits=1000, max_rows=1000000, min_sim=0.0, logger=None, progress_interval=100000)¶

Performs a 1D similarity screen with a single structure query and writes hits to a compressed CSV file.

Parameters

st_query (structure.Structure) – The query structure
oned_data_file (str) – Name of the input 1D data file (.1dbin)
hits_file (str) – Name of the output compressed CSV file (.csv.gz)
start (int) – 0-based starting row in oned_data_file
stop (int) – Upper limit on the rows to screen. For example, if start=5 and stop=10, rows 5, 6, 7, 8 and 9 will be screened. Screening proceeds until the end of the file by default.
write_query_row (bool) – Whether to write the 1D query as the first row
sort (bool) – Whether to write a sorted hits file
max_hits (int) – Cap on the number of sorted hits to write
min_sim (float) – Write only hits whose similarity to the query are greater than or equal to this value
logger (logging.Logger or NoneType) – Logger for info level progress messages
progress_interval (int) – Interval between progress messages

Returns

Number of hits written

Return type

int

schrodinger.application.phase.packages.oned_task_utils.split_structure_file(structure_file, prefix_out, nfiles)¶

Splits a structure file into a number of smaller, equal-sized files named <prefix_out>_1.<ext>, <prefix_out>_2.<ext>, etc., where <ext> will be ‘smi.gz’, ‘csv.gz’, ‘maegz’ or ‘sdfgz’, depending on the type of file supplied.

Parameters

structure_file (str) – The name of the structure file to be split
prefix_out (str) – Prefix for all output structure files
nfiles (int) – The number of output structure files to create

Returns

The names of the files created

Return type

list(str)

Raise

ValueError if the file format is illegal

schrodinger.application.phase.packages.oned_task_utils.split_oned_data_file(oned_data_file, prefix_out, nfiles)¶

Splits a 1D data file into a number of smaller, equal-sized files named <prefix_out>_1.1dbin, <prefix_out>_2.1dbin, etc.

Parameters

oned_data_file (str) – The name of the 1D data file to be split
prefix_out (str) – Prefix for all output 1D data files
nfiles (int) – The number of output 1D data files to create

Returns

The names of the files created

Return type

list(str)

schrodinger.application.phase.packages.oned_task_utils.write_oned_data_file_row_count(oned_data_file, row_count)¶

Appends the total number of rows to a 1D data file.

Parameters

oned_data_file (str) – Name of the 1D data file (.1dbin)
row_count (int) – Total number of rows in file