I submitted a Glide docking job to our cluster a couple of weeks ago and we just had a system failure that caused our NFS disks to unmount and for the queuing system on our cluster to shut down. Is there a way to restart the jobs?

Distributed Glide jobs can be restarted at a coarse level: that is, incomplete subjobs have to be started from the beginning, but any completed subjobs do not have to be rerun. For a given Glide job with multiple subjobs, first check to see what state Job Control thinks the job is in:

$SCHRODINGER/jobcontrol -list -c JobId

where JobId is the Schrodinger Job ID of the Glide job, visible in the Monitor panel or near the top of the jobname.log file. You may find that some subjobs are in 'stranded' status, which happens when Job Control on the launch machine loses track of the superintending Job Control processes for the backends on the compute nodes. If there are 'running' subjobs when you know they really aren't running, try

$SCHRODINGER/jobcontrol -ping -c JobId

to have Job Control refresh their statuses. Next, try

$SCHRODINGER/jobcontrol -recover -c JobId

to see if Job Control can recover any files from the compute nodes. It could be that the Glide backends on the compute nodes continued running and were able to produce pv or lib files.

Once the job has been cleaned up, from Job Control's perspective, and the main Glide driver job is in 'died' status, you can try to restart the job


Glide will rerun any 'died' or 'killed' subjobs, plus any subjobs that didn't get run in the original job, and then combine all the old and new results together.

