[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bgl-discuss] Something going on today? Jobs aren't terminating.



>>>>> "Susan" == Susan Coghlan <smc@xxxxxxxxxxx> writes:

  Susan> Hi Chad,

  Susan> Something has gone wrong with the scheduler.  I'm not quite
  Susan> certain how to fix the problem.

Actually, the queue manager had stopped receiving process completion
requests. This is why jobs appeared to get stuck. We should go through
a bunch of diagnostic stuff tomorrow. I am looking into why this
happened, but I can also add code to minimize impact as well. 

  Susan> Everyone,

  Susan> Please use mpirun with -partition in the old way for your
  Susan> jobs until we can fix the scheduler problem.  Use the
  Susan> partitions in the top midplane.  Pick one that is not in use.
  Susan> Use 'bgl-listblocks --all | grep R001_J' to get the name of
  Susan> the partitions available and pick one with a status of 'F'.

Everything is back up and running now. I have run a test job through
the system, and everything worked fine. You can resume using the
queuing system.
 -nld

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.