[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bgl-discuss] big-o-jobs



The scheduler went into a hung state last night.  My guess is that it had
to do with Mark's 32 node job that died after two nodes ran out of memory.

Narayan is looking into the problem but suspects that it has to do with a
check we put in yesterday for partitions left in a bad state.  That check
absolutely needs to be there, otherwise we end up in a situation like
yesterday where the scheduler kept trying to schedule jobs on a partition
left in a bad state causing all the jobs would fail.

Susan.


On Tue, 15 Mar 2005, Ray Loy wrote:

>
> mark,
>
> i also had jobs remain in the queue without running.
> something must be wrong.
>
> ray
>
> - --------------------------------------------------------------------
> To add or remove yourself from this mailing list, use the 'notifyme'
> command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.
>
>

- --------------------------------------------------------------------
To add or remove yourself from this mailing list, use the 'notifyme'
command on any BGL machine. To remove: notifyme -n, to add: notifyme -y.