RESCOMP Archives

February 2006

RESCOMP@LISTSERV.MIAMIOH.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
David Woods <[log in to unmask]>
Reply To:
Research Computing Support <[log in to unmask]>, David Woods <[log in to unmask]>
Date:
Mon, 13 Feb 2006 07:58:37 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (67 lines)
I think that cancelling the training sessions should only be done if
absolutely necessary.  There are several people who have registered for
these sessions, and cancelling them might send a bad message.  If needed, we
can limit the exercises that are done in the session.  Right now, I had
planned on the following:
- login
- load and unload modules
- run a batch job that uses shell commands like sleep and date
- compile a simple hello world C or FORTRAN program
- compile a hello world MPI program and run it as a batch job


Dave

-----Original Message-----
From: Research Computing Support [mailto:[log in to unmask]] On
Behalf Of Jaime E. Combariza
Sent: Saturday, February 11, 2006 10:37 PM
To: [log in to unmask]
Subject: FYI, cluster status

FYI:

We are facing some problems with the cluster which need to be fixed as
soon as possible. This means that the machine will be unavailable for some
periods of time (4 - 12 hours is possible).

1 - The head node has conflicts with the openmanage software (Dell
software that runs on the head node and monitors hardware). We can
'uninstall' this package but we have not gotten an 'okay' from Dell. Dell
has been notified and is consulting with their support team and are
supposed to get back to us as soon as possible. If not done by Monday noon
I will contact Dell one more time.

2 - Infiniband drivers need to be upgraded (this is to be done after the
OS on the head node has been fixed). Cisco will do this, remotely.

3 - Ibrix software needs to be upgraded. This may take some time possible
up to one day. Tests are being conducted on one of the compute nodes.
Ibrix will do this, remotely.

This means that we may not be able to offer the first two tutorial
sessions. I hope to have more information Monday morning, before
cancelling the sessions on the 16 and 17.

The main problem is that it is very easy to crash the head node and if
this happens then the tutorial session will be incomplete.

Possible solutions (not tested with several users on the system).

1 - do not run any openmp jobs.
2 - do NOT kill any running jobs (on the head node) may not be easy to
implement. I have killed several serial jobs and the head node did not
crash but I will not trust this node.
3 - compile and run (interactively) tests on compute nodes. However, batch
jobs need to be submitted from the head node.


-- 
Jaime E. Combariza
Assistant Director Research Computing
Academic Technology Services
[log in to unmask]
(513) 529-5080
Miami University
Oxford, Ohio 45056

ATOM RSS1 RSS2