Subject: | |
From: | |
Reply To: | |
Date: | Mon, 18 Dec 2006 08:52:48 -0500 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
Henry:
Two things:
1 - at the end of your script you have a "~". Please delete it. It is
giving you the /home/wanx is a directory message
2 - A Signal 7 (Bus Error) is usually a memory error (Bad memory
Access). It was given by process 70 which was possibly
in rack 2. Since this error is not consistent it is difficult to say
if the program is trying to access more memory than is
available or if there is a problem with memory at some nodes. I am
cc'ing the rescomp group so they can run some tests.
Frank, could you check if we are having memory problems at some
nodes? I am guessing rack 2. Henry was running
a parallel job with 128 processes. Most of the nodes on rack one were
empty so it probably used nodes c-1-1 to c2-?
(c1-2 and possibly c1-1 were being used).
Thanks
At 11:33 AM 12/16/2006, you wrote:
>Dear Jaime:
>
>Most time my jobs can go through. But sometimes I may have the
>following errors:
>
>=============================================================
>mpiexec: Warning: tasks 0-69,71-127 died with signal 9 (Killed).
>mpiexec: Warning: task 70 died with signal 7 (Bus error).
>
>real 0m2.310s
>user 0m0.007s
>sys 0m0.031s
>/var/spool/PBS/mom_priv/jobs/20969.mulnx37.SC: line 14: /home/wanx:
>is a directory
>
>=============================================================
>
>mpiexec: Warning: tasks 0-67,69-127 died with signal 9 (Killed).
>mpiexec: Warning: task 68 died with signal 7 (Bus error).
>
>real 0m5.974s
>user 0m0.009s
>sys 0m0.050s
>/var/spool/PBS/mom_priv/jobs/20971.mulnx37.SC: line 14: /home/wanx:
>is a directory
>=============================================================
>
>I am not sure whether this is common or not.
>
>Best,
>
>-Henry
_______
Jaime E. Combariza, Ph.D.
Assistant Director Research Computing
http://www.muohio.edu/researchcomputing
Miami University
(513) 529-5080
|
|
|