Subject: | |
From: | |
Reply To: | |
Date: | Fri, 17 Feb 2006 16:59:23 -0500 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
Hi Robin:
From the host list, I found that both 4-2 and 4-19 is down.
And also, when I try to launch the simple cpi job at /home/xudong/hpl/bin/Linux_ATHLON_VSIPL like:
/usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 81 -paramfile paramfile -hostfile list1 ./cpi
I have seen that error message like:
6] Abort: Got an asynchronous event: VAPI_PORT_ERROR (VAPI_EV_SYNDROME_NONE) at line 362 in file mpid/vapi/viainit.c
come out, usually this indicate a bad infiniband card or something wrong inside infiniband switcher.
I have a list for the bad node for my 80 nodes testing, and the list is growing after that, this is a incomplete list:
3-1
3-2
3-7
3-8
3-9
3-15
3-19
3-21
3-22
The bad host list is really too big, so I thought maybe something wrong in the switcher, we need to make sure
all the infiniband card operate well before we goto any linpack testing.
thanks
xudong
|
|
|