Hi Robin:
If you mean:
Port State of Port #1 on HCA #0 ........ DOWN
This is normal, port 1 for HCA is usually unused.
However, this message:
Port State of Port #0 on HCA #0 ........ UP 4X
is different from what I have, I have
Port State of Port #0 on HCA #0 ........ UP
Don't have 4X there. I am not sure if it's normal.
Also, I want you know that I have a host list have create at /home/xudong/hpl/bin/Linux_ATHLON_VSIPL,
you run linpack like:
/usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 200 -paramfile paramfile -hostfile list1 ./xhpl
of course, before you run it, you need to modify the HPL.dat file to makes Ps * Qs = 200.
Also, for the quick testing, I have put Ns as 500, this number is too small for real performance measurement.
we need increase it to something like 50000 for real testing.
I have run the 36 cpu job for linpack, it come out as 2.1 Gflops with 500 Ns, it will go up if we put Ns as
50000.
thanks
xudong
-----Original Message-----
From: Robin [mailto:[log in to unmask]]
Sent: Friday, February 17, 2006 5:21 PM
To: Research Computing Support; Xudong Yu
Cc: [log in to unmask]; Madhusudan; [log in to unmask]; Patrick
Sutton
Subject: Re: update 1-41830023 Infiniband linpack testing on Miami
University
All,
It might be the module that Steve Cruz was talking about.
Steve Cruz mentioned that one of the ports in the switch is bad. It
does seem to be entire module that is bad.
How can we get this done as soon as possible ?
I'll re-flash those nodes' HCA and see what happened.
Up 4% (what ever that means).
---- Performing InfiniBand HCA Self Test ----
Number of HCAs Detected ................ 1
PCI Device Check ....................... PASS
Kernel Arch ............................ ia32e
Host Driver Version .................... rhel3-2.4.21-32.EL-3.2.0-67
Host Driver RPM Check .................. PASS
HCA Type of HCA #0 ..................... CougarCub
HCA Firmware on HCA #0 ................. v3.3.5 build 3.2.0.67
HCA.CougarCub.A1
HCA Firmware Check on HCA #0 ........... PASS
Host Driver Initialization ............. PASS
Number of HCA Ports Active ............. 1
Port State of Port #0 on HCA #0 ........ UP 4X
Port State of Port #1 on HCA #0 ........ DOWN
Error Counter Check .................... PASS
Kernel Syslog Check .................... PASS
Node GUID .............................. 00:05:ad:00:00:04:8c:b0
------------------ DONE ---------------------
Thanks,
Robin
On Feb 17, 2006, at 4:59 PM, Xudong Yu wrote:
> Hi Robin:
> From the host list, I found that both 4-2 and 4-19 is down.
> And also, when I try to launch the simple cpi job at /home/xudong/
> hpl/bin/Linux_ATHLON_VSIPL like:
>
> /usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 81 -paramfile
> paramfile -hostfile list1 ./cpi
>
> I have seen that error message like:
>
> 6] Abort: Got an asynchronous event: VAPI_PORT_ERROR
> (VAPI_EV_SYNDROME_NONE) at line 362 in file mpid/vapi/viainit.c
>
> come out, usually this indicate a bad infiniband card or something
> wrong inside infiniband switcher.
>
> I have a list for the bad node for my 80 nodes testing, and the
> list is growing after that, this is a incomplete list:
>
> 3-1
> 3-2
> 3-7
> 3-8
> 3-9
> 3-15
> 3-19
> 3-21
> 3-22
> The bad host list is really too big, so I thought maybe something
> wrong in the switcher, we need to make sure
> all the infiniband card operate well before we goto any linpack
> testing.
>
> thanks
>
> xudong
|