Subject: | |
From: | |
Reply To: | |
Date: | Mon, 20 Feb 2006 14:34:28 +0530 |
Content-Type: | TEXT/PLAIN |
Parts/Attachments: |
|
|
Hi Robin,
Can you kindly let me know where this is at.
If you don't know already, the 4% is actually 4 times "X" where "X" is
2.5Gigabits (IB specs) as against some of the older cards which were 1X.
Standing by.
Thanks & best regards,
--
;Madhu
+91 9945180001 / +91 80 41369300
--
On Fri, 17 Feb 2006, Robin wrote:
> All,
>
> It might be the module that Steve Cruz was talking about.
> Steve Cruz mentioned that one of the ports in the switch is bad. It does seem
> to be entire module that is bad.
>
> How can we get this done as soon as possible ?
>
> I'll re-flash those nodes' HCA and see what happened.
> Up 4% (what ever that means).
>
> ---- Performing InfiniBand HCA Self Test ----
> Number of HCAs Detected ................ 1
> PCI Device Check ....................... PASS
> Kernel Arch ............................ ia32e
> Host Driver Version .................... rhel3-2.4.21-32.EL-3.2.0-67
> Host Driver RPM Check .................. PASS
> HCA Type of HCA #0 ..................... CougarCub
> HCA Firmware on HCA #0 ................. v3.3.5 build 3.2.0.67
> HCA.CougarCub.A1
> HCA Firmware Check on HCA #0 ........... PASS
> Host Driver Initialization ............. PASS
> Number of HCA Ports Active ............. 1
> Port State of Port #0 on HCA #0 ........ UP 4X
> Port State of Port #1 on HCA #0 ........ DOWN
> Error Counter Check .................... PASS
> Kernel Syslog Check .................... PASS
> Node GUID .............................. 00:05:ad:00:00:04:8c:b0
> ------------------ DONE ---------------------
>
>
> Thanks,
> Robin
>
>
> On Feb 17, 2006, at 4:59 PM, Xudong Yu wrote:
>
>> Hi Robin:
>> From the host list, I found that both 4-2 and 4-19 is down.
>> And also, when I try to launch the simple cpi job at
>> /home/xudong/hpl/bin/Linux_ATHLON_VSIPL like:
>>
>> /usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 81 -paramfile paramfile
>> -hostfile list1 ./cpi
>>
>> I have seen that error message like:
>>
>> 6] Abort: Got an asynchronous event: VAPI_PORT_ERROR
>> (VAPI_EV_SYNDROME_NONE) at line 362 in file mpid/vapi/viainit.c
>>
>> come out, usually this indicate a bad infiniband card or something
>> wrong inside infiniband switcher.
>>
>> I have a list for the bad node for my 80 nodes testing, and the list
>> is growing after that, this is a incomplete list:
>>
>> 3-1
>> 3-2
>> 3-7
>> 3-8
>> 3-9
>> 3-15
>> 3-19
>> 3-21
>> 3-22
>> The bad host list is really too big, so I thought maybe something
>> wrong in the switcher, we need to make sure
>> all the infiniband card operate well before we goto any linpack testing.
>>
>> thanks
>>
>> xudong
|
|
|