RESCOMP Archives

February 2006

RESCOMP@LISTSERV.MIAMIOH.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Madhusudan <[log in to unmask]>
Reply To:
Research Computing Support <[log in to unmask]>, Madhusudan <[log in to unmask]>
Date:
Mon, 20 Feb 2006 14:34:28 +0530
Content-Type:
TEXT/PLAIN
Parts/Attachments:
TEXT/PLAIN (90 lines)
Hi Robin,

Can you kindly let me know where this is at.

If you don't know already, the 4% is actually 4 times "X" where "X" is 
2.5Gigabits (IB specs) as against some of the older cards which were 1X.

Standing by.

Thanks & best regards,
-- 
;Madhu
+91 9945180001 / +91 80 41369300
--

On Fri, 17 Feb 2006, Robin wrote:

> All,
>
> It might be the module that Steve Cruz was talking about.
> Steve Cruz mentioned that one of the ports in the switch is bad. It does seem 
> to be entire module that is bad.
>
> How can we get this done as soon as possible ?
>
> I'll re-flash those nodes' HCA and see what happened.
> Up 4% (what ever that means).
>
> ---- Performing InfiniBand HCA Self Test ----
> Number of HCAs Detected ................ 1
> PCI Device Check ....................... PASS
> Kernel Arch ............................ ia32e
> Host Driver Version .................... rhel3-2.4.21-32.EL-3.2.0-67
> Host Driver RPM Check .................. PASS
> HCA Type of HCA #0 ..................... CougarCub
> HCA Firmware on HCA #0 ................. v3.3.5 build 3.2.0.67 
> HCA.CougarCub.A1
> HCA Firmware Check on HCA #0 ........... PASS
> Host Driver Initialization ............. PASS
> Number of HCA Ports Active ............. 1
> Port State of Port #0 on HCA #0 ........ UP 4X
> Port State of Port #1 on HCA #0 ........ DOWN
> Error Counter Check .................... PASS
> Kernel Syslog Check .................... PASS
> Node GUID .............................. 00:05:ad:00:00:04:8c:b0
> ------------------ DONE ---------------------
>
>
> Thanks,
> Robin
>
>
> On Feb 17, 2006, at 4:59 PM, Xudong Yu wrote:
>
>> Hi Robin:
>> 	From the host list, I found that both 4-2 and 4-19 is down.
>> 	And also, when I try to launch the simple cpi job at 
>> /home/xudong/hpl/bin/Linux_ATHLON_VSIPL like:
>> 
>> /usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 81 -paramfile paramfile 
>> -hostfile list1 ./cpi
>>
>> 	I have seen that error message like:
>> 
>> 6] Abort: Got an asynchronous event: VAPI_PORT_ERROR 
>> (VAPI_EV_SYNDROME_NONE) at line 362 in file mpid/vapi/viainit.c
>>
>> 	come out, usually this indicate a bad infiniband card or something 
>> wrong inside infiniband switcher.
>>
>> 	I have a list for the bad node for my 80 nodes testing, and the list 
>> is growing after that, this is a incomplete list:
>> 
>> 3-1
>> 3-2
>> 3-7
>> 3-8
>> 3-9
>> 3-15
>> 3-19
>> 3-21
>> 3-22
>> 	The bad host list is really too big, so I thought maybe something 
>> wrong in the switcher, we need to make sure
>> all the infiniband card operate well before we goto any linpack testing.
>> 
>> thanks
>> 
>> xudong

ATOM RSS1 RSS2