RESCOMP Archives

February 2006

RESCOMP@LISTSERV.MIAMIOH.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Reply To:
Research Computing Support <[log in to unmask]>, Robin <[log in to unmask]>
Date:
Sat, 18 Feb 2006 10:20:28 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (132 lines)
Thanks Xudong..
That has been very helpful.

Robin

Xudong Yu wrote:

>Hi Robin:
>	If you mean:
>
>Port State of Port #1 on HCA #0 ........ DOWN
>
>	This is normal, port 1 for HCA is usually unused.
>	However, this message:
>
>Port State of Port #0 on HCA #0 ........ UP 4X
>
>	is different from what I have, I have 
>
>Port State of Port #0 on HCA #0 ........ UP
>
>	Don't have 4X there. I am not sure if it's normal.
>
>	Also, I want you know that I have a host list have create at /home/xudong/hpl/bin/Linux_ATHLON_VSIPL,
>you run linpack like:
>
>	/usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 200 -paramfile paramfile -hostfile list1 ./xhpl
>
>	of course, before you run it, you need to modify the HPL.dat file to makes Ps * Qs = 200. 
>
>	Also, for the quick testing, I have put Ns as 500, this number is too small for real performance measurement.
>we need increase it to something like 50000 for real testing.
>
>	I have run the 36 cpu job for linpack, it come out as 2.1 Gflops with 500 Ns, it will go up if we put Ns as
>50000.
>
>thanks
>
>xudong
>
>
>
>-----Original Message-----
>From: Robin [mailto:[log in to unmask]]
>Sent: Friday, February 17, 2006 5:21 PM
>To: Research Computing Support; Xudong Yu
>Cc: [log in to unmask]; Madhusudan; [log in to unmask]; Patrick
>Sutton
>Subject: Re: update 1-41830023 Infiniband linpack testing on Miami
>University
>
>
>All,
>
>It might be the module that Steve Cruz was talking about.
>Steve Cruz mentioned that one of the ports in the switch is bad. It  
>does seem to be entire module that is bad.
>
>How can we get this done as soon as possible ?
>
>I'll re-flash those nodes' HCA and see what happened.
>Up 4% (what ever that means).
>
>---- Performing InfiniBand HCA Self Test ----
>Number of HCAs Detected ................ 1
>PCI Device Check ....................... PASS
>Kernel Arch ............................ ia32e
>Host Driver Version .................... rhel3-2.4.21-32.EL-3.2.0-67
>Host Driver RPM Check .................. PASS
>HCA Type of HCA #0 ..................... CougarCub
>HCA Firmware on HCA #0 ................. v3.3.5 build 3.2.0.67  
>HCA.CougarCub.A1
>HCA Firmware Check on HCA #0 ........... PASS
>Host Driver Initialization ............. PASS
>Number of HCA Ports Active ............. 1
>Port State of Port #0 on HCA #0 ........ UP 4X
>Port State of Port #1 on HCA #0 ........ DOWN
>Error Counter Check .................... PASS
>Kernel Syslog Check .................... PASS
>Node GUID .............................. 00:05:ad:00:00:04:8c:b0
>------------------ DONE ---------------------
>
>
>Thanks,
>Robin
>
>
>On Feb 17, 2006, at 4:59 PM, Xudong Yu wrote:
>
>  
>
>>Hi Robin:
>>	From the host list, I found that both 4-2 and 4-19 is down.
>>	And also, when I try to launch the simple cpi job at /home/xudong/ 
>>hpl/bin/Linux_ATHLON_VSIPL like:
>>
>>/usr/local/topspin/mpi/mpich/bin/mpirun_ssh -np 81 -paramfile  
>>paramfile -hostfile list1 ./cpi
>>
>>	I have seen that error message like:
>>
>>6] Abort: Got an asynchronous event: VAPI_PORT_ERROR  
>>(VAPI_EV_SYNDROME_NONE) at line 362 in file mpid/vapi/viainit.c
>>
>>	come out, usually this indicate a bad infiniband card or something  
>>wrong inside infiniband switcher.
>>
>>	I have a list for the bad node for my 80 nodes testing, and the  
>>list is growing after that, this is a incomplete list:
>>
>>3-1
>>3-2
>>3-7
>>3-8
>>3-9
>>3-15
>>3-19
>>3-21
>>3-22
>>	The bad host list is really too big, so I thought maybe something  
>>wrong in the switcher, we need to make sure
>>all the infiniband card operate well before we goto any linpack  
>>testing.
>>
>>thanks
>>
>>xudong
>>
>
>  
>

ATOM RSS1 RSS2