RESCOMP Archives

March 2006

RESCOMP@LISTSERV.MIAMIOH.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Reply To:
Research Computing Support <[log in to unmask]>, Robin <[log in to unmask]>
Date:
Sun, 5 Mar 2006 20:40:06 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (186 lines)
Stephen E. Wright wrote:

> This time I linked with -limg and -lm in that order.  Also, given 
> Robin's comments on environment, I qsub'd the job with -V.  The error 
> is now:

Steve, the original error message went away as I made changes so that 
default modules (Intel 9's LD_LIBRARY_PATH) is loaded during a 
non-interactive
SSH session as well. Coincidentally, shortly after my change, you qsub 
it with a '-V'.

This is more of an issue of mpirun_ssh to pass on the env variable. It 
is circumvented by loading a default lib (per Modules)
while doing a mpirun_ssh. We were still in discussion on making 
mpirun_ssh to pass env variables in a way that is consistent w/ Module 
setup.
We have Modules and there's a need to give the flexibility of using a 
different lib (not the default) in general; not just Intel's. As such,
mpirun_ssh needs a wrapper to pass env variables as needed.

You can remove ~/.Xauthority file. In any case, it shouldn't cause the 
job to not run.
The real error here that prevent the job to run is that it 'segfaulted'.
Not sure if it's worth a try: did you forget '#!/bin/bash -l' (typo?) on 
the first line ? Note the '-l' option; it's almost recommended all the 
time. It would setup your 'Module' environment.

Thanks,
Robin


>> ++ mpirun_ssh -np 2 -hostfile /var/spool/PBS/aux/749.mulnx31 ./mnm3mpi
>> /usr/bin/X11/xauth:  error in locking authority file 
>> /home/wrightse/.Xauthority
>> accept: Resource temporarily unavailable
>> Terminating processes..
>> read: Success
>> /var/spool/PBS/mom_priv/jobs/749.mulnx31.SC: line 14: 18537 
>> Segmentation fault      mpirun_ssh -np 2
>>  -hostfile $PBS_NODEFILE ./mnm3mpi
>
>
> The executable doesn't involve graphics or displays, so why is there 
> an Xauth problem?
>
> -- Steve
>
> At 04:43 PM 3/5/2006, Robin wrote:
>
>> > I tried some parallel code over the weekend, but ran into some 
>> problems.
>> >
>> > 1. Suggestion:  'mpicc' isn't understood by my makefile; 'which'
>> > identifies
>> > it as an alias to 'mpicc.i', which does work in a makefile.  Maybe 
>> modules
>> > could make this transparent.
>>
>> I would possibly discuss this w/ all later about this one.
>> I believe that we ought to create a symlink for mpicc -> mpicc.i,
>> rather than using alias.
>>
>> > 2. Suggestion:  Loading mpi-topspin via 'modules' should add the mpich
>> > include-path to the INCLUDE environment variable, so the user 
>> doesn't have
>> > to hunt down its location.
>>
>>
>>
>> > 3. Problem:  I tried to run the following PBS job:
>>
>> mpirun_ssh uses SSH behind the scene.
>>
>> SSH does not pass environment variables.
>>
>> That is if you do 'module add mrbayes'. Your shell's PATH env will have
>> /software/mrbayes/mrbayes-3.1.2. If you do, ssh c1-1 'echo $PATH', all
>> other compute nodes (naturally) does not have
>> '/software/mrbayes/mrbayes-3.1.2' on its PATH by default (w/o you
>> explicitly passing it in).
>>
>> We have 2 choices:
>> 1) Ask users to pass in all the appropriate env variables
>> 2) Use a wrapper script to pass in all the needed details.
>>    Yes.. other labs have wrapper scripts (or they could ask the
>>    users to pass it on themselves), etc.
>>
>>    OSC uses mpiexec (as a wrapper script). I've written
>>    /usr/local/topspin/mpi/mpich/bin/mpirun_ssh.wrap.sh, just to see
>>    if this is workable. If that works and others , we
>>    can deploy the wrapper script as mpirun_ssh.
>>
>>    In all default Linux, PATH almost never includes current directory.
>>    For your own compiled apps (not managed via module, and thus
>>    PATH), you either need to specify
>>
>>
>>
>> >>./mnm3mpi: error while loading shared libraries: libimf.so: cannot 
>> open
>> >>shared object file: No such
>> >>file or directory
>>
>> I just modified (over the weekend) the startup script so that bash shell
>> will also load all the *default* modules.
>>
>>
>> >
>> >>#PBS -l nodes=1:ppn=2
>> >>#PBS -l walltime=06:00:00
>> >>#PBS -j oe
>> >>set -x
>> >>date
>> >>cd $TMPDIR
>> >>
>> >># copy the executable
>> >>cp ${HOME}/mnm/mnm3mpi .
>> >>cp ${HOME}/mnm/interval .
>> >>
>> >>ls
>> >>
>> >># run the executable
>> >>mpirun_ssh -np 2 -hostfile $PBS_NODEFILE mnm3mpi > trial-0000.out
>> >>
>> >>date
>> >>
>> >># copy the output
>> >>cp ./trial-0000.out ${HOME}/mnm
>> >
>> > Here's the output file showing an error:
>> >
>> >>++ date
>> >>Sun Mar  5 14:18:54 EST 2006
>> >>++ cd /tmp/pbs.738.mulnx31
>> >>++ cp /home/wrightse/mnm/mnm3mpi .
>> >>++ cp /home/wrightse/mnm/interval .
>> >>++ ls
>> >>interval
>> >>mnm3mpi
>> >>++ mpirun_ssh -np 2 -hostfile /var/spool/PBS/aux/738.mulnx31 mnm3mpi
>> >>/usr/bin/env: mnm3mpi/usr/bin/env: mnm3mpi: No such file or directory
>> >>: No such file or directory
>> >>++ date
>> >>Sun Mar  5 14:18:54 EST 2006
>> >>++ cp ./trial-0000.out /home/wrightse/mnm
>> >
>> > Suspecting that the path didn't include the '.' directory, I 
>> changed the
>> > mpirun command to use './mnm3mpi' rather than 'mnm3mpi', which gave 
>> me the
>> > following error instead:
>> >
>> >>++ date
>> >>Sun Mar  5 14:22:40 EST 2006
>> >>++ cd /tmp/pbs.739.mulnx31
>> >>++ cp /home/wrightse/mnm/mnm3mpi .
>> >>++ cp /home/wrightse/mnm/interval .
>> >>++ ls
>> >>interval
>> >>mnm3mpi
>> >>++ mpirun_ssh -np 2 -hostfile /var/spool/PBS/aux/739.mulnx31 ./mnm3mpi
>> >>./mnm3mpi: error while loading shared libraries: libimf.so: cannot 
>> open
>> >>shared object file: No such
>> >>file or directory
>> >>./mnm3mpi: error while loading shared libraries: libimf.so: cannot 
>> open
>> >>shared object file: No such
>> >>file or directory
>> >>++ date
>> >>Sun Mar  5 14:22:40 EST 2006
>> >>++ cp ./trial-0000.out /home/wrightse/mnm
>> >
>> > Any ideas what's going wrong here?
>> >
>> > Steve
>> >
>
>
> _________________________________________
> Stephen E. Wright, Associate Professor
> Department of Mathematics & Statistics
> Miami University, Oxford, OH  45056
> ph: (513) 529-1837 , fax: (513) 529-1493

ATOM RSS1 RSS2