Chapter 13. Email Import

Table of Contents

Email Import

Email Import

The objective of this document is to assist system administrators with the installation of Dcache at Tier 2 sites and configuring it to work with the LHC Computing Grid.

Hi, Just looking for the opinion of others to see if we should report this. In the case of

srmcp gsiftp://gridftpp.gridka.de:2811//etc/group \    
       
srm://dcache.gridpp.rl.ac.uk:8443//pnfs/gridpp.rl.ac.uk/data/dteam/exabyteasdfa

then our SRM does it

May 04 14:45:21 BST 2005:  srm returned requestId = -2147135830
Wed May 04 14:45:21 BST 2005: sleeping 1 seconds ...
Wed May 04 14:45:22 BST 2005: sleeping 4 seconds ...
Wed May 04 14:45:27 BST 2005: sleeping 4 seconds ...
Wed May 04 14:45:31 BST 2005: sleeping 4 seconds ...

the reason it is retrying is because the transfer fails since I am not authorised to use gridftpp.gridka.de at all.

Does that not count an a permanent error, should the SRM just give up and stop trying. Steve

The ftp server and the SRM use the same grid map file. As D-Cache integrates a Java Grid FTP server and there is no other authorisation going on.

srmcp by default does pull when doing an SRM copy.

From the error message in the logs, the destination SRM (in this case one of the Dcache pools) is trying to access the source grid ftp server and failing because Steve (actually its me in this particular one) isn't in the grid-map file of the source Grid FTP server:

05/04 12:55:48 Cell(dcache.gridpp.rl.ac.uk_1@dcachexDomain) :
org.globus.ftp.exception.ServerException: Server refused performing
the request. Custom message: User authorisation failed. (error code
) [Nested exception message:  Custom message: Unexpected reply: 530
No local mapping for Globus ID].  Nested exception is
org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom
message: nexpected reply: 530 No local mapping for Globus ID

When the SRM gets this error it doesn't consider it fatal:

05/04 12:55:48 Cell(RemoteGsiftpTransferManager@srmDomain) :
[id=30148 store src=gsiftp://gridftpp.gridka.de/et/group
dest=///pnfs/gridpp.rl.ac.uk/data/dteam/dr35-fzk-20050504-01]:sending
error reply, reply code=8 errorObject=tranfer failed
:org.globus.ftp.exception.ServerException: Server refused
performing the request. Custom message: User authorization failed.
(error code 1) [Nested exception message:  Custom message:
Unexpected reply: 530 No local mapping for Globus ID] [Nested
exception is org.globus.ftp.exception.UnexpectedReplyCodeException:
 Custom message: Unexpected reply: 530 No local mapping for Globus
ID] for id=30148 store src=gsiftp://gridftpp.gridka.de/etc/group
dest=///pnfs/gridpp.rl.ac.uk/data/dteam/dr35-fzk-20050504-01

And periodically retries the transfer.

Hi Matt, Are the disks scsi? You could recompile the kernel and add in the options:-

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
# CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
# CONFIG_SCSI_LOGGING is not set

using the SL linux config file, it should be easy to add in this option, but keep the rest of the kernel configuration identical.

The reason we wish to have 2 pools on each node is that each node has to serve 2 separate raid arrays. As there's no way to merge these 2 raids into one volume we require 2 pool processes. Also as we would like to connect our SE to both the production grid and UK Light it would be useful if we could make intelligent use of the two separate connections on each pool node (like assign a Grid FTP door to each).

As for our raid problem apparently red hat (and apparently therefore SL) doesn't support LUN's, which our raid controllers use to mark the raid partitions. Any idea's how to get round this. The tech-support from our supplier suggested we might have to recompile the kernel, and as you can imagine we'd like to avoid having to do this.

(steve-sorry if I'm repeating what Brian said but thought it useful to post to the entire list).

cheers, matt

I've lost the mail about you suggesting to SL to change their default kernel but I doubt they will do this nor should they in my opinion. It is important people get as close as possible to red hat in this area. A bug for bug match. Steve

hello, Have several Dcache installation questions that I hope you guys can answer, forgive me for sounding a bit rambly but I'm just looking for intelligent input.

i) what do i change to have 2 different Dcache processes on one node? I'm using the YAIM method of installation. I can see that to identify the second pool process by pointing to a different pnfs file system (for example pool-node:/pnfs1/ pool-node:/pnfs2/ in the site-info.def) but how do you start these 2 different processes using YAIM to set things up? Or do I have to get my hands dirty and use something other then the YAIM install method.

One Dcache pool process manages all the Dcache pools on a node. Check the /opt/d-cache/config/<hostname>.poollist file, it should have one line for each pool:

csfnfs39_1  /exportstage/cms-data24//pool  sticky=allowed
recover-space recover-control recover-anyway lfs=precious
tag.hostname=csfnfs39.rl.ac.uk
csfnfs39_2  /exportstage/cms-data25//pool  sticky=allowed
recover-space recover-control recover-anyway lfs=precious
tag.hostname=csfnfs39.rl.ac.uk

>iii) finally, not really a Dcache question but very related to
>storage, we
>here at Lancaster our having some troubles. We have set up one
of the
>nodes that will be part of our final SE, got it running SL, but
our 2
>raid arrays attached to it were partitioned into 3 1.8 TB
partitions
>each and we can only see 2 of the 6 partitions, we're missing
7.2 TB!
>One of the sister nodes with the original OS (SUSE) still on it
>clearly shows 6 1.8 TB partitions, but we can't mount the
>corresponding partitions on our SL box. We've tried many
things, our
>service people have been contacted but it doesn't hurt to ask
as many
>people as possible :-)

With such little info, I can only guess that the SL kernel does not support multi-LUNs, neither does the RH9 or 7.x varieties.

You can crack this nut in at least 2 ways.

1. Check that all LUNs are being exported from the SCSI controller card. The Adaptec HBAs have a switch to enable this. Once that is enabled then the Adaptec BIOS scan shows all the LUNs rather than the first. Other HBAs may not do this.

2. In my opinion, the best(?) way is to rebuild the kernel with multi-LUN support enabled. You can also include read streaming as well as pumping up the queue tags. i.e. There are at least 3 parameters to increase to get performant arrays in our case.

3. The nice and best way(?) to get multi-LUN support working is to add the following to /etc/modules.conf:

options scsi_mod max_scsi_luns=255

and then remake initrd with mkinitrd and the usual incantation. If that is how they have their system configured.

However, the big gotcha with this is that one parameter in /etc/modules.conf has worked for me but not 2 or 3. Hence, I can get multi-LUN support but not read streaming or increasing the queue tags.

I'd suggest that they try number 3 first in order to validate that multi-LUN support has to be enabled in order to see all the LUNs and then when they claim it is not performant, they will end up having to make a kernel anyways.

Of course, all the above is written with the hindsight of working with Adaptec HBAs, if they have LSI or anyone elses, most of the above may not apply. Cheers, - Nick

You don't need to recompile the kernel. Just add in you modules.conf

options scsi_mod max_scsi_luns=255

recreate your initrd

/sbin/new-kernel-pkg --mkinitrd --depmod --install 2.4.21-32.ELsmp

and you are set.

RHEL update 2 (or 3?) disabled the querying of luns in a scsi device because some devices have problems with this.

> Hello, > > The reason we wish to have 2 pools on each node is that each node has > to serve 2 separate raid arrays. As there's no way to merge these 2 > raids into one volume we require 2 pool processes. Also as we would > like to connect our SE to both the production grid and UK Light it > would be useful if we could make intelligent use of the two separate > connections on each pool node (like assign a Grid FTP door to each).

There currently is no way to bind a Grid FTP door to an interface. The Grid FTP door uses the system's default ip address when returning the connection details for the data channel to the client regardless of the interface the request was received on.


> 
> Which leads me to a question, how easy is it to stick pool
nodes into
> an existing Dcache setup? From what I understand you have to
enter
> pool into the poollist file but after you do that how do you
get the
> admin node to rescan it's pool-list? Or do you have to restart
the
> whole process?
> 

To add a pool to a node by hand:

0. Ensure the d-cache-core rpm is installed

1. Create the pool
     mkdir -p /path/to/pool/control
     mkdir -p /path/to/pool/data
2. Create the pool config file
The easiest is to copy from another pool and fix the diskspace,
otherwise put the following in /path/to/pool/setup (fixing the
diskspace):
set max diskspace <Diskspace>
set heartbeat 30
set sticky allowed
set report remove off
set breakeven 250.0
set gap 4294967296
set duplicate request none
set p2p separated
#
# Flushing Thread setup
#
flush set max active 1000
flush set interval 60
flush set retry delay 60
#
# HsmStorageHandler2(diskCacheV111.pools.HsmStorageHandler2)
#
rh set max active 2
st set max active 2
rh set timeout 14400
st set timeout 14400
#
# Nothing from the diskCacheV111.pools.SpaceSweeper0#
mover set max active 100
p2p set max active 2
#
#  Pool to Pool (P2P) [$Id: D-Cache-Howto-Email-Import.xml,v 1.2
2005/07/12 16:13:06 synge Exp $]
#
pp set port 0
pp set max active 0
jtm set timeout -queue=io -lastAccess=0 -total=0
jtm set timeout -queue=p2p -lastAccess=0 -total=0
csm set checksumtype adler32
csm set policy -frequently=off
csm set policy -onread=off -onwrite=off -onrestore=off
-ontransfer=off -enforcecrc=off -getcrcfromhsm=off

3. Add the pool to the /opt/d-cache/config/<hostname>.poollist, easiest way is again to adapt another entry.

Start (or restart) the pool service. There should be an script in /opt/d-cache/bin/ called dcache-pool. If you symlink that into /etc/init.d, the usual chkconfig and service mechanisms should work.

There may be some way to do all this using YAIM of course.

Hi all,

Having looked at the operation of the Dcache-written Grid FTP service on Dcache a little closer we've noticed the following:

- all people in the dteam VO seem to be mapped to dteam001. - and no log seems to be kept as to which certificate made which transaction.

So there is seemingly no record kept of who created or deleted which files... Sites may want/need to be made aware of this feature?

(Although as the Grid FTP door allows files to be created only in the appropriate /pnfs/<domain>/data/<VO> directory this doesn't seem to directly create a security hole)

Also, this begs the question: why on earth does the YAIM installation create all the pool accounts and the gridmapdir when there is absolutely no use made of them?

And, of course, these pool accounts have /home directories, by default. Although the Grid FTP service doesn't appear to be able to upload files to, say /home/dteam001/.ssh, this still probably isn't a great idea. And finally by default lcg-expiregridmapdir is installed (to recycle pool accounts that are never used?)

cheers, Owen.

ps. I do also hope the 'feature' of the Dcache Grid FTP service noted last Friday is being reported back up the chain?

I had a look at /opt/d-cache/billing and it doesn't keep *any* information about direct Grid FTP file transfers. Is there an option to increase logging or something similar?

I wonder if any of the other protocols (which ones are available?) can log any information. Are all of them GSI enabled? If not it might not be possible to get a DN at all for them. For example if your pool nodes are in a machine with user access (WN/CE/whatever) it's trivial to use a userland nfs program (nfsshell for example) to access any file in the system without logging or permission checking. People that are planning to use free space from their WNs should be aware of this.

Kostas

Hi,

You don't need pnfs installed on the pool node. There should only be one pnfs running in your entire dcache instance. What you need to do is nfs mount the pnfs service from your admin node on the pool node.

On the admin node, go to /pnfs/fs/admin/etc/exports and copy the file 127.0.0.1 to a file that has the ip address of your pool node as the name, then cd into the trusted directory and do the same.

Then on the pool node add something like

gw04.hep.ph.ic.ac.uk:/fs /pnfs/fs             nfs   
hard,intr,rw,noac,auto 0 0

to your /etc/fstab, create the /pnfs/fs directory, mount it and create the symlink

Then, to run a gridftp door on your pool node: Set the /opt/d-cache/etc/door_config file on the pool node to

ADMIN_NODE     gw04.hep.ph.ic.ac.uk
door          active
--------------------
GSIDCAP         no
GRIDFTP         yes
SRM             no

and then run /opt/d-cache/install/install_doors.sh and then start dcache-opt.

Dcache uses the name of things to work out which one to use. so if you have two gridftp doors named GFTP then the one that started last gets used. What the install_doors script does is change the name to something more unique (GFTP-<hostname -s> basically) so you don't get name collisions. You can change the names yourself in the /opt/d-cache/config/*.batch files

Derek

There are some records in the cd /opt/d-cache/billing/ area.

There is also a config option to dump the records straight into a postgres database for post processing. We turned it on and it works.

I am vague memory that it does not contain the DN which is main bit info you are looking for , we do have a request in with developers for this to be added but this was only informal so this should be looked at/verified/chased up.

Steve

You don't need to recompile the kernel. Just add in you modules.conf


options scsi_mod max_scsi_luns=255

recreate your initrd 

/sbin/new-kernel-pkg --mkinitrd --depmod --install 2.4.21-32.ELsmp

and you are set.

RHEL update 2 (or 3?) disabled the querying of luns in a scsi
device
because some devices have problems with this.

Kostas


Hi Derek,

Thanks for the advice.

We have tried a few permutations on this as follows:

Ross, D (Derek) wrote:
> You don't need pnfs installed on the pool node. There should
only be
> one pnfs running in your entire dcache instance. What you need
to do
> is nfs mount the pnfs service from your admin node on the pool
node.
>
Ok.  We stopped pnfs on the pool node.

> On the admin node, go to  /pnfs/fs/admin/etc/exports and copy
the
> file 127.0.0.1 to a file that has the ip address of your pool
node as
> the name, then cd into the trusted directory and do the same.

Already done!  We don't like pnfs being visible to the whole
world...

> Then on the pool node add something like
gw04.hep.ph.ic.ac.uk:/fs
> /pnfs/fs             nfs     hard,intr,rw,noac,auto 0 0 to
your
> /etc/fstab, create the /pnfs/fs directory, mount it and create
the
> symlink

Done.

> Then, to run a gridftp door on your pool node: Set the
> /opt/d-cache/etc/door_config file on the pool node to
>
> ADMIN_NODE     gw04.hep.ph.ic.ac.uk
>
> door          active -------------------- GSIDCAP         no
GRIDFTP
> yes SRM             no
>
> and then run /opt/d-cache/install/install_doors.sh and then
start
> dcache-opt.

Done.

 > dCache uses the name of things to work out which one to use.
so if you
 >have two gridftp doors named GFTP then the one that started
last gets
 >used. What the install_doors script does is change the name to
>something
 >more unique (GFTP-<hostname -s> basically) so you don't
get name
 >collisions. You can change the names yourself in the
 >/opt/d-cache/config/*.batch files

OK - the door gridftpdoor-gw03Domain started on the pool node.

However, after this transfers still seemed to go through gw04
(admin node).

So, we stopped dcache-opt on the admin node and edited
/opt/d-cache/etc/door-config on the admin node to something like:

ADMIN_NODE     gw04.hep.ph.ic.ac.uk

door          active
--------------------
GSIDCAP         yes
GRIDFTP         no
SRM             yes

ran install_doors.sh and started dcache-opt.

Now all transfers reliably go direct to gw03 (pool node).

However, they still fail!

Here is the error:
> copying CopyJob, source = file:////root/testfile destination =
gsiftp://gw03.hep
>
.ph.ic.ac.uk:2811//pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526
> GridftpClient: connecting to gw03.hep.ph.ic.ac.uk on port 2811
> GridftpClient: gridFTPClient tcp buffer size is set to 1048576
> GridftpClient: gridFTPWrite started, source file is
java.io.RandomAccessFile@1ce
> 669e destination path is
/pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526
> GridftpClient: parallelism: 10
> GridftpClient: adler 32 for file
java.io.RandomAccessFile@1ce669e is 77e90756
> GridftpClient: waiting for completion of transfer
> GridftpClient: gridFtpWrite: starting the transfer in emode to
/pnfs/hep.ph.ic.a
> c.uk/data/dteam/testfile1526
> org.globus.ftp.exception.ServerException: Server refused
performing the request.
>  Custom message:  (error code 1) [Nested exception message: 
Custom message: Une
> xpected reply: 553
/pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot create
> file: CacheException(rc=666;msg=Path do not exist)].  Nested
exception is org.gl
> obus.ftp.exception.UnexpectedReplyCodeException:  Custom
message: Unexpected rep
> ly: 553 /pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot
create file: Cache
> Exception(rc=666;msg=Path do not exist)
>         at
org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:167)
> GridftpClient:  transfer exception
> org.globus.ftp.exception.ServerException: Server refused
performing the request.
>  Custom message:  (error code 1) [Nested exception message: 
Custom message: Une
> xpected reply: 553
/pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot create
> file: CacheException(rc=666;msg=Path do not exist)].  Nested
exception is org.gl
> obus.ftp.exception.UnexpectedReplyCodeException:  Custom
message: Unexpected rep
> ly: 553 /pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot
create file: Cache
> Exception(rc=666;msg=Path do not exist)
>         at
org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:167)
> GridftpClient: closing client :
org.dcache.srm.util.GridftpClient$FnalGridFTPCli
> ent@14d5bc9
> GridftpClient: closed client
> copy failed with the error
> org.globus.ftp.exception.ServerException: Server refused
performing the request.
>  Custom message:  (error code 1) [Nested exception message: 
Custom message: Une
> xpected reply: 553
/pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot create
> file: CacheException(rc=666;msg=Path do not exist)].  Nested
exception is org.gl
> obus.ftp.exception.UnexpectedReplyCodeException:  Custom
message: Unexpected rep
> ly: 553 /pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot
create file: Cache
> Exception(rc=666;msg=Path do not exist)
>         at
org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:167)
>  try again

which keeps repeating as the srmcp gets retried.

BTW: we also reset the gridftp root path back to "/" in case this
was
causing the 'Path do not exist' error, but this had no effect.

BTW2: After running install_doors.sh on the admin node, the nice
cellInfo web page on the admin node is showing Offline for the dcap,
gftp
and srm cells, presumably as these are now dcap-gw04 and
srm-gw04....

Any suggestions?

Mona and Owen.

On Tue, May 17, 2005 at 06:16:19PM +0100 or thereabouts, Alessandra
Forti wrote:
> Hi,
>
> I'm trying to make work dccp but it keeps on giving me this
error message:

Try these few.

1. dccp
dcap://bohr0013.tier2.hep.man.ac.uk:22125//pnfs/tier2.hep.man.ac.uk/data/dteam/aNonExistantFile.srm
/home/aforti/yeaa.dcap.

2. dccp
gsidcap://bohr0013.tier2.hep.man.ac.uk:22128//pnfs/tier2.hep.man.ac.uk/data/dteam/aNonExistantFile.srm
/home/aforti/yeaa.dcap.

3. srmcp
srm://bohr0013.tier2.hep.man.ac.uk:8443/pnfs/tier2.hep.man.ac.uk/data/dteam/aNonExistantFile.srm
file:////tmp/junk

You should be able to add -protocol=dcap to the second one to have
the SRM use a dcap and not a Grid FTP TURL but the transfer part
failed for me
earlier today with a missing file on the client side.

(1. ) will fail unless you have write access with a your client
side uid/gid.

 Steve

Hi,

> The port number for GSI DCAP is 22128, you'll need to specify
it on the
> command-line: i.e. gsidcap://gw04.hep.ph.ic.ac.uk:22128/

OK, we now can read and write through a GSI DCAP door on the admin
node.

Still no joy understanding what is going wrong with GSI FTP door on
the
pool node though.

cheers,
Owen.

On Thu, 19 May 2005 13:03:04 +0100
Matt Doidge <matt.doidge@GMAIL.COM> wrote:

> Hello,
> Thanks for all your replys. JAVA_LOCATION is pointing towards
the
> correct directory, but I have no JAVA_HOME variable set
anywhere (but
> it sounds like they're the same)..

to set an environment variable

export  JAVA_LOCATION=/usr/java/j2sdk1.4.2_04
export  JAVA_HOME=/usr/java/j2sdk1.4.2_04/jre/


then to test this type

env | grep $JAVA_HOME

if you skip the export line this will not work but

echo $JAVA_HOME

will work as the export function says to sh derived shells that
this
variable should be exported to child processes.

> The dcache port range is set as;
> DCACHE_PORT_RANGE="20000 25000"
> this looks correct to me (although the syntax for the
commented out
> DPM_PORT_RANGE is "20000,25000"
> interestingly this is set the same as GLOBUS_TCP_PORT_RANGE,
there
> isn't some weird clash or port reservation going on is there?

Yes, that's why Globus Grid FTP does not run on D-Cache setups


> checking the logs brings up, in dCache.log (and in
utility.log), an
> interesting java message:
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
25000,20000

That may be interesting

>
> maybe I've got the inverted port range thing going on after
> all....will try changing it around in the site-info.def and
see if
> that works (although i could have sworn i had the latest
version of
> YAIM).

I should not worry to much about that.

Regards

Owen

This is an automated notification sent by LCG Savannah.
It relates to:
                bugs #8777, project LCG Operations

==============================================================================
  OVERVIEW of bugs #8777:
==============================================================================

URL:
  
<http://savannah.cern.ch/bugs/?func=detailitem&item_id=8777>

                  Summary: the information system on the SE dcache
node doesn't
work
                  Project: LCG Operations
             Submitted by: aforti
             Submitted on: 2005-May-27 20:27
                 Category: Information Service
                 Severity: 5 - Average
                 Priority: 5 - Normal
               Item Group: Malfunctioning
                   Status: None
                  Privacy: Public
              Assigned to: lfield
              Open/Closed: Open
                  Release: None
          Reproducibility: Every Time
                   Effort: 0.00

     _______________________________________________________


Hi,

the information system part is not working.

YAIM
=====

1) /opt/lcg/yaim/functions/config_gip

line 366

dn:
GlueSARoot=$VO:${storage#${CE_CLOSE_SE1_ACCESS_POINT}/},GlueSEUniqueID=${SE_
HOST},Mds-Vo-name=local,o=grid

doesn't produce the desired output. I've replaced it with this.

dn:
GlueSARoot=$VO:${storage},GlueSEUniqueID=${SE_HOST},Mds-Vo-name=local,o=grid

2) Some glue schema fields are hardcoded in the script.
    They should go into a configuration file/template that
can be safely edited, editing scripts is not a good practise.

example of fields:

GlueSAPolicyFileLifeTime: permanent
GlueSAPolicyMaxFileSize: 10000
GlueSAPolicyMinFileSize: 1
GlueSAPolicyMaxData: 100
GlueSAPolicyMaxNumFiles: 10
GlueSAPolicyMaxPinDuration: 10
GlueSAPolicyQuota: 00
GlueSAStateAvailableSpace: 1

3) the dynamic part of the ldif contained in
/opt/lcg/var/gip/tmp/lcg-info-dynamic-se.ldif.XXXX
doesn't get created (the file is 0 size)

IS
===

lcg-info-dynamic-dcache seems to have the wrong srm command
to get the info about the available space. In any case the available
and used
space are not published correctly.

SRM
====

srm commands complain if $HOME/.srmconfig/config.xml
doesn't exist even when given -conf=config_file option.

This is really annoying.

Anyway to go back to the IS when this command is run it fails
because of this
and because it doesn't seem to like the host certificates.

  [root@bohr0013 root]# SRM_PATH=/opt/d-cache/srm
/opt/d-cache/srm/bin/srm-storage-element-info
-conf=/opt/d-cache/srm/conf/config.xml
https://bohr0013.tier2.hep.man.ac.uk:8443/srm/infoProvider1_0.wsdl
AxisFault
  faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server
  faultSubcode:
  faultString: org.dcache.srm.SRMAuthorizationException:  can not
determine
username from
GlobusId=/C=UK/O=eScience/OU=Manchester/L=HEP/CN=bohr0013.tier2.hep.man.ac.uk/E=alessandra.forti@manchester.ac.uk
  faultActor:

.................................

plus another ton of java errors.

I've attached the config_gip I modified. This worked for my dcache
installation (apart from the used and available space).

thanks

cheers
Alessandra





     _______________________________________________________

File Attachments:


-------------------------------------------------------
Date: 2005-May-27 20:27  Name: config_gip  Size: 13.8KB   By:
aforti
modifed config_gip for dcache to publish more correct info
<http://savannah.cern.ch/bugs/download.php?item_id=8777&item_file_id=883>

==============================================================================

This item URL is:
  
<http://savannah.cern.ch/bugs/?func=detailitem&item_id=8777>

_______________________________________________
   Message sent via/by LCG Savannah
   http://savannah.cern.ch/

Hi,

1)  Are there any dcache/srm tools to delete files or should I use
     LD_PRELOAD and go with unix commands?
1a) Do I have to delete in /pnfs...?
1b) What happens if I delete files on the pools?
1c) Is there a way to recover is the pnfs db goes out of sync with
the
     file systems on the pools?

2) Is there a way to bind GSI DCAP to one NIC and gridftp to the
    other?

3) How can I disable dcap and leave only GSI DCAP?

4) Is it possible to avoid direct access to the pool nodes? i.e.
the
    connection the node opens the connection with the client when
the
    request arrives from the head node through srm? I guess this
might be
    possible only with ftp though.

thanks

cheers
Alessandra

Hi all,

At last we have gridftp to the pool node working :)))))))

Inspired by Alessandra's 4 line configuration of a pool node we
reinstalled
our test pool node and tried again...

==========================================================================
STEP1 : Install pool-node using yaim script

mkdir /pool-path1

# We guessed this step from an obscure message in the gridftpdoor
logs on
the pool node after a failed transfer:

mkdir /tmp/dcache-ftp-tlog


/opt/lcg/yaim/scripts/install_node site-info-dcache.def
lcg-SEDCache
mv *.pem /etc/grid-security
/opt/lcg/yaim/scripts/configure_node site-info-dcache.def SE_dcache

Then on the pool node we added something like to /etc/fstab
gw04.hep.ph.ic.ac.uk:/fs /pnfs/fs             nfs    
hard,intr,rw,noac,auto
0 0

mkdir -p /pnfs/fs
mount -a

Set the /opt/d-cache/etc/door_config file on the pool node to 

ADMIN_NODE     gw04.hep.ph.ic.ac.uk

door          active
--------------------
GSIDCAP         no
GRIDFTP         yes
SRM             no

sh /opt/d-cache/install/install_doors.sh
ln -s /opt/d-cache/bin/dcache-opt /etc/init.d

chkconfig dcache-opt on
service dcache-opt start

========================================================
STEP2 : Change the pool size on the pool-node (this step was
necessary for
us as the auto install set the pool size to 0GB)

Edit the file /pool-path1/pool/setup
set max diskspace 600m
service dcache-pool restart

on the admin node:
/sbin/service dcache-opt restart
/sbin/service dcache-core restart

And it works! 

Thanks to all for the various steps you have all contributed to the
above!


We will start our production setup from Monday.

Thanks once again,

Mona and Owen


Hi all,
Sorry for replying to my own stuff.  Just a couple of additions.
I strongly suggests that before installing Dcache using yaim you
check
the your first search domain is set to `hostname -d`.  Otherwise,
your
yaim installation will break showing PnfsManager Offline, but pnfs
will
be mounted.

We have MY_DOMAIN=gridpp.rl.ac.uk set in the site-info.def.

For example on our headnode dev03:
$ hostname -d
gridpp.rl.ac.uk

$ cat /etc/resolv.conf
search gridpp.rl.ac.uk
nameserver 130.246.xxx.yyy
nameserver 130.246.xxx.yyy
nameserver 130.246.xxx.yyy

We'll soon publish patches so that this dependency is removed.

If you run into this problem, your install can still be saved
without
reinstalling the entire machine from scratch.  Here's a short
howto:

cd /pnfs
ln -s fs/usr `hostname -d`
/opt/d-cache/bin/grid-mapfile2dcache-kpwd
echo `hostname -d` > /pnfs/fs/admin/etc/config/serverId

As the unwary sysadmin could be hold back for days not knowing how
should
/pnfs/fs look like, I've attached an example.  Please note
especially the
"gridpp.rl.ac.uk -> fs/usr/" link.

This is a major bug.  The yaim Dcache installation uses *three*
methods
of figuring out the domain of your admin node.  1) MY_DOMAIN, 2)
grepping /etc/resolv.conf for search/domain, 3) `hostname -d`.
This needs to be synchronised, ideally using the MY_DOMAIN variable
in
site-info.def.

Thanks.

Regards Jiri and Owen

Hi Owen,

thanks.

Just in case you want to add it to your set of patches to report.
I've
attached my config_gip I don't think it is general enough to
propose a
real patch because the SE section is only one and I don't know how
they
want to integrate dcache,dpm and classic SE together. It gives
anyway
a good idea of what I think needs to be changed for dcache
accordingly to
Tier1 lcg-info.

Other changes I think should be done are:

1) Some glue schema fields are hardcoded in the script.
    They should go into a configuration file/template that can be
    safely edited, editing scripts is not a good practise.

example of fields:

GlueSAPolicyFileLifeTime: permanent
GlueSAPolicyMaxFileSize: 10000
GlueSAPolicyMinFileSize: 1
GlueSAPolicyMaxData: 100
GlueSAPolicyMaxNumFiles: 10
GlueSAPolicyMaxPinDuration: 10
GlueSAPolicyQuota: 00
GlueSAStateAvailableSpace: 1

2) the dynamic part of the ldif contained in
/opt/lcg/var/gip/tmp/lcg-info-dynamic-se.ldif.XXXX
doesn't get created (the file is 0 size)

I can't find what is missing right now and I'm a bit tired,
tomorrow
I'll look better. I think that it is the wrapper that doesn't get
called
or run by some other program. If anyone has any idea.... let me
know.

cheers
Alessandra

On Wed, 25 May 2005, Owen Synge wrote:

> Steve T gave me some more support on this issue, of publishing
the
> incorrect information. It should help you all. I attached the
>
> /opt/lcg/var/gip/lcg-info-generic.conf
>
> earlier in the thread, which comes from the tier one D-Cache
install.
>
> From Laurence field's home page some generic info on "gip"
which is a
> very "Laurence" name for an application.
>
>
http://lfield.home.cern.ch/lfield/cgi-bin/wiki.cgi?area=gip&page=documentation
>
> is the command for regenerating the information
>
> /opt/lcg/sbin/lcg-info-generic-config
> /opt/lcg/var/gip/lcg-info-generic.conf
>
> and should be done when ever
/opt/lcg/var/gip/lcg-info-generic.conf is
> changed then the information provider uses
>
> su - edginfo -c '/opt/lcg/libexec/lcg-info-wrapper'
>
> To launch the script hierarchy.
>
> Regards
>
> Owen
>
>
> On Wed, 25 May 2005 14:41:16 +0100
> Alessandra Forti <Alessandra.Forti@MANCHESTER.AC.UK>
wrote:
>
>> Thanks better than parsing ldap. :)
>>
>> cheers
>> alessandra
>>
>>
>> On Wed, 25 May 2005, Owen Synge wrote:
>>
>>> On Wed, 25 May 2005 14:28:04 +0100
>>> Alessandra Forti
<Alessandra.Forti@MANCHESTER.AC.UK> wrote:
>>>
>>>> Hi,
>>>>
>>>> I think there is a bug in
>>>>
>>>> /opt/lcg/yaim/functions/config_gip
>>>>
>>>> line 366
>>>>
>>>> dn:
>>>
GlueSARoot=$VO:${storage#${CE_CLOSE_SE1_ACCESS_POINT}/},GlueSEUnique
>>> ID=${SE_ > HOST},Mds-Vo-name=local,o=grid
>>>>
>>>> doesn't produce the desired output. I've replaced
it with this.
>>>>
>>>> dn:
>>>
GlueSARoot=$VO:${storage},GlueSEUniqueID=${SE_HOST},Mds-Vo-name=loca
>>> l,o=grid >
>>>> I'm looking for other things that might need
change.
>>>>
>>>> cheers
>>>> alessandra
>>>>
>>> Great I shall add it to my patch collection, Attached
is Steve's
>>> reference MDS provider written by hand for the tier 1
>>>
>>> Regards
>>>
>>> Owen S
>>>
>>> PS I went to a party for may day which had 4 people
called Owen in
>>> one party an all time record for me.

Hi,

I am going to have a look at disabling dcache from having access to
the whole
system, does anyone have an idea which paths each java process
needs access to?

I am thinking something like:

java_options="...  -Djava.security.policy=dcap.policy ..."

And in dcap.policy something like this for each path that
it needs access.

grant {
    permission java.io.FilePermission "/poolpath", "read";
    permission java.io.FilePermission "/poolpath", "write";
    permission java.io.FilePermission "/poolpath", "delete";
}

Cheers,
Kostas

gstat monitors your BDII information if you follow the links to
your site you should find an SRM endpoint if you followed the BDII
configuration thread between Ale


1 - Getting the MDS service to correctly advertise the SE as an
SRM;

Have you tried the solution that I proposed earlier,


/opt/lcg/sbin/lcg-info-generic-config \
      /opt/lcg/var/gip/lcg-info-generic.conf

and should be done when ever /opt/lcg/var/gip/lcg-info-generic.conf
is
changed then the information provider uses

su - edginfo -c '/opt/lcg/libexec/lcg-info-wrapper'

Included is the lcg-info-generic thats hand edited by Steve T for
the tier 1, I should do a comparison of diffs with your current set
up and see if your SRM registers correctly we are behind firewalls
so cat test adequately the fixes here but it seems to work for
Alessandra

Regards

Owen

Dear all,
another version of the yaim Dcache installation is ready.

http://storage.esc.rl.ac.uk/patches/yaim/yaim-2_4_0-4-gpp-0.3.diff.gz

Again, nothing earth-shattering, but it could save you (especially
Imperial that requested this particular feature) some time.

Please take a look at
http://storage.esc.rl.ac.uk/patches/yaim/ChangeLog

Note that if you use this patch, you'll need to change your
existing site-info.def DCACHE_POOLS variable in the following
fashion:

original (e.g.)
DCACHE_POOLS="dev02.$MY_DOMAIN:/pool dev06.$MY_DOMAIN:/pool"

new (note the extra colon!)
DCACHE_POOLS="dev02.$MY_DOMAIN::/pool dev06.$MY_DOMAIN::/pool"

(Sorry about that, I'll fix that soon so that you don't need to do
this.)

or
DCACHE_POOLS="dev02.$MY_DOMAIN:10:/pool dev06.$MY_DOMAIN:20:/pool"

if you want to limit the pool size on dev02 to 10GB and dev06 to
20GB.

I've added support for multiple pools on one machine for the yaim
support.  You could do (e.g.):

DCACHE_POOLS="dev02.$MY_DOMAIN:10:/pool/1
dev02.$MY_DOMAIN:10:/pool/2 dev06.$MY_DOMAIN:20:/pool"

which will give you two 10GB pools on dev02.

I'll look into assigning specific pools to specific VOs soon
(hopefully).

Thanks, good luck.

--
Jiri

On Mon, May 30, 2005 at 10:41:30AM +0100, Greig A Cowan wrote:

> Hi Kostas,
>
> > What i did for our pool node was to download the d-cache
rpms and
> > installed them by hand. Configuration is trivial you only
have to edit 5
> > filesi, here is the list from the d-cache instructions:
>
> Thanks for the email. I had started doing a dCache install by
hand but
> with all the focus being on the yaim method, I thought it
would be good to
> give it a try. I am thinking about going back to the manual
install, but
> surely I still need all of the edg, vdt, perl and postgresql
installed as
> well? Surely if the yaim install failed due to unmatched
dependencies on
> these, the same will happen with a manual install?

No you don't need *any* of that. You only need to be able to create
a dcache.kpwd
file and you can get away with a cron job that copies it from the
admin node.

Yaim is *really* over zelous and it installs everything more or
less.
For example our admin node ended up with the SE version of the
gridftp server
installed and enabled to run!!! It also installed a "random"
version of
postgres that someone downloaded from the web although RHEL
provides
and supports postgres (who is responsible for security updates for
that version now?).

As you can imagine i am not a huge fun of scripts that try to do my
job
and at the end i end up doing more work. If you have a look at what
yaim
does you'll realise that what it does is trivial and you can
easily do
it yourself while gaining some useful knowledge in the process on
how
everything works and how to fix it if something goes wrong.
</rant>

Cheers,
Kostas

Hi,

I am really worried about two problems with d-cache

1) At the moment there isn't a way (that i know off) to
find which user upload which files. Without that ability
the server is useless for anything in the grid world.

2) I could be mistaken here but from what i know about
the java Globus implementation that d-cache is using
there is no CRL support. If this is the case it means
that revoked certificates can still be used to access
the server and this is unacceptable.

I am not sure if we will be able to deploy d-cache
if this problems aren't solved since it's a clear
violation of Imperial's policy. We might be able to get
permission to run it but i wont bet on it.

Kostas

Hi,
I've played a little bit with my Dcache admin node.
Unless I've missed something blatantly obvious (in which case
thanks in advance for pointing it out) the there seems to be
no straightforward way of disabling GSI DCAP and dcap on admin
nodes.

Running /opt/d-cache/install/install_doors.sh with GSI DCAP
disabled on the admin node resulted in *all* doors being
down, not just the selected ones.  After a few hours of trying to
fix this, I decided to reinstall...  I wouldn't recommend running
this script on your admin nodes.

OK, here's how I've disabled dcap and GSI DCAP on my admin node. 
The
bad news is that is is a hack, the good news is that you can
reverse
the process should you wish to re-enable the doors.

Disabling dcap
~~~~~~~~~~~~~~
In /opt/d-cache/config/door.batch
I've commented out the following bunch of lines (70--82
counting the first line as 1, not 0).

create dmg.cells.services.login.LoginManager DCap \
            "${dCapPort} \
             diskCacheV111.doors.DCapDoor \
             -keepAlive=300 \
             -poolRetry=2700 \
             -prot=telnet -localOk \
             -truncate=${truncate} \
             -maxLogin=1500 \
             -brokerUpdateTime=30 \
             -protocolFamily=dcap \
             -protocolVersion=3.0 \
             -poolProxy=PoolManager \
             -loginBroker=LoginBroker"

Disabling GSI DCAP
~~~~~~~~~~~~~~~~~
mv /opt/d-cache/jobs/gsidcapdoor
/opt/d-cache/jobs/gsidcapdoor.disabled

Then I brutally restarted the entire dCache using my last-resort
Dcache
script as restarting only -opt/-core wouldn't help (doors kept
reappearing after I killed them or/and restarted).

Check that dcap is not running:
lsof -i tcp:22125 | tail -n 1 | awk '{print $2}'
should return no PID

Check that gsidcap is not running:
lsof -i tcp:22128 | tail -n 1 | awk '{print $2}'
should return no PID

Hope that helps.

--
Jiri

> Hi,
> I've played a little bit with my Dcache admin node.
> Unless I've missed something blatantly obvious (in which case
> thanks in advance for pointing it out) the there seems to be
> no straightforward way of disabling GSIDCAP and dcap on admin
> nodes.

In terms of the dcache-opt service there is no difference between
an "admin" node and a "pool" node. So install_doors.sh should work
just fine for disabling GSIDCAP on the admin node.

> 
> Running /opt/d-cache/install/install_doors.sh with GSIDCAP
> disabled on the admin node resulted in *all* doors being
> down, not just the selected ones.  After a few hours of trying
to
> fix this, I decided to reinstall...  I wouldn't recommend
running
> this script on your admin nodes.

How were you determining that the doors were down? If you've just
installed the dcache-opt rpm on the admin node then the doors will
be SRM,Dcap-GSI and GFTP. If you've run the install_doors script
they'll be SRM-<host>, Dcap-GSI-<host> and
GFTP-<host> and won't show as up on  the web interface. If
you want them too, you can edit the bottom of
/opt/d-cache/config/httpd.batch and then restart the httpd service
with
/opt/d-cache/jobs/httpd stop;/opt/d-cache/jobs/httpd
-logfile=/opt/d-cache/log/http.log start


What I suspect happened is that you had the dcache-opt service
running when you ran the install_doors script. The install_doors
script changes the /etc/init.d script, so it was probably trying to
stop services that weren't started and the ones that were started
got lost.

> 
> OK, here's how I've disabled dcap and GSI DCAP on my admin
node.  The
> bad news is that is is a hack, the good news is that you can
reverse
> the process should you wish to re-enable the doors.
> 
> Disabling dcap
> ~~~~~~~~~~~~~~
> In /opt/d-cache/config/door.batch
> I've commented out the following bunch of lines (70--82
> counting the first line as 1, not 0).
> 

Or you could comment out the 2 lines that start and stop the door
service in /etc/init.d/dcache-core

 
> Then I brutally restarted the entire dCache using my 
> last-resort dCache
> script as restarting only -opt/-core wouldn't help (doors kept
> reappearing after I killed them or/and restarted).
 

Each java process is started by a script, if the script see its
child java process end, it'll start another, kill the scripts
first:

# ps aux | grep srm
root     19909  0.0  0.0  4348 1228 ?        S    May29   0:00
/bin/sh /opt/d-cache/jobs/srm-dcache
-logfile=/opt/d-cache/log/srm-dcache.log start
root     19910 22.0 19.6 631948 404012 ?     S    May29 461:39
/usr/java/j2sdk1.4.2_08/bin/java -server -Xmx384m
-XX:MaxDirectMemorySize=384m
-Dorg.globus.tcp.port.range=50000,52000 dmg.cells.services.Domain
srm-dcacheDomain -param
setupFile=/opt/d-cache/config/srm-dcacheSetup
ourHomeDir=/opt/d-cache ourName=srm-dcache



Derek 
> > Hi,
> > 
> > I am really worried about two problems with d-cache
> > 
> > 1) At the moment there isn't a way (that i know off) to
> > find which user upload which files. Without that ability
> > the server is useless for anything in the grid world.
> 
> For gridftp:
> 
> In the /opt/d-cache/config/gridftp-<host>.batch, change
the first line from 
> 
> set printout 2
> to 
> set printout 3
> 
> and then restart the dcache-opt service. This increases the
verbosity of the gridftp door and includes in the output the DN

Thanks, this seem to work although the verbosity is increased to an
insane level :(

$ ls -al gridftpdoor-sedsk00.log
-rw-r--r--   1 root root  4562 Jun  1 17:14 gridftpdoor-sedsk00.log
$ uberftp -a gsi -H sedsk00 dir 
...
$ ls -al gridftpdoor-sedsk00.log

> Hi Derek,
> 
> copying to the srm it sleeps for longer and longer intervals
of time. Here 
> is the output:

In terms of the error this suggests that the srm is up but the
internals of retrieving files is broken. What is happening is first
the file is requested and given a request ID.

> Thu Jun 02 14:58:54 BST 2005:  srm returned requestId =
-2147482384

but after this the srm module is waiting for other D-Cache modules
to state that the data is ready to be transferred. They are not
talking to each other correctly. I have never seen this outside
coding with srmcp, I suggest that the pnfs database is corrupt or
broken.

Regards

Owen

copying to the srm it sleeps for longer and longer intervals of
time. Here 
is the output:

[aforti@bohr0003 aforti]$ ./srmcp.sh pippo pippo14
pippo ===> pippo14
SRM  srmcp file:///./pippo 
srm://bohr0013.tier2.hep.man.ac.uk:8443//pnfs/tier2.hep.man.ac.uk/data/dteam/pippo14.srm
SRM Configuration:
         debug=true
         gsissl=true
         help=false
         pushmode=false
         userproxy=true
         buffer_size=2048
         tcp_buffer_size=0
         config_file=/home/aforti/.srmconfig/config.xml
         glue_mapfile=/opt/d-cache/srm/conf/SRMServerV1.map
         webservice_path=srm/managerv1.wsdl
         webservice_protocol=https
         gsiftpclinet=globus-url-copy
         protocols_list=http,gsiftp
         save_config_file=null
         srmcphome=/opt/d-cache/srm
         urlcopy=/opt/d-cache/srm/bin/url-copy.sh
         x509_user_cert=/home/aforti/.globus/usercert.pem
         x509_user_key=/home/aforti/.globus/userkey.pem
         x509_user_proxy=/tmp/x509up_u500
        
x509_user_trusted_certificates=/etc/grid-security/certificates
         retry_num=3
         retry_timeout=1000
         wsdl_url=null
         use_urlcopy_script=true
         connect_to_wsdl=false
         from[0]=file:///./pippo

to=srm://bohr0013.tier2.hep.man.ac.uk:8443//pnfs/tier2.hep.man.ac.uk/data/dteam/pippo14.srm

Thu Jun 02 14:58:51 BST 2005: starting SRMPutClient
Thu Jun 02 14:58:51 BST 2005:
SRMClient(https,srm/managerv1.wsdl,true)
Thu Jun 02 14:58:51 BST 2005: connecting to server
Thu Jun 02 14:58:51 BST 2005: connected to server, obtaining proxy
SRMClientV1 : connecting to srm at 
httpg://bohr0013.tier2.hep.man.ac.uk:8443/srm/managerv1
Thu Jun 02 14:58:52 BST 2005: got proxy of type class 
org.dcache.srm.client.SRMClientV1
SRMClientV1 :   put, sources[0]="./pippo"
SRMClientV1 :   put, 
dests[0]="srm://bohr0013.tier2.hep.man.ac.uk:8443//pnfs/tier2.hep.man.ac.uk/data/dteam/pippo14.srm"
SRMClientV1 :   put, protocols[0]="http"
SRMClientV1 :   put, protocols[1]="dcap"
SRMClientV1 :   put, protocols[2]="gsiftp"
SRMClientV1 :  put, contacting service 
httpg://bohr0013.tier2.hep.man.ac.uk:8443/srm/managerv1
doneAddingJobs is false
copy_jobs is empty
Thu Jun 02 14:58:54 BST 2005:  srm returned requestId = -2147482384
Thu Jun 02 14:58:54 BST 2005: sleeping 1 seconds ...
Thu Jun 02 14:58:55 BST 2005: sleeping 4 seconds ...
Thu Jun 02 14:58:59 BST 2005: sleeping 4 seconds ...
Thu Jun 02 14:59:04 BST 2005: sleeping 4 seconds ...
Thu Jun 02 14:59:08 BST 2005: sleeping 4 seconds ...
Thu Jun 02 14:59:12 BST 2005: sleeping 4 seconds ...
Thu Jun 02 14:59:17 BST 2005: sleeping 7 seconds ...

Hi Alessandra,

Are any Put Requests listed when you do a ls -put in the SRM module
in the admin interface?
Its case sensitive, the module is called SRM (or possibly
SRM-${hostname of srm node})
If there are, does cancelling them (cancel all -put .* to cancel
them all) allow a new transfer to work?

Derek
> Hi,
>
> The file I sent only reduces lifetimes for get requests to 1
hour, the lifetimes for put and copy requests were left at the
default of 24 hours.
>
> Add this line to the SRM part of the srm.batch file to reduce
put lifetimes to 12 hours
>
>        -put-lifetime=43200000 \
>
>
> Note that this only applies to new requests, old requests will
still have the old lifetime.
>
> Derek
Hi Derek,

what is it the unit of time millisec?

cheers
alessandra

On Mon, 6 Jun 2005, Ross, D (Derek) wrote:
On Tuesday 07 June 2005 14:52, Matt Doidge wrote:
> hello,
>
> I'm planning to stick a new pool node onto our existing dcache
set-up,
>  which should be a simple enough process, but looking at the
d-cache
> set-up it mentions editing the poollist file, which we have
but it is
> empty!
> we have a:
> /opt/d-cache/config/fal-pygrid-20.poollist

I know that this file got setup by YAIM when we did it at
Glasgow....

>
> but there is nothing in it. Our d-cache system seems to be
working
> fine, it was set up using yaim. 

...so that's weird.

What _should_ be in it, for a pool, is described at

http://www.physics.gla.ac.uk/gridpp/datamanagement/index.php/ScotgridDcacheDiskPoolAdd

(and certainly worked for us).

> We want to add a node to the setup 
> that's almost exactly the same as the one already attached, (6
pools
> on 6 separate partitions on one node). Would it be as simple
as
> repeating the normal pool installation process with a slightly
modified
> site-info.def file (with the extra pools added) or is it
something
> more subtle?

That does work for adding new pool nodes. The site-info.def is only
used by 
YAIM, not by Dcache once it's running. See

http://www.physics.gla.ac.uk/gridpp/datamanagement/index.php/ScotgridDcachePoolNodeAdd

Hope that helps

Graeme

Hi Owen(s),

Mapping pools to vo's is mentioned in the how-to, on this page:
http://storage.esc.rl.ac.uk/documentation/html/D-Cache-Howto/ar01s11.html

The relevant part is:

Set storage group for each VO dir (there are 4 spaces between
StoreName and the VO name, don't know if its significant)

  cd ${vo}
  echo "StoreName    ${vo}" >".(tag)(OSMTemplate)"       
  echo ${vo} > ".(tag)(sGroup)"
  cd ..

Set up pool groups and directory affinities, for each VO add the
following lines to /opt/d-cache/config/PoolManager.conf

psu create pgroup ${vo}-pgroup
psu create unit -store ${vo}:${vo}@osm
psu create ugroup ${vo}
psu addto ugroup ${vo} ${vo}:${vo}@osm
psu create link ${vo}-link world-net ${vo}
psu add link ${vo}-link ${vo}-pgroup
psu set link ${vo}-link -readpref=10 -writepref=10 -cachepref=10

Which makes perfect sense to me (I wrote it :-) ), but it is just a
wee bit terse. Here goes with a slightly more verbose version:

To map a VO to a pool firstly you have to tag the directory in the
pnfs file system that the VO will use. The tags will be inherited by
any directory created under the tagged directory after it has been
tagged. To tag a directory, change into it and run the following
commands
     echo "StoreName    ${vo}" >".(tag)(OSMTemplate)"       
     echo ${vo} > ".(tag)(sGroup)"
where ${vo} is the name of the VO e.g. dteam. Note that although we
use the same name both times here it isn't necessary to do so, for
instance the Tier 1 has a dteam directory where the .(tag)(sGroup)
contains the words tape, and this is used to map to a separate set
of pools for access to the Atlas Data Store.

The second part of configuring mappings between VOs and pools
involves the PoolManager. If your dcache instance is halted then
you can add them to the /opt/d-cache/config/PoolManager.conf on the
admin node, otherwise they should be entered into the PoolManager
modules of the admin interface, remembering to finish with save to
write the configuration to disk.
    psu create pgroup ${vo}-pgroup
    psu create unit -store ${vo}:${vo}@osm
    psu create ugroup ${vo}
    psu addto ugroup ${vo} ${vo}:${vo}@osm
    psu create link ${vo}-link world-net ${vo}
    psu add link ${vo}-link ${vo}-pgroup
    psu set link ${vo}-link -readpref=10 -writepref=10
-cachepref=10
Note that most of the names of things in the above commands are
convention, and there is no requirement to actually follow this
scheme. The first command creates a pool group, this is exactly
what it sounds: a group of pools. The second command defines a
unit, this is something that matches against a property of the
incoming request, in this case the storage information of where the
file should be written. The names in this command do matter, they
should match those used to tag the directory earlier, the name used
in the .(tag)(OSMTemplate) comes first. The third command creates a
unit group, this is just a group of units. The fourth command adds
the unit created to the new unit group. The fifth command creates
a link, this is the mapping between incoming requests and
destination pool, and adds two unit groups to it. world-net is a
existing unit group that matches requests coming from any ip
address and the second unit group is the one just created. The
sixth command adds the pool group created to the new link. The
seventh command set various properties of the link.

Once all those commands are down, psu addto pgroup ${vo}-pgroup
<poolname> will add a pool to the pool group. If this pool is
not for all vos to access, you may wish to remove it from the
default pool group with psu remove from default <poolname>, to
ensure that files from other VOs cannot get written to that pool.
Note that a pool can belong to more than one pool group, so it is
perfectly possible to have two VOs writing to the same pool,
however there is no way to stop one VO using all of the space in
the pool.

Hope this helps,

Derek

On Thu, 9 Jun 2005 16:51:02 +0100
Greig A Cowan <g.cowan@ED.AC.UK> wrote:

> Hi Derek,
> 
> > Have you tried starting the srm with the pool turned off,
i.e.
> > 
> > service dcache-pool stop
> > service dcache-opt start
> > 
> > Does that log anything different in srm.log?
> 
> SRM is now on line!! Surprisingly, there are now no error
messages in 
> srm.log.

Do you know what happen here? I am curious because I don't want to
have these problems again. Do you have any record of how you kicked
the service into behaving?

Regards

Owen
Hi,

At IC we are observing a strange behaviour of srmcopy. 
Diskpool-1 and Diskpool-2 are both running gsiftp. 
However, the copy process selects randomly gsiftp on diskpool-1 to 
copy files to both the disk pools. Instead, the gsiftp on dispool-1
should 
be used for copying files on to diskpool-1. Similarly, gsiftp
running on diskpool-2 
should be used for copying files to diskpool-2.

Please suggest if any configuration changes are needed.

Regards,

Mona 

No, this is the correct behaviour. Selection of gridftp door is
separate from selection of pool. The fact that there may be a
gridftp door on the same host as a pool isn't taken into account
anywhere. I'm afraid you'll just have to live with it.

Derek


On Mon, Jun 13, 2005 at 02:08:14PM +0100 or thereabouts, Owen Synge
wrote:
> On Mon, 13 Jun 2005 10:36:46 +0100
> Jiri Mencak <j.mencak@RL.AC.UK> wrote:
> 
> > Hi,
> > first of all my apologies to old-timers for stating the
obvious.
> > 
> > If your application to join dteam has not been approved
yet and you 
> > want to test your dcache installation, you can add your
DN to
> > /etc/grid-mapfile manually and run
> > /opt/d-cache/bin/grid-mapfile2dcache-kpwd.
> > 
> > Regards.
> > 
> > -- 
> > Jiri
> 
> Sorry to spell things out to this degree but I am sure things
are better to clarify all the details exactly. Unfortunately I
don't have a personal test CA.
> 
> OK so to add a new user to the grid map file do the following,
> 
> 1 get the DN of the cert you wish to add, for me this cert DN
is 
> 
> "/C=UK/O=eScience/OU=CLRC/L=RAL/CN=owen synge"
> 
> 2 Decide which of the supported virtual organisations (VO)
available I am within the group of 
developer/systemadmin/deployment experts so I chose as I imagine
you all will "dteam"
> 
> 3 Add a single line to the gridmap file in the following
format
> 
> ${DN} .${VO}

If you want to add single extra members then you change the 
/opt/edg/etc/grid_mapfile_local

edg-mkgridmap constructs /etc/grid-sucurity/grid-mapfile out
of /opt/edg/etc/edg-mkgridmap.conf and the local one above.

Steve

> 
> and the following line as bellow
> 
> "/C=UK/O=eScience/OU=CLRC/L=RAL/CN=owen synge" .dteam
> 
> 4 Use the method for pulling the gridmap file into the D-Cache
system. 
> 
> /opt/d-cache/bin/grid-mapfile2dcache-kpwd
> 
> This will effect all D-cache components including GridFTP
access. With Globus based solutions the this stage is unnecessary
> 
> Regards
> 
> Owen S

-- 
Steve Traylen
s.traylen@rl.ac.uk
http://www.gridpp.ac.uk/

Hi Matt,

> Is anyone else continuing to have SRM troubles? 

Yes. I'm having issues using globus-url-copy, I'm investigating
them just 
now. 

> whilst adding some
> new pools to my setup I noticed the srm had dropped offline,
probably
> whilst the system was rebooted t'other day. I tried restarting
the
> admin node, had no luck, tried a few more tricks and
eventually
> resorted to a reinstall of dcache. Thankfully it's all working
again
> now by the looks of it, with the additional pools included in
to the
> set up. But it was very annoying to have to do. Any ideas why
the SRM
> is reluctant to become on line sometimes?

I always find this problem. What works for me is to start and stop
various 
dcache services on the admin node:

service dcache-pool stop
service dcache-opt restart
service dcache-pool start

It seems that the pool services can break SRM, so turn them off
first. I 
appear to be able to leave the pool node alone when doing this and 
everything comes back on line again.

> Which comes to my second question, when we have our SE
officially up
> and running is there a procedure for rescuing the postgres
database
> and other critical information if it all goes pear shaped and
the
> admin node needs reinstalling or some similar drastic action?
The only
> reason I could do my cowboy fix today is that our SE isn't in
use
> yet....

Don't know about this....surely there is some way of doing this?

Cheers,
Greig

On Thu, 16 Jun 2005 11:47:30 +0100
Jamie Kelvin Ferguson <jfergus7@PH.ED.AC.UK> wrote:

> Hi Andrew,
> 
> You seen to have a pretty comprehensive knowledge of srm's. Do
you know of 
> any way, either by querying the info system or otherwise, to
automatically 
> detect if a site is srm or classic SE? 

Its available through BDII, I shall find out more details soon.
 
> And is there a way to evaluate how much available/used space
exists for 
> each flavour of storage? 

This information is available for Classic SE's but is not yet
available for D-cache.

> 
> Will the new middle ware accommodate sites advertising the above
values?

Yes but we have yet to get this working.

> Any help would be appreciated.
> 
> Cheers,
> Jamie Ferguson.

Regards

Owen S

Hi Jamie,

if you want to know about everyone Ianthe world you can parse the
output of 
this command

ldapsearch -x -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -b o=grid

there is an srm command to get out the information from each srm.

/opt/d-cache/srm/bin/srm-storage-element-info 
https://<SRM-NODE>:8443/srm/infoProvider1_0.wsdl

cheers
alessandra

This might help:

ldapsearch -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x \
  -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:srm*)'

vs.

ldapsearch -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x \
  -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:disk)'

Though this search:

ldapsearch -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x \
  -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:disk) &
(GlueSEPort=8443)'

does raise the question as to whether GlueSEPort is actually used!

? In principle, you ought to be able to run a gsiftp server on 8443
and 
an srm on 2811, I'd imagine, *as long as* you advertised it
correctly. 
But from the fact that perfectly functional Classic SEs are
advertising 
8443 while actually running their gsiftp server on 2811 (I checked
a 
few) seems to show that the GlueSEPort isn't being picked up by the
lcg 
replica management software.  I suspect a yaim default here...

There's also GlueSEType but it isn't really reliable (or even
defined...)

SRM:
> [maroney@gfe03 RB]$ ldapsearch -LLL -H
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b
'mds-vo-name=local,o=grid' '(GlueSEName=*:srm*)' | grep GlueSEName
| wc
>      14      28     438
> [maroney@gfe03 RB]$ ldapsearch -LLL -H
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b
'mds-vo-name=local,o=grid' '(GlueSEName=*:srm*)' | grep GlueSEPort
| sort | uniq -c
>      14 GlueSEPort: 8443
> [maroney@gfe03 RB]$ ldapsearch -LLL -H
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b
'mds-vo-name=local,o=grid' '(GlueSEName=*:srm*)' | grep GlueSEType
| sort | uniq -c
>       2 GlueSEType: srm
>       2 GlueSEType: srm_v1

Classic SE:
> [maroney@gfe03 RB]$ ldapsearch -LLL -H
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b
'mds-vo-name=local,o=grid' '(GlueSEName=*:disk)' | grep GlueSEName
| wc
>     147     294    4170
> [maroney@gfe03 RB]$ ldapsearch -LLL -H
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b
'mds-vo-name=local,o=grid' '(GlueSEName=*:disk)' | grep GlueSEPort
| sort | uniq -c
>      27 GlueSEPort: 2811
>     120 GlueSEPort: 8443
> [maroney@gfe03 RB]$ ldapsearch -LLL -H
ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b
'mds-vo-name=local,o=grid' '(GlueSEName=*:disk)' | grep GlueSEType
| sort | uniq -c
>     117 GlueSEType: disk
>       1 GlueSEType: gsiftp
>       5 GlueSEType: srm



Hi,

Jamie Kelvin Ferguson wrote:

> Hi Owen,
> 
> Thanks for that  - thats what I'm looking for. I assume all
sites
> with an srm will have an the term 'srm' in there SE type?

I'm afraid not!  The GlueSEType field doesn't seem to be used and
isn't
actually defined for most SRMs (in fact, to get the SRM at IC to
publish
properly I had to drop the field completely, so I guess it's
deprecated
even...).

The GlueSEName is the reliable field as the LCG replica manage
reads
what follows the ":" to decide which type the SE is ("disk" or
"srm_v1"
seem the only options...).

> Is there a way to tell which type of storage (permanent,
durable 
> volatile) sites are offering. I.e. how does a site advertise
this 
> information?

There is a field advertising this separately for *each* separate
storage 
area (GlueSARoot) GlueSAPolicyFileLifeTime, eg:

ldapsearch -LLL -H ldap://gfe02.hep.ph.ic.ac.uk:2135 -x   \
    -b
'GlueSEUniqueID=gfe02.hep.ph.ic.ac.uk,mds-vo-name=local,o=grid'  \
    GlueSARoot, GlueSAPolicyFileLifeTime

> Also, Phil might have mentioned something about this, If a site
> advertises permanent space and it can be seen from the info.
system
> that it has a capacity of say 5TB. Can it be assumed that all
5TB of
> storage there is permanent, or is it possible for a site to
divide up
> its type of storage and only a fraction of that 5TB will be
permanent
> space.

I wish I knew the answer to this!  How do storage spaces map to
SARoot 
map to VOs?

The GlueSARoot advertises only a single storage type.  Possible you

could define different SARoots, with different storage types.

Also, at the moment, the GlueSARoot *seems* to be one-per-VO (and
vice 
versa), but there is also a GlueSAAccessControlBaseRule.  This
would 
indicate that an SARoot might support more than one VO, though I
can't 
imagine the use case for it...

cheers,
Owen M

This is correct , the only way around this is run and publish two
SRM end points. We now have dcache.gridpp.rl.ac.uk and I think
dcache-tape.gridpp.rl.ac.uk. Each of them has  a different storage
root and published parameters, or rather it should but we have not
got around to that yet.
 
Steve
 /opt/d-cache/srm/bin/srm-advisory-delete
> -debug=true
> srm://gfe02.hep.ph.ic.ac.uk:8443/pnfs/hep.ph.ic.ac.uk/data/dte
> am/testfilejun
> e21
> SRM Configuration:
>         debug=true
>         gsissl=true
>         help=false
>         pushmode=false
>         userproxy=true
>         buffer_size=2048
>         tcp_buffer_size=0
>         config_file=/home/aggarwa/.srmconfig/config.xml
>         glue_mapfile=/opt/d-cache/srm/conf/SRMServerV1.map
>         webservice_path=srm/managerv1.wsdl
>         webservice_protocol=https
>         gsiftpclinet=globus-url-copy
>         protocols_list=http,gsiftp
>         save_config_file=null
>         srmcphome=/opt/d-cache/srm
>         urlcopy=/opt/d-cache/srm/bin/url-copy.sh
>         x509_user_cert=/home/aggarwa/k5-ca-proxy.pem
>         x509_user_key=/home/aggarwa/k5-ca-proxy.pem
>         x509_user_proxy=/home/aggarwa/k5-ca-proxy.pem
>         
>
x509_user_trusted_certificates=/home/aggarwa/.globus/certificates
>         retry_num=20
>         retry_timeout=10000
>         wsdl_url=null
>         use_urlcopy_script=false
>         connect_to_wsdl=false
>         from=null
>         to=null
> 
> Tue Jun 21 13:21:37 BST 2005:
SRMClient(https,srm/managerv1.wsdl,true)
> Tue Jun 21 13:21:37 BST 2005: connecting to server
> Tue Jun 21 13:21:37 BST 2005: connected to server, obtaining
proxy
> SRMClientV1 : connecting to srm at
> httpg://gfe02.hep.ph.ic.ac.uk:8443/srm/managerv1
> Tue Jun 21 13:21:38 BST 2005: got proxy of type class
> org.dcache.srm.client.SRMClientV1
> Tue Jun 21 13:21:38 BST 2005: calling srm.advisoryDelete()
> SRMClientV1 :   advisoryDelete
> SURLS[0]="srm://gfe02.hep.ph.ic.ac.uk:8443/pnfs/hep.ph.ic.ac.u
> k/data/dteam/t
> estfilejune21"
> SRMClientV1 :  advisoryDelete, contacting service
> httpg://gfe02.hep.ph.ic.ac.uk:8443/srm/managerv1
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> SRMClientV1 : sleeping for 0 milliseconds before retrying
> 
>
===============================================================
> 
> Any suggestions?
Hi Mona,

srm-advisory-delete works for me if I  add -connect_to_wsdl=true to
the command line.

Derek 

Tue, 21 Jun 2005 14:23:03 +0100