Table of Contents
The objective of this document is to assist system administrators with the installation of Dcache at Tier 2 sites and configuring it to work with the LHC Computing Grid.
Hi, Just looking for the opinion of others to see if we should report this. In the case of
srmcp gsiftp://gridftpp.gridka.de:2811//etc/group \ srm://dcache.gridpp.rl.ac.uk:8443//pnfs/gridpp.rl.ac.uk/data/dteam/exabyteasdfa
then our SRM does it
May 04 14:45:21 BST 2005: srm returned requestId = -2147135830 Wed May 04 14:45:21 BST 2005: sleeping 1 seconds ... Wed May 04 14:45:22 BST 2005: sleeping 4 seconds ... Wed May 04 14:45:27 BST 2005: sleeping 4 seconds ... Wed May 04 14:45:31 BST 2005: sleeping 4 seconds ...
the reason it is retrying is because the transfer fails since I am not authorised to use gridftpp.gridka.de at all.
Does that not count an a permanent error, should the SRM just give up and stop trying. Steve
The ftp server and the SRM use the same grid map file. As D-Cache integrates a Java Grid FTP server and there is no other authorisation going on.
srmcp by default does pull when doing an SRM copy.
From the error message in the logs, the destination SRM (in this case one of the Dcache pools) is trying to access the source grid ftp server and failing because Steve (actually its me in this particular one) isn't in the grid-map file of the source Grid FTP server:
05/04 12:55:48 Cell(dcache.gridpp.rl.ac.uk_1@dcachexDomain) : org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: User authorisation failed. (error code ) [Nested exception message: Custom message: Unexpected reply: 530 No local mapping for Globus ID]. Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: nexpected reply: 530 No local mapping for Globus ID
When the SRM gets this error it doesn't consider it fatal:
05/04 12:55:48 Cell(RemoteGsiftpTransferManager@srmDomain) : [id=30148 store src=gsiftp://gridftpp.gridka.de/et/group dest=///pnfs/gridpp.rl.ac.uk/data/dteam/dr35-fzk-20050504-01]:sending error reply, reply code=8 errorObject=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: User authorization failed. (error code 1) [Nested exception message: Custom message: Unexpected reply: 530 No local mapping for Globus ID] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 530 No local mapping for Globus ID] for id=30148 store src=gsiftp://gridftpp.gridka.de/etc/group dest=///pnfs/gridpp.rl.ac.uk/data/dteam/dr35-fzk-20050504-01
And periodically retries the transfer.
Hi Matt, Are the disks scsi? You could recompile the kernel and add in the options:-
# # Some SCSI devices (e.g. CD jukebox) support multiple LUNs # # CONFIG_SCSI_MULTI_LUN=y CONFIG_SCSI_CONSTANTS=y # CONFIG_SCSI_LOGGING is not set
using the SL linux config file, it should be easy to add in this option, but keep the rest of the kernel configuration identical.
The reason we wish to have 2 pools on each node is that each node has to serve 2 separate raid arrays. As there's no way to merge these 2 raids into one volume we require 2 pool processes. Also as we would like to connect our SE to both the production grid and UK Light it would be useful if we could make intelligent use of the two separate connections on each pool node (like assign a Grid FTP door to each).
As for our raid problem apparently red hat (and apparently therefore SL) doesn't support LUN's, which our raid controllers use to mark the raid partitions. Any idea's how to get round this. The tech-support from our supplier suggested we might have to recompile the kernel, and as you can imagine we'd like to avoid having to do this.
(steve-sorry if I'm repeating what Brian said but thought it useful to post to the entire list).
cheers, matt
I've lost the mail about you suggesting to SL to change their default kernel but I doubt they will do this nor should they in my opinion. It is important people get as close as possible to red hat in this area. A bug for bug match. Steve
hello, Have several Dcache installation questions that I hope you guys can answer, forgive me for sounding a bit rambly but I'm just looking for intelligent input.
i) what do i change to have 2 different Dcache processes on one node? I'm using the YAIM method of installation. I can see that to identify the second pool process by pointing to a different pnfs file system (for example pool-node:/pnfs1/ pool-node:/pnfs2/ in the site-info.def) but how do you start these 2 different processes using YAIM to set things up? Or do I have to get my hands dirty and use something other then the YAIM install method.
One Dcache pool process manages all the Dcache pools on a node. Check the /opt/d-cache/config/<hostname>.poollist file, it should have one line for each pool:
csfnfs39_1 /exportstage/cms-data24//pool sticky=allowed recover-space recover-control recover-anyway lfs=precious tag.hostname=csfnfs39.rl.ac.uk csfnfs39_2 /exportstage/cms-data25//pool sticky=allowed recover-space recover-control recover-anyway lfs=precious tag.hostname=csfnfs39.rl.ac.uk
>iii) finally, not really a Dcache question but very related to >storage, we >here at Lancaster our having some troubles. We have set up one of the >nodes that will be part of our final SE, got it running SL, but our 2 >raid arrays attached to it were partitioned into 3 1.8 TB partitions >each and we can only see 2 of the 6 partitions, we're missing 7.2 TB! >One of the sister nodes with the original OS (SUSE) still on it >clearly shows 6 1.8 TB partitions, but we can't mount the >corresponding partitions on our SL box. We've tried many things, our >service people have been contacted but it doesn't hurt to ask as many >people as possible :-)
With such little info, I can only guess that the SL kernel does not support multi-LUNs, neither does the RH9 or 7.x varieties.
You can crack this nut in at least 2 ways.
1. Check that all LUNs are being exported from the SCSI controller card. The Adaptec HBAs have a switch to enable this. Once that is enabled then the Adaptec BIOS scan shows all the LUNs rather than the first. Other HBAs may not do this.
2. In my opinion, the best(?) way is to rebuild the kernel with multi-LUN support enabled. You can also include read streaming as well as pumping up the queue tags. i.e. There are at least 3 parameters to increase to get performant arrays in our case.
3. The nice and best way(?) to get multi-LUN support working is to add the following to /etc/modules.conf:
options scsi_mod max_scsi_luns=255
and then remake initrd with mkinitrd and the usual incantation. If that is how they have their system configured.
However, the big gotcha with this is that one parameter in /etc/modules.conf has worked for me but not 2 or 3. Hence, I can get multi-LUN support but not read streaming or increasing the queue tags.
I'd suggest that they try number 3 first in order to validate that multi-LUN support has to be enabled in order to see all the LUNs and then when they claim it is not performant, they will end up having to make a kernel anyways.
Of course, all the above is written with the hindsight of working with Adaptec HBAs, if they have LSI or anyone elses, most of the above may not apply. Cheers, - Nick
You don't need to recompile the kernel. Just add in you modules.conf
options scsi_mod max_scsi_luns=255
recreate your initrd
/sbin/new-kernel-pkg --mkinitrd --depmod --install 2.4.21-32.ELsmp
and you are set.
RHEL update 2 (or 3?) disabled the querying of luns in a scsi device because some devices have problems with this.
> Hello, > > The reason we wish to have 2 pools on each node is that each node has > to serve 2 separate raid arrays. As there's no way to merge these 2 > raids into one volume we require 2 pool processes. Also as we would > like to connect our SE to both the production grid and UK Light it > would be useful if we could make intelligent use of the two separate > connections on each pool node (like assign a Grid FTP door to each).
There currently is no way to bind a Grid FTP door to an interface. The Grid FTP door uses the system's default ip address when returning the connection details for the data channel to the client regardless of the interface the request was received on.
> > Which leads me to a question, how easy is it to stick pool nodes into > an existing Dcache setup? From what I understand you have to enter > pool into the poollist file but after you do that how do you get the > admin node to rescan it's pool-list? Or do you have to restart the > whole process? > To add a pool to a node by hand: 0. Ensure the d-cache-core rpm is installed 1. Create the pool
mkdir -p /path/to/pool/control mkdir -p /path/to/pool/data
2. Create the pool config file The easiest is to copy from another pool and fix the diskspace, otherwise put the following in /path/to/pool/setup (fixing the diskspace):
set max diskspace <Diskspace> set heartbeat 30 set sticky allowed set report remove off set breakeven 250.0 set gap 4294967296 set duplicate request none set p2p separated # # Flushing Thread setup # flush set max active 1000 flush set interval 60 flush set retry delay 60 # # HsmStorageHandler2(diskCacheV111.pools.HsmStorageHandler2) # rh set max active 2 st set max active 2 rh set timeout 14400 st set timeout 14400 # # Nothing from the diskCacheV111.pools.SpaceSweeper0# mover set max active 100 p2p set max active 2 # # Pool to Pool (P2P) [$Id: D-Cache-Howto-Email-Import.xml,v 1.2 2005/07/12 16:13:06 synge Exp $] # pp set port 0 pp set max active 0 jtm set timeout -queue=io -lastAccess=0 -total=0 jtm set timeout -queue=p2p -lastAccess=0 -total=0 csm set checksumtype adler32 csm set policy -frequently=off csm set policy -onread=off -onwrite=off -onrestore=off -ontransfer=off -enforcecrc=off -getcrcfromhsm=off
3. Add the pool to the /opt/d-cache/config/<hostname>.poollist, easiest way is again to adapt another entry.
Start (or restart) the pool service. There should be an script in /opt/d-cache/bin/ called dcache-pool. If you symlink that into /etc/init.d, the usual chkconfig and service mechanisms should work.
There may be some way to do all this using YAIM of course.
Hi all,
Having looked at the operation of the Dcache-written Grid FTP service on Dcache a little closer we've noticed the following:
- all people in the dteam VO seem to be mapped to dteam001. - and no log seems to be kept as to which certificate made which transaction.
So there is seemingly no record kept of who created or deleted which files... Sites may want/need to be made aware of this feature?
(Although as the Grid FTP door allows files to be created only in the appropriate /pnfs/<domain>/data/<VO> directory this doesn't seem to directly create a security hole)
Also, this begs the question: why on earth does the YAIM installation create all the pool accounts and the gridmapdir when there is absolutely no use made of them?
And, of course, these pool accounts have /home directories, by default. Although the Grid FTP service doesn't appear to be able to upload files to, say /home/dteam001/.ssh, this still probably isn't a great idea. And finally by default lcg-expiregridmapdir is installed (to recycle pool accounts that are never used?)
cheers, Owen.
ps. I do also hope the 'feature' of the Dcache Grid FTP service noted last Friday is being reported back up the chain?
I had a look at /opt/d-cache/billing and it doesn't keep *any* information about direct Grid FTP file transfers. Is there an option to increase logging or something similar?
I wonder if any of the other protocols (which ones are available?) can log any information. Are all of them GSI enabled? If not it might not be possible to get a DN at all for them. For example if your pool nodes are in a machine with user access (WN/CE/whatever) it's trivial to use a userland nfs program (nfsshell for example) to access any file in the system without logging or permission checking. People that are planning to use free space from their WNs should be aware of this.
Kostas
Hi,
You don't need pnfs installed on the pool node. There should only be one pnfs running in your entire dcache instance. What you need to do is nfs mount the pnfs service from your admin node on the pool node.
On the admin node, go to /pnfs/fs/admin/etc/exports and copy the file 127.0.0.1 to a file that has the ip address of your pool node as the name, then cd into the trusted directory and do the same.
Then on the pool node add something like
gw04.hep.ph.ic.ac.uk:/fs /pnfs/fs nfs hard,intr,rw,noac,auto 0 0
to your /etc/fstab, create the /pnfs/fs directory, mount it and create the symlink
Then, to run a gridftp door on your pool node: Set the /opt/d-cache/etc/door_config file on the pool node to
ADMIN_NODE gw04.hep.ph.ic.ac.uk
door active -------------------- GSIDCAP no GRIDFTP yes SRM no
and then run /opt/d-cache/install/install_doors.sh and then start dcache-opt.
Dcache uses the name of things to work out which one to use. so if you have two gridftp doors named GFTP then the one that started last gets used. What the install_doors script does is change the name to something more unique (GFTP-<hostname -s> basically) so you don't get name collisions. You can change the names yourself in the /opt/d-cache/config/*.batch files
Derek
There are some records in the cd /opt/d-cache/billing/ area.
There is also a config option to dump the records straight into a postgres database for post processing. We turned it on and it works.
I am vague memory that it does not contain the DN which is main bit info you are looking for , we do have a request in with developers for this to be added but this was only informal so this should be looked at/verified/chased up.
Steve
You don't need to recompile the kernel. Just add in you modules.conf
options scsi_mod max_scsi_luns=255
recreate your initrd
/sbin/new-kernel-pkg --mkinitrd --depmod --install 2.4.21-32.ELsmp
and you are set. RHEL update 2 (or 3?) disabled the querying of luns in a scsi device because some devices have problems with this. Kostas
Hi Derek, Thanks for the advice. We have tried a few permutations on this as follows: Ross, D (Derek) wrote: > You don't need pnfs installed on the pool node. There should only be > one pnfs running in your entire dcache instance. What you need to do > is nfs mount the pnfs service from your admin node on the pool node. > Ok. We stopped pnfs on the pool node. > On the admin node, go to /pnfs/fs/admin/etc/exports and copy the > file 127.0.0.1 to a file that has the ip address of your pool node as > the name, then cd into the trusted directory and do the same. Already done! We don't like pnfs being visible to the whole world... > Then on the pool node add something like gw04.hep.ph.ic.ac.uk:/fs > /pnfs/fs nfs hard,intr,rw,noac,auto 0 0 to your > /etc/fstab, create the /pnfs/fs directory, mount it and create the > symlink Done. > Then, to run a gridftp door on your pool node: Set the > /opt/d-cache/etc/door_config file on the pool node to > > ADMIN_NODE gw04.hep.ph.ic.ac.uk > > door active -------------------- GSIDCAP no GRIDFTP > yes SRM no > > and then run /opt/d-cache/install/install_doors.sh and then start > dcache-opt. Done. > dCache uses the name of things to work out which one to use. so if you >have two gridftp doors named GFTP then the one that started last gets >used. What the install_doors script does is change the name to >something >more unique (GFTP-<hostname -s> basically) so you don't get name >collisions. You can change the names yourself in the >/opt/d-cache/config/*.batch files OK - the door gridftpdoor-gw03Domain started on the pool node. However, after this transfers still seemed to go through gw04 (admin node). So, we stopped dcache-opt on the admin node and edited /opt/d-cache/etc/door-config on the admin node to something like: ADMIN_NODE gw04.hep.ph.ic.ac.uk door active -------------------- GSIDCAP yes GRIDFTP no SRM yes ran install_doors.sh and started dcache-opt. Now all transfers reliably go direct to gw03 (pool node). However, they still fail! Here is the error: > copying CopyJob, source = file:////root/testfile destination = gsiftp://gw03.hep > .ph.ic.ac.uk:2811//pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526 > GridftpClient: connecting to gw03.hep.ph.ic.ac.uk on port 2811 > GridftpClient: gridFTPClient tcp buffer size is set to 1048576 > GridftpClient: gridFTPWrite started, source file is java.io.RandomAccessFile@1ce > 669e destination path is /pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526 > GridftpClient: parallelism: 10 > GridftpClient: adler 32 for file java.io.RandomAccessFile@1ce669e is 77e90756 > GridftpClient: waiting for completion of transfer > GridftpClient: gridFtpWrite: starting the transfer in emode to /pnfs/hep.ph.ic.a > c.uk/data/dteam/testfile1526 > org.globus.ftp.exception.ServerException: Server refused performing the request. > Custom message: (error code 1) [Nested exception message: Custom message: Une > xpected reply: 553 /pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot create > file: CacheException(rc=666;msg=Path do not exist)]. Nested exception is org.gl > obus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected rep > ly: 553 /pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot create file: Cache > Exception(rc=666;msg=Path do not exist) > at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:167) > GridftpClient: transfer exception > org.globus.ftp.exception.ServerException: Server refused performing the request. > Custom message: (error code 1) [Nested exception message: Custom message: Une > xpected reply: 553 /pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot create > file: CacheException(rc=666;msg=Path do not exist)]. Nested exception is org.gl > obus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected rep > ly: 553 /pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot create file: Cache > Exception(rc=666;msg=Path do not exist) > at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:167) > GridftpClient: closing client : org.dcache.srm.util.GridftpClient$FnalGridFTPCli > ent@14d5bc9 > GridftpClient: closed client > copy failed with the error > org.globus.ftp.exception.ServerException: Server refused performing the request. > Custom message: (error code 1) [Nested exception message: Custom message: Une > xpected reply: 553 /pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot create > file: CacheException(rc=666;msg=Path do not exist)]. Nested exception is org.gl > obus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected rep > ly: 553 /pnfs/hep.ph.ic.ac.uk/data/dteam/testfile1526: Cannot create file: Cache > Exception(rc=666;msg=Path do not exist) > at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:167) > try again which keeps repeating as the srmcp gets retried. BTW: we also reset the gridftp root path back to "/" in case this was causing the 'Path do not exist' error, but this had no effect. BTW2: After running install_doors.sh on the admin node, the nice cellInfo web page on the admin node is showing Offline for the dcap, gftp and srm cells, presumably as these are now dcap-gw04 and srm-gw04.... Any suggestions? Mona and Owen.
On Tue, May 17, 2005 at 06:16:19PM +0100 or thereabouts, Alessandra Forti wrote: > Hi, > > I'm trying to make work dccp but it keeps on giving me this error message: Try these few. 1. dccp dcap://bohr0013.tier2.hep.man.ac.uk:22125//pnfs/tier2.hep.man.ac.uk/data/dteam/aNonExistantFile.srm /home/aforti/yeaa.dcap. 2. dccp gsidcap://bohr0013.tier2.hep.man.ac.uk:22128//pnfs/tier2.hep.man.ac.uk/data/dteam/aNonExistantFile.srm /home/aforti/yeaa.dcap. 3. srmcp srm://bohr0013.tier2.hep.man.ac.uk:8443/pnfs/tier2.hep.man.ac.uk/data/dteam/aNonExistantFile.srm file:////tmp/junk You should be able to add -protocol=dcap to the second one to have the SRM use a dcap and not a Grid FTP TURL but the transfer part failed for me earlier today with a missing file on the client side. (1. ) will fail unless you have write access with a your client side uid/gid. Steve
Hi, > The port number for GSI DCAP is 22128, you'll need to specify it on the > command-line: i.e. gsidcap://gw04.hep.ph.ic.ac.uk:22128/ OK, we now can read and write through a GSI DCAP door on the admin node. Still no joy understanding what is going wrong with GSI FTP door on the pool node though. cheers, Owen.
On Thu, 19 May 2005 13:03:04 +0100 Matt Doidge <matt.doidge@GMAIL.COM> wrote: > Hello, > Thanks for all your replys. JAVA_LOCATION is pointing towards the > correct directory, but I have no JAVA_HOME variable set anywhere (but > it sounds like they're the same).. to set an environment variable export JAVA_LOCATION=/usr/java/j2sdk1.4.2_04 export JAVA_HOME=/usr/java/j2sdk1.4.2_04/jre/ then to test this type env | grep $JAVA_HOME if you skip the export line this will not work but echo $JAVA_HOME will work as the export function says to sh derived shells that this variable should be exported to child processes. > The dcache port range is set as; > DCACHE_PORT_RANGE="20000 25000" > this looks correct to me (although the syntax for the commented out > DPM_PORT_RANGE is "20000,25000" > interestingly this is set the same as GLOBUS_TCP_PORT_RANGE, there > isn't some weird clash or port reservation going on is there? Yes, that's why Globus Grid FTP does not run on D-Cache setups > checking the logs brings up, in dCache.log (and in utility.log), an > interesting java message: > > Exception in thread "main" java.lang.NoClassDefFoundError: 25000,20000 That may be interesting > > maybe I've got the inverted port range thing going on after > all....will try changing it around in the site-info.def and see if > that works (although i could have sworn i had the latest version of > YAIM). I should not worry to much about that. Regards Owen
This is an automated notification sent by LCG Savannah. It relates to: bugs #8777, project LCG Operations ============================================================================== OVERVIEW of bugs #8777: ============================================================================== URL: <http://savannah.cern.ch/bugs/?func=detailitem&item_id=8777> Summary: the information system on the SE dcache node doesn't work Project: LCG Operations Submitted by: aforti Submitted on: 2005-May-27 20:27 Category: Information Service Severity: 5 - Average Priority: 5 - Normal Item Group: Malfunctioning Status: None Privacy: Public Assigned to: lfield Open/Closed: Open Release: None Reproducibility: Every Time Effort: 0.00 _______________________________________________________ Hi, the information system part is not working. YAIM ===== 1) /opt/lcg/yaim/functions/config_gip line 366 dn: GlueSARoot=$VO:${storage#${CE_CLOSE_SE1_ACCESS_POINT}/},GlueSEUniqueID=${SE_ HOST},Mds-Vo-name=local,o=grid doesn't produce the desired output. I've replaced it with this. dn: GlueSARoot=$VO:${storage},GlueSEUniqueID=${SE_HOST},Mds-Vo-name=local,o=grid 2) Some glue schema fields are hardcoded in the script. They should go into a configuration file/template that can be safely edited, editing scripts is not a good practise. example of fields: GlueSAPolicyFileLifeTime: permanent GlueSAPolicyMaxFileSize: 10000 GlueSAPolicyMinFileSize: 1 GlueSAPolicyMaxData: 100 GlueSAPolicyMaxNumFiles: 10 GlueSAPolicyMaxPinDuration: 10 GlueSAPolicyQuota: 00 GlueSAStateAvailableSpace: 1 3) the dynamic part of the ldif contained in /opt/lcg/var/gip/tmp/lcg-info-dynamic-se.ldif.XXXX doesn't get created (the file is 0 size) IS === lcg-info-dynamic-dcache seems to have the wrong srm command to get the info about the available space. In any case the available and used space are not published correctly. SRM ==== srm commands complain if $HOME/.srmconfig/config.xml doesn't exist even when given -conf=config_file option. This is really annoying. Anyway to go back to the IS when this command is run it fails because of this and because it doesn't seem to like the host certificates. [root@bohr0013 root]# SRM_PATH=/opt/d-cache/srm /opt/d-cache/srm/bin/srm-storage-element-info -conf=/opt/d-cache/srm/conf/config.xml https://bohr0013.tier2.hep.man.ac.uk:8443/srm/infoProvider1_0.wsdl AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server faultSubcode: faultString: org.dcache.srm.SRMAuthorizationException: can not determine username from GlobusId=/C=UK/O=eScience/OU=Manchester/L=HEP/CN=bohr0013.tier2.hep.man.ac.uk/E=alessandra.forti@manchester.ac.uk faultActor: ................................. plus another ton of java errors. I've attached the config_gip I modified. This worked for my dcache installation (apart from the used and available space). thanks cheers Alessandra _______________________________________________________ File Attachments: ------------------------------------------------------- Date: 2005-May-27 20:27 Name: config_gip Size: 13.8KB By: aforti modifed config_gip for dcache to publish more correct info <http://savannah.cern.ch/bugs/download.php?item_id=8777&item_file_id=883> ============================================================================== This item URL is: <http://savannah.cern.ch/bugs/?func=detailitem&item_id=8777> _______________________________________________ Message sent via/by LCG Savannah http://savannah.cern.ch/
Hi, 1) Are there any dcache/srm tools to delete files or should I use LD_PRELOAD and go with unix commands? 1a) Do I have to delete in /pnfs...? 1b) What happens if I delete files on the pools? 1c) Is there a way to recover is the pnfs db goes out of sync with the file systems on the pools? 2) Is there a way to bind GSI DCAP to one NIC and gridftp to the other? 3) How can I disable dcap and leave only GSI DCAP? 4) Is it possible to avoid direct access to the pool nodes? i.e. the connection the node opens the connection with the client when the request arrives from the head node through srm? I guess this might be possible only with ftp though. thanks cheers Alessandra
Hi all, At last we have gridftp to the pool node working :))))))) Inspired by Alessandra's 4 line configuration of a pool node we reinstalled our test pool node and tried again... ========================================================================== STEP1 : Install pool-node using yaim script mkdir /pool-path1 # We guessed this step from an obscure message in the gridftpdoor logs on the pool node after a failed transfer: mkdir /tmp/dcache-ftp-tlog /opt/lcg/yaim/scripts/install_node site-info-dcache.def lcg-SEDCache mv *.pem /etc/grid-security /opt/lcg/yaim/scripts/configure_node site-info-dcache.def SE_dcache Then on the pool node we added something like to /etc/fstab gw04.hep.ph.ic.ac.uk:/fs /pnfs/fs nfs hard,intr,rw,noac,auto 0 0 mkdir -p /pnfs/fs mount -a Set the /opt/d-cache/etc/door_config file on the pool node to ADMIN_NODE gw04.hep.ph.ic.ac.uk door active -------------------- GSIDCAP no GRIDFTP yes SRM no sh /opt/d-cache/install/install_doors.sh ln -s /opt/d-cache/bin/dcache-opt /etc/init.d chkconfig dcache-opt on service dcache-opt start ======================================================== STEP2 : Change the pool size on the pool-node (this step was necessary for us as the auto install set the pool size to 0GB) Edit the file /pool-path1/pool/setup set max diskspace 600m service dcache-pool restart on the admin node: /sbin/service dcache-opt restart /sbin/service dcache-core restart And it works! Thanks to all for the various steps you have all contributed to the above! We will start our production setup from Monday. Thanks once again, Mona and Owen
Hi all, Sorry for replying to my own stuff. Just a couple of additions. I strongly suggests that before installing Dcache using yaim you check the your first search domain is set to `hostname -d`. Otherwise, your yaim installation will break showing PnfsManager Offline, but pnfs will be mounted. We have MY_DOMAIN=gridpp.rl.ac.uk set in the site-info.def. For example on our headnode dev03: $ hostname -d gridpp.rl.ac.uk $ cat /etc/resolv.conf search gridpp.rl.ac.uk nameserver 130.246.xxx.yyy nameserver 130.246.xxx.yyy nameserver 130.246.xxx.yyy We'll soon publish patches so that this dependency is removed. If you run into this problem, your install can still be saved without reinstalling the entire machine from scratch. Here's a short howto: cd /pnfs ln -s fs/usr `hostname -d` /opt/d-cache/bin/grid-mapfile2dcache-kpwd echo `hostname -d` > /pnfs/fs/admin/etc/config/serverId As the unwary sysadmin could be hold back for days not knowing how should /pnfs/fs look like, I've attached an example. Please note especially the "gridpp.rl.ac.uk -> fs/usr/" link. This is a major bug. The yaim Dcache installation uses *three* methods of figuring out the domain of your admin node. 1) MY_DOMAIN, 2) grepping /etc/resolv.conf for search/domain, 3) `hostname -d`. This needs to be synchronised, ideally using the MY_DOMAIN variable in site-info.def. Thanks. Regards Jiri and Owen
Hi Owen, thanks. Just in case you want to add it to your set of patches to report. I've attached my config_gip I don't think it is general enough to propose a real patch because the SE section is only one and I don't know how they want to integrate dcache,dpm and classic SE together. It gives anyway a good idea of what I think needs to be changed for dcache accordingly to Tier1 lcg-info. Other changes I think should be done are: 1) Some glue schema fields are hardcoded in the script. They should go into a configuration file/template that can be safely edited, editing scripts is not a good practise. example of fields: GlueSAPolicyFileLifeTime: permanent GlueSAPolicyMaxFileSize: 10000 GlueSAPolicyMinFileSize: 1 GlueSAPolicyMaxData: 100 GlueSAPolicyMaxNumFiles: 10 GlueSAPolicyMaxPinDuration: 10 GlueSAPolicyQuota: 00 GlueSAStateAvailableSpace: 1 2) the dynamic part of the ldif contained in /opt/lcg/var/gip/tmp/lcg-info-dynamic-se.ldif.XXXX doesn't get created (the file is 0 size) I can't find what is missing right now and I'm a bit tired, tomorrow I'll look better. I think that it is the wrapper that doesn't get called or run by some other program. If anyone has any idea.... let me know. cheers Alessandra On Wed, 25 May 2005, Owen Synge wrote: > Steve T gave me some more support on this issue, of publishing the > incorrect information. It should help you all. I attached the > > /opt/lcg/var/gip/lcg-info-generic.conf > > earlier in the thread, which comes from the tier one D-Cache install. > > From Laurence field's home page some generic info on "gip" which is a > very "Laurence" name for an application. > > http://lfield.home.cern.ch/lfield/cgi-bin/wiki.cgi?area=gip&page=documentation > > is the command for regenerating the information > > /opt/lcg/sbin/lcg-info-generic-config > /opt/lcg/var/gip/lcg-info-generic.conf > > and should be done when ever /opt/lcg/var/gip/lcg-info-generic.conf is > changed then the information provider uses > > su - edginfo -c '/opt/lcg/libexec/lcg-info-wrapper' > > To launch the script hierarchy. > > Regards > > Owen > > > On Wed, 25 May 2005 14:41:16 +0100 > Alessandra Forti <Alessandra.Forti@MANCHESTER.AC.UK> wrote: > >> Thanks better than parsing ldap. :) >> >> cheers >> alessandra >> >> >> On Wed, 25 May 2005, Owen Synge wrote: >> >>> On Wed, 25 May 2005 14:28:04 +0100 >>> Alessandra Forti <Alessandra.Forti@MANCHESTER.AC.UK> wrote: >>> >>>> Hi, >>>> >>>> I think there is a bug in >>>> >>>> /opt/lcg/yaim/functions/config_gip >>>> >>>> line 366 >>>> >>>> dn: >>> GlueSARoot=$VO:${storage#${CE_CLOSE_SE1_ACCESS_POINT}/},GlueSEUnique >>> ID=${SE_ > HOST},Mds-Vo-name=local,o=grid >>>> >>>> doesn't produce the desired output. I've replaced it with this. >>>> >>>> dn: >>> GlueSARoot=$VO:${storage},GlueSEUniqueID=${SE_HOST},Mds-Vo-name=loca >>> l,o=grid > >>>> I'm looking for other things that might need change. >>>> >>>> cheers >>>> alessandra >>>> >>> Great I shall add it to my patch collection, Attached is Steve's >>> reference MDS provider written by hand for the tier 1 >>> >>> Regards >>> >>> Owen S >>> >>> PS I went to a party for may day which had 4 people called Owen in >>> one party an all time record for me.
Hi, I am going to have a look at disabling dcache from having access to the whole system, does anyone have an idea which paths each java process needs access to? I am thinking something like: java_options="... -Djava.security.policy=dcap.policy ..." And in dcap.policy something like this for each path that it needs access. grant { permission java.io.FilePermission "/poolpath", "read"; permission java.io.FilePermission "/poolpath", "write"; permission java.io.FilePermission "/poolpath", "delete"; } Cheers, Kostas
gstat monitors your BDII information if you follow the links to your site you should find an SRM endpoint if you followed the BDII configuration thread between Ale 1 - Getting the MDS service to correctly advertise the SE as an SRM; Have you tried the solution that I proposed earlier, /opt/lcg/sbin/lcg-info-generic-config \ /opt/lcg/var/gip/lcg-info-generic.conf and should be done when ever /opt/lcg/var/gip/lcg-info-generic.conf is changed then the information provider uses su - edginfo -c '/opt/lcg/libexec/lcg-info-wrapper' Included is the lcg-info-generic thats hand edited by Steve T for the tier 1, I should do a comparison of diffs with your current set up and see if your SRM registers correctly we are behind firewalls so cat test adequately the fixes here but it seems to work for Alessandra Regards Owen
Dear all, another version of the yaim Dcache installation is ready. http://storage.esc.rl.ac.uk/patches/yaim/yaim-2_4_0-4-gpp-0.3.diff.gz Again, nothing earth-shattering, but it could save you (especially Imperial that requested this particular feature) some time. Please take a look at http://storage.esc.rl.ac.uk/patches/yaim/ChangeLog Note that if you use this patch, you'll need to change your existing site-info.def DCACHE_POOLS variable in the following fashion: original (e.g.) DCACHE_POOLS="dev02.$MY_DOMAIN:/pool dev06.$MY_DOMAIN:/pool" new (note the extra colon!) DCACHE_POOLS="dev02.$MY_DOMAIN::/pool dev06.$MY_DOMAIN::/pool" (Sorry about that, I'll fix that soon so that you don't need to do this.) or DCACHE_POOLS="dev02.$MY_DOMAIN:10:/pool dev06.$MY_DOMAIN:20:/pool" if you want to limit the pool size on dev02 to 10GB and dev06 to 20GB. I've added support for multiple pools on one machine for the yaim support. You could do (e.g.): DCACHE_POOLS="dev02.$MY_DOMAIN:10:/pool/1 dev02.$MY_DOMAIN:10:/pool/2 dev06.$MY_DOMAIN:20:/pool" which will give you two 10GB pools on dev02. I'll look into assigning specific pools to specific VOs soon (hopefully). Thanks, good luck. -- Jiri
On Mon, May 30, 2005 at 10:41:30AM +0100, Greig A Cowan wrote: > Hi Kostas, > > > What i did for our pool node was to download the d-cache rpms and > > installed them by hand. Configuration is trivial you only have to edit 5 > > filesi, here is the list from the d-cache instructions: > > Thanks for the email. I had started doing a dCache install by hand but > with all the focus being on the yaim method, I thought it would be good to > give it a try. I am thinking about going back to the manual install, but > surely I still need all of the edg, vdt, perl and postgresql installed as > well? Surely if the yaim install failed due to unmatched dependencies on > these, the same will happen with a manual install? No you don't need *any* of that. You only need to be able to create a dcache.kpwd file and you can get away with a cron job that copies it from the admin node. Yaim is *really* over zelous and it installs everything more or less. For example our admin node ended up with the SE version of the gridftp server installed and enabled to run!!! It also installed a "random" version of postgres that someone downloaded from the web although RHEL provides and supports postgres (who is responsible for security updates for that version now?). As you can imagine i am not a huge fun of scripts that try to do my job and at the end i end up doing more work. If you have a look at what yaim does you'll realise that what it does is trivial and you can easily do it yourself while gaining some useful knowledge in the process on how everything works and how to fix it if something goes wrong. </rant> Cheers, Kostas
Hi, I am really worried about two problems with d-cache 1) At the moment there isn't a way (that i know off) to find which user upload which files. Without that ability the server is useless for anything in the grid world. 2) I could be mistaken here but from what i know about the java Globus implementation that d-cache is using there is no CRL support. If this is the case it means that revoked certificates can still be used to access the server and this is unacceptable. I am not sure if we will be able to deploy d-cache if this problems aren't solved since it's a clear violation of Imperial's policy. We might be able to get permission to run it but i wont bet on it. Kostas
Hi, I've played a little bit with my Dcache admin node. Unless I've missed something blatantly obvious (in which case thanks in advance for pointing it out) the there seems to be no straightforward way of disabling GSI DCAP and dcap on admin nodes. Running /opt/d-cache/install/install_doors.sh with GSI DCAP disabled on the admin node resulted in *all* doors being down, not just the selected ones. After a few hours of trying to fix this, I decided to reinstall... I wouldn't recommend running this script on your admin nodes. OK, here's how I've disabled dcap and GSI DCAP on my admin node. The bad news is that is is a hack, the good news is that you can reverse the process should you wish to re-enable the doors. Disabling dcap ~~~~~~~~~~~~~~ In /opt/d-cache/config/door.batch I've commented out the following bunch of lines (70--82 counting the first line as 1, not 0). create dmg.cells.services.login.LoginManager DCap \ "${dCapPort} \ diskCacheV111.doors.DCapDoor \ -keepAlive=300 \ -poolRetry=2700 \ -prot=telnet -localOk \ -truncate=${truncate} \ -maxLogin=1500 \ -brokerUpdateTime=30 \ -protocolFamily=dcap \ -protocolVersion=3.0 \ -poolProxy=PoolManager \ -loginBroker=LoginBroker" Disabling GSI DCAP ~~~~~~~~~~~~~~~~~ mv /opt/d-cache/jobs/gsidcapdoor /opt/d-cache/jobs/gsidcapdoor.disabled Then I brutally restarted the entire dCache using my last-resort Dcache script as restarting only -opt/-core wouldn't help (doors kept reappearing after I killed them or/and restarted). Check that dcap is not running: lsof -i tcp:22125 | tail -n 1 | awk '{print $2}' should return no PID Check that gsidcap is not running: lsof -i tcp:22128 | tail -n 1 | awk '{print $2}' should return no PID Hope that helps. -- Jiri
> Hi, > I've played a little bit with my Dcache admin node. > Unless I've missed something blatantly obvious (in which case > thanks in advance for pointing it out) the there seems to be > no straightforward way of disabling GSIDCAP and dcap on admin > nodes. In terms of the dcache-opt service there is no difference between an "admin" node and a "pool" node. So install_doors.sh should work just fine for disabling GSIDCAP on the admin node. > > Running /opt/d-cache/install/install_doors.sh with GSIDCAP > disabled on the admin node resulted in *all* doors being > down, not just the selected ones. After a few hours of trying to > fix this, I decided to reinstall... I wouldn't recommend running > this script on your admin nodes. How were you determining that the doors were down? If you've just installed the dcache-opt rpm on the admin node then the doors will be SRM,Dcap-GSI and GFTP. If you've run the install_doors script they'll be SRM-<host>, Dcap-GSI-<host> and GFTP-<host> and won't show as up on the web interface. If you want them too, you can edit the bottom of /opt/d-cache/config/httpd.batch and then restart the httpd service with /opt/d-cache/jobs/httpd stop;/opt/d-cache/jobs/httpd -logfile=/opt/d-cache/log/http.log start What I suspect happened is that you had the dcache-opt service running when you ran the install_doors script. The install_doors script changes the /etc/init.d script, so it was probably trying to stop services that weren't started and the ones that were started got lost. > > OK, here's how I've disabled dcap and GSI DCAP on my admin node. The > bad news is that is is a hack, the good news is that you can reverse > the process should you wish to re-enable the doors. > > Disabling dcap > ~~~~~~~~~~~~~~ > In /opt/d-cache/config/door.batch > I've commented out the following bunch of lines (70--82 > counting the first line as 1, not 0). > Or you could comment out the 2 lines that start and stop the door service in /etc/init.d/dcache-core > Then I brutally restarted the entire dCache using my > last-resort dCache > script as restarting only -opt/-core wouldn't help (doors kept > reappearing after I killed them or/and restarted). Each java process is started by a script, if the script see its child java process end, it'll start another, kill the scripts first: # ps aux | grep srm root 19909 0.0 0.0 4348 1228 ? S May29 0:00 /bin/sh /opt/d-cache/jobs/srm-dcache -logfile=/opt/d-cache/log/srm-dcache.log start root 19910 22.0 19.6 631948 404012 ? S May29 461:39 /usr/java/j2sdk1.4.2_08/bin/java -server -Xmx384m -XX:MaxDirectMemorySize=384m -Dorg.globus.tcp.port.range=50000,52000 dmg.cells.services.Domain srm-dcacheDomain -param setupFile=/opt/d-cache/config/srm-dcacheSetup ourHomeDir=/opt/d-cache ourName=srm-dcache Derek
> > Hi, > > > > I am really worried about two problems with d-cache > > > > 1) At the moment there isn't a way (that i know off) to > > find which user upload which files. Without that ability > > the server is useless for anything in the grid world. > > For gridftp: > > In the /opt/d-cache/config/gridftp-<host>.batch, change the first line from > > set printout 2 > to > set printout 3 > > and then restart the dcache-opt service. This increases the verbosity of the gridftp door and includes in the output the DN Thanks, this seem to work although the verbosity is increased to an insane level :( $ ls -al gridftpdoor-sedsk00.log -rw-r--r-- 1 root root 4562 Jun 1 17:14 gridftpdoor-sedsk00.log $ uberftp -a gsi -H sedsk00 dir ... $ ls -al gridftpdoor-sedsk00.log
> Hi Derek, > > copying to the srm it sleeps for longer and longer intervals of time. Here > is the output: In terms of the error this suggests that the srm is up but the internals of retrieving files is broken. What is happening is first the file is requested and given a request ID. > Thu Jun 02 14:58:54 BST 2005: srm returned requestId = -2147482384 but after this the srm module is waiting for other D-Cache modules to state that the data is ready to be transferred. They are not talking to each other correctly. I have never seen this outside coding with srmcp, I suggest that the pnfs database is corrupt or broken. Regards Owen
copying to the srm it sleeps for longer and longer intervals of time. Here is the output: [aforti@bohr0003 aforti]$ ./srmcp.sh pippo pippo14 pippo ===> pippo14 SRM srmcp file:///./pippo srm://bohr0013.tier2.hep.man.ac.uk:8443//pnfs/tier2.hep.man.ac.uk/data/dteam/pippo14.srm SRM Configuration: debug=true gsissl=true help=false pushmode=false userproxy=true buffer_size=2048 tcp_buffer_size=0 config_file=/home/aforti/.srmconfig/config.xml glue_mapfile=/opt/d-cache/srm/conf/SRMServerV1.map webservice_path=srm/managerv1.wsdl webservice_protocol=https gsiftpclinet=globus-url-copy protocols_list=http,gsiftp save_config_file=null srmcphome=/opt/d-cache/srm urlcopy=/opt/d-cache/srm/bin/url-copy.sh x509_user_cert=/home/aforti/.globus/usercert.pem x509_user_key=/home/aforti/.globus/userkey.pem x509_user_proxy=/tmp/x509up_u500 x509_user_trusted_certificates=/etc/grid-security/certificates retry_num=3 retry_timeout=1000 wsdl_url=null use_urlcopy_script=true connect_to_wsdl=false from[0]=file:///./pippo to=srm://bohr0013.tier2.hep.man.ac.uk:8443//pnfs/tier2.hep.man.ac.uk/data/dteam/pippo14.srm Thu Jun 02 14:58:51 BST 2005: starting SRMPutClient Thu Jun 02 14:58:51 BST 2005: SRMClient(https,srm/managerv1.wsdl,true) Thu Jun 02 14:58:51 BST 2005: connecting to server Thu Jun 02 14:58:51 BST 2005: connected to server, obtaining proxy SRMClientV1 : connecting to srm at httpg://bohr0013.tier2.hep.man.ac.uk:8443/srm/managerv1 Thu Jun 02 14:58:52 BST 2005: got proxy of type class org.dcache.srm.client.SRMClientV1 SRMClientV1 : put, sources[0]="./pippo" SRMClientV1 : put, dests[0]="srm://bohr0013.tier2.hep.man.ac.uk:8443//pnfs/tier2.hep.man.ac.uk/data/dteam/pippo14.srm" SRMClientV1 : put, protocols[0]="http" SRMClientV1 : put, protocols[1]="dcap" SRMClientV1 : put, protocols[2]="gsiftp" SRMClientV1 : put, contacting service httpg://bohr0013.tier2.hep.man.ac.uk:8443/srm/managerv1 doneAddingJobs is false copy_jobs is empty Thu Jun 02 14:58:54 BST 2005: srm returned requestId = -2147482384 Thu Jun 02 14:58:54 BST 2005: sleeping 1 seconds ... Thu Jun 02 14:58:55 BST 2005: sleeping 4 seconds ... Thu Jun 02 14:58:59 BST 2005: sleeping 4 seconds ... Thu Jun 02 14:59:04 BST 2005: sleeping 4 seconds ... Thu Jun 02 14:59:08 BST 2005: sleeping 4 seconds ... Thu Jun 02 14:59:12 BST 2005: sleeping 4 seconds ... Thu Jun 02 14:59:17 BST 2005: sleeping 7 seconds ...
Hi Alessandra, Are any Put Requests listed when you do a ls -put in the SRM module in the admin interface? Its case sensitive, the module is called SRM (or possibly SRM-${hostname of srm node}) If there are, does cancelling them (cancel all -put .* to cancel them all) allow a new transfer to work? Derek
> Hi, > > The file I sent only reduces lifetimes for get requests to 1 hour, the lifetimes for put and copy requests were left at the default of 24 hours. > > Add this line to the SRM part of the srm.batch file to reduce put lifetimes to 12 hours > > -put-lifetime=43200000 \ > > > Note that this only applies to new requests, old requests will still have the old lifetime. > > Derek Hi Derek, what is it the unit of time millisec? cheers alessandra On Mon, 6 Jun 2005, Ross, D (Derek) wrote:
On Tuesday 07 June 2005 14:52, Matt Doidge wrote: > hello, > > I'm planning to stick a new pool node onto our existing dcache set-up, > which should be a simple enough process, but looking at the d-cache > set-up it mentions editing the poollist file, which we have but it is > empty! > we have a: > /opt/d-cache/config/fal-pygrid-20.poollist I know that this file got setup by YAIM when we did it at Glasgow.... > > but there is nothing in it. Our d-cache system seems to be working > fine, it was set up using yaim. ...so that's weird. What _should_ be in it, for a pool, is described at http://www.physics.gla.ac.uk/gridpp/datamanagement/index.php/ScotgridDcacheDiskPoolAdd (and certainly worked for us). > We want to add a node to the setup > that's almost exactly the same as the one already attached, (6 pools > on 6 separate partitions on one node). Would it be as simple as > repeating the normal pool installation process with a slightly modified > site-info.def file (with the extra pools added) or is it something > more subtle? That does work for adding new pool nodes. The site-info.def is only used by YAIM, not by Dcache once it's running. See http://www.physics.gla.ac.uk/gridpp/datamanagement/index.php/ScotgridDcachePoolNodeAdd Hope that helps Graeme
Hi Owen(s), Mapping pools to vo's is mentioned in the how-to, on this page: http://storage.esc.rl.ac.uk/documentation/html/D-Cache-Howto/ar01s11.html The relevant part is: Set storage group for each VO dir (there are 4 spaces between StoreName and the VO name, don't know if its significant) cd ${vo} echo "StoreName ${vo}" >".(tag)(OSMTemplate)" echo ${vo} > ".(tag)(sGroup)" cd .. Set up pool groups and directory affinities, for each VO add the following lines to /opt/d-cache/config/PoolManager.conf psu create pgroup ${vo}-pgroup psu create unit -store ${vo}:${vo}@osm psu create ugroup ${vo} psu addto ugroup ${vo} ${vo}:${vo}@osm psu create link ${vo}-link world-net ${vo} psu add link ${vo}-link ${vo}-pgroup psu set link ${vo}-link -readpref=10 -writepref=10 -cachepref=10 Which makes perfect sense to me (I wrote it :-) ), but it is just a wee bit terse. Here goes with a slightly more verbose version: To map a VO to a pool firstly you have to tag the directory in the pnfs file system that the VO will use. The tags will be inherited by any directory created under the tagged directory after it has been tagged. To tag a directory, change into it and run the following commands echo "StoreName ${vo}" >".(tag)(OSMTemplate)" echo ${vo} > ".(tag)(sGroup)" where ${vo} is the name of the VO e.g. dteam. Note that although we use the same name both times here it isn't necessary to do so, for instance the Tier 1 has a dteam directory where the .(tag)(sGroup) contains the words tape, and this is used to map to a separate set of pools for access to the Atlas Data Store. The second part of configuring mappings between VOs and pools involves the PoolManager. If your dcache instance is halted then you can add them to the /opt/d-cache/config/PoolManager.conf on the admin node, otherwise they should be entered into the PoolManager modules of the admin interface, remembering to finish with save to write the configuration to disk. psu create pgroup ${vo}-pgroup psu create unit -store ${vo}:${vo}@osm psu create ugroup ${vo} psu addto ugroup ${vo} ${vo}:${vo}@osm psu create link ${vo}-link world-net ${vo} psu add link ${vo}-link ${vo}-pgroup psu set link ${vo}-link -readpref=10 -writepref=10 -cachepref=10 Note that most of the names of things in the above commands are convention, and there is no requirement to actually follow this scheme. The first command creates a pool group, this is exactly what it sounds: a group of pools. The second command defines a unit, this is something that matches against a property of the incoming request, in this case the storage information of where the file should be written. The names in this command do matter, they should match those used to tag the directory earlier, the name used in the .(tag)(OSMTemplate) comes first. The third command creates a unit group, this is just a group of units. The fourth command adds the unit created to the new unit group. The fifth command creates a link, this is the mapping between incoming requests and destination pool, and adds two unit groups to it. world-net is a existing unit group that matches requests coming from any ip address and the second unit group is the one just created. The sixth command adds the pool group created to the new link. The seventh command set various properties of the link. Once all those commands are down, psu addto pgroup ${vo}-pgroup <poolname> will add a pool to the pool group. If this pool is not for all vos to access, you may wish to remove it from the default pool group with psu remove from default <poolname>, to ensure that files from other VOs cannot get written to that pool. Note that a pool can belong to more than one pool group, so it is perfectly possible to have two VOs writing to the same pool, however there is no way to stop one VO using all of the space in the pool. Hope this helps, Derek
On Thu, 9 Jun 2005 16:51:02 +0100 Greig A Cowan <g.cowan@ED.AC.UK> wrote: > Hi Derek, > > > Have you tried starting the srm with the pool turned off, i.e. > > > > service dcache-pool stop > > service dcache-opt start > > > > Does that log anything different in srm.log? > > SRM is now on line!! Surprisingly, there are now no error messages in > srm.log. Do you know what happen here? I am curious because I don't want to have these problems again. Do you have any record of how you kicked the service into behaving? Regards Owen
Hi, At IC we are observing a strange behaviour of srmcopy. Diskpool-1 and Diskpool-2 are both running gsiftp. However, the copy process selects randomly gsiftp on diskpool-1 to copy files to both the disk pools. Instead, the gsiftp on dispool-1 should be used for copying files on to diskpool-1. Similarly, gsiftp running on diskpool-2 should be used for copying files to diskpool-2. Please suggest if any configuration changes are needed. Regards, Mona No, this is the correct behaviour. Selection of gridftp door is separate from selection of pool. The fact that there may be a gridftp door on the same host as a pool isn't taken into account anywhere. I'm afraid you'll just have to live with it. Derek
On Mon, Jun 13, 2005 at 02:08:14PM +0100 or thereabouts, Owen Synge wrote: > On Mon, 13 Jun 2005 10:36:46 +0100 > Jiri Mencak <j.mencak@RL.AC.UK> wrote: > > > Hi, > > first of all my apologies to old-timers for stating the obvious. > > > > If your application to join dteam has not been approved yet and you > > want to test your dcache installation, you can add your DN to > > /etc/grid-mapfile manually and run > > /opt/d-cache/bin/grid-mapfile2dcache-kpwd. > > > > Regards. > > > > -- > > Jiri > > Sorry to spell things out to this degree but I am sure things are better to clarify all the details exactly. Unfortunately I don't have a personal test CA. > > OK so to add a new user to the grid map file do the following, > > 1 get the DN of the cert you wish to add, for me this cert DN is > > "/C=UK/O=eScience/OU=CLRC/L=RAL/CN=owen synge" > > 2 Decide which of the supported virtual organisations (VO) available I am within the group of developer/systemadmin/deployment experts so I chose as I imagine you all will "dteam" > > 3 Add a single line to the gridmap file in the following format > > ${DN} .${VO} If you want to add single extra members then you change the /opt/edg/etc/grid_mapfile_local edg-mkgridmap constructs /etc/grid-sucurity/grid-mapfile out of /opt/edg/etc/edg-mkgridmap.conf and the local one above. Steve > > and the following line as bellow > > "/C=UK/O=eScience/OU=CLRC/L=RAL/CN=owen synge" .dteam > > 4 Use the method for pulling the gridmap file into the D-Cache system. > > /opt/d-cache/bin/grid-mapfile2dcache-kpwd > > This will effect all D-cache components including GridFTP access. With Globus based solutions the this stage is unnecessary > > Regards > > Owen S -- Steve Traylen s.traylen@rl.ac.uk http://www.gridpp.ac.uk/
Hi Matt, > Is anyone else continuing to have SRM troubles? Yes. I'm having issues using globus-url-copy, I'm investigating them just now. > whilst adding some > new pools to my setup I noticed the srm had dropped offline, probably > whilst the system was rebooted t'other day. I tried restarting the > admin node, had no luck, tried a few more tricks and eventually > resorted to a reinstall of dcache. Thankfully it's all working again > now by the looks of it, with the additional pools included in to the > set up. But it was very annoying to have to do. Any ideas why the SRM > is reluctant to become on line sometimes? I always find this problem. What works for me is to start and stop various dcache services on the admin node: service dcache-pool stop service dcache-opt restart service dcache-pool start It seems that the pool services can break SRM, so turn them off first. I appear to be able to leave the pool node alone when doing this and everything comes back on line again. > Which comes to my second question, when we have our SE officially up > and running is there a procedure for rescuing the postgres database > and other critical information if it all goes pear shaped and the > admin node needs reinstalling or some similar drastic action? The only > reason I could do my cowboy fix today is that our SE isn't in use > yet.... Don't know about this....surely there is some way of doing this? Cheers, Greig
On Thu, 16 Jun 2005 11:47:30 +0100 Jamie Kelvin Ferguson <jfergus7@PH.ED.AC.UK> wrote: > Hi Andrew, > > You seen to have a pretty comprehensive knowledge of srm's. Do you know of > any way, either by querying the info system or otherwise, to automatically > detect if a site is srm or classic SE? Its available through BDII, I shall find out more details soon. > And is there a way to evaluate how much available/used space exists for > each flavour of storage? This information is available for Classic SE's but is not yet available for D-cache. > > Will the new middle ware accommodate sites advertising the above values? Yes but we have yet to get this working. > Any help would be appreciated. > > Cheers, > Jamie Ferguson. Regards Owen S
Hi Jamie, if you want to know about everyone Ianthe world you can parse the output of this command ldapsearch -x -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -b o=grid there is an srm command to get out the information from each srm. /opt/d-cache/srm/bin/srm-storage-element-info https://<SRM-NODE>:8443/srm/infoProvider1_0.wsdl cheers alessandra
This might help: ldapsearch -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x \ -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:srm*)' vs. ldapsearch -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x \ -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:disk)'
Though this search: ldapsearch -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x \ -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:disk) & (GlueSEPort=8443)' does raise the question as to whether GlueSEPort is actually used! ? In principle, you ought to be able to run a gsiftp server on 8443 and an srm on 2811, I'd imagine, *as long as* you advertised it correctly. But from the fact that perfectly functional Classic SEs are advertising 8443 while actually running their gsiftp server on 2811 (I checked a few) seems to show that the GlueSEPort isn't being picked up by the lcg replica management software. I suspect a yaim default here...
There's also GlueSEType but it isn't really reliable (or even defined...) SRM: > [maroney@gfe03 RB]$ ldapsearch -LLL -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:srm*)' | grep GlueSEName | wc > 14 28 438 > [maroney@gfe03 RB]$ ldapsearch -LLL -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:srm*)' | grep GlueSEPort | sort | uniq -c > 14 GlueSEPort: 8443 > [maroney@gfe03 RB]$ ldapsearch -LLL -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:srm*)' | grep GlueSEType | sort | uniq -c > 2 GlueSEType: srm > 2 GlueSEType: srm_v1 Classic SE: > [maroney@gfe03 RB]$ ldapsearch -LLL -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:disk)' | grep GlueSEName | wc > 147 294 4170 > [maroney@gfe03 RB]$ ldapsearch -LLL -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:disk)' | grep GlueSEPort | sort | uniq -c > 27 GlueSEPort: 2811 > 120 GlueSEPort: 8443 > [maroney@gfe03 RB]$ ldapsearch -LLL -H ldap://lcgbdii02.gridpp.rl.ac.uk:2170 -x -b 'mds-vo-name=local,o=grid' '(GlueSEName=*:disk)' | grep GlueSEType | sort | uniq -c > 117 GlueSEType: disk > 1 GlueSEType: gsiftp > 5 GlueSEType: srm
Hi, Jamie Kelvin Ferguson wrote: > Hi Owen, > > Thanks for that - thats what I'm looking for. I assume all sites > with an srm will have an the term 'srm' in there SE type? I'm afraid not! The GlueSEType field doesn't seem to be used and isn't actually defined for most SRMs (in fact, to get the SRM at IC to publish properly I had to drop the field completely, so I guess it's deprecated even...). The GlueSEName is the reliable field as the LCG replica manage reads what follows the ":" to decide which type the SE is ("disk" or "srm_v1" seem the only options...). > Is there a way to tell which type of storage (permanent, durable > volatile) sites are offering. I.e. how does a site advertise this > information? There is a field advertising this separately for *each* separate storage area (GlueSARoot) GlueSAPolicyFileLifeTime, eg: ldapsearch -LLL -H ldap://gfe02.hep.ph.ic.ac.uk:2135 -x \ -b 'GlueSEUniqueID=gfe02.hep.ph.ic.ac.uk,mds-vo-name=local,o=grid' \ GlueSARoot, GlueSAPolicyFileLifeTime > Also, Phil might have mentioned something about this, If a site > advertises permanent space and it can be seen from the info. system > that it has a capacity of say 5TB. Can it be assumed that all 5TB of > storage there is permanent, or is it possible for a site to divide up > its type of storage and only a fraction of that 5TB will be permanent > space. I wish I knew the answer to this! How do storage spaces map to SARoot map to VOs? The GlueSARoot advertises only a single storage type. Possible you could define different SARoots, with different storage types. Also, at the moment, the GlueSARoot *seems* to be one-per-VO (and vice versa), but there is also a GlueSAAccessControlBaseRule. This would indicate that an SARoot might support more than one VO, though I can't imagine the use case for it... cheers, Owen M
This is correct , the only way around this is run and publish two SRM end points. We now have dcache.gridpp.rl.ac.uk and I think dcache-tape.gridpp.rl.ac.uk. Each of them has a different storage root and published parameters, or rather it should but we have not got around to that yet. Steve
/opt/d-cache/srm/bin/srm-advisory-delete > -debug=true > srm://gfe02.hep.ph.ic.ac.uk:8443/pnfs/hep.ph.ic.ac.uk/data/dte > am/testfilejun > e21 > SRM Configuration: > debug=true > gsissl=true > help=false > pushmode=false > userproxy=true > buffer_size=2048 > tcp_buffer_size=0 > config_file=/home/aggarwa/.srmconfig/config.xml > glue_mapfile=/opt/d-cache/srm/conf/SRMServerV1.map > webservice_path=srm/managerv1.wsdl > webservice_protocol=https > gsiftpclinet=globus-url-copy > protocols_list=http,gsiftp > save_config_file=null > srmcphome=/opt/d-cache/srm > urlcopy=/opt/d-cache/srm/bin/url-copy.sh > x509_user_cert=/home/aggarwa/k5-ca-proxy.pem > x509_user_key=/home/aggarwa/k5-ca-proxy.pem > x509_user_proxy=/home/aggarwa/k5-ca-proxy.pem > > x509_user_trusted_certificates=/home/aggarwa/.globus/certificates > retry_num=20 > retry_timeout=10000 > wsdl_url=null > use_urlcopy_script=false > connect_to_wsdl=false > from=null > to=null > > Tue Jun 21 13:21:37 BST 2005: SRMClient(https,srm/managerv1.wsdl,true) > Tue Jun 21 13:21:37 BST 2005: connecting to server > Tue Jun 21 13:21:37 BST 2005: connected to server, obtaining proxy > SRMClientV1 : connecting to srm at > httpg://gfe02.hep.ph.ic.ac.uk:8443/srm/managerv1 > Tue Jun 21 13:21:38 BST 2005: got proxy of type class > org.dcache.srm.client.SRMClientV1 > Tue Jun 21 13:21:38 BST 2005: calling srm.advisoryDelete() > SRMClientV1 : advisoryDelete > SURLS[0]="srm://gfe02.hep.ph.ic.ac.uk:8443/pnfs/hep.ph.ic.ac.u > k/data/dteam/t > estfilejune21" > SRMClientV1 : advisoryDelete, contacting service > httpg://gfe02.hep.ph.ic.ac.uk:8443/srm/managerv1 > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > SRMClientV1 : sleeping for 0 milliseconds before retrying > > =============================================================== > > Any suggestions? Hi Mona, srm-advisory-delete works for me if I add -connect_to_wsdl=true to the command line. Derek
Tue, 21 Jun 2005 14:23:03 +0100