Table of Contents
Raw import
"Matt Doidge"
Before I go about upgrading our SE to lcg release 2_5_0 i was wondering if there is any sagely advice to be had from our storage guru's. Is there an easy and magical way to upgrade D-Cache to the latest version included in this release or will it be easier to start again, remove D-Cache then do the usual YAIM install procedure with the 2_5 release?
cheers,
matt
"Owen Synge"
"Owen Synge"
On Fri, 24 Jun 2005 16:47:44 +0100 Greig A Cowan wrote: > Hi everyone, > > Previously at Edinburgh... > > People were able to copy files into our dCache, but were unable to copy > files out. The source of this issue was diagnosed to be a problem with > the pnfs database. > > To fix this problem I have just done a reinstall of dCache with yaim (at > the same time updating to the LCG 2.5.0 middle ware) and I was hoping that > it would be possible for someone to test out srmcp's and globus-url-copy's > to and from our dCache? > > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/ > > Let me know of any errors. > > Cheers, > Greig > It failed with srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng e.20050628104900.epcc.ed.ac.uk The response is debug: response from gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: 150 Openning BINARY data connection for /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.epcc.ed.ac. uk3 debug: fault on connection to gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: Handle not in the proper state debug: error reading response from gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: the handle 0x806b22c was already registered for closing debug: data callback, no error, buffer 0x80703d8, length 0, offset=0, eof=true debug: operation complete Which suggests a networking issue, does it work from the box? Regards Owen S
"Greig A Cowan"
Hi Owen, > > It failed with > > > > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng e.20050628104900.epcc.ed.ac.uk > > > > The response is > > > > debug: response from gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: > > 150 Openning BINARY data connection for /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.epcc.ed.ac. uk3 > > > > debug: fault on connection to gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: Handle not in the proper state > > debug: error reading response from gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: the handle 0x806b22c was already registered for closing > > debug: data callback, no error, buffer 0x80703d8, length 0, offset=0, eof=true > > debug: operation complete > > > > > > Which suggests a networking issue, does it work from the box? Thanks to some help from Steve Thorn at NeSC, we may have begun tracking down the problem. He has been able to perform globus-url-copy's into our dCache from another ScotGrid machine (glenmorangie.epcc.ed.ac.uk). This goes through a ScotGrid switch, not through the wider SRIF network that traffic uses coming from outside of Edinburgh. It looks like at least some of the middleware on our admin node (srm.epcc.ed.ac.uk) is using the default globus port range of 20000-25000 (see packet dump below). On the SRIF network these ports are blocked as 50000-52000 is the allowed range. The globus config file /etc/sysconfig/globus contains the correct range, as does the GLOBUS_TCP_PORT_RANGE environment variable, but who can say what configuration files/hard coding is used in the various SRM doors. To summarise: when not encumbered by firewalls it works consistently. srm.epcc is *not* using the port range defined in /etc/sysconfig/globus. Is there any way of changing the default port range consistently? Cheers, Greig Packet dump. Irrelevant packets removed glenmorangie# tcpdump -q host srm.epcc.ed.ac.uk tcpdump: listening on eth0 16:11:35.312192 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 0 (DF) 16:11:35.312392 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:35.312424 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 0 (DF) 16:11:35.319327 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 24 (DF) 16:11:35.319338 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 0 (DF) 16:11:35.319742 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 13 (DF) 16:11:35.319968 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:35.364483 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 22 (DF) 16:11:35.383613 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 127 (DF) 16:11:35.383768 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:35.459060 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 1448 (DF) 16:11:35.459068 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 1448 (DF) 16:11:35.459074 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 0 (DF) 16:11:35.459395 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 1448 (DF) 16:11:35.459404 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 1448 (DF) 16:11:35.459410 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 0 (DF) 16:11:35.459414 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 643 (DF) 16:11:35.491566 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 0 (DF) 16:11:35.500755 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 1448 (DF) 16:11:35.500772 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 1448 (DF) 16:11:35.500921 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:35.500946 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 1448 (DF) 16:11:35.500952 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 479 (DF) 16:11:35.500957 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:35.501111 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:35.501119 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:35.544562 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 111 (DF) 16:11:35.544688 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 0 (DF) 16:11:35.545911 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 47 (DF) 16:11:35.546096 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:35.624240 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 467 (DF) 16:11:35.632006 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 707 (DF) 16:11:35.632284 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:35.640070 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 8 (DF) 16:11:35.640349 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 78 (DF) 16:11:35.648938 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 90 (DF) 16:11:35.649278 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 58 (DF) 16:11:35.649973 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 78 (DF) 16:11:35.650311 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 58 (DF) 16:11:35.650918 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 130 (DF) 16:11:35.651256 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 58 (DF) 16:11:35.651881 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 66 (DF) 16:11:35.652171 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 58 (DF) 16:11:35.653098 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 90 (DF) 16:11:35.653555 glenmorangie.epcc.ed.ac.uk.50001 > srm.epcc.ed.ac.uk.20001: tcp 0 (DF) 16:11:35.653687 srm.epcc.ed.ac.uk.20001 > glenmorangie.epcc.ed.ac.uk.50001: tcp 0 (DF) 16:11:35.653705 glenmorangie.epcc.ed.ac.uk.50001 > srm.epcc.ed.ac.uk.20001: tcp 0 (DF) 16:11:35.654057 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 122 (DF) 16:11:35.686168 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:35.968342 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 162 (DF) 16:11:35.968795 glenmorangie.epcc.ed.ac.uk.50001 > srm.epcc.ed.ac.uk.20001: tcp 6 (DF) 16:11:35.968968 srm.epcc.ed.ac.uk.20001 > glenmorangie.epcc.ed.ac.uk.50001: tcp 0 (DF) 16:11:35.969080 glenmorangie.epcc.ed.ac.uk.50001 > srm.epcc.ed.ac.uk.20001: tcp 0 (DF) 16:11:36.001600 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 0 (DF) 16:11:36.006172 srm.epcc.ed.ac.uk.20001 > glenmorangie.epcc.ed.ac.uk.50001: tcp 0 (DF) 16:11:36.048766 srm.epcc.ed.ac.uk.20001 > glenmorangie.epcc.ed.ac.uk.50001: tcp 0 (DF) 16:11:36.048781 glenmorangie.epcc.ed.ac.uk.50001 > srm.epcc.ed.ac.uk.20001: tcp 0 (DF) 16:11:36.272498 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 78 (DF) 16:11:36.272535 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 0 (DF) 16:11:36.273337 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 58 (DF) 16:11:36.273465 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:36.273982 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 66 (DF) 16:11:36.274065 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 16:11:36.274329 glenmorangie.epcc.ed.ac.uk.50000 > srm.epcc.ed.ac.uk.2811: tcp 0 (DF) 16:11:36.274479 srm.epcc.ed.ac.uk.2811 > glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF) 73 packets received by filter 0 packets dropped by kernel
"Alessandra Forti"
Hi Greig, there is a DCACHE_PORT_RANGE field in the site-info.def. Try to change that and rerun the config part of the installation. cheers alessandra On Tue, 28 Jun 2005, Greig A Cowan wrote: > Hi Owen, > >>> It failed with >>> >>> srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng e.20050628104900.epcc.ed.ac.uk >>> >>> The response is >>> >>> debug: response from gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: >>> 150 Openning BINARY data connection for /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.epcc.ed.ac. uk3 >>> >>> debug: fault on connection to gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: Handle not in the proper state >>> debug: error reading response from gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: the handle 0x806b22c was already registered for closing >>> debug: data callback, no error, buffer 0x80703d8, length 0, offset=0, eof=true >>> debug: operation complete >>> >>> >>> Which suggests a networking issue, does it work from the box? > > Thanks to some help from Steve Thorn at NeSC, we may have begun tracking > down the problem. He has been able to perform globus-url-copy's into our > dCache from another ScotGrid machine (glenmorangie.epcc.ed.ac.uk). This > goes through a ScotGrid switch, not through the wider SRIF network that > traffic uses coming from outside of Edinburgh. > > It looks like at least some of the middleware on our admin node > (srm.epcc.ed.ac.uk) is using the default globus port range of 20000-25000 > (see packet dump below). On the SRIF network these ports are blocked as > 50000-52000 is the allowed range. The globus config file > /etc/sysconfig/globus contains the correct range, as does the > GLOBUS_TCP_PORT_RANGE environment variable, but who can say what > configuration files/hard coding is used in the various srm doors. > > To summarise: when not encumbered by firewalls it works consistently. > srm.epcc is *not* using the port range defined in > /etc/sysconfig/globus. > > Is there any way of changing the default port range consistently? > > Cheers, > Greig >
"Ross, D \(Derek\)"
Hi Grieg, Check the /opt/d-cache/config/dCacheSetup file. Check that the java options near the top is using the right ports i.e. java_options="-server -Xmx512m -XX:MaxDirectMemorySize=512m \ -Dorg.globus.tcp.port.range=50000,52000" Further down, there's also clientDataPortRange=50000:52000 Derek > -----Original Message----- > From: GRIDPP2: Deployment and support of SRM and local storage > management [mailto:GRIDPP-STORAGE@JISCMAIL.AC.UK]On Behalf Of Greig A > Cowan > Sent: 28 June 2005 17:53 > To: GRIDPP-STORAGE@JISCMAIL.AC.UK > Subject: Re: New Edinburgh dCache install > > > Hi Owen, > > > > It failed with > > > > > > > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/fil > e_test_synge.20050628104900.epcc.ed.ac.uk > > > > > > The response is > > > > > > debug: response from > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam > /file_test_synge.20050628104900.epcc.ed.ac.uk3: > > > 150 Openning BINARY data connection for > /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900. > epcc.ed.ac.uk3 > > > > > > debug: fault on connection to > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam > /file_test_synge.20050628104900.epcc.ed.ac.uk3: Handle not in > the proper state > > > debug: error reading response from > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam > /file_test_synge.20050628104900.epcc.ed.ac.uk3: the handle > 0x806b22c was already registered for closing > > > debug: data callback, no error, buffer 0x80703d8, length > 0, offset=0, eof=true > > > debug: operation complete > > > > > > > > > Which suggests a networking issue, does it work from the box? > > Thanks to some help from Steve Thorn at NeSC, we may have > begun tracking > down the problem. He has been able to perform > globus-url-copy's into our > dCache from another ScotGrid machine > (glenmorangie.epcc.ed.ac.uk). This > goes through a ScotGrid switch, not through the wider SRIF > network that > traffic uses coming from outside of Edinburgh. > > It looks like at least some of the middleware on our admin node > (srm.epcc.ed.ac.uk) is using the default globus port range of > 20000-25000 > (see packet dump below). On the SRIF network these ports are > blocked as > 50000-52000 is the allowed range. The globus config file > /etc/sysconfig/globus contains the correct range, as does the > GLOBUS_TCP_PORT_RANGE environment variable, but who can say what > configuration files/hard coding is used in the various srm doors. > > To summarise: when not encumbered by firewalls it works consistently. > srm.epcc is *not* using the port range defined in > /etc/sysconfig/globus. > > Is there any way of changing the default port range consistently? > > Cheers, > Greig
"Greig A Cowan"
Hi guys, The DCACHE_PORT_RANGE field in the site-info.def file is commented out. Is this not the same for everyone? GLOBUS_TCP_PORT_RANGE was set to the correct value though: "50000 52000" Anyway, I've altered the dCacheSetup file as Derek suggested and globus-url-copy now appears to be working for me using our relocatable UI. Can someone else try out srmcp/globus-url-copy commands to and from our dCache to test it? srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/ gsiftp://srm.epcc.ed.ac.uk/pnfs/epcc.ed.ac.uk/data/dteam/ Thanks, Greig > Hi Grieg, > > Check the /opt/d-cache/config/dCacheSetup file. Check that the java options near the top is using the right ports i.e. > > java_options="-server -Xmx512m -XX:MaxDirectMemorySize=512m \ > -Dorg.globus.tcp.port.range=50000,52000" > > > Further down, there's also > > clientDataPortRange=50000:52000 > > > Derek > > > > -----Original Message----- > > From: GRIDPP2: Deployment and support of SRM and local storage > > management [mailto:GRIDPP-STORAGE@JISCMAIL.AC.UK]On Behalf Of Greig A > > Cowan > > Sent: 28 June 2005 17:53 > > To: GRIDPP-STORAGE@JISCMAIL.AC.UK > > Subject: Re: New Edinburgh dCache install > > > > > > Hi Owen, > > > > > > It failed with > > > > > > > > > > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/fil > > e_test_synge.20050628104900.epcc.ed.ac.uk > > > > > > > > The response is > > > > > > > > debug: response from > > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam > > /file_test_synge.20050628104900.epcc.ed.ac.uk3: > > > > 150 Openning BINARY data connection for > > /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900. > > epcc.ed.ac.uk3 > > > > > > > > debug: fault on connection to > > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam > > /file_test_synge.20050628104900.epcc.ed.ac.uk3: Handle not in > > the proper state > > > > debug: error reading response from > > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam > > /file_test_synge.20050628104900.epcc.ed.ac.uk3: the handle > > 0x806b22c was already registered for closing > > > > debug: data callback, no error, buffer 0x80703d8, length > > 0, offset=0, eof=true > > > > debug: operation complete > > > > > > > > > > > > Which suggests a networking issue, does it work from the box? > > > > Thanks to some help from Steve Thorn at NeSC, we may have > > begun tracking > > down the problem. He has been able to perform > > globus-url-copy's into our > > dCache from another ScotGrid machine > > (glenmorangie.epcc.ed.ac.uk). This > > goes through a ScotGrid switch, not through the wider SRIF > > network that > > traffic uses coming from outside of Edinburgh. > > > > It looks like at least some of the middleware on our admin node > > (srm.epcc.ed.ac.uk) is using the default globus port range of > > 20000-25000 > > (see packet dump below). On the SRIF network these ports are > > blocked as > > 50000-52000 is the allowed range. The globus config file > > /etc/sysconfig/globus contains the correct range, as does the > > GLOBUS_TCP_PORT_RANGE environment variable, but who can say what > > configuration files/hard coding is used in the various srm doors. > > > > To summarise: when not encumbered by firewalls it works consistently. > > srm.epcc is *not* using the port range defined in > > /etc/sysconfig/globus. > > > > Is there any way of changing the default port range consistently? > > > > Cheers, > > Greig
"Owen Synge"
On Tue, 28 Jun 2005 18:30:23 +0100 Greig A Cowan wrote: > Hi guys, > > The DCACHE_PORT_RANGE field in the site-info.def file is commented out. Is > this not the same for everyone? GLOBUS_TCP_PORT_RANGE was set to > the correct value though: "50000 52000" > > Anyway, I've altered the dCacheSetup file as Derek suggested and > globus-url-copy now appears to be working for me using our relocatable UI. > Can someone else try out srmcp/globus-url-copy commands to and from our > dCache to test it? > > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/ > gsiftp://srm.epcc.ed.ac.uk/pnfs/epcc.ed.ac.uk/data/dteam/ > > Thanks, > Greig I just tested and /opt/d-cache/srm/bin/srm -copy -webservice_protocol=httpg -x509_user_proxy=/tmp/x509up_u27529 srm://se2-gla.scotgrid.ac.uk:8443/pnfs/ph.gla.ac.uk/data/dteam/file_test_ synge.20050628183135.ph.gla.ac.uk file://///tmp//file_test_synge.20050628183135.ph.gla.ac.uk was failing although it worked earlier today but the important news is /opt/d-cache/srm/bin/srm -copy -webservice_protocol=httpg -x509_user_proxy=/tmp/x509up_u27529 file://///usr/lib/X11/rgb.txt srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng e.20050628183135.epcc.ed.ac.uk transfer rc = 0 /opt/d-cache/srm/bin/srm -copy -webservice_protocol=httpg -x509_user_proxy=/tmp/x509up_u27529 srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng e.20050628183135.epcc.ed.ac.uk file://///tmp//file_test_synge.20050628183135.epcc.ed.ac.uk md5sum match srm.epcc.ed.ac.uk Which is WONDERFUL thank you so much for going through this process of installation once more and finding the issues, I shall now repeat the test for ph.gla.ac.uk. Regards Owen Synge > > Hi Grieg, > > > > Check the /opt/d-cache/config/dCacheSetup file. Check that the java options near the top is using the right ports i.e. > > > > java_options="-server -Xmx512m -XX:MaxDirectMemorySize=512m \ > > -Dorg.globus.tcp.port.range=50000,52000" > > > > > > Further down, there's also > > > > clientDataPortRange=50000:52000 > > > > > > Derek > > > > > > > -----Original Message----- > > > From: GRIDPP2: Deployment and support of SRM and local storage > > > management [mailto:GRIDPP-STORAGE@JISCMAIL.AC.UK]On Behalf Of Greig A > > > Cowan > > > Sent: 28 June 2005 17:53 > > > To: GRIDPP-STORAGE@JISCMAIL.AC.UK > > > Subject: Re: New Edinburgh dCache install > > > > > > > > > Hi Owen, > > > > > > > > It failed with > > > > > > > > > > > > > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/fil > > > e_test_synge.20050628104900.epcc.ed.ac.uk > > > > > > > > > > The response is > > > > > > > > > > debug: response from > > > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam > > > /file_test_synge.20050628104900.epcc.ed.ac.uk3: > > > > > 150 Openning BINARY data connection for > > > /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900. > > > epcc.ed.ac.uk3 > > > > > > > > > > debug: fault on connection to > > > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam > > > /file_test_synge.20050628104900.epcc.ed.ac.uk3: Handle not in > > > the proper state > > > > > debug: error reading response from > > > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam > > > /file_test_synge.20050628104900.epcc.ed.ac.uk3: the handle > > > 0x806b22c was already registered for closing > > > > > debug: data callback, no error, buffer 0x80703d8, length > > > 0, offset=0, eof=true > > > > > debug: operation complete > > > > > > > > > > > > > > > Which suggests a networking issue, does it work from the box? > > > > > > Thanks to some help from Steve Thorn at NeSC, we may have > > > begun tracking > > > down the problem. He has been able to perform > > > globus-url-copy's into our > > > dCache from another ScotGrid machine > > > (glenmorangie.epcc.ed.ac.uk). This > > > goes through a ScotGrid switch, not through the wider SRIF > > > network that > > > traffic uses coming from outside of Edinburgh. > > > > > > It looks like at least some of the middleware on our admin node > > > (srm.epcc.ed.ac.uk) is using the default globus port range of > > > 20000-25000 > > > (see packet dump below). On the SRIF network these ports are > > > blocked as > > > 50000-52000 is the allowed range. The globus config file > > > /etc/sysconfig/globus contains the correct range, as does the > > > GLOBUS_TCP_PORT_RANGE environment variable, but who can say what > > > configuration files/hard coding is used in the various srm doors. > > > > > > To summarise: when not encumbered by firewalls it works consistently. > > > srm.epcc is *not* using the port range defined in > > > /etc/sysconfig/globus. > > > > > > Is there any way of changing the default port range consistently? > > > > > > Cheers, > > > Greig
"Greig A Cowan"
Hi Owen, > I just tested and > > > /opt/d-cache/srm/bin/srm -copy -webservice_protocol=httpg > -x509_user_proxy=/tmp/x509up_u27529 > srm://se2-gla.scotgrid.ac.uk:8443/pnfs/ph.gla.ac.uk/data/dteam/file_test_ synge.20050628183135.ph.gla.ac.uk > file://///tmp//file_test_synge.20050628183135.ph.gla.ac.uk > > was failing although it worked earlier today but the important news is > > /opt/d-cache/srm/bin/srm -copy -webservice_protocol=httpg > -x509_user_proxy=/tmp/x509up_u27529 file://///usr/lib/X11/rgb.txt > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng e.20050628183135.epcc.ed.ac.uk > transfer rc = 0 /opt/d-cache/srm/bin/srm -copy > -webservice_protocol=httpg -x509_user_proxy=/tmp/x509up_u27529 > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng e.20050628183135.epcc.ed.ac.uk > file://///tmp//file_test_synge.20050628183135.epcc.ed.ac.uk md5sum match > srm.epcc.ed.ac.uk > > > Which is WONDERFUL > > thank you so much for going through this process of installation once more > and finding the issues, I shall now repeat the test for ph.gla.ac.uk. Excellent. What's the best way that people have found so far to thoroughly test their dCache? Owen, are your scripts sufficient for this task? Once I (and everyone else) is satisfied that things are working properly, I will go ahead and add the rest of Edinburghs available storage. Maybe we can speak about this at the phone conference tomorrow? Cheers, Greig
"Alessandra Forti"
Hi Greig, > The DCACHE_PORT_RANGE field in the site-info.def file is commented out. Is > this not the same for everyone? GLOBUS_TCP_PORT_RANGE was set to > the correct value though: "50000 52000" yes, it is the same for all the other sites, but all the other sites are using range 20000-25000 which are the default values which are hard coded in Dcache. cheers alessandra > > Anyway, I've altered the dCacheSetup file as Derek suggested and > globus-url-copy now appears to be working for me using our relocatable UI. > Can someone else try out srmcp/globus-url-copy commands to and from our > dCache to test it? > > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/ > gsiftp://srm.epcc.ed.ac.uk/pnfs/epcc.ed.ac.uk/data/dteam/ > > Thanks, > Greig > >> Hi Grieg, >> >> Check the /opt/d-cache/config/dCacheSetup file. Check that the java options near the top is using the right ports i.e. >> >> java_options="-server -Xmx512m -XX:MaxDirectMemorySize=512m \ >> -Dorg.globus.tcp.port.range=50000,52000" >> >> >> Further down, there's also >> >> clientDataPortRange=50000:52000 >> >> >> Derek >> >> >>> -----Original Message----- >>> From: GRIDPP2: Deployment and support of SRM and local storage >>> management [mailto:GRIDPP-STORAGE@JISCMAIL.AC.UK]On Behalf Of Greig A >>> Cowan >>> Sent: 28 June 2005 17:53 >>> To: GRIDPP-STORAGE@JISCMAIL.AC.UK >>> Subject: Re: New Edinburgh dCache install >>> >>> >>> Hi Owen, >>> >>>>> It failed with >>>>> >>>>> >>> srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/fil >>> e_test_synge.20050628104900.epcc.ed.ac.uk >>>>> >>>>> The response is >>>>> >>>>> debug: response from >>> gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam >>> /file_test_synge.20050628104900.epcc.ed.ac.uk3: >>>>> 150 Openning BINARY data connection for >>> /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900. >>> epcc.ed.ac.uk3 >>>>> >>>>> debug: fault on connection to >>> gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam >>> /file_test_synge.20050628104900.epcc.ed.ac.uk3: Handle not in >>> the proper state >>>>> debug: error reading response from >>> gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam >>> /file_test_synge.20050628104900.epcc.ed.ac.uk3: the handle >>> 0x806b22c was already registered for closing >>>>> debug: data callback, no error, buffer 0x80703d8, length >>> 0, offset=0, eof=true >>>>> debug: operation complete >>>>> >>>>> >>>>> Which suggests a networking issue, does it work from the box? >>> >>> Thanks to some help from Steve Thorn at NeSC, we may have >>> begun tracking >>> down the problem. He has been able to perform >>> globus-url-copy's into our >>> dCache from another ScotGrid machine >>> (glenmorangie.epcc.ed.ac.uk). This >>> goes through a ScotGrid switch, not through the wider SRIF >>> network that >>> traffic uses coming from outside of Edinburgh. >>> >>> It looks like at least some of the middleware on our admin node >>> (srm.epcc.ed.ac.uk) is using the default globus port range of >>> 20000-25000 >>> (see packet dump below). On the SRIF network these ports are >>> blocked as >>> 50000-52000 is the allowed range. The globus config file >>> /etc/sysconfig/globus contains the correct range, as does the >>> GLOBUS_TCP_PORT_RANGE environment variable, but who can say what >>> configuration files/hard coding is used in the various srm doors. >>> >>> To summarise: when not encumbered by firewalls it works consistently. >>> srm.epcc is *not* using the port range defined in >>> /etc/sysconfig/globus. >>> >>> Is there any way of changing the default port range consistently? >>> >>> Cheers, >>> Greig
"Greig A Cowan"
Hi Owen, > It failed with > > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng e.20050628104900.epcc.ed.ac.uk > > The response is > > debug: response from gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: > 150 Openning BINARY data connection for /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.epcc.ed.ac. uk3 > > debug: fault on connection to gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: Handle not in the proper state > debug: error reading response from gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_ synge.20050628104900.epcc.ed.ac.uk3: the handle 0x806b22c was already registered for closing > debug: data callback, no error, buffer 0x80703d8, length 0, offset=0, eof=true > debug: operation complete > > > Which suggests a networking issue, does it work from the box? srmcp does work from the box itself. Have you seen anything like this before? Greig
"Matt Doidge"
sorry all for not attending the storage meeting, i somehow forgot about the meeting despite the million reminders and went and booked a trip to the opticians.
Lancaster Status:
Still running on 2_4_0, reluctant to upgrade yet (it might break!). Planning on some tests that will completely thrash the system (continuous file transfer for a more vigorous tests). We still aren't advertising, for various reasons, the main one being once people start using us to keep their data we lose the freedom to mess about with our system, our current set up is not ideal and we might eventually want to physically some of our machines.
Lancaster Plans:
We're planning on setting up a second SE, for use exclusively in SC3 testing, this will be of the same size as we currently have (32 TB). This will be the latest version of Dcache (as included in 2_5_0) and will be a YAIM install.
Lancaster wish list:
What I would like to know is how to go about with "Dcache disaster recovery", back up procedures in case it all goes wrong, and how we can go about getting data off our pools in the event that the admin node does something stupid like melts (can we attach a new admin node and keep our data?).
Once again my apologies for being forgetful,
matt
"Greig A Cowan"
Hi everyone,
I've been testing out our dCache install using globus-url-copy. I just tried copying a large file (>1GB) into our dCache and it worked fine, the md5sum of source and target files are the same. The transfer was done using the -p 10 option with globus. However, when I attempt to copy the file back out, everything seems to be going fine (i.e. usual dialogue between client and server) until I get the error:
426 Transfer aborted, closing connection :java.net.ConnectException: Connection refused
I can copy small files in and out without any problem when I don't use the -p 10 option, but transfer out of the dCache fails with small files if I use the -p 10 option. Has anyone seen this behaviour before? It looks like it's more Edinburgh network issues.
Thanks, Greig
"Alessandra Forti"
Hi Jens,
to add to the documentation I found this page today.
http://grid-deployment.web.cern.ch/grid-deployment/gis/lcg-GCR/
It was released or updated last week and it might be useful to link it from the storage pages.
cheers alessandra
"Graeme A Stewart"
On Tuesday 28 June 2005 15:08, Jean-Philippe Baud wrote: > Actually you need also a line > > RFIOD TRUST <short_hostname> <fdqn> > > This means that you need a shift.conf including 5 lines for RFIOD: > TRUST, FTRUST, RTRUST, WTRUST and XTRUST. > We will update the documentation. > Please confirm. Thanks a lot. Hi Jean-Phillipe Good news, inserting the "TRUST" line means it works: grid07:~$ cat /etc/shift.conf RFIOD TRUST grid07 grid07.ph.gla.ac.uk RFIOD WTRUST grid07 grid07.ph.gla.ac.uk RFIOD RTRUST grid07 grid07.ph.gla.ac.uk RFIOD XTRUST grid07 grid07.ph.gla.ac.uk RFIOD FTRUST grid07 grid07.ph.gla.ac.uk RFIO DAEMONV3_WRMT 1 grid07:~$ globus-url-copy file:/etc/group gsiftp://grid07/dpm/ph.gla.ac.uk/home/dteam/testDir/newGroupTest grid07:~$ dpns-ls -l /dpm/ph.gla.ac.uk/home/dteam/testDir/newGroupTest -rw-rw-r-- 1 dteam001 dteam 531 Jun 28 20:14 /dpm/ph.gla.ac.uk/home/dteam/testDir/newGroupTest However, srmcp still isn't working: grid07:~$ /opt/d-cache/srm/bin/srmcp file:////boot/vmlinux-2.4.21-20.EL srm://grid07:8443/dpm/ph.gla.ac.uk/home/dteam/testDir/srmtest org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: (error code 1) [Nested exception message: Custom message: Unexpected reply: 553 grid07:/opt/dpmp/dteam/2005-06-28/srmtest.38.0: Permission denied.]. Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 553 grid07:/opt/dpmp/dteam/2005-06-28/srmtest.38.0: Permission denied. at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:167) GridftpClient: transfer exception I realised this was a different error condition to the globus-url-copy error, so on a hunch I tried chmoding the DPM pool directories to 777 and then it succeeds: grid07:~$ /opt/d-cache/srm/bin/srmcp file:////boot/vmlinux-2.4.21-20.EL srm://grid07:8443/dpm/ph.gla.ac.uk/home/dteam/testDir/srmtest2 grid07:~$ echo $? 0 grid07:~$ dpns-ls -l /dpm/ph.gla.ac.uk/home/dteam/testDir/srmtest2 -rw-rw-r-- 1 dteam001 dteam 2922104 Jun 28 20:18 /dpm/ph.gla.ac.uk/home/dteam/testDir/srmtest2 However, looking at the underlying pool directory: root of grid07:/opt/dpmp# v /opt/dpmp/dteam/2005-06-28/ total 8592 -rw-rw---- 1 dpmmgr dpmmgr 531 Jun 28 20:14 newGroupTest.37.0 -rw-rw---- 1 dteam001 dteam 2922104 Jun 28 20:18 srmtest2.40.0 So, somehow srmcp (this is the dCache client, by the way) is copying in the file directly, using its own GSI pool mapped account, instead of copying to the dpns name space. I also confirm this contrasting the globus-url-copy log entry from dpm-gsiftp: DATE=20050628191443.865506 HOST=grid07 PROG=wuftpd NL.EVNT=FTP_INFO START=20050628191443.818685 USER=dteam001 FILE=/dpm/ph.gla.ac.uk/home/dteam/testDir/newGroupTest BUFFER=87840 BLOCK=65536 NBYTES=531 VOLUME=?(rfio-file) STREAMS=1 STRIPES=1 DEST=1[194.36.1.137] TYPE=STOR CODE=226 with that from the srmcp: DATE=20050628192444.426157 HOST=grid07 PROG=wuftpd NL.EVNT=FTP_INFO START=20050628192444.343060 USER=dteam001 FILE=grid07:/opt/dpmp/dteam/2005-06-28/srmtest3.41.0 BUFFER=87840 BLOCK=65536 NBYTES=2922104 VOLUME=?(rfio-file) STREAMS=10 STRIPES=1 DEST=1[194.36.1.137] TYPE=STOR CODE=226 Debug trace of "srmcp -debug" and the DPM srmv1 logs attached. Thanks for your help so far - hopefully we can resolve this so I can report success with DPM to the GridPP 13 meeting next week in Durham! Graeme > Jean-Philippe > P.S. "localhost" in shift.conf is useless. It is not recognised. Removed. -- -------------------------------------------------------------------- Dr Graeme Stewart http://www.physics.gla.ac.uk/~graeme/ Department of Physics and Astronomy, University of Glasgow, Scotland
"Jiri Mencak"
Hi,
on some occasions when `/opt/d-cache/install/install.sh' script is (re)run (some, sometimes even all) of the following files end up being 0 size:
/pnfs/fs/admin/etc/config/serverId /pnfs/fs/admin/etc/config/serverName /pnfs/fs/admin/etc/config/serverRoot
Which results in PnfsManager being Offline.
Looking at the way serverName is being written by `/opt/d-cache/install/install.sh' reveals a non-deterministic sleep (increasing the value 5 up to 50 doesn't help) which I'm a bit concerned about:
<snip> sleep 5 cd $PNFS_ROOT/fs sr=`cat ".(id)(usr)"` cd ./admin/etc/config echo `hostname` > ./serverName </snip>
Workaround
I've found that removing the /pnfs/fs/admin/etc/config/server* files helps. For example changing the previous code snippet to
<snip> sleep 5 cd $PNFS_ROOT/fs sr=`cat ".(id)(usr)"` cd ./admin/etc/config rm -f ./serverName echo `hostname` > ./serverName </snip>
fixed the situation.
A proper solution to this issue would be much appreciated.
Regards.
-- Jiri
"Greig A Cowan"
"Greig A Cowan"
Hi everyone,
What is the best way of specifying that the storage space in your dCache is volatile? We would like to be able to make sure that any files that are transferred into our storage during testing can be removed again. Is this even possible at the moment? I seem to remember reading that the different storage types are not available until SRM v2?
There was a previous discussion on this list regarding these issues (see the thread: Querying the information system for srm info.), but would it be possible for anyone to clarify the situation.
Thanks, Greig
"Owen Synge"
On Thu, 30 Jun 2005 19:59:58 +0100 Greig A Cowan wrote: > Hi everyone, > > What is the best way of specifying that the storage space in your dCache > is volatile? I dont think you need to yet http://www.cnaf.infn.it/~sergio/datatag/glue/v11/SE/index.htm I could not see attribute for "storage space type" please correct me if you can. > We would like to be able to make sure that any files that are > transferred into our storage during testing can be removed again. Is this > even possible at the moment? I believe we are all fine with you doing this outside production runs. > I seem to remember reading that the different > storage types are not available until SRM v2? Yes but these storage types are with regard to the services we expect from the systems you guys run so at the moment the ideas are in our heads rather than formally specified as a computer query-able and job scheduling factor. If your service was in the future "permanent space type" we would address these issues in the Glue Schemer before we could expect you to provide the higher levels of service we rarely should have this in the Schemer now as how do you know that a site is tier 2 or tier 0-1 but its on the schemer groups to-do list. If no one is sure maybe I should check. > There was a previous discussion on this list regarding these issues (see > the thread: Querying the information system for srm info.), but would it > be possible for anyone to clarify the situation. I hope I have but please feel free to pull me up in lack of clarity or that I have glossed over the issues. Regards Owen
"Philip Clark"
>> I seem to remember reading that the different >> storage types are not available until SRM v2? > > Yes but these storage types are with regard to the services we expect from the systems you guys run so at the moment the ideas are in our heads rather than formally specified as a computer query-able and job scheduling factor. If your service was in the future "permanent space type" we would address these issues in the Glue Schemer before we could expect you to provide the higher levels of service we rarely should have this in the Schemer now as how do you know that a site is tier 2 or tier 0-1 but its on the schemer groups to-do list. If no one is sure maybe I should check. is there a way to make sure the space is volatile at the moment? -Phil
"Alessandra Forti"
Hi Phil,
you can declare it in the Glue schema. Then whoever puts important files on your SE should be aware that they can be deleted.
The field is
GlueSAPolicyFileLifeTime
and can be set for each VO in
/opt/lcg/var/gip/lcg-info-generic.conf
Be careful that this is overridden if you run YAIM function config_gip again. It is one of those things I asked to be changed in bug 8777 I keep on talking about.
If you put it in the Glue schema you are declaring your SE policy (as the name says) which means you can delete files whether the system can do it automatically for you (SRM v2) or not (SRM v1, classic SE). At least this is how I interpret it.
cheers alessandra
"Owen Synge"
On Fri, 1 Jul 2005 13:50:03 +0100 Alessandra Forti wrote: > Hi Phil, > > you can declare it in the Glue schema. Then whoever puts important files > on your SE should be aware that they can be deleted. > > The field is > > GlueSAPolicyFileLifeTime > > and can be set for each VO in > > /opt/lcg/var/gip/lcg-info-generic.conf > > Be careful that this is overridden if you run YAIM function config_gip > again. It is one of those things I asked to be changed in bug 8777 I keep > on talking about. > > If you put it in the Glue schema you are declaring your SE policy (as the > name says) which means you can delete files whether the system can do it > automatically for you (SRM v2) or not (SRM v1, classic SE). At least this > is how I interpret it. > > cheers > alessandra Thank you, this is how I interpret it also, but I should want to advise that you should not delete files unless they are old and you need the to do something difficult which would be aided by deleting files or to free space. I am delighted that this field is available today. Regards Owen > On Fri, 1 Jul 2005, Philip Clark wrote: > > >>> I seem to remember reading that the different > >>> storage types are not available until SRM v2? > >> > >> Yes but these storage types are with regard to the services we expect from the systems you guys run so at the moment the ideas are in our heads rather than formally specified as a computer query-able and job scheduling factor. If your service was in the future "permanent space type" we would address these issues in the Glue Schemer before we could expect you to provide the higher levels of service we rarely should have this in the Schemer now as how do you know that a site is tier 2 or tier 0-1 but its on the schemer groups to-do list. If no one is sure maybe I should check. > > > > is there a way to make sure the space is volatile at the moment? > > > > -Phil > > > > -- > ******************************************** > * Dr Alessandra Forti * > * Technical Coordinator - NorthGrid Tier2 * > * http://www.hep.man.ac.uk/u/aforti * > ********************************************
"Graeme A Stewart"
An update on DPM here at Glasgow: Yesterday I received a fix from the LCG team for the problems which were causing srmcp to fail. The fix, for the DPM gridftp daemon, is available at: http://grid-deployment.web.cern.ch/grid-deployment/RpmDir_i386-sl3/extern al/DPM-gridftp-server-1.3.4-1sec_sl3.i386.rpm So if you are trying DPM right now, I would apply this RPM and then restart dpm-gsiftp. This also means that Glasgow's DPM is available for external testing at: srm://grid07.ph.gla.ac.uk:8443/dpm/ph.gla.ac.uk/home/dteam and gsiftp://grid07.ph.gla.ac.uk/dpm/ph.gla.ac.uk/home/dteam If you have the DPM client rpms (only part of 2.5.0 I think) then you might also try playing with the dpns-* commands, e.g., $ gris-proxy-init $ export DPNS_HOST=grid07.ph.gla.ac.uk $ dpns-ls -l /dpm/ph.gla.ac.uk/home/dteam I will try and collate my DPM experience in the DM wiki: http://www.physics.gla.ac.uk/gridpp/datamanagement/index.php/DiskPoolMana ger so you don't have to trail through piles of old emails! Cheers Graeme -- -------------------------------------------------------------------- Dr Graeme Stewart http://www.astro.gla.ac.uk/users/graeme/ Department of Physics and Astronomy, University of Glasgow, Scotland
"Kostas Georgiou"
"Kostas Georgiou"
One of our pool nodes is in a lot of pain after the local CMS people copied some files to it. Any ideas on what the problem is? # ls -al /opt/d-cache/log/sedsk00Domain.log -rw-r--r-- 1 root root 2072936448 Jul 1 14:54 /opt/d-cache/log/sedsk00Domain.log 07/01 11:14:25 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BBF8 07/01 11:14:26 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BC08 07/01 11:14:26 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BC28 07/01 11:14:29 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BC18 07/01 11:14:54 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BC78 07/01 11:15:10 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BC68 07/01 11:15:11 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BC88 07/01 11:15:15 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BC58 07/01 11:15:15 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BCA0 07/01 11:16:00 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BD40 07/01 11:16:00 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BD30 07/01 11:16:08 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BD70 07/01 11:16:13 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BD60 07/01 11:17:02 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BE20 07/01 11:17:06 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No crc available for 00010000000000000000BE30 07/01 11:18:01 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid syntax in remove filespec >null< 07/01 11:18:01 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid syntax in remove filespec >null< ..... 07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid syntax in remove filespec >null< 07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid syntax in remove filespec >null< 07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid syntax in remove filespec >null< 07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid syntax in remove filespec >null< 07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid syntax in remove filespec >null< 07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : remov Disk partition full ...... Kostas
"Owen Synge"
I have never seen this error.
My bet is that the D-Cache is complaining because D-Cache failed to store check sum information after the files where received.
D-Cache internally queues files and then processes them to move to pools with affinity etc.
May be the check sums are also not stored at the time of writing, and D-Cache is very unhappy about not being able to write them to disk, and not gracefully refusing to over fill a pool but instead allowing files to start to write to a pool that is full, then complaining that it does not have check sums.
Its only a guess.
Regards
Owen
"Matt Doidge"
"Matt Doidge"
heya guys and girls,
I have been trying to install a second srm here at Lancaster for use with SC3 (so we can have an SE for production and an SE for SC stuff). Am not having much luck with the installation - the srm isn't coming on line, and nothing is listening on port 8443 when i check using a netstat. Does the 2_5_0 version of YAIM still require patching? And also when using YAIM to install a second SE is there anything odd that you have to do, I just filled the SE2 variables in in the same manner I would have for a standard SE.
thanks for your time again guys.
matt
"Greig A Cowan"
Hi Matt, > Am not having much luck with the installation - the srm isn't coming > on line, and nothing is listening on port 8443 when i check using a > netstat. Does the 2_5_0 version of YAIM still require patching? And > also when using YAIM to install a second SE is there anything odd that > you have to do, I just filled the SE2 variables in in the same manner > I would have for a standard SE.
Yep, I installed dCache from 2_5_0 and made sure I had Jiri's patch.
As for SRM, have you tried the solution of switching off the pool services, restarting opt services and then starting the pools again?
Not sure about the second SE issue though.
Greig
"Greig A Cowan"
Hi Matt,
Have a look here for some YAIM install instructions that worked for me.
http://www.gridpp.ac.uk/deployment/admin/dcache/dcache_yaim_install.txt
They are pretty rough, but you should be able to follow them. Give me an
email if you need anything clarified.
Greig
"Greig A Cowan"
From the GridPP pages:
http://www.gridpp.ac.uk/deployment/admin/dcache/dcache_yaim_install.txt
Create the relevant pointers to the rpm repositories
echo 'rpm http://storage.esc.rl.ac.uk/ apt/datastore/sl3.0.4 stable obsolete' \ > /etc/apt/sources.list.d/gpp_storage.list apt-get install d-cache-gpp-yaimlink
You can also do this by hand by creating the link:
ln -s /opt/d-cache-gpp/bin/lcg/config_gpp_sedcache \ /opt/lcg/yaim/functions/local/config_sedcache
Cheers,
Greig
"Greig A Cowan"
"Greig A Cowan"
Hi everyone,
I think we really need to combine the current dCache documenting in order to form an authoritative source for installing, administering, monitoring the system. It can be very confusing for anyone starting out with dCache
when there is not one single document that they can go to find out all relevant information.
Maybe some of us can discuss this issue before we leave GridPP13 today.
Also, I have started distilling some of the very useful information that this list has provided about dCache but hasn't really been written down before and also some things that I think are very useful for people running dCache to know. You can find it here:
http://www.gridpp.ac.uk/deployment/admin/dcache/faq.html
Don't worry, I am going to change the size of the font in the code blocks so that it is slightly easier to read! It is still very much a work in progress so any comments/contributions would be appreciated. You can also edit the pages yourself if you have been granted write access to this part of the GridPP site.
Cheers,
Greig
"Steve Traylen"
On Wed, Jul 06, 2005 at 10:17:14AM +0100 or thereabouts, Greig A Cowan wrote: > Hi everyone, > > I think we really need to combine the current dCache documentation in order > to form an authoritative source for installing, administering, monitoring > the system. It can be very confusing for anyone starting out with dCache > when there is not one single document that they can go to find out all > relevant information. I was thinking about this myself, apart from anything else I don't think many people outside the UK are aware of the documentation that exists. The LCG wiki is the sensible top level place to bung nothing else other than links. Steve > > Maybe some of us can discuss this issue before we leave GridPP13 today. > > Also, I have started distilling some of the very useful information that > this list has provided about dCache but hasn't really been written down > before and also some things that I think are very useful for people > running dCache to know. You can find it here: > > http://www.gridpp.ac.uk/deployment/admin/dcache/faq.html > > Don't worry, I am going to change the size of the font in the code blocks > so that it is slightly easier to read! It is still very much a work in > progress so any comments/contributions would be appreciated. You can also > edit the pages yourself if you have been granted write access to this part > of the GridPP site. > > Cheers, > Greig > > --
"Jiri Mencak"
I suspect you mean dCache that comes with LCG 2.5.0 release. We have patches to dCache YAIM installation scripts which come with LCG 2.4.0 release. All of these have been accepted into LCG 2.5.0 release so as long as you're using this particular release, you don't need any patches.
Regards.
Jiri
"Alessandra Forti"
"Alessandra Forti"
Hi,
I discovered the mystery. There are 2 openssl installed: a standard /usr/bin/openssl and a globus /opt/globus/bin/openssl. I don't know if the globus version is modified, recompiled or just older but they give a different output.
When /opt/d-cache/bin/grid-mapfile2dcache-kpwd is run by hand it calls the globus one and there is Email= and E= in /opt/d-cache/etc/dcache.kpwd
But when it is run by the cron job the standard openssl is called because the path doesn't contain /opt/globus/bin and we get the standard emailAddress= in /opt/d-cache/etc/dcache.kpwd which mismatches other parts of the code (I haven't looked into this yet and don't know if I want to).
The simplest way to correct this is to add /opt/globus/bin at the beginning of the path in /etc/cron.d/edg-mkgridmap.
/opt/lcg/libexec/lcg-info-wrapper does work now but there is still the problem of the ldif cache file being zero and therefore ldap doesn't pick it up.
cheers
alessandra
"Jensen, J \(Jens\)"
I bet they are different versions.
/usr/bin/openssl version 0.9.7
/opt/globus/bin/openssl version 0.9.6
-j
"Alessandra Forti"
Yes, they are different. I don't know if globus has anything else in it. cheers alessandra On Mon, 11 Jul 2005, Jensen, J (Jens) wrote: > I bet they are different versions. > > /usr/bin/openssl version > => 0.9.7 > /opt/globus/bin/openssl version > => 0.9.6 > > -j > >> -----Original Message----- >> From: GRIDPP2: Deployment and support of SRM and local storage >> management [mailto:GRIDPP-STORAGE@JISCMAIL.AC.UK]On Behalf Of >> Alessandra >> Forti >> Sent: 11 July 2005 15:53 >> To: GRIDPP-STORAGE@JISCMAIL.AC.UK >> Subject: It's a globus thing >> >> >> Hi, >> >> I discovered the mystery. There are 2 openssl installed: a standard >> /usr/bin/openssl and a globus /opt/globus/bin/openssl. I >> don't know if the >> globus version is modified, recompiled or just older but they give a >> different output. >> >> When /opt/d-cache/bin/grid-mapfile2dcache-kpwd is run by hand >> it calls the >> globus one and there is Email= and E= in /opt/d-cache/etc/dcache.kpwd >> >> But when it is run by the cron job the standard openssl is >> called because >> the path doesn't contain /opt/globus/bin and we get the standard >> emailAddress= in /opt/d-cache/etc/dcache.kpwd which >> mismatches other parts >> of the code (I haven't looked into this yet and don't know if >> I want to). >> >> The simplest way to correct this is to add /opt/globus/bin at the >> beginning of the path in /etc/cron.d/edg-mkgridmap. >> >> /opt/lcg/libexec/lcg-info-wrapper does work now but there is >> still the problem of the ldif cache file being zero and >> therefore ldap >> doesn't pick it up. >> >> cheers >> alessandra >> >> On Fri, 8 Jul 2005, owen maroney wrote: >> >>> And for IC we have: >>> >>> mapping >>> >> "/C=UK/O=eScience/OU=Imperial/L=Physics/CN=gfe02.hep.ph.ic.ac. >> uk/emailAddress=lcg-site-admin@imperial.ac.uk" >>> edginfo >>> >>> login edginfo read-write 19491 19491 / / / >>> >>> >> /C=UK/O=eScience/OU=Imperial/L=Physics/CN=gfe02.hep.ph.ic.ac.u >> k/emailAddress=lcg-site-admin@imperial.ac.uk >>> >>> So, I try replacing this with: >>>> mapping >>>> >> "/C=UK/O=eScience/OU=Imperial/L=Physics/CN=gfe02.hep.ph.ic.ac. >> uk/E=lcg-site-admin@imperial.ac.uk" >>>> edginfo >>>> >>>> login edginfo read-write 19491 19491 / / / >>>> >> /C=UK/O=eScience/OU=Imperial/L=Physics/CN=gfe02.hep.ph.ic.ac.u >> k/E=lcg-site-admin@imperial.ac.uk >>> >>> then >>> >>> su - edginfo >>>> [edginfo@gfe02 edginfo]$ >> /opt/d-cache/srm/bin/srm-storage-element-info >>>> https://gfe02.hep.ph.ic.ac.uk:8443/srm/infoProvider1_0.wsdl >>> >>> produces stuff ending with: >>> >>>> StorageElementInfo : >>>> totalSpace =2541546897408 (2481979392 KB) >>>> usedSpace =30397658395 (29685213 KB) >>>> availableSpace =2502704826621 (2444047682 KB) >>> >>> Hurrah! (although this will get overwritten at the next >> edg-mkgridmap >>> update...) >>> >>> However, although now when I run, as edginfo: >>>> [edginfo@gfe02 edginfo]$ /opt/lcg/libexec/lcg-info-dynamic-se >>> >>> I get a 3 second pause and output like: >>>> dn: >>>> >> GlueSARoot=lhcb:/pnfs/hep.ph.ic.ac.uk/data/lhcb,GlueSEUniqueID >> =gfe02.hep.ph.ic.ac.uk,Mds-Vo-name=local,o=grid >>>> GlueSAStateAvailableSpace: 2444047682 >>>> GlueSAStateUsedSpace: 37931710 >>> >>> when I run, as edginfo, /opt/lcg/libexec/lcg-info-wrapper, >> it takes less than >>> a second and produces output including: >>>> GlueSAStateAvailableSpace: 00 >>>> GlueSAStateUsedSpace: 00 >>> in the output. I checked /opt/lcg/var/gip/tmp and the file >>> lcg-info-dynamic-dcache.ldif.7010 is being updated but is >> only zero sized. >>> >>> So the output of the dynamic-se script does not seem to be getting >>> incorporated into the output of the wrapper script. >>> >>> cheers, >>> Owen. >>> >>> Alessandra Forti wrote: >>>> no I have >>>> >>>> mapping >>>> >> "/C=UK/O=eScience/OU=Manchester/L=HEP/CN=bohr0013.tier2.hep.ma >> n.ac.uk/emailAddress=alessandra.forti@manchester.ac.uk" >>>> edginfo >>>> >>>> login edginfo read-write 18948 18948 / / / >>>> >>>> >> /C=UK/O=eScience/OU=Manchester/L=HEP/CN=bohr0013.tier2.hep.man >> .ac.uk/emailAddress=alessandra.forti@manchester.ac.uk >>>> >>>> cheers >>>> alessandra >>>> >>>> On Fri, 8 Jul 2005, Steve Traylen wrote: >>>> >>>>> On Thu, Jul 07, 2005 at 04:54:33PM +0100 or thereabouts, >> Philip Clark >>>>> wrote: >>>>> >>>>>> >>>>>>> >>>>>>> I don't think we have a workaround, it just works? >>>>>>> >>>>>>> I expect you have mentioned it before but what is the problem? >>>>>> >>>>>> >>>>>> http://savannah.cern.ch/bugs/?func=detailitem&item_id=8777 >>>>>> >>>>>> We need to understand why you are not seeing this bug. >> IC, Manchester >>>>>> and Edinburgh all seem to have it. If we try to monitor >> your storage >>>>>> through the lcg information system then I expect it will >> show up too. >>>>> >>>>> >>>>> Does your dcache.kpwd contain >>>>> >>>>> >> /C=UK/O=eScience/OU=Manchester/L=HEP/CN=bohr0013.tier2.hep.man >> .ac.uk/E=alessandra.forti@manchester.ac.uk >>>>> >>>>> i.e are you seeing this one. >>>>> >>>>> https://savannah.cern.ch/bugs/?func=detailitem&item_id=5295 >>>>> >>>>> but I'm sure I have already asked this question twice so >> feel free to >>>>> scream if we are going through the same loop? >>>>> >>>>> Steve >>>>> >>>>>> >>>>>> -Phil >>>>> >>>>> >>>>> -- >>>>> Steve Traylen >>>>> s.traylen@rl.ac.uk >>>>> http://www.gridpp.ac.uk/ >>>>> >>>> >>>
"Alessandra Forti"
Hi,
I got it finally working. This line
$ENV{PATH} = "/opt/d-cache/srm/bin";
needs to be added to /opt/lcg/libexec/lcg-info-dynamic-dcache
I put it after
$ENV{HOME} = "/var/tmp"; $ENV{SRM_PATH} = "/opt/d-cache/srm";
for house keeping.
cheers
alessandra
"Kostas Georgiou"
Hi,
Once again our Dcache pool nodes started getting the following error message:
07/13 13:33:50 Cell(cmsdsk00_2@cmsdsk00Domain) : removeFiles : invalid syntax in remove filespec >null< 07/13 13:33:50 Cell(cmsdsk00_2@cmsdsk00Domain) : removeFiles : invalid syntax in remove filespec >null< 07/13 13:33:50 Cell(cmsdsk00_2@cmsdsk00Domain) : removeFiles : invalid syntax in remove filespec >null< 07/13 13:33:50 Cell(cmsdsk00_2@cmsdsk00Domain) : removeFiles : invalid syntax in remove filespec >null< 07/13 13:33:50 Cell(cmsdsk00_2@cmsdsk00Domain) : removeFiles : invalid syntax in remove filespec >null<
The admin node is also under load presumably because it's sending the remove command non stop although i can't see anything at the logs.
Since the partition with the logs fills up pretty fast this affects our ability to log file transfers so we need a solution to the problem.
Can someone with access to the developers or the source code have a look at what is causing this?
Cheers,
kostas
"Ross, D \(Derek\)"
Hi Kostas,
We're also seeing this at the Tier 1 now, I've mailed the developers. Restarting the pool makes it stop (for a while).
Derek
"Greig A Cowan"
Hi everyone,
Just to let you all know that in order to get Edinburgh publishing the correct storage I had to add in an extra step in addition to what Alessandra previously mentioned. Even after making Alessandra changes, I was still finding that the wrong version of openssl (i.e. the non-globus one) was being used in the /opt/d-cache/bin/grid-mapfile2dcache-kpwd script. To rectify this, I added /opt/globus/bin to the PATH variable in /etc/crontab (this was in addition to adding /opt/globus/bin to PATH in /etc/cron.d/edg-mkgridmap).
The correct version of openssl is now being used, meaning that there are no more references to emailAddress= in the /opt/dcache/etc/dcache.kpwd file. You can see that our storage is now being correctly reported at:
http://www.ph.ed.ac.uk/~jfergus7/gridppDiscStatus.html
Mona: if you need a hand with Imperials information publishing, let me know.
Thanks,
Greig
"Greig A Cowan"
"Greig A Cowan"
Hi everyone,
I am currently working on some scripts that should hopefully automate the process of draining a pool and then removing it (although I am surprised that dCache does not have some built in function for doing this). Has no one else done this before?
I just need to finish off a small part that allows me to make a pool read only before it is drained. Once this is done I can post the scripts to the list if people want. They are fairly basic, but it should probably be enough for people to understand what needs to be done so that they can modify them as they see fit.
Cheers,
Greig
"Steve Traylen"
Hi everyone,
I am currently working on some scripts that should hopefully automate the process of draining a pool and then removing it (although I am surprised that dCache does not have some built in function for doing this). Has no one else done this before?
They are working on something slightly more automated. If you have a tape back end things are much easier and since they are in that situation this is currently available. It is something they are working on though and they definitely accept it as something to be added.
As you say it is currently possible but a pain, your scripts should make it very much easier. Thanks.
I just need to finish off a small part that allows me to make a pool read only before it is drained. Once this is done I can post the scripts to the list if people want. They are fairly basic, but it should probably be enough for people to understand what needs to be done so that they can modify them as they see fit.
Cheers,
Greig
"Kostas Georgiou"
On Wed, Jul 13, 2005 at 01:55:12PM +0100, Ross, D (Derek) wrote: > Hi Kostas, > > We're also seeing this at the Tier 1 now, I've mailed the developers. Restarting the pool makes it stop (for a while). I think it shows up when a pool gets full and Dcache tries to reclaim the space. It's likely that the server gets confused somehow and it tries to remove files that have been removed by hand or something similar. At restart time i see that the pool has some files in a weird state. 07/13 15:15:37 Cell(cmsdsk00_1@cmsdsk00Domain) : Starting Flushing Thread 07/13 15:15:37 Cell(cmsdsk00_1@cmsdsk00Domain) : Constructor done (still waiting for 'inventory') 07/13 15:15:37 Cell(cmsdsk00_2@cmsdsk00Domain) : New Pool Mode : disabled(fetch,store,stage,p2p-client,p2p-server,) 07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168D8 : CacheException(rc=210;msg=Illegal Control State : receiving.cient) 07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : Trying to recover : 0001000000000000000168D8 07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168D8 : Trying to get storageinfo 07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : PnfsHandler : CacheException (10001) : Pnfs error : Pnfs File not found : 0001000000000000000168D8 07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168D8 : get storageinfo got CacheException(rc=10001;msg=Pnfs error : Pnfs File not found : 0001000000000000000168D8) 07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168D8 : recover : file not found -> removed 07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168E0 : CacheException(rc=210;msg=Illegal Control State : receiving.cient) 07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : Trying to recover : 0001000000000000000168E0 07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168E0 : Trying to get storageinfo 07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : PnfsHandler : CacheException (10001) : Pnfs error : Pnfs File not found : 0001000000000000000168E0 07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168E0 : get storageinfo got CacheException(rc=10001;msg=Pnfs error : Pnfs File not found : 0001000000000000000168E0) 07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168E0 : recover : file not found -> removed 07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168E8 : CacheException(rc=210;msg=Illegal Control State : receiving.cient) 07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : Trying to recover : 0001000000000000000168E8 07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168E8 : Trying to get storageinfo 07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : PnfsHandler : CacheException (10001) : Pnfs error : Pnfs File not found : 0001000000000000000168E8 07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168E8 : get storageinfo got CacheException(rc=10001;msg=Pnfs error : Pnfs File not found : 0001000000000000000168E8) 07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168E8 : recover : file not found -> removed 07/13 15:15:47 Cell(cmsdsk00_2@cmsdsk00Domain) : 000100000000000000016930 : CacheException(rc=210;msg=Illegal Control State : receiving.cient) 07/13 15:15:47 Cell(cmsdsk00_2@cmsdsk00Domain) : Trying to recover : 000100000000000000016930 07/13 15:15:47 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 000100000000000000016930 : Trying to get storageinfo 07/13 15:15:48 Cell(cmsdsk00_2@cmsdsk00Domain) : PnfsHandler : CacheException (10001) : Pnfs error : Pnfs File not found : 000100000000000000016930 07/13 15:15:48 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 000100000000000000016930 : get storageinfo got CacheException(rc=10001;msg=Pnfs error : Pnfs File not found : 000100000000000000016930) 07/13 15:15:48 Cell(cmsdsk00_2@cmsdsk00Domain) : 000100000000000000016930 : recover : file not found -> removed 07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : runInventory #=249;space=81011045121/483183820800 07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : New Pool Mode : enabled 07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : Pool enabled cmsdsk00_2 07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : Repository finished 07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : Starting Flushing Thread 07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : Constructor done (still waiting for 'inventory')
"Owen Synge"
I spoke to Steve Derek and Andrew at Lunch about the issues of contacting production managers about your disks being full and the conclusion that Andrew suggested and Steve and Derek agreed to is to let the system fill up.
I believe that you should publish your SRM's as volatile and after using the service for some time you may wish to upgrade your service status.
Regards
Owen
"Mona Aggarwal"
"Mona Aggarwal"
Hi all,
Following is a list of useful links to add and remove a pool from dCache.
1. dCache4SiteAdmins.pdf ==> Section 4
http://www.gridpp.ac.uk/deployment/admin/dcache/index.html
The new version of this guide will be included in LCG 2.6.0 released.
2. UK dCache experiences FAQ
http://www.gridpp.ac.uk/deployment/admin/dcache/faq.html
Moreover, Greig is working on a script to automate file transfers between two pools.
Thanks Greig!
Regards,
Mona
"Greig A Cowan"
Hi everyone,
Moreover, Greig is working on a script to automate file transfers ween two pools.
I'm doing this right now, but I'm trying to make the process a little smoother. At the moment my scripts generate a list of the pnfs IDs of the files in the pool that has to be removed. The scripts then loop over this list, entering the admin interface, removing a file and then exiting. This process repeats for each file. Obviously this means that it I need to manually enter the password each time I login to the admin interface. This solution becomes very annoying if you have more than a handful of files (although it's probably not as bad as performing the entire process by hand!). Has anyone managed to successfully use ssh keys to speed up access to the admin interface?
At the moment, my ~/.ssh/config file on the admin node contains:
Host dcache_admin Hostname localhost user admin Port 22223 Cipher blowfish Ciphers blowfish-cbc
Does anyone know how would I go about creating a key-pair for admin@localhost? I know how to do this when I am logging into a remote machine when I have access to a shell, but I'm not too sure how to do it in the case of the dCache admin interface.
Thanks,
Greig
"Philip Clark"
Hi Folks, We are hopefully going to be making the GridPP storage monitoring more visible in the GridPP pages. Could you check the link below to make sure you site is being reported on correcting? http://www.ph.ed.ac.uk/~jfergus7/gridppDiscStatus.html Does anyone know why RAL jumped to 1PB? Is this tape being included. It seems suspiciously like too round a number. -Phil
Greig A Cowan
Hi everyone,
At the moment, we at Edinburgh are re-evaluating our site policy for mapping VOs to particular pools/pool groups. I was hoping that the other sites with dCache/DPM installations could post to the list what their current policy is (if they have one). I would just like an idea of what is best practise.
For example, should we allocate 2TB to each of the LHC VOs and then leave the remaining space available for any VO to use? Do we even _need_ to have separate pools for VOs? As long as the storage is being utilised, does it matter who is using it?
What we do have to ensure is that all the experiments know the storage is volatile. This is pretty urgent, since we are going to be getting thousands of files and they are all from those different users.
Thanks,
Greig
Owen Synge
Owen Synge
Does anyone have a definitive way of checking the D-Cache version number, I tried the following commands
[root@dev01 root]# head -n3 /opt/d-cache/bin/dcache-core #!/bin/sh # $Id: D-Cache-Howto-Email-Import2.xml,v 1.11 2005/08/25 16:51:43 synge Exp $ # [root@dev01 root]# rpm -qa | grep cache d-cache-opt-1.5.3-73 distcache-0.4.2-9.3 d-cache-client-1.0-76 d-cache-lcg-5.0.0-1 d-cache-gpp-1.1.2l-1 distcache-devel-0.4.2-9.3 d-cache-core-1.5.2-74 d-cache-gpp-admin-1.1.2l-1
hello,
i tried yesterday to update our Dcache on fal-pygrid-20 and attached node to 2_5_0, i did this over the top of the existing set up. Everything still works, and the cynic in me would like to know if the upgrade went successfully- so how can you tell what Dcache version your using? I can't seem to find a nice VERSION file or any magic command that tells me. It seems like a silly thing not to know how to do!
cheers, hope you have a good weekend.
matt
As you can see the numbers don't match this would be useful for all involved downstream of the core development team and maybe them when they want to know what version a users bug is in. We would use it to check upgrades and make bug reports more specific.
This can be accomplished using plain old CVS version numbers and using the CVS built in of variables being populated from tags
https://www.cvshome.org/docs/manual/cvs-1.11.20/cvs_12.html#SEC97
If each file in
/opt/d-cache/dcap/bin /opt/d-cache/srm/bin /opt/d-cache/dcap/bin
had a version option it would be good. I should also like each file in
/opt/d-cache/etc
to contain the version number so we could see which version a file started as.
Also
/opt/d-cache/docs
Could probably benefit from some version numbering too.
These version numbers can also generate the RPM's version number as well as the applications version number from the same CVS tag so making everything clear and no fear of duplication and forking.
Regards
Owen Synge
Greig A Cowan
Sorry Owen, I am not sure about how to check the version number. I have just downloaded the current version, 1.6.5-2 and listing the contents of the archive gives: tar -t --file=dcache-v1.6.5-2.tar dcache_deploy/ dcache_deploy/d-cache-client-1.0-100-RH73.i386.rpm dcache_deploy/d-cache-client-1.0-100.i386.rpm dcache_deploy/d-cache-core-1.5.2-83.i386.rpm dcache_deploy/d-cache-opt-1.5.3-84.i386.rpm dcache_deploy/dCache-installation-instructions.txt dcache_deploy/pnfs-3.1.10-15.i386.rpm dcache_deploy/Release.notes So it is not clear how the different component numbers are combined together to give the final version number. Could you speak to the dCache developers about this? Cheers, Greig On Tue, 19 Jul 2005, Owen Synge wrote: > Does anyone have a definitive way of checking the D-Cache version number, I tried the following commands > > [root@dev01 root]# head -n3 /opt/d-cache/bin/dcache-core > #!/bin/sh > # $Id: D-Cache-Howto-Email-Import2.xml,v 1.11 2005/08/25 16:51:43 synge Exp $ > # > [root@dev01 root]# rpm -qa | grep cache > d-cache-opt-1.5.3-73 > distcache-0.4.2-9.3 > d-cache-client-1.0-76 > d-cache-lcg-5.0.0-1 > d-cache-gpp-1.1.2l-1 > distcache-devel-0.4.2-9.3 > d-cache-core-1.5.2-74 > d-cache-gpp-admin-1.1.2l-1 > > As you can see the numbers don't match this would be useful for all involved downstream of the core development team and maybe them when they want to know what version a users bug is in. We would use it to check upgrades and make bug reports more specific. > > This can be accomplished using plain old CVS version numbers and using the CVS built in of variables being populated from tags > > https://www.cvshome.org/docs/manual/cvs-1.11.20/cvs_12.html#SEC97 > > If each file in > > /opt/d-cache/dcap/bin > /opt/d-cache/srm/bin > /opt/d-cache/dcap/bin > > had a version option it would be good. I should also like each file in > > /opt/d-cache/etc > > to contain the version number so we could see which version a file started as. > > Also > > /opt/d-cache/docs > > Could probably benefit from some version numbering too. > > These version numbers can also generate the RPM's version number as well as the applications version number from the same CVS tag so making everything clear and no fear of duplication and forking. > > Regards > > Owen Synge > > > > > > Begin forwarded message: > > Date: Fri, 15 Jul 2005 14:36:14 +0100 > From: "Matt Doidge" > To: "Synge, OM \(Owen\)" > Subject: silly question > > > hello, > i tried yesterday to update our Dcache on fal-pygrid-20 and attached > node to 2_5_0, i did this over the top of the existing set up. > Everything still works, and the cynic in me would like to know if the > upgrade went successfully- so how can you tell what Dcache version your > using? I can't seem to find a nice VERSION file or any magic command > that tells me. It seems like a silly thing not to know how to do! > > cheers, hope you have a good weekend. > > matt > -- ======================================================================= Dr Greig A Cowan http://www.ph.ed.ac.uk/~gcowan1 School of Physics, University of Edinburgh, James Clerk Maxwell Building DCACHE PAGES: http://www.gridpp.ac.uk/deployment/admin/dcache/index.html =======================================================================
Jiri Mencak
07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: copy request state changed to Done 07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: changing fr#-2147483522 to Done 07/19 10:18:35 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521Request.createCopyRequest : created new request succesfully 07/19 10:20:08 Cell(SRM@srmDomain) : remoing TransferInfo for callerId=20000 07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host]) 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.runRemoteToLocalCopy(CopyFileRequest.java:666) 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:770) 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121) 07/19 10:20:08 Cell(SRM@srmDomain) : at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java) 07/19 10:20:08 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534) 07/19 10:20:08 Cell(SRM@srmDomain) : CopyFileRequest #-2147483520: copy failed 07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host]) 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:798) 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121) 07/19 10:20:08 Cell(SRM@srmDomain) : at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java) 07/19 10:20:08 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534) 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521copyRequest getter_putter is non null, stopping 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521changing fr#-2147483520 to Failed 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521error : 07/19 10:20:36 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.IllegalStateTransition: g illegal state transition from Canceled to Failed 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:532) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:417) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyRequest.stateChanged(CopyRequest.java:952) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:566) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:417) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.request.Request.getRequestStatus(Request.java:521) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.SRM.getRequestStatus(SRM.java:868) 07/19 10:20:36 Cell(SRM@srmDomain) : at diskCacheV111.srm.server.SRMServerV1.getRequestStatus(SRMServerV1.java:360) 07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 07/19 10:20:36 Cell(SRM@srmDomain) : at java.lang.reflect.Method.invoke(Method.java:324) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.reflect.Invocation.execute(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.reflect.Invocation.invoke(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.service.object.ObjectService.invoke(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:534) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:508) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.http.SOAPHTTPHandler.service(SOAPHTTPHandler.java:88) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.server.http.ServletServer.service(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at javax.servlet.http.HttpServlet.service(HttpServlet.java:853) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.servlet.Config.service(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.http.HTTPContext.service(HTTPContext.java:84) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.servlet.ServletContainer.service(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.http.WebServer.service(WebServer.java:87) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.socket.SocketServer.run(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.socket.SocketRequest.run(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.thread.ThreadPool.run(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534)
Greig A Cowan
Greig A Cowan
Hi everyone,
We are currently involved in the file transfers from RAL. However, we have been having trouble with our pool node in that all the CPU (8*1.9 GHz) and memory (physical RAM is 32 GB) resources have been quickly used up, grinding the machine to a halt. This has prevented us from accepting files.
When Steve Thorn (NeSC) analysed the machine, it appears that dCache was spawning java processes:
1195 ? S 0:00 /bin/sh /opt/d-cache/jobs/pool -pool=dcache -logfile 1197 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java -server -Xmx256m 1200 ? S 9:55 \_ /usr/java/j2sdk1.4.2_08/bin/java -server -Xmx 1201 ? S 0:57 \_ /usr/java/j2sdk1.4.2_08/bin/java -server 1202 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java -server 1203 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java -server 1204 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java -server ...
There were ~200 each using 57 MB RAM. At one point, the total RAM used was 31 GB. At the moment, Dcache services have been stopped on the pool node and after a reboot the machine appears to have returned to normal. Has anyone seen/heard of this before?
Any advice would be useful.
Cheers,
Greig
Kostas Georgiou
On Tue, Jul 19, 2005 at 04:17:22PM +0100, Greig A Cowan wrote: > Hi everyone, > > We are currently involved in the file transfers from RAL. However, we have > been having trouble with our pool node in that all the CPU (8*1.9 GHz) > and memory (physical RAM is 32 GB) resources have been quickly used up, > grinding the machine to a halt. This has prevented us from accepting > files. > > When Steve Thorn (NeSC) analysed the machine, it appears that dCache was > spawning java processes: > > 1195 ? S 0:00 /bin/sh /opt/d-cache/jobs/pool -pool=dcache > -logfile > 1197 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java -server > -Xmx256m > 1200 ? S 9:55 \_ /usr/java/j2sdk1.4.2_08/bin/java > -server -Xmx > 1201 ? S 0:57 \_ /usr/java/j2sdk1.4.2_08/bin/java > -server > 1202 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java > -server > 1203 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java > -server > 1204 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java > -server > ... > > There were ~200 each using 57 MB RAM. At one point, the total RAM used was > 31 GB. At the moment, Dcache services have been stopped on the pool node > and after a reboot the machine appears to have returned to normal. Has > anyone seen/heard of this before? I can see around 400 threads from the two java processes in one of our pool nodes. Total memory in use is ~380MB for both of them (~100 for the pool, ~180 for gridftp). Are you sure that the problem was caused because of low memory? Threads share all the data so it's more likely to me that the process was only using 57MB total ;P In our pool node, there are also 165 connections from csfnfs*.rl.ac.uk and the disk spends most of it's seeking instead of doing something useful (writing) which causes a huge load. Cheers, Kostas
Philip Clark
Hi Kostas, Yes, this makes sense the threads all share the same memory. I have suggest Greig move this thread to lcg-support-dcache. Are you on this list? Maybe something else caused our downtime. Hope to find out soon. -Phil Kostas Georgiou writes: > On Tue, Jul 19, 2005 at 04:17:22PM +0100, Greig A Cowan wrote: > >> Hi everyone, >> >> We are currently involved in the file transfers from RAL. However, we have >> been having trouble with our pool node in that all the CPU (8*1.9 GHz) >> and memory (physical RAM is 32 GB) resources have been quickly used up, >> grinding the machine to a halt. This has prevented us from accepting >> files. >> >> When Steve Thorn (NeSC) analysed the machine, it appears that dCache was >> spawning java processes: >> >> 1195 ? S 0:00 /bin/sh /opt/d-cache/jobs/pool -pool=dcache >> -logfile >> 1197 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java -server >> -Xmx256m >> 1200 ? S 9:55 \_ /usr/java/j2sdk1.4.2_08/bin/java >> -server -Xmx >> 1201 ? S 0:57 \_ /usr/java/j2sdk1.4.2_08/bin/java >> -server >> 1202 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java >> -server >> 1203 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java >> -server >> 1204 ? S 0:00 \_ /usr/java/j2sdk1.4.2_08/bin/java >> -server >> ... >> >> There were ~200 each using 57 MB RAM. At one point, the total RAM used was >> 31 GB. At the moment, Dcache services have been stopped on the pool node >> and after a reboot the machine appears to have returned to normal. Has >> anyone seen/heard of this before? > > I can see around 400 threads from the two java processes in one of our pool > nodes. Total memory in use is ~380MB for both of them (~100 for the pool, ~180 > for gridftp). Are you sure that the problem was caused because of low memory? > Threads share all the data so it's more likely to me that the process was only > using 57MB total ;P > > In our pool node, there are also 165 connections from csfnfs*.rl.ac.uk and the disk > spends most of it's seeking instead of doing something useful (writing) which > causes a huge load. > > Cheers, > Kostas
Kostas Georgiou
On Tue, Jul 19, 2005 at 05:31:25PM +0100, Philip Clark wrote: > Hi Kostas, > > Yes, this makes sense the threads all share the same memory. I have > suggest Greig move this thread to lcg-support-dcache. Are you on this > list? I wasn't even aware that the mailing list existed. Any info on how to subscribe to it? > Maybe something else caused our downtime. Hope to find out soon. If i am not mistaken you are running AS2.1 which is using the old threading model which is far less efficient than the new one so i wont rule out the threads caused the problem. Kostas
Owen Synge
On Tue, 19 Jul 2005 17:31:25 +0100 Philip Clark wrote: > Hi Kostas, > > Yes, this makes sense the threads all share the same memory. I have > suggest Greig move this thread to lcg-support-dcache. Are you on this > list?
I am not and should like the thread to be a cross post so I can see it too without a flood of irrelevant emails, but a summary would be great.
> > Maybe something else caused our downtime. Hope to find out soon. > > -Phil
I think the disk thrashing issue is worth suggesting to others as a performance hit as the experiments are trying to break things to find out what breaks under what circumstances and then I think trying to find out what is the best way to work with the software stack
Just my 2 pence worth
Regards
Owen S
Kostas Georgiou
Well i have parallel streams set to 1 for srm and it seemed to work fine for the Phedex transfers from RAL (~480/Mbit/sec). Somehow the SC3 transfers that Derek is running at the moment use something between 2 and 5 streams (from my strace logs) the end result is that we haven't managed to get more than ~80Mbit/sec :(
From the strace logs it looks like each thread in d-cache writes it's own stream as it arrives instead of merging everything back in a buffer resulting in writes like:
lseek(23, 106792960, SEEK_SET) = 106792960 write(23, ..., 10240) = 10240 .. lseek(23, 106844160, SEEK_SET) = 106844160 write(23, ..., 10240) = 10240
<guesswork> The OS/Raid controller might be able to merge everything back together before writing to the disk but with 250 streams that we have at the moment i think it's unlikely to happen (iostat reports minimal merges compared to writes).
Since we are using RAID5 for the disks with a stripe of 64K the non merged 10K writes result in partial stripe write which causes Read-Modify-Write operations slowing down everything even more :( </guesswork>
Too bad there is no source available to play with different settings :( I'll boot one of the pool nodes with the Anticipatory elevator which might be able to do better than the other ones but i don't expect it to do much difference :(
Cheers, Kostas
"Sansum, RA (Andrew)"
So many streams to the RAID controller is bound to be a disaster. We need to reduce the number of parallel streams being generated by FTS/dCache.
Regards
Andrew
Greig A Cowan
Hi everyone,
Following on from my post yesterday regarding dCache spawning java processes, I have carried out some more tests of our dCache and have found the following:
The problem of excessive memory/CPU usage appears to occur during all file transfers into our dCache (using any level of parallelism). Once a transfer is complete, the CPU usage returns to normal but the memory used is not released.
The problem with the sustained RAL FTS transfers is that due to the continuous nature of the transfers, the CPU usage is always high and after enough time all the memory runs out. Presumably this is even the case when just performing periodic transfers: after enough of them the system will still run out of memory. This possibly helps to explain why we have previously had problems of this nature with our pool node, even before RAL started their FTS transfers.
Any information anyone has on this matter would be useful. We are currently looking into issues regarding the memory management of our pool node. I will keep you posted on our progress.
Thanks,
Greig
Steve Traylen
Hi Brian,
As you know I was looking at the FTS logs for transfers for RAL to Lanc'
They contain the exciting. And from the srmcp below that it looks like the set done method is not working failing with an Axis error.
I don't know.
Steve
005-07-21 13:45:01,346 [WARN ] - Starting gsiftp transfer TURL source = gsiftp://gftp0441.gridpp.rl.ac.uk:2811//pnfs/gridpp.rl.ac.uk/data/dteam/fts_test/fts_test-20 TURL dest = gsiftp://fal-pygrid-26.lancs.ac.uk:2811//pnfs/lancs.ac.uk/data/dteam/fts_ral/339e212941f1ff49d4332e7b90e4cfd8 FILE SIZE = 1055162368 2005-07-21 13:55:02,645 [INFO ] - STATUS:END fail:TRANSFER 2005-07-21 13:55:02,645 [INFO ] - STATUS:BEGIN:SRM_PUTDONE 2005-07-21 13:55:02,646 [DEBUG] - Performing Call to method srm__setFileStatus 2005-07-21 13:55:02,948 [DEBUG] - Call completed to srm__setFileStatus 2005-07-21 13:55:02,949 [INFO ] - STATUS:END:SRM_PUTDONE 2005-07-21 13:55:02,949 [DEBUG] - Performing Call to method srm__advisoryDelete 2005-07-21 13:55:03,881 [DEBUG] - Call completed to srm__advisoryDelete 2005-07-21 13:55:03,882 [INFO ] - STATUS:BEGIN:SRM_GETDONE 2005-07-21 13:55:03,882 [DEBUG] - Performing Call to method srm__setFileStatus 2005-07-21 13:55:04,281 [DEBUG] - Call completed to srm__setFileStatus 2005-07-21 13:55:04,282 [INFO ] - STATUS:END:SRM_GETDONE 2005-07-21 13:55:04,283 [INFO ] - STATUS:FAILED 2005-07-21 13:55:04,283 [DEBUG] - exiting listener thread which still seems active 2005-07-21 13:55:04,283 [ERROR] - FINAL:ABORT:TRANSFER - Transfer timed out.%
Also trying an srmcp on light visible host.
RM Configuration: debug=true gsissl=true help=false pushmode=false userproxy=true buffer_size=2048 tcp_buffer_size=0 config_file=/home/traylens/.srmconfig/config.xml glue_mapfile=/opt/d-cache/srm/conf/SRMServerV1.map webservice_path=srm/managerv1.wsdl webservice_protocol=https gsiftpclinet=globus-url-copy protocols_list=http,gsiftp save_config_file=null srmcphome=/opt/d-cache/srm urlcopy=/opt/d-cache/srm/bin/url-copy.sh x509_user_cert=/home/csf/traylens/.globus/usercert.pem x509_user_key=/home/csf/traylens/.globus/userkey.pem x509_user_proxy=/tmp/x509up_u27532 x509_user_trusted_certificates=/etc/grid-security/certificates retry_num retry_timeout=10000 wsdl_url=null use_urlcopy_script=true connect_to_wsdl=false from[0]=file:////etc/group to=srm://fal-pygrid-26.lancs.ac.uk:8443//pnfs/lancs.ac.uk/data/dteam/bingo Thu Jul 21 14:54:50 BST 2005: starting SRMPutClient Thu Jul 21 14:54:50 BST 2005: SRMClient(https,srm/managerv1.wsdl,true) Thu Jul 21 14:54:50 BST 2005: connecting to server Thu Jul 21 14:54:50 BST 2005: connected to server, obtaining proxy SRMClientV1 : connecting to srm at httpg://fal-pygrid-26.lancs.ac.uk:8443/srm/managerv1 Thu Jul 21 14:54:51 BST 2005: got proxy of type class org.dcache.srm.client.SRMClientV1 SRMClientV1 : put, sources[0]="/etc/group" SRMClientV1 : put, dests[0]="srm://fal-pygrid-26.lancs.ac.uk:8443//pnfs/lancs.ac.uk/data/dteam/bingo" SRMClientV1 : put, protocols[0]="http" SRMClientV1 : put, protocols[1]="dcap" SRMClientV1 : put, protocols[2]="gsiftp" SRMClientV1 : put, contacting service httpg://fal-pygrid-26.lancs.ac.uk:8443/srm/managerv1 doneAddingJobs is false copy_jobs is empty Thu Jul 21 14:54:53 BST 2005: srm returned requestId = -2147480986 Thu Jul 21 14:54:53 BST 2005: sleeping 1 seconds ... Thu Jul 21 14:54:58 BST 2005: FileRequestStatus with SURL=srm://fal-pygrid-26.lancs.ac.uk:8443//pnfs/lancs.ac.uk/data/dteam/bingo is Ready Thu Jul 21 14:54:58 BST 2005: received TURL=gsiftp://fal-pygrid-26.lancs.ac.uk:2811//pnfs/lancs.ac.uk/data/dteam/bingo doneAddingJobs is false copy_jobs is not empty copying CopyJob, source = file:////etc/group destination = gsiftp://fal-pygrid-26.lancs.ac.uk:2811//pnfs/lancs.ac.uk/data/dteam/bingo trying script copy executing command /opt/d-cache/srm/bin/url-copy.sh -get-protocols exit value is 0 GridftpClient: connecting to fal-pygrid-26.lancs.ac.uk on port 2811 GridftpClient: gridFTPClient tcp buffer size is set to 1048576 GridftpClient: gridFTPWrite started, source file is java.io.RandomAccessFile@12c3327 destination path is /pnfs/lancs.ac.uk/data/dteam/bingo GridftpClient: parallelism: 10 GridftpClient: adler 32 for file java.io.RandomAccessFile@12c3327 is f6bacd15 GridftpClient: waiting for completion of transfer GridftpClient: gridFtpWrite: starting the transfer in emode to /pnfs/lancs.ac.uk/data/dteam/bingo GridftpClient: DiskDataSink.close() called GridftpClient: gridFTPWrite() wrote 649bytes GridftpClient: closing client : org.dcache.srm.util.GridftpClient$FnalGridFTPClient@a83a13 GridftpClient: closed client execution of CopyJob, source = file:////etc/group destination = gsiftp://fal-pygrid-26.lancs.ac.uk:2811//pnfs/lancs.ac.uk/data/dteam/bingo completed setting file request -2147480985 status to Done AxisFault faultCode: {http://xml.apache.org/axis/}HTTP faultSubcode: faultString: (0)null faultActor: faultNode: faultDetail: {}:return code: 0 {http://xml.apache.org/axis/}HttpErrorCode:0 (0)null at org.apache.axis.transport.http.HTTPSender.readFromSocket(HTTPSender.java:663) at org.apache.axis.transport.http.HTTPSender.invoke(HTTPSender.java:94) at org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32) at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) at org.apache.axis.client.AxisClient.invoke(AxisClient.java:147) at org.apache.axis.client.Call.invokeEngine(Call.java:2719) at org.apache.axis.client.Call.invoke(Call.java:2702) at org.apache.axis.client.Call.invoke(Call.java:2378) at org.apache.axis.client.Call.invoke(Call.java:2301) at org.apache.axis.client.Call.invoke(Call.java:1758) at org.dcache.srm.client.axis.ISRMStub.setFileStatus(ISRMStub.java:512) at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1082) at gov.fnal.srm.util.CopyJob.done(CopyJob.java:152) at gov.fnal.srm.util.Copier.run(Copier.java:305) at java.lang.Thread.run(Thread.java:534) SRMClientV1 : getRequestStatus: try #0 failed with exception java.lang.RuntimeException: (0)null at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1086) at gov.fnal.srm.util.CopyJob.done(CopyJob.java:152) at gov.fnal.srm.util.Copier.run(Copier.java:305) at java.lang.Thread.run(Thread.java:534) setting File Request to "Done" failed java.lang.RuntimeException: (0)null at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1086) at gov.fnal.srm.util.CopyJob.done(CopyJob.java:152) at gov.fnal.srm.util.Copier.run(Copier.java:305) at java.lang.Thread.run(Thread.java:534) Exception in thread "main" java.lang.RuntimeException: (0)null at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1086) at gov.fnal.srm.util.CopyJob.done(CopyJob.java:152) at gov.fnal.srm.util.Copier.run(Copier.java:305) at java.lang.Thread.run(Thread.java:534) setting file request -2147480985 status to Done java.lang.IllegalStateException: Shutdown in progress at java.lang.Shutdown.add(Shutdown.java:79) at java.lang.Runtime.addShutdownHook(Runtime.java:190) at gov.fnal.srm.util.Copier.run(Copier.java:229) at java.lang.Thread.run(Thread.java:534)
Kostas Georgiou
Kostas Georgiou
Hi,
After our cms people deleted 1.5-2.0TB from Dcache with GSI DCAP we discovered that the pools haven't freed the space although the files are gone from the /pnfs name space. In the admin web page the pools show as full and the data is marked as precious.
Any ideas on how to clean up the mess?
Cheers,
Kostas
"Ross, D (Derek)"
Hi Kostas,
Have a look in /opt/pnfsdb/pnfs/trash/2 on your admin node, the name of every file in there is the pnfsid of a file that has been deleted from pnfs, rep rm <pnfsid> -force in the pool's cell in the admin node should delete the file.
Derek
Kostas Georgiou
Ah and i was about to delete the pool to recover the space :) Is there a reason why Dcache doesn't delete the files automatically? Do we have to go through the cleanup exercise manually or it was some random failure in Dcache?
There are 1195 files in the trash :( I'll have to write some script i guess. Do i delete the files from the trash after i run rep rm .... or Dcache will do something about it?
Cheers,
Kostas
Greig A Cowan
Ah and i was about to delete the pool to recover the space :) Is there a reason why Dcache doesn't delete the files automatically? Do we have to go through the cleanup exercise manually or it was some random failure in Dcache?
There are 1195 files in the trash :( I'll have to write some script i guess. Do i delete the files from the trash after i run rep rm .... or Dcache will do something about it?
I had started to write a set of scripts to remove files from a dCache pool, but then Owen S mentioned that someone (Judith Novak?) at CERN had already done this. Has there been any progress on finding out about this Owen?
Cheers,
Greig
"Ross, D (Derek)"
Ah and i was about to delete the pool to recover the space :) Is there a reason why Dcache doesn't delete the files automatically?
It does normally, this maybe related to the problem where the pool starts writing errors in to the log file about deleting files.
There are 1195 files in the trash :( I'll have to write some script i guess. Do i delete the files from the trash after i run rep rm .... or Dcache will do something about it?
I think you're free to delete them.
Derek
Kostas Georgiou
I see, i deleted everything from the pools leaving only 13GB in /pnfs for some reason even after deleting everything by rep rm ... Dcache reports that ~500GB are still in use :(
I also noticed that some files in the trash don't have an associated pool and i can't delete them, am i right to guess that those files have been deleted?
Any ideas what the fields mean? I can't seem to find anything in the "documentation".
$ cat 000100000000000000018AA0 2,0,0,0.0,0.0 :d=true; sedsk00_1 $ cat 000100000000000000023EB8 2,0,0,0.0,0.0 : sedsk00_1 $ cat 000100000000000000016EC8 2,0,0,0.0,0.0 :d=true;c=1:5dc00001;s=*; sedsk00_1 $ cat 000100000000000000023D08 2,0,0,0.0,0.0 :d=true; $ cat 00010000000000000000F3D8 2,0,0,0.0,0.0 :
"Brian Davies"
aren't these files deleted by the thresh command when the Dcache reaches some (settable) filled capacity. ie files are marked as deletable but are not removed until you reach x% filled capacity at which point it starts removing these files until the filled capacity is reduced back to y% or when all removable files have been removed ( whichever comes first). Brian
Kostas Georgiou
I see, i deleted everything from the pools leaving only 13GB in /pnfs for some reason even after deleting everything by rep rm ... Dcache reports that ~500GB are still in use :(
It seems that was caused by my script that missed some files :( Here is the improved and simplified if anyone needs to use it at some point.
Cheers, Kostas
$ ./dcache_emptytrash > commands $ ssh dcacheadm < commands $ cat dcache_emptytrash #!/bin/bash trash=/opt/pnfsdb/pnfs/trash/2 pools="sedsk00_1 cmsdsk00_1 cmsdsk00_2" for pool in $pools; do echo "cd $pool" for file in `find $trash -type f -print0 | xargs -0 grep -sl $pool`; do echo "rep rm ${file##*/} -force" done echo ".." done echo logoff
Kostas Georgiou
On Mon, Jul 25, 2005 at 09:49:51AM +0100, Brian Davies wrote: > aren't these files deleted by the thresh command when the Dcache > reaches some (settable) filled capacity. ie files are marked as > deletable but are not removed until you reach x% filled capacity at > which point it starts removing these files until the filled capacity > is reduced back to y% or when all removable files have been removed ( > whichever comes first). From the three pools that we have the usage was 100%, 90% and 80% while around 80% of the total space was "deleted", all the files in the full pool were marked as deletable and the cleanup never happened. I'll have a look at setting the threshold to something really low and see if it makes a difference. Kostas
Kostas Georgiou
Hi,
Has anyone seen error messages like this ones before in the admin node? To me it smells like a corruption in the pnfs database somewhere
Cheers,
Kostas
nfs_refresh_inode: inode number mismatch expected (0xc/0x104327f), got (0xc/0x1043278) nfs_refresh_inode: inode number mismatch expected (0xc/0x1043497), got (0xc/0x1043490) nfs_refresh_inode: inode number mismatch expected (0xc/0x103e53f), got (0xc/0x103e538) nfs_refresh_inode: inode number mismatch expected (0xc/0x104367f), got (0xc/0x1043678) nfs_refresh_inode: inode number mismatch expected (0xc/0x100e827), got (0xc/0x100e820) nfs_refresh_inode: inode number mismatch expected (0xc/0x1001127), got (0xc/0x1001120) nfs_refresh_inode: inode number mismatch expected (0xc/0x1001067), got (0xc/0x1080) nfs_refresh_inode: inode number mismatch expected (0xc/0x1087), got (0xc/0x1080) nfs_refresh_inode: inode number mismatch expected (0xc/0x1047), got (0xc/0x1040) nfs_refresh_inode: inode number mismatch expected (0xc/0x1027), got (0xc/0x1020) nfs_refresh_inode: inode number mismatch expected (0xc/0x1043687), got (0xc/0x1043680) nfs_refresh_inode: inode number mismatch expected (0xc/0x1043697), got (0xc/0x1043690) nfs_refresh_inode: inode number mismatch expected (0xc/0x104373f), got (0xc/0x1043738) nfs_refresh_inode: inode number mismatch expected (0xc/0x1043717), got (0xc/0x1043710) nfs_refresh_inode: inode number mismatch expected (0xc/0x1043727), got (0xc/0x1043720) nfs_refresh_inode: inode number mismatch expected (0xc/0x100e827), got (0xc/0x100e820) nfs_refresh_inode: inode number mismatch expected (0xc/0x1001127), got (0xc/0x1001120) nfs_refresh_inode: inode number mismatch expected (0xc/0x1001067), got (0xc/0x1080) nfs_refresh_inode: inode number mismatch expected (0xc/0x1087), got (0xc/0x1080) nfs_refresh_inode: inode number mismatch expected (0xc/0x1047), got (0xc/0x1040) nfs_refresh_inode: inode number mismatch expected (0xc/0x1027), got (0xc/0x1020) nfs_refresh_inode: inode number mismatch expected (0xc/0x103385f), got (0xc/0x1033858)
Jiri Mencak
Dear all,
I've played a little bit with dual-homed machines and dCache with mixed success. Nevertheless, it think it is worth reporting and I'm looking forward to your feedback.
Architecture
Pentium III 600
OS
Scientific Linux SL Release 3.0.4 (SL)
dCache
d-cache-client-1.0-100 d-cache-core-1.5.2-83 d-cache-lcg-5.0.0-1 d-cache-opt-1.5.3-84 (d-cache-gpp-v1.2.1-1)
I have done a simplified dCache installation using the GridPP storage dependency RPMs (no BDII etc.) to speed things up, LCG yaim 2.5.0 installation should work equally well.
Scenario
Admin node: dual-homed box with a /pool on the same box (I know, 3 dual-homed boxes would be better with no pool on the admin node, but this should do as a proof of concept) Pool node: dual-homed box with a /pool
Public Interfaces: E0a (admin.public.ac.uk), E0p (pool.public.ac.uk) Private Interfaces: E1a (192.168.0.32), E1p (192.168.0.33) E0a --------------- E1a | | +---| admin |---+ | | /pool | | | --------------- | | | | | Private Net Public Net ----+----- ----+----- ------------| switch | | switch | ----+----- ----+----- | | | | | | ---------------- | | | | | | | | | +--| pool |---+ | | | /pool | | | E0p ---------------- E1p | | | | ........................ | | | | O T H E R P O O Ls |
Installation
1) Installed SL 3.0.4 and grid certificates
2) Made sure `hostname` returns FQDN associated with E0a and E0p, in other words, public FQDN.
3) To make internal dCache communication pass through private interfaces I've set up an internal DNS server to fool admin and pool nodes into thinking admin.public.ac.uk is 192.168.0.32 and pool.public.ac.uk is 192.168.0.33.
4) Made sure
`hostname -d` = `grep ^search /etc/resolv.conf | awk '{print $2}'`
5) Set up site-info.def:
MY_DOMAIN=`hostname -d` DCACHE_ADMIN=<E1a private FQDN> DCACHE_POOLS="`hostname -f`:2:/pool"
6) Installed dCache using GridPP storage dependency RPMs.
Testing
globus-url-copy and dCache SRM copy worked fine including third party copying (get) _from_ dual-homed boxes. Unfortunately, third party (put) _to_ dual-homed boxes fails. Relevant dCache log snippets attached.
Tier 2 dual-homing requirements
It would be nice to hear what the architectural requirements from Tier 2 sites are with regard to dual-homing are. I was working under the assumption that the purpose of dual-homed machines was to increase network throughput on the public interface by passing internal dCache communication through the private interface and to shield dCache from the outside world and expose only SRM and GridFTP on the public interface.
I suspect there will be other/different requirements with regard to the dual-homed architecture so it would be nice to hear them. Owen tells me that if you need dual-homing, your setup will almost certainly be Lightpath on the public interface, and university network on the private interface.
I'm now partly leaving dCache support moving onto another project, so I cannot guarantee I'll be working on dual-homing in the future.
Regards.
Jiri
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : Failed : CacheException(rc=666;msg=can't get pnfsId (not a pnfsfile)) 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : CacheException(rc=666;msg=can't get pnfsId (not a pnfsfile)) 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.cells.PnfsManager2.getStorageInfo(PnfsManager2.java:950) 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.cells.PnfsManager2.processPnfsMessage(PnfsManager2.java:1597) 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at diskCacheV111.cells.PnfsManager2$ProcessThread.run(PnfsManager2.java:1518) 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : at java.lang.Thread.run(Thread.java:534) 07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : Error obtaining 'l' flag for getSimulatedFilesize : java.io.FileNotFoundException: /pnfs/fs/.(puse)(000100000000000000001120)(2) (Is a directory) 07/19 10:19:12 Cell(PnfsManager@pnfsDomain) : Error obtaining 'l' flag for getSimulatedFilesize : java.io.FileNotFoundException: /pnfs/fs/.(puse)(000100000000000000001120)(2) (Is a directory) 07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: copy request state changed to Done 07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: changing fr#-2147483522 to Done 07/19 10:18:35 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521Request.createCopyRequest : created new request succesfully 07/19 10:20:08 Cell(SRM@srmDomain) : remoing TransferInfo for callerId=20000 07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host]) 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.runRemoteToLocalCopy(CopyFileRequest.java:666) 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:770) 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121) 07/19 10:20:08 Cell(SRM@srmDomain) : at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java) 07/19 10:20:08 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534) 07/19 10:20:08 Cell(SRM@srmDomain) : CopyFileRequest #-2147483520: copy failed 07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host]) 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:798) 07/19 10:20:08 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121) 07/19 10:20:08 Cell(SRM@srmDomain) : at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java) 07/19 10:20:08 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534) 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521copyRequest getter_putter is non null, stopping 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521changing fr#-2147483520 to Failed 07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521error : 07/19 10:20:36 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.IllegalStateTransition: g illegal state transition from Canceled to Failed 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:532) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:417) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.request.CopyRequest.stateChanged(CopyRequest.java:952) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:566) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.scheduler.Job.setState(Job.java:417) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.request.Request.getRequestStatus(Request.java:521) 07/19 10:20:36 Cell(SRM@srmDomain) : at org.dcache.srm.SRM.getRequestStatus(SRM.java:868) 07/19 10:20:36 Cell(SRM@srmDomain) : at diskCacheV111.srm.server.SRMServerV1.getRequestStatus(SRMServerV1.java:360) 07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 07/19 10:20:36 Cell(SRM@srmDomain) : at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 07/19 10:20:36 Cell(SRM@srmDomain) : at java.lang.reflect.Method.invoke(Method.java:324) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.reflect.Invocation.execute(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.reflect.Invocation.invoke(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.service.object.ObjectService.invoke(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:534) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:508) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.soap.http.SOAPHTTPHandler.service(SOAPHTTPHandler.java:88) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.server.http.ServletServer.service(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at javax.servlet.http.HttpServlet.service(HttpServlet.java:853) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.servlet.Config.service(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.http.HTTPContext.service(HTTPContext.java:84) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.servlet.ServletContainer.service(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.http.WebServer.service(WebServer.java:87) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.socket.SocketServer.run(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.net.socket.SocketRequest.run(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at electric.util.thread.ThreadPool.run(Unknown Source) 07/19 10:20:36 Cell(SRM@srmDomain) : at java.lang.Thread.run(Thread.java:534)
Hi all,
sorry for replying to my own email, but I thought I'd preserve the ``thread''.
After giving the dual-homed boxes some time to rest and discussing this with dCache developers, I've given the 3rd party copies (_to_ dual-homed boxes) another chance. The weird thing is that they started to work! I admit to having rebooted the boxes for kernel upgrade and therefore restarted dCache, so that might have helped. I spent some time trying to replicate the problem, with no luck unfortunately. Dual-homed dCache (as described below) just works for me now.
Thanks and regards.
Jiri
Greig A Cowan
Hi everyone,
I have been playing around with dCache and have recently found a problem that I wasn't previously experiencing.
If (as root) I list the contents of the /pnfs/fs directory, I get the expected output:
# ls /pnfs/fs admin README usr
If I then try to view the contents of the README file, I get an error. I have previously been able to view the contents without any problem.
# cat /pnfs/fs/README Command failed! Server error message for [1]: "Couldn't determine hsmType" (errno 37). Failed open file in the dCache. cat: README: Input/output error
This same error is also generated in the directories lower down in the /pnfs tree when (for example) I try to copy the contents of the file /pnfs/fs/admin/etc/exports/127.0.0.1 to the IP address of a pool node (as you need to do to create a door on a pool node).
I have the correct dCache dcap libraries preloaded:
export LD_PRELOAD=/opt/d-cache/dcap/lib/libpdcap.so
Note that I am able to view the contents of test files that I have copied into /pnfs/epcc.ed.ac.uk/data/dteam.
The only references to hsm that I can find are in the pool setup files which contain commented out lines such as:
hsm set osm -pnfs=/pnfs/fs
Does/has anyone else experienced this problem?
Thanks,
Greig
Greig A Cowan
Hi all,
Our pool node has been taking a bit of a pounding during the FTS transfer tests from RAL. The machine has 16GB of RAM all of which was being used to handle the transfers. You will be able to see the high memory and CPU usage using the Ganglia page for our pool node:
http://mon.epcc.ed.ac.uk/ganglia/?r=day&c=ScotGrid-Edinburgh&h=dcache.epcc.ed.ac.uk
8 parallel transfers were being used during these tests.
Have any of the other Tier-2s (IC, Lancaster) experienced this sort of behaviour during sustained transfers? What about RAL? Andrew Sansum definitely mentioned that RAL and SARA may have been seeing a problem like this that was solved by reducing the number of parallel transfers.
I think it would be good to find out if this was just an issue with Edinburghs setup or a more general dCache issue due to java processes consuming large amounts of pool node resources. This may be an issue we see more of once all Tier-2s get an SRM (assuming they use dCache that is!).
Any information would be useful.
Thanks,
Greig
Owen Synge
On Wed, 27 Jul 2005 11:39:05 +0100 Greig A Cowan wrote: > Hi everyone, > > I have been playing around with dCache and have recently found a problem > that I wasn't previously experiencing. > > If (as root) I list the contents of the /pnfs/fs directory, I get the > expected output: > > # ls /pnfs/fs > admin README usr > > If I then try to view the contents of the README file, I get an error. > I have previously been able to view the contents without any problem. > > # cat /pnfs/fs/README > Command failed! > Server error message for [1]: "Couldn't determine hsmType" (errno 37). > Failed open file in the dCache. > cat: README: Input/output error Odd I get the error [root@dev01 root]# export LD_PRELOAD=/opt/d-cache/dcap/lib/libpdcap.so [root@dev01 root]# ls /pnfs/ fs gridpp.rl.ac.uk [root@dev01 root]# cat /pnfs/fs/ admin README usr [root@dev01 root]# cat /pnfs/fs/README Failed to create a control line Failed open file in the dCache. cat: /pnfs/fs/README: Connection refused but it used to work when I installed D-Cache and wrote this down in the HOWTO > This same error is also generated in the directories lower down in the > /pnfs tree when (for example) I try to copy the contents of the file > /pnfs/fs/admin/etc/exports/127.0.0.1 to the IP address of a pool node (as > you need to do to create a door on a pool node). > > I have the correct dCache dcap libraries preloaded: > > export LD_PRELOAD=/opt/d-cache/dcap/lib/libpdcap.so > > Note that I am able to view the contents of test files that I have copied > into /pnfs/epcc.ed.ac.uk/data/dteam. > > The only references to hsm that I can find are in the pool setup files > which contain commented out lines such as: > > hsm set osm -pnfs=/pnfs/fs > > Does/has anyone else experienced this problem? The hsm stuff is/should be a bit of a distraction as it stands for Hierarchical storage manager and is the term commonly used to describe a tape storage system. Regards Owen
Alessandra Forti
Alessandra Forti
Hi Greig,
I suspect (and might be wrong) that the kernel tuning Kostas has applied to IC might be useful. I was waiting for him to say something more about it. :)
In the meantime you can look at this page
https://uimon.cern.ch/twiki/bin/view/LCG/ServiceChallengeTwoProgressSARALogbook
to see if you find anything useful that could help you.
cheers
alessandra
Greig A Cowan
Hi Alessandra,
I suspect (and might be wrong) that the kernel tuning Kostas has applied to IC might be useful. I was waiting for him to say something more about it. :)
I was looking for more information from him as well!
In the meantime you can look at this page
https://uimon.cern.ch/twiki/bin/view/LCG/ServiceChallengeTwoProgressSARALogbook
Thanks for the link, I think this could prove useful.
Greig
Kostas Georgiou
Well some questions first :)
Are the java processes really use that much memory or is it the buffer cache? The kernel will try to cache as many files as possible so high memory usage is normal (your ganglia page isn't visible from the outside world)
What type of disks/controllers do you have? From what I've seen Dcache generates really bad IO patterns with parallel transfers, have you tried running with only one stream and multiple files instead?
Cheers,
Kostas
Kostas Georgiou
Have a look at this paper: http://people.redhat.com/nhorman/papers/rhel3_vm.pdf It's RHEL3 specific but it will give you an idea about what you need to tune.
Is it really a problem that the file cache is taking all the unused memory? As long as the pages are "clean" the kernel can throw them out easily without any problems. You might want to tune vm.bdflush though to be more aggressive so dirty pages are written to disk as fast as possible.
During the FTS tests we were running with 8 files and 0 (?) streams. Can anyone point me to a web page with information regarding the difference between streams and files? With this setup we were seeing ~200Mb/s into our site.
With n multiple streams Dcache writes data like: stream: write 10K, seek ahead (n-1)*10K, write 10K, ....
With 1 stream you get: stream: write 10K, write 10K, ...
If the data from all the streams doesn't arrive at the same time you end up with writes all over the place and your disk IO suffers as a result. The 3ware raid cards that we use at IC get around ~70MB/sec at sequential IO and around ~1-5MB/sec at random IO in RAID5. It's only a guess in my part that the parallel streams cause that much random IO or not since i didn't had that much time to test different settings during the FTS transfers :(
Cheers,
Kostas