Chapter 14. Email Import

Chapter 14. Email Import
Prev

Table of Contents

lcg upgrade advice
Re: New Edinburgh dCache install
my apologies & Lancaster update
globus-url-copy -p 10 problem
Re: Minutes of today's storage phone conf
DPM Status at Glasgow
PnfsManager being Offline
volatile storage
Re: DPM Status at Glasgow
D-cache poolnode in pain...
Dcache install with YAIM and 2_5_0
Re: patch to 2_5_0 D-Cache
Documentation
Re: patch to 2_5_0 Dcache
It's a globus thing
lcdif cache file
dcache problems
Re: dcache problems
Information system
Draining/removing a pool
Action: Adding and removing pools recipe to be added to mailing list
Tier 1 storage
Site policy for mapping pools to VOs
Fw: silly question
My dual-homing dCache experience
dCache spawning java processes?
Edinburgh dCache problems
failed copies to Lancaster.
Pool Storage
pnfs causing nfs errors
Re: My dual-homing dCache experience
hsmType
dCache memory issues
Re: hsmType
Re: dCache memory issues

email

Raw import

lcg upgrade advice

"Matt Doidge"

Before I go about upgrading our SE to lcg release 2_5_0 i was wondering if there is any sagely advice to be had from our storage guru's. Is there an easy and magical way to upgrade D-Cache to the latest version included in this release or will it be easier to start again, remove D-Cache then do the usual YAIM install procedure with the 2_5 release?

cheers,

matt

Re: New Edinburgh dCache install

"Owen Synge"

Re: New Edinburgh dCache install

"Owen Synge"

On Fri, 24 Jun 2005 16:47:44 +0100
Greig A Cowan  wrote:

> Hi everyone,
>
> Previously at Edinburgh...
>
> People were able to copy files into our dCache, but were unable to 
copy
> files out. The source of this issue was diagnosed to be a problem with
> the pnfs database.
>
> To fix this problem I have just done a reinstall of dCache with yaim 
(at
> the same time updating to the LCG 2.5.0 middle ware) and I was hoping 
that
> it would be possible for someone to test out srmcp's and 
globus-url-copy's
> to and from our dCache?
>
> srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/
>
> Let me know of any errors.
>
> Cheers,
> Greig
>


It failed with

srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng
e.20050628104900.epcc.ed.ac.uk

The response is

debug: response from 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3:
150 Openning BINARY data connection for 
/pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.epcc.ed.ac.
uk3

debug: fault on connection to 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3: Handle not in the proper state
debug: error reading response from 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3: the handle 0x806b22c was already 
registered for closing
debug: data callback, no error, buffer 0x80703d8, length 0, offset=0, 
eof=true
debug: operation complete


Which suggests a networking issue, does it work from the box?


Regards

Owen S

Re: New Edinburgh dCache install

"Greig A Cowan"

Hi Owen,

> > It failed with
> >
> > 
srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng
e.20050628104900.epcc.ed.ac.uk
> >
> > The response is
> >
> > debug: response from 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3:
> > 150 Openning BINARY data connection for 
/pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.epcc.ed.ac.
uk3
> >
> > debug: fault on connection to 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3: Handle not in the proper state
> > debug: error reading response from 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3: the handle 0x806b22c was already 
registered for closing
> > debug: data callback, no error, buffer 0x80703d8, length 0, 
offset=0, eof=true
> > debug: operation complete
> >
> >
> > Which suggests a networking issue, does it work from the box?

Thanks to some help from Steve Thorn at NeSC, we may have begun tracking
down the problem. He has been able to perform globus-url-copy's into our
dCache from another ScotGrid machine (glenmorangie.epcc.ed.ac.uk). This
goes through a ScotGrid switch, not through the wider SRIF network that
traffic uses coming from outside of Edinburgh.

It looks like at least some of the middleware on our admin node
(srm.epcc.ed.ac.uk) is using the default globus port range of 
20000-25000
(see packet dump below). On the SRIF network these ports are blocked as
50000-52000 is the allowed range. The globus config file
/etc/sysconfig/globus contains the correct range, as does the
GLOBUS_TCP_PORT_RANGE environment variable, but who can say what
configuration files/hard coding is used in the various SRM doors.

To summarise: when not encumbered by firewalls it works consistently.
srm.epcc is *not* using the port range defined in
/etc/sysconfig/globus.

Is there any way of changing the default port range consistently?

Cheers,
Greig

Packet dump. Irrelevant packets removed


glenmorangie# tcpdump -q host srm.epcc.ed.ac.uk
tcpdump: listening on eth0
16:11:35.312192 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 0 (DF)
16:11:35.312392 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:35.312424 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 0 (DF)
16:11:35.319327 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 24 (DF)
16:11:35.319338 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 0 (DF)
16:11:35.319742 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 13 (DF)
16:11:35.319968 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:35.364483 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 22 (DF)
16:11:35.383613 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 127 (DF)
16:11:35.383768 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:35.459060 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 1448 (DF)
16:11:35.459068 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 1448 (DF)
16:11:35.459074 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 0 (DF)
16:11:35.459395 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 1448 (DF)
16:11:35.459404 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 1448 (DF)
16:11:35.459410 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 0 (DF)
16:11:35.459414 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 643 (DF)
16:11:35.491566 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 0 (DF)
16:11:35.500755 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 1448 (DF)
16:11:35.500772 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 1448 (DF)
16:11:35.500921 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:35.500946 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 1448 (DF)
16:11:35.500952 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 479 (DF)
16:11:35.500957 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:35.501111 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:35.501119 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:35.544562 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 111 (DF)
16:11:35.544688 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 0 (DF)
16:11:35.545911 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 47 (DF)
16:11:35.546096 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:35.624240 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 467 (DF)
16:11:35.632006 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 707 (DF)
16:11:35.632284 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:35.640070 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 8 (DF)
16:11:35.640349 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 78 (DF)
16:11:35.648938 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 90 (DF)
16:11:35.649278 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 58 (DF)
16:11:35.649973 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 78 (DF)
16:11:35.650311 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 58 (DF)
16:11:35.650918 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 130 (DF)
16:11:35.651256 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 58 (DF)
16:11:35.651881 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 66 (DF)
16:11:35.652171 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 58 (DF)
16:11:35.653098 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 90 (DF)
16:11:35.653555 glenmorangie.epcc.ed.ac.uk.50001 >
srm.epcc.ed.ac.uk.20001: tcp 0 (DF)
16:11:35.653687 srm.epcc.ed.ac.uk.20001 >
glenmorangie.epcc.ed.ac.uk.50001: tcp 0 (DF)
16:11:35.653705 glenmorangie.epcc.ed.ac.uk.50001 >
srm.epcc.ed.ac.uk.20001: tcp 0 (DF)
16:11:35.654057 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 122 (DF)
16:11:35.686168 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:35.968342 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 162 (DF)
16:11:35.968795 glenmorangie.epcc.ed.ac.uk.50001 >
srm.epcc.ed.ac.uk.20001: tcp 6 (DF)
16:11:35.968968 srm.epcc.ed.ac.uk.20001 >
glenmorangie.epcc.ed.ac.uk.50001: tcp 0 (DF)
16:11:35.969080 glenmorangie.epcc.ed.ac.uk.50001 >
srm.epcc.ed.ac.uk.20001: tcp 0 (DF)
16:11:36.001600 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 0 (DF)
16:11:36.006172 srm.epcc.ed.ac.uk.20001 >
glenmorangie.epcc.ed.ac.uk.50001: tcp 0 (DF)
16:11:36.048766 srm.epcc.ed.ac.uk.20001 >
glenmorangie.epcc.ed.ac.uk.50001: tcp 0 (DF)
16:11:36.048781 glenmorangie.epcc.ed.ac.uk.50001 >
srm.epcc.ed.ac.uk.20001: tcp 0 (DF)
16:11:36.272498 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 78 (DF)
16:11:36.272535 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 0 (DF)
16:11:36.273337 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 58 (DF)
16:11:36.273465 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:36.273982 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 66 (DF)
16:11:36.274065 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
16:11:36.274329 glenmorangie.epcc.ed.ac.uk.50000 >
srm.epcc.ed.ac.uk.2811: tcp 0 (DF)
16:11:36.274479 srm.epcc.ed.ac.uk.2811 >
glenmorangie.epcc.ed.ac.uk.50000: tcp 0 (DF)
                                                                         
                                                                         
          
73 packets received by filter
0 packets dropped by kernel

Re: New Edinburgh dCache install

"Alessandra Forti"

Hi Greig,

there is a DCACHE_PORT_RANGE field in the site-info.def.
Try to change that and rerun the config part of the installation.

cheers
alessandra

On Tue, 28 Jun 2005, Greig A Cowan wrote:

> Hi Owen,
>
>>> It failed with
>>>
>>> 
srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng
e.20050628104900.epcc.ed.ac.uk
>>>
>>> The response is
>>>
>>> debug: response from 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3:
>>> 150 Openning BINARY data connection for 
/pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.epcc.ed.ac.
uk3
>>>
>>> debug: fault on connection to 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3: Handle not in the proper state
>>> debug: error reading response from 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3: the handle 0x806b22c was already 
registered for closing
>>> debug: data callback, no error, buffer 0x80703d8, length 0, 
offset=0, eof=true
>>> debug: operation complete
>>>
>>>
>>> Which suggests a networking issue, does it work from the box?
>
> Thanks to some help from Steve Thorn at NeSC, we may have begun 
tracking
> down the problem. He has been able to perform globus-url-copy's into 
our
> dCache from another ScotGrid machine (glenmorangie.epcc.ed.ac.uk). 
This
> goes through a ScotGrid switch, not through the wider SRIF network 
that
> traffic uses coming from outside of Edinburgh.
>
> It looks like at least some of the middleware on our admin node
> (srm.epcc.ed.ac.uk) is using the default globus port range of 
20000-25000
> (see packet dump below). On the SRIF network these ports are blocked 
as
> 50000-52000 is the allowed range. The globus config file
> /etc/sysconfig/globus contains the correct range, as does the
> GLOBUS_TCP_PORT_RANGE environment variable, but who can say what
> configuration files/hard coding is used in the various srm doors.
>
> To summarise: when not encumbered by firewalls it works consistently.
> srm.epcc is *not* using the port range defined in
> /etc/sysconfig/globus.
>
> Is there any way of changing the default port range consistently?
>
> Cheers,
> Greig
>

Re: New Edinburgh dCache install

"Ross, D \(Derek\)"

Hi Grieg,

Check the /opt/d-cache/config/dCacheSetup file. Check that the java 
options near the top is using the right ports i.e.

java_options="-server -Xmx512m -XX:MaxDirectMemorySize=512m \
              -Dorg.globus.tcp.port.range=50000,52000"


Further down, there's also

clientDataPortRange=50000:52000


Derek


> -----Original Message-----
> From: GRIDPP2: Deployment and support of SRM and local storage
> management [mailto:GRIDPP-STORAGE@JISCMAIL.AC.UK]On Behalf Of Greig A
> Cowan
> Sent: 28 June 2005 17:53
> To: GRIDPP-STORAGE@JISCMAIL.AC.UK
> Subject: Re: New Edinburgh dCache install
>
>
> Hi Owen,
>
> > > It failed with
> > >
> > >
> srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/fil
> e_test_synge.20050628104900.epcc.ed.ac.uk
> > >
> > > The response is
> > >
> > > debug: response from
> gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
> /file_test_synge.20050628104900.epcc.ed.ac.uk3:
> > > 150 Openning BINARY data connection for
> /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.
> epcc.ed.ac.uk3
> > >
> > > debug: fault on connection to
> gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
> /file_test_synge.20050628104900.epcc.ed.ac.uk3: Handle not in
> the proper state
> > > debug: error reading response from
> gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
> /file_test_synge.20050628104900.epcc.ed.ac.uk3: the handle
> 0x806b22c was already registered for closing
> > > debug: data callback, no error, buffer 0x80703d8, length
> 0, offset=0, eof=true
> > > debug: operation complete
> > >
> > >
> > > Which suggests a networking issue, does it work from the box?
>
> Thanks to some help from Steve Thorn at NeSC, we may have
> begun tracking
> down the problem. He has been able to perform
> globus-url-copy's into our
> dCache from another ScotGrid machine
> (glenmorangie.epcc.ed.ac.uk). This
> goes through a ScotGrid switch, not through the wider SRIF
> network that
> traffic uses coming from outside of Edinburgh.
>
> It looks like at least some of the middleware on our admin node
> (srm.epcc.ed.ac.uk) is using the default globus port range of
> 20000-25000
> (see packet dump below). On the SRIF network these ports are
> blocked as
> 50000-52000 is the allowed range. The globus config file
> /etc/sysconfig/globus contains the correct range, as does the
> GLOBUS_TCP_PORT_RANGE environment variable, but who can say what
> configuration files/hard coding is used in the various srm doors.
>
> To summarise: when not encumbered by firewalls it works consistently.
> srm.epcc is *not* using the port range defined in
> /etc/sysconfig/globus.
>
> Is there any way of changing the default port range consistently?
>
> Cheers,
> Greig

Re: New Edinburgh dCache install

"Greig A Cowan"

Hi guys,

The DCACHE_PORT_RANGE field in the site-info.def file is commented out. 
Is
this not the same for everyone? GLOBUS_TCP_PORT_RANGE was set to
the correct value though: "50000 52000"

Anyway, I've altered the dCacheSetup file as Derek suggested and
globus-url-copy now appears to be working for me using our relocatable 
UI.
Can someone else try out srmcp/globus-url-copy commands to and from our
dCache to test it?

srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/
gsiftp://srm.epcc.ed.ac.uk/pnfs/epcc.ed.ac.uk/data/dteam/

Thanks,
Greig

> Hi Grieg,
>
> Check the /opt/d-cache/config/dCacheSetup file. Check that the java 
options near the top is using the right ports i.e.
>
> java_options="-server -Xmx512m -XX:MaxDirectMemorySize=512m \
>               -Dorg.globus.tcp.port.range=50000,52000"
>
>
> Further down, there's also
>
> clientDataPortRange=50000:52000
>
>
> Derek
>
>
> > -----Original Message-----
> > From: GRIDPP2: Deployment and support of SRM and local storage
> > management [mailto:GRIDPP-STORAGE@JISCMAIL.AC.UK]On Behalf Of Greig 
A
> > Cowan
> > Sent: 28 June 2005 17:53
> > To: GRIDPP-STORAGE@JISCMAIL.AC.UK
> > Subject: Re: New Edinburgh dCache install
> >
> >
> > Hi Owen,
> >
> > > > It failed with
> > > >
> > > >
> > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/fil
> > e_test_synge.20050628104900.epcc.ed.ac.uk
> > > >
> > > > The response is
> > > >
> > > > debug: response from
> > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
> > /file_test_synge.20050628104900.epcc.ed.ac.uk3:
> > > > 150 Openning BINARY data connection for
> > /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.
> > epcc.ed.ac.uk3
> > > >
> > > > debug: fault on connection to
> > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
> > /file_test_synge.20050628104900.epcc.ed.ac.uk3: Handle not in
> > the proper state
> > > > debug: error reading response from
> > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
> > /file_test_synge.20050628104900.epcc.ed.ac.uk3: the handle
> > 0x806b22c was already registered for closing
> > > > debug: data callback, no error, buffer 0x80703d8, length
> > 0, offset=0, eof=true
> > > > debug: operation complete
> > > >
> > > >
> > > > Which suggests a networking issue, does it work from the box?
> >
> > Thanks to some help from Steve Thorn at NeSC, we may have
> > begun tracking
> > down the problem. He has been able to perform
> > globus-url-copy's into our
> > dCache from another ScotGrid machine
> > (glenmorangie.epcc.ed.ac.uk). This
> > goes through a ScotGrid switch, not through the wider SRIF
> > network that
> > traffic uses coming from outside of Edinburgh.
> >
> > It looks like at least some of the middleware on our admin node
> > (srm.epcc.ed.ac.uk) is using the default globus port range of
> > 20000-25000
> > (see packet dump below). On the SRIF network these ports are
> > blocked as
> > 50000-52000 is the allowed range. The globus config file
> > /etc/sysconfig/globus contains the correct range, as does the
> > GLOBUS_TCP_PORT_RANGE environment variable, but who can say what
> > configuration files/hard coding is used in the various srm doors.
> >
> > To summarise: when not encumbered by firewalls it works 
consistently.
> > srm.epcc is *not* using the port range defined in
> > /etc/sysconfig/globus.
> >
> > Is there any way of changing the default port range consistently?
> >
> > Cheers,
> > Greig

Re: New Edinburgh dCache install

"Owen Synge"

On Tue, 28 Jun 2005 18:30:23 +0100
Greig A Cowan  wrote:

> Hi guys,
>
> The DCACHE_PORT_RANGE field in the site-info.def file is commented 
out. Is
> this not the same for everyone? GLOBUS_TCP_PORT_RANGE was set to
> the correct value though: "50000 52000"
>
> Anyway, I've altered the dCacheSetup file as Derek suggested and
> globus-url-copy now appears to be working for me using our relocatable 
UI.
> Can someone else try out srmcp/globus-url-copy commands to and from 
our
> dCache to test it?
>
> srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/
> gsiftp://srm.epcc.ed.ac.uk/pnfs/epcc.ed.ac.uk/data/dteam/
>
> Thanks,
> Greig

I just tested and


/opt/d-cache/srm/bin/srm -copy -webservice_protocol=httpg 
-x509_user_proxy=/tmp/x509up_u27529 
srm://se2-gla.scotgrid.ac.uk:8443/pnfs/ph.gla.ac.uk/data/dteam/file_test_
synge.20050628183135.ph.gla.ac.uk 
file://///tmp//file_test_synge.20050628183135.ph.gla.ac.uk

was failing although it worked earlier today but the important news is

/opt/d-cache/srm/bin/srm -copy -webservice_protocol=httpg 
-x509_user_proxy=/tmp/x509up_u27529 file://///usr/lib/X11/rgb.txt 
srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng
e.20050628183135.epcc.ed.ac.uk
transfer rc = 0
/opt/d-cache/srm/bin/srm -copy -webservice_protocol=httpg 
-x509_user_proxy=/tmp/x509up_u27529 
srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng
e.20050628183135.epcc.ed.ac.uk 
file://///tmp//file_test_synge.20050628183135.epcc.ed.ac.uk
md5sum match srm.epcc.ed.ac.uk


Which is WONDERFUL

thank you so much for going through this process of installation once more 
and finding the issues, I shall now repeat the test for ph.gla.ac.uk.

Regards

Owen Synge



> > Hi Grieg,
> >
> > Check the /opt/d-cache/config/dCacheSetup file. Check that the java 
options near the top is using the right ports i.e.
> >
> > java_options="-server -Xmx512m -XX:MaxDirectMemorySize=512m \
> >               -Dorg.globus.tcp.port.range=50000,52000"
> >
> >
> > Further down, there's also
> >
> > clientDataPortRange=50000:52000
> >
> >
> > Derek
> >
> >
> > > -----Original Message-----
> > > From: GRIDPP2: Deployment and support of SRM and local storage
> > > management [mailto:GRIDPP-STORAGE@JISCMAIL.AC.UK]On Behalf Of 
Greig A
> > > Cowan
> > > Sent: 28 June 2005 17:53
> > > To: GRIDPP-STORAGE@JISCMAIL.AC.UK
> > > Subject: Re: New Edinburgh dCache install
> > >
> > >
> > > Hi Owen,
> > >
> > > > > It failed with
> > > > >
> > > > >
> > > srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/fil
> > > e_test_synge.20050628104900.epcc.ed.ac.uk
> > > > >
> > > > > The response is
> > > > >
> > > > > debug: response from
> > > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
> > > /file_test_synge.20050628104900.epcc.ed.ac.uk3:
> > > > > 150 Openning BINARY data connection for
> > > /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.
> > > epcc.ed.ac.uk3
> > > > >
> > > > > debug: fault on connection to
> > > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
> > > /file_test_synge.20050628104900.epcc.ed.ac.uk3: Handle not in
> > > the proper state
> > > > > debug: error reading response from
> > > gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
> > > /file_test_synge.20050628104900.epcc.ed.ac.uk3: the handle
> > > 0x806b22c was already registered for closing
> > > > > debug: data callback, no error, buffer 0x80703d8, length
> > > 0, offset=0, eof=true
> > > > > debug: operation complete
> > > > >
> > > > >
> > > > > Which suggests a networking issue, does it work from the box?
> > >
> > > Thanks to some help from Steve Thorn at NeSC, we may have
> > > begun tracking
> > > down the problem. He has been able to perform
> > > globus-url-copy's into our
> > > dCache from another ScotGrid machine
> > > (glenmorangie.epcc.ed.ac.uk). This
> > > goes through a ScotGrid switch, not through the wider SRIF
> > > network that
> > > traffic uses coming from outside of Edinburgh.
> > >
> > > It looks like at least some of the middleware on our admin node
> > > (srm.epcc.ed.ac.uk) is using the default globus port range of
> > > 20000-25000
> > > (see packet dump below). On the SRIF network these ports are
> > > blocked as
> > > 50000-52000 is the allowed range. The globus config file
> > > /etc/sysconfig/globus contains the correct range, as does the
> > > GLOBUS_TCP_PORT_RANGE environment variable, but who can say what
> > > configuration files/hard coding is used in the various srm doors.
> > >
> > > To summarise: when not encumbered by firewalls it works 
consistently.
> > > srm.epcc is *not* using the port range defined in
> > > /etc/sysconfig/globus.
> > >
> > > Is there any way of changing the default port range consistently?
> > >
> > > Cheers,
> > > Greig

Re: New Edinburgh dCache install

"Greig A Cowan"

Hi Owen,

> I just tested and
>
>
> /opt/d-cache/srm/bin/srm -copy -webservice_protocol=httpg
> -x509_user_proxy=/tmp/x509up_u27529
> 
srm://se2-gla.scotgrid.ac.uk:8443/pnfs/ph.gla.ac.uk/data/dteam/file_test_
synge.20050628183135.ph.gla.ac.uk
> file://///tmp//file_test_synge.20050628183135.ph.gla.ac.uk
>
> was failing although it worked earlier today but the important news is
>
> /opt/d-cache/srm/bin/srm -copy -webservice_protocol=httpg
> -x509_user_proxy=/tmp/x509up_u27529 file://///usr/lib/X11/rgb.txt
> 
srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng
e.20050628183135.epcc.ed.ac.uk
> transfer rc = 0 /opt/d-cache/srm/bin/srm -copy
> -webservice_protocol=httpg -x509_user_proxy=/tmp/x509up_u27529
> 
srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng
e.20050628183135.epcc.ed.ac.uk
> file://///tmp//file_test_synge.20050628183135.epcc.ed.ac.uk md5sum 
match
> srm.epcc.ed.ac.uk
>
>
> Which is WONDERFUL
>
> thank you so much for going through this process of installation once 
more
> and finding the issues, I shall now repeat the test for ph.gla.ac.uk.


Excellent. What's the best way that people have found so far to 
thoroughly
test their dCache? Owen, are your scripts sufficient for this task?

Once I (and everyone else) is satisfied that things are working 
properly,
I will go ahead and add the rest of Edinburghs available storage.

Maybe we can speak about this at the phone conference tomorrow?

Cheers,
Greig

Re: New Edinburgh dCache install

"Alessandra Forti"

Hi Greig,

> The DCACHE_PORT_RANGE field in the site-info.def file is commented 
out. Is
> this not the same for everyone? GLOBUS_TCP_PORT_RANGE was set to
> the correct value though: "50000 52000"

yes, it is the same for all the other sites, but all the other sites are 

using range 20000-25000 which are the default values which are hard coded 

in Dcache.

cheers
alessandra

>
> Anyway, I've altered the dCacheSetup file as Derek suggested and
> globus-url-copy now appears to be working for me using our relocatable 
UI.
> Can someone else try out srmcp/globus-url-copy commands to and from 
our
> dCache to test it?
>
> srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/
> gsiftp://srm.epcc.ed.ac.uk/pnfs/epcc.ed.ac.uk/data/dteam/
>
> Thanks,
> Greig
>
>> Hi Grieg,
>>
>> Check the /opt/d-cache/config/dCacheSetup file. Check that the java 
options near the top is using the right ports i.e.
>>
>> java_options="-server -Xmx512m -XX:MaxDirectMemorySize=512m \
>>               -Dorg.globus.tcp.port.range=50000,52000"
>>
>>
>> Further down, there's also
>>
>> clientDataPortRange=50000:52000
>>
>>
>> Derek
>>
>>
>>> -----Original Message-----
>>> From: GRIDPP2: Deployment and support of SRM and local storage
>>> management [mailto:GRIDPP-STORAGE@JISCMAIL.AC.UK]On Behalf Of Greig 
A
>>> Cowan
>>> Sent: 28 June 2005 17:53
>>> To: GRIDPP-STORAGE@JISCMAIL.AC.UK
>>> Subject: Re: New Edinburgh dCache install
>>>
>>>
>>> Hi Owen,
>>>
>>>>> It failed with
>>>>>
>>>>>
>>> srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/fil
>>> e_test_synge.20050628104900.epcc.ed.ac.uk
>>>>>
>>>>> The response is
>>>>>
>>>>> debug: response from
>>> gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
>>> /file_test_synge.20050628104900.epcc.ed.ac.uk3:
>>>>> 150 Openning BINARY data connection for
>>> /pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.
>>> epcc.ed.ac.uk3
>>>>>
>>>>> debug: fault on connection to
>>> gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
>>> /file_test_synge.20050628104900.epcc.ed.ac.uk3: Handle not in
>>> the proper state
>>>>> debug: error reading response from
>>> gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam
>>> /file_test_synge.20050628104900.epcc.ed.ac.uk3: the handle
>>> 0x806b22c was already registered for closing
>>>>> debug: data callback, no error, buffer 0x80703d8, length
>>> 0, offset=0, eof=true
>>>>> debug: operation complete
>>>>>
>>>>>
>>>>> Which suggests a networking issue, does it work from the box?
>>>
>>> Thanks to some help from Steve Thorn at NeSC, we may have
>>> begun tracking
>>> down the problem. He has been able to perform
>>> globus-url-copy's into our
>>> dCache from another ScotGrid machine
>>> (glenmorangie.epcc.ed.ac.uk). This
>>> goes through a ScotGrid switch, not through the wider SRIF
>>> network that
>>> traffic uses coming from outside of Edinburgh.
>>>
>>> It looks like at least some of the middleware on our admin node
>>> (srm.epcc.ed.ac.uk) is using the default globus port range of
>>> 20000-25000
>>> (see packet dump below). On the SRIF network these ports are
>>> blocked as
>>> 50000-52000 is the allowed range. The globus config file
>>> /etc/sysconfig/globus contains the correct range, as does the
>>> GLOBUS_TCP_PORT_RANGE environment variable, but who can say what
>>> configuration files/hard coding is used in the various srm doors.
>>>
>>> To summarise: when not encumbered by firewalls it works 
consistently.
>>> srm.epcc is *not* using the port range defined in
>>> /etc/sysconfig/globus.
>>>
>>> Is there any way of changing the default port range consistently?
>>>
>>> Cheers,
>>> Greig

Re: New Edinburgh dCache install

"Greig A Cowan"

Hi Owen,

> It failed with
>
> 
srm://srm.epcc.ed.ac.uk:8443/pnfs/epcc.ed.ac.uk/data/dteam/file_test_syng
e.20050628104900.epcc.ed.ac.uk
>
> The response is
>
> debug: response from 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3:
> 150 Openning BINARY data connection for 
/pnfs/epcc.ed.ac.uk/data/dteam/file_test_synge.20050628104900.epcc.ed.ac.
uk3
>
> debug: fault on connection to 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3: Handle not in the proper state
> debug: error reading response from 
gsiftp://srm.epcc.ed.ac.uk:2811//pnfs/epcc.ed.ac.uk/data/dteam/file_test_
synge.20050628104900.epcc.ed.ac.uk3: the handle 0x806b22c was already 
registered for closing
> debug: data callback, no error, buffer 0x80703d8, length 0, 
offset=0, eof=true
> debug: operation complete
>
>
> Which suggests a networking issue, does it work from the box?


srmcp does work from the box itself.

Have you seen anything like this before?

Greig

my apologies & Lancaster update

"Matt Doidge"

sorry all for not attending the storage meeting, i somehow forgot about the meeting despite the million reminders and went and booked a trip to the opticians.

Lancaster Status:

Still running on 2_4_0, reluctant to upgrade yet (it might break!). Planning on some tests that will completely thrash the system (continuous file transfer for a more vigorous tests). We still aren't advertising, for various reasons, the main one being once people start using us to keep their data we lose the freedom to mess about with our system, our current set up is not ideal and we might eventually want to physically some of our machines.

Lancaster Plans:

We're planning on setting up a second SE, for use exclusively in SC3 testing, this will be of the same size as we currently have (32 TB). This will be the latest version of Dcache (as included in 2_5_0) and will be a YAIM install.

Lancaster wish list:

What I would like to know is how to go about with "Dcache disaster recovery", back up procedures in case it all goes wrong, and how we can go about getting data off our pools in the event that the admin node does something stupid like melts (can we attach a new admin node and keep our data?).

Once again my apologies for being forgetful,

matt

globus-url-copy -p 10 problem

"Greig A Cowan"

Hi everyone,

I've been testing out our dCache install using globus-url-copy. I just tried copying a large file (>1GB) into our dCache and it worked fine, the md5sum of source and target files are the same. The transfer was done using the -p 10 option with globus. However, when I attempt to copy the file back out, everything seems to be going fine (i.e. usual dialogue between client and server) until I get the error:

426 Transfer aborted, closing connection :java.net.ConnectException:
Connection refused

I can copy small files in and out without any problem when I don't use the -p 10 option, but transfer out of the dCache fails with small files if I use the -p 10 option. Has anyone seen this behaviour before? It looks like it's more Edinburgh network issues.

Thanks, Greig

Re: Minutes of today's storage phone conf

"Alessandra Forti"

Hi Jens,

to add to the documentation I found this page today.

http://grid-deployment.web.cern.ch/grid-deployment/gis/lcg-GCR/

It was released or updated last week and it might be useful to link it from the storage pages.

cheers alessandra

DPM Status at Glasgow

"Graeme A Stewart"

On Tuesday 28 June 2005 15:08, Jean-Philippe Baud wrote:
> Actually you need also a line
>
> RFIOD TRUST <short_hostname> <fdqn>
>
> This means that you need a shift.conf including 5 lines for RFIOD:
> TRUST, FTRUST, RTRUST, WTRUST and XTRUST.
> We will update the documentation.
> Please confirm. Thanks a lot.

Hi Jean-Phillipe

Good news, inserting the "TRUST" line means it works:

grid07:~$ cat /etc/shift.conf
RFIOD TRUST grid07 grid07.ph.gla.ac.uk
RFIOD WTRUST grid07 grid07.ph.gla.ac.uk
RFIOD RTRUST grid07 grid07.ph.gla.ac.uk
RFIOD XTRUST grid07 grid07.ph.gla.ac.uk
RFIOD FTRUST grid07 grid07.ph.gla.ac.uk
RFIO DAEMONV3_WRMT 1
grid07:~$ globus-url-copy file:/etc/group 
gsiftp://grid07/dpm/ph.gla.ac.uk/home/dteam/testDir/newGroupTest
grid07:~$ dpns-ls -l /dpm/ph.gla.ac.uk/home/dteam/testDir/newGroupTest
-rw-rw-r--   1 dteam001 dteam                   531 Jun 28 20:14 
/dpm/ph.gla.ac.uk/home/dteam/testDir/newGroupTest

However, srmcp still isn't working:

grid07:~$ /opt/d-cache/srm/bin/srmcp file:////boot/vmlinux-2.4.21-20.EL 
srm://grid07:8443/dpm/ph.gla.ac.uk/home/dteam/testDir/srmtest
org.globus.ftp.exception.ServerException: Server refused performing the 
request. Custom message:  (error code 1) [Nested exception message:  
Custom message: Unexpected reply: 553 
grid07:/opt/dpmp/dteam/2005-06-28/srmtest.38.0: Permission denied.].  
Nested exception is 
org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: 
Unexpected reply: 553 grid07:/opt/dpmp/dteam/2005-06-28/srmtest.38.0: 
Permission denied.
        at 
org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:167)
GridftpClient:  transfer exception

I realised this was a different error condition to the globus-url-copy 
error,
so on a hunch I tried chmoding the DPM pool directories to 777 and then 
it
succeeds:

grid07:~$ /opt/d-cache/srm/bin/srmcp file:////boot/vmlinux-2.4.21-20.EL 
srm://grid07:8443/dpm/ph.gla.ac.uk/home/dteam/testDir/srmtest2
grid07:~$ echo $?
0
grid07:~$ dpns-ls -l /dpm/ph.gla.ac.uk/home/dteam/testDir/srmtest2
-rw-rw-r--   1 dteam001 dteam               2922104 Jun 28 20:18 
/dpm/ph.gla.ac.uk/home/dteam/testDir/srmtest2

However, looking at the underlying pool directory:

root of grid07:/opt/dpmp# v /opt/dpmp/dteam/2005-06-28/
total 8592
-rw-rw----    1 dpmmgr   dpmmgr        531 Jun 28 20:14 
newGroupTest.37.0
-rw-rw----    1 dteam001 dteam     2922104 Jun 28 20:18 srmtest2.40.0

So, somehow srmcp (this is the dCache client, by the way) is copying in
the file directly, using its own GSI pool mapped account, instead of 
copying
to the dpns name space.

I also confirm this contrasting the globus-url-copy log entry from
dpm-gsiftp:

DATE=20050628191443.865506 HOST=grid07 PROG=wuftpd 
NL.EVNT=FTP_INFO START=20050628191443.818685 USER=dteam001 
FILE=/dpm/ph.gla.ac.uk/home/dteam/testDir/newGroupTest BUFFER=87840 
BLOCK=65536 NBYTES=531 VOLUME=?(rfio-file) STREAMS=1 STRIPES=1 
DEST=1[194.36.1.137] TYPE=STOR CODE=226

with that from the srmcp:

DATE=20050628192444.426157 HOST=grid07 PROG=wuftpd 
NL.EVNT=FTP_INFO START=20050628192444.343060 USER=dteam001 
FILE=grid07:/opt/dpmp/dteam/2005-06-28/srmtest3.41.0 BUFFER=87840 
BLOCK=65536 NBYTES=2922104 VOLUME=?(rfio-file) STREAMS=10 
STRIPES=1 DEST=1[194.36.1.137] TYPE=STOR CODE=226

Debug trace of "srmcp -debug" and the DPM srmv1 logs attached.


Thanks for your help so far - hopefully we can resolve this so
I can report success with DPM to the GridPP 13 meeting next week
in Durham!

Graeme

>        Jean-Philippe
> P.S. "localhost" in shift.conf is useless. It is not recognised.

Removed.

--
--------------------------------------------------------------------
Dr Graeme Stewart              http://www.physics.gla.ac.uk/~graeme/
Department of Physics and Astronomy, University of Glasgow, Scotland

PnfsManager being Offline

"Jiri Mencak"

Hi,

on some occasions when `/opt/d-cache/install/install.sh' script is (re)run (some, sometimes even all) of the following files end up being 0 size:

/pnfs/fs/admin/etc/config/serverId
/pnfs/fs/admin/etc/config/serverName
/pnfs/fs/admin/etc/config/serverRoot

Which results in PnfsManager being Offline.

Looking at the way serverName is being written by `/opt/d-cache/install/install.sh' reveals a non-deterministic sleep (increasing the value 5 up to 50 doesn't help) which I'm a bit concerned about:

<snip>
sleep 5
cd $PNFS_ROOT/fs
sr=`cat ".(id)(usr)"`
cd ./admin/etc/config
echo `hostname` > ./serverName
</snip>

Workaround

I've found that removing the /pnfs/fs/admin/etc/config/server* files helps. For example changing the previous code snippet to

<snip>
sleep 5
cd $PNFS_ROOT/fs
sr=`cat ".(id)(usr)"`
cd ./admin/etc/config
rm -f ./serverName
echo `hostname` > ./serverName
</snip>

fixed the situation.

A proper solution to this issue would be much appreciated.

Regards.

-- Jiri

volatile storage

"Greig A Cowan"

volatile storage

"Greig A Cowan"

Hi everyone,

What is the best way of specifying that the storage space in your dCache is volatile? We would like to be able to make sure that any files that are transferred into our storage during testing can be removed again. Is this even possible at the moment? I seem to remember reading that the different storage types are not available until SRM v2?

There was a previous discussion on this list regarding these issues (see the thread: Querying the information system for srm info.), but would it be possible for anyone to clarify the situation.

Thanks, Greig

Re: volatile storage

"Owen Synge"

On Thu, 30 Jun 2005 19:59:58 +0100
Greig A Cowan  wrote:

> Hi everyone,
>
> What is the best way of specifying that the storage space in your 
dCache
> is volatile?

I dont think you need to yet

http://www.cnaf.infn.it/~sergio/datatag/glue/v11/SE/index.htm

I could not see attribute for "storage space type" please correct me if 
you can.

> We would like to be able to make sure that any files that are
> transferred into our storage during testing can be removed again. Is 
this
> even possible at the moment?

I believe we are all fine with you doing this outside production runs.

> I seem to remember reading that the different
> storage types are not available until SRM v2?

Yes but these storage types are with regard to the services we expect 
from the systems you guys run so at the moment the ideas are in our 
heads rather than formally specified as a computer query-able and job 
scheduling factor. If your service was in the future "permanent space 
type" we would address these issues in the Glue Schemer before we could 
expect you to provide the higher levels of service we rarely should have 
this in the Schemer now as how do you know that a site is tier 2 or tier 
0-1 but its on the schemer groups to-do list. If no one is sure maybe I 
should check.

> There was a previous discussion on this list regarding these issues 
(see
> the thread: Querying the information system for srm info.), but would 
it
> be possible for anyone to clarify the situation.

I hope I have but please feel free to pull me up in lack of clarity or 
that I have glossed over the issues.

Regards

Owen

Re: volatile storage

"Philip Clark"

>> I seem to remember reading that the different
>> storage types are not available until SRM v2?
>
> Yes but these storage types are with regard to the services we expect 
from the systems you guys run so at the moment the ideas are in our 
heads rather than formally specified as a computer query-able and job 
scheduling factor. If your service was in the future "permanent space 
type" we would address these issues in the Glue Schemer before we could 
expect you to provide the higher levels of service we rarely should have 
this in the Schemer now as how do you know that a site is tier 2 or tier 
0-1 but its on the schemer groups to-do list. If no one is sure maybe I 
should check.

is there a way to make sure the space is volatile at the moment?

-Phil

Re: volatile storage

"Alessandra Forti"

Hi Phil,

you can declare it in the Glue schema. Then whoever puts important files on your SE should be aware that they can be deleted.

The field is

GlueSAPolicyFileLifeTime

and can be set for each VO in

/opt/lcg/var/gip/lcg-info-generic.conf

Be careful that this is overridden if you run YAIM function config_gip again. It is one of those things I asked to be changed in bug 8777 I keep on talking about.

If you put it in the Glue schema you are declaring your SE policy (as the name says) which means you can delete files whether the system can do it automatically for you (SRM v2) or not (SRM v1, classic SE). At least this is how I interpret it.

cheers alessandra

Re: volatile storage

"Owen Synge"

On Fri, 1 Jul 2005 13:50:03 +0100
Alessandra Forti wrote:

> Hi Phil,
>
> you can declare it in the Glue schema. Then whoever puts important 
files
> on your SE should be aware that they can be deleted.
>
> The field is
>
> GlueSAPolicyFileLifeTime
>
> and can be set for each VO in
>
> /opt/lcg/var/gip/lcg-info-generic.conf
>
> Be careful that this is overridden if you run YAIM function config_gip
> again. It is one of those things I asked to be changed in bug 8777 I 
keep
> on talking about.
>
> If you put it in the Glue schema you are declaring your SE policy (as 
the
> name says) which means you can delete files whether the system can do 
it
> automatically for you (SRM v2) or not (SRM v1, classic SE). At least 
this
> is how I interpret it.
>
> cheers
> alessandra


Thank you, this is how I interpret it also, but I should want to advise 
that you should not delete files unless they are old and you need the to 
do something difficult which would be aided by deleting files or to free 
space.

I am delighted that this field is available today.



Regards

Owen






> On Fri, 1 Jul 2005, Philip Clark wrote:
>
> >>> I seem to remember reading that the different
> >>> storage types are not available until SRM v2?
> >>
> >> Yes but these storage types are with regard to the services we 
expect from the systems you guys run so at the moment the ideas are in 
our heads rather than formally specified as a computer query-able and 
job scheduling factor. If your service was in the future "permanent 
space type" we would address these issues in the Glue Schemer before we 
could expect you to provide the higher levels of service we rarely 
should have this in the Schemer now as how do you know that a site is 
tier 2 or tier 0-1 but its on the schemer groups to-do list. If no one 
is sure maybe I should check.
> >
> > is there a way to make sure the space is volatile at the moment?
> >
> > -Phil
> >
>
> --
> ********************************************
> * Dr Alessandra Forti                         *
> * Technical Coordinator - NorthGrid Tier2  *
> * http://www.hep.man.ac.uk/u/aforti           *
> ********************************************

Re: DPM Status at Glasgow

"Graeme A Stewart"

An update on DPM here at Glasgow:

Yesterday I received a fix from the LCG team for the problems which were 

causing srmcp to fail.

The fix, for the DPM gridftp daemon, is available at:

http://grid-deployment.web.cern.ch/grid-deployment/RpmDir_i386-sl3/extern
al/DPM-gridftp-server-1.3.4-1sec_sl3.i386.rpm

So if you are trying DPM right now, I would apply this RPM and then 
restart
dpm-gsiftp.


This also means that Glasgow's DPM is available for external testing at:

srm://grid07.ph.gla.ac.uk:8443/dpm/ph.gla.ac.uk/home/dteam

and

gsiftp://grid07.ph.gla.ac.uk/dpm/ph.gla.ac.uk/home/dteam


If you have the DPM client rpms (only part of 2.5.0 I think) then you 
might
also try playing with the dpns-* commands, e.g.,

$ gris-proxy-init
$ export DPNS_HOST=grid07.ph.gla.ac.uk
$ dpns-ls -l /dpm/ph.gla.ac.uk/home/dteam


I will try and collate my DPM experience in the DM wiki:

http://www.physics.gla.ac.uk/gridpp/datamanagement/index.php/DiskPoolMana
ger

so you don't have to trail through piles of old emails!

Cheers

Graeme

--
--------------------------------------------------------------------
Dr Graeme Stewart           http://www.astro.gla.ac.uk/users/graeme/
Department of Physics and Astronomy, University of Glasgow, Scotland

D-cache poolnode in pain...

"Kostas Georgiou"

D-cache poolnode in pain...

"Kostas Georgiou"

One of our pool nodes is in a lot of pain after the local CMS people
copied some files to it. Any ideas on what the problem is?

# ls -al /opt/d-cache/log/sedsk00Domain.log
-rw-r--r--  1 root root 2072936448 Jul  1 14:54 
/opt/d-cache/log/sedsk00Domain.log

07/01 11:14:25 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BBF8
07/01 11:14:26 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BC08
07/01 11:14:26 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BC28
07/01 11:14:29 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BC18
07/01 11:14:54 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BC78
07/01 11:15:10 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BC68
07/01 11:15:11 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BC88
07/01 11:15:15 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BC58
07/01 11:15:15 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BCA0
07/01 11:16:00 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BD40
07/01 11:16:00 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BD30
07/01 11:16:08 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BD70
07/01 11:16:13 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BD60
07/01 11:17:02 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BE20
07/01 11:17:06 Cell(sedsk00_1@sedsk00Domain) : getChecksumFromPnfs : No 
crc available for 00010000000000000000BE30
07/01 11:18:01 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<
07/01 11:18:01 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<
.....
07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<
07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<
07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<
07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<
07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<
07/01 14:23:39 Cell(sedsk00_1@sedsk00Domain) : remov                     
                    

Disk partition full ......


Kostas

Re: D-cache poolnode in pain...

"Owen Synge"

I have never seen this error.

My bet is that the D-Cache is complaining because D-Cache failed to store check sum information after the files where received.

D-Cache internally queues files and then processes them to move to pools with affinity etc.

May be the check sums are also not stored at the time of writing, and D-Cache is very unhappy about not being able to write them to disk, and not gracefully refusing to over fill a pool but instead allowing files to start to write to a pool that is full, then complaining that it does not have check sums.

Its only a guess.

Regards

Owen

Dcache install with YAIM and 2_5_0

"Matt Doidge"

Dcache install with YAIM and 2_5_0

"Matt Doidge"

heya guys and girls,

I have been trying to install a second srm here at Lancaster for use with SC3 (so we can have an SE for production and an SE for SC stuff). Am not having much luck with the installation - the srm isn't coming on line, and nothing is listening on port 8443 when i check using a netstat. Does the 2_5_0 version of YAIM still require patching? And also when using YAIM to install a second SE is there anything odd that you have to do, I just filled the SE2 variables in in the same manner I would have for a standard SE.

thanks for your time again guys.

matt

Re: Dcache install with YAIM and 2_5_0

"Greig A Cowan"

Hi Matt,

> Am not having much luck with the installation - the srm isn't coming
> on line, and nothing is listening on port 8443 when i check using a
> netstat. Does the 2_5_0 version of YAIM still require patching? And
> also when using YAIM to install a second SE is there anything odd that
> you have to do, I just filled the SE2 variables in in the same manner
> I would have for a standard SE.

Yep, I installed dCache from 2_5_0 and made sure I had Jiri's patch.

As for SRM, have you tried the solution of switching off the pool services, restarting opt services and then starting the pools again?

Not sure about the second SE issue though.

Greig

Re: Dcache install with YAIM and 2_5_0

"Greig A Cowan"

Hi Matt,

Have a look here for some YAIM install instructions that worked for me.

http://www.gridpp.ac.uk/deployment/admin/dcache/dcache_yaim_install.txt

They are pretty rough, but you should be able to follow them. Give me an

email if you need anything clarified.

Greig

Re: patch to 2_5_0 D-Cache

"Greig A Cowan"

From the GridPP pages:

http://www.gridpp.ac.uk/deployment/admin/dcache/dcache_yaim_install.txt

Create the relevant pointers to the rpm repositories

echo 'rpm http://storage.esc.rl.ac.uk/ apt/datastore/sl3.0.4 stable obsolete' \
    > /etc/apt/sources.list.d/gpp_storage.list
apt-get install d-cache-gpp-yaimlink

You can also do this by hand by creating the link:

ln -s /opt/d-cache-gpp/bin/lcg/config_gpp_sedcache \ 
      /opt/lcg/yaim/functions/local/config_sedcache

Cheers,

Greig

Documentation

"Greig A Cowan"

Documentation

"Greig A Cowan"

Hi everyone,

I think we really need to combine the current dCache documenting in order to form an authoritative source for installing, administering, monitoring the system. It can be very confusing for anyone starting out with dCache

when there is not one single document that they can go to find out all relevant information.

Maybe some of us can discuss this issue before we leave GridPP13 today.

Also, I have started distilling some of the very useful information that this list has provided about dCache but hasn't really been written down before and also some things that I think are very useful for people running dCache to know. You can find it here:

http://www.gridpp.ac.uk/deployment/admin/dcache/faq.html

Don't worry, I am going to change the size of the font in the code blocks so that it is slightly easier to read! It is still very much a work in progress so any comments/contributions would be appreciated. You can also edit the pages yourself if you have been granted write access to this part of the GridPP site.

Cheers,

Greig

Re: Documentation

"Steve Traylen"

On Wed, Jul 06, 2005 at 10:17:14AM +0100 or thereabouts, Greig A Cowan 
wrote:
> Hi everyone,
>
> I think we really need to combine the current dCache documentation in 
order
> to form an authoritative source for installing, administering, 
monitoring
> the system. It can be very confusing for anyone starting out with 
dCache
> when there is not one single document that they can go to find out all 

> relevant information.

I was thinking about this myself, apart from anything else I don't
think many people outside the UK are aware of the documentation that
exists.

The LCG wiki is the sensible top level place to bung nothing else
other than links.

 Steve

>
> Maybe some of us can discuss this issue before we leave GridPP13 
today.
>
> Also, I have started distilling some of the very useful information 
that
> this list has provided about dCache but hasn't really been written 
down
> before and also some things that I think are very useful for people
> running dCache to know. You can find it here:
>
> http://www.gridpp.ac.uk/deployment/admin/dcache/faq.html
>
> Don't worry, I am going to change the size of the font in the code 
blocks
> so that it is slightly easier to read! It is still very much a work in 

> progress so any comments/contributions would be appreciated. You can 
also
> edit the pages yourself if you have been granted write access to this 
part
> of the GridPP site.
>
> Cheers,
> Greig
>
> --

Re: patch to 2_5_0 Dcache

"Jiri Mencak"

I suspect you mean dCache that comes with LCG 2.5.0 release. We have patches to dCache YAIM installation scripts which come with LCG 2.4.0 release. All of these have been accepted into LCG 2.5.0 release so as long as you're using this particular release, you don't need any patches.

Regards.

Jiri

It's a globus thing

"Alessandra Forti"

It's a globus thing

"Alessandra Forti"

Hi,

I discovered the mystery. There are 2 openssl installed: a standard /usr/bin/openssl and a globus /opt/globus/bin/openssl. I don't know if the globus version is modified, recompiled or just older but they give a different output.

When /opt/d-cache/bin/grid-mapfile2dcache-kpwd is run by hand it calls the globus one and there is Email= and E= in /opt/d-cache/etc/dcache.kpwd

But when it is run by the cron job the standard openssl is called because the path doesn't contain /opt/globus/bin and we get the standard emailAddress= in /opt/d-cache/etc/dcache.kpwd which mismatches other parts of the code (I haven't looked into this yet and don't know if I want to).

The simplest way to correct this is to add /opt/globus/bin at the beginning of the path in /etc/cron.d/edg-mkgridmap.

/opt/lcg/libexec/lcg-info-wrapper does work now but there is still the problem of the ldif cache file being zero and therefore ldap doesn't pick it up.

cheers

alessandra

Re: It's a globus thing

"Jensen, J \(Jens\)"

I bet they are different versions.

/usr/bin/openssl version 0.9.7

/opt/globus/bin/openssl version 0.9.6

-j

Re: It's a globus thing

"Alessandra Forti"

Yes, they are different. I don't know if globus has anything else in it.

cheers
alessandra

On Mon, 11 Jul 2005, Jensen, J (Jens) wrote:

> I bet they are different versions.
>
> /usr/bin/openssl version
>                                            => 0.9.7
> /opt/globus/bin/openssl version
>                                            => 0.9.6
>
> -j
>
>> -----Original Message-----
>> From: GRIDPP2: Deployment and support of SRM and local storage
>> management [mailto:GRIDPP-STORAGE@JISCMAIL.AC.UK]On Behalf Of
>> Alessandra
>> Forti
>> Sent: 11 July 2005 15:53
>> To: GRIDPP-STORAGE@JISCMAIL.AC.UK
>> Subject: It's a globus thing
>>
>>
>> Hi,
>>
>> I discovered the mystery. There are 2 openssl installed: a standard
>> /usr/bin/openssl and a globus /opt/globus/bin/openssl. I
>> don't know if the
>> globus version is modified, recompiled or just older but they give a
>> different output.
>>
>> When /opt/d-cache/bin/grid-mapfile2dcache-kpwd is run by hand
>> it calls the
>> globus one and there is Email= and E= in 
/opt/d-cache/etc/dcache.kpwd
>>
>> But when it is run by the cron job the standard openssl is
>> called because
>> the path doesn't contain /opt/globus/bin and we get the standard
>> emailAddress= in /opt/d-cache/etc/dcache.kpwd which
>> mismatches other parts
>> of the code (I haven't looked into this yet and don't know if
>> I want to).
>>
>> The simplest way to correct this is to add /opt/globus/bin at the
>> beginning of the path in /etc/cron.d/edg-mkgridmap.
>>
>> /opt/lcg/libexec/lcg-info-wrapper does work now but there is
>> still the problem of the ldif cache file being zero and
>> therefore ldap
>> doesn't pick it up.
>>
>> cheers
>> alessandra
>>
>> On Fri, 8 Jul 2005, owen maroney wrote:
>>
>>> And for IC we have:
>>>
>>> mapping
>>>
>> 
"/C=UK/O=eScience/OU=Imperial/L=Physics/CN=gfe02.hep.ph.ic.ac.
>> uk/emailAddress=lcg-site-admin@imperial.ac.uk"
>>> edginfo
>>>
>>> login edginfo read-write 19491 19491 / / /
>>>
>>>
>> 
/C=UK/O=eScience/OU=Imperial/L=Physics/CN=gfe02.hep.ph.ic.ac.u
>> k/emailAddress=lcg-site-admin@imperial.ac.uk
>>>
>>> So, I try replacing this with:
>>>> mapping
>>>>
>> 
"/C=UK/O=eScience/OU=Imperial/L=Physics/CN=gfe02.hep.ph.ic.ac.
>> uk/E=lcg-site-admin@imperial.ac.uk"
>>>> edginfo
>>>>
>>>> login edginfo read-write 19491 19491 / / /
>>>>
>> 
/C=UK/O=eScience/OU=Imperial/L=Physics/CN=gfe02.hep.ph.ic.ac.u
>> k/E=lcg-site-admin@imperial.ac.uk
>>>
>>> then
>>>
>>> su - edginfo
>>>> [edginfo@gfe02 edginfo]$
>> /opt/d-cache/srm/bin/srm-storage-element-info
>>>> https://gfe02.hep.ph.ic.ac.uk:8443/srm/infoProvider1_0.wsdl
>>>
>>> produces stuff ending with:
>>>
>>>> StorageElementInfo :
>>>>                      totalSpace     =2541546897408 (2481979392 
KB)
>>>>                      usedSpace      =30397658395 (29685213 KB)
>>>>                      availableSpace =2502704826621 (2444047682 
KB)
>>>
>>> Hurrah! (although this will get overwritten at the next
>> edg-mkgridmap
>>> update...)
>>>
>>> However, although now when I run, as edginfo:
>>>> [edginfo@gfe02 edginfo]$ /opt/lcg/libexec/lcg-info-dynamic-se
>>>
>>> I get a 3 second pause and output like:
>>>> dn:
>>>>
>> GlueSARoot=lhcb:/pnfs/hep.ph.ic.ac.uk/data/lhcb,GlueSEUniqueID
>> =gfe02.hep.ph.ic.ac.uk,Mds-Vo-name=local,o=grid
>>>> GlueSAStateAvailableSpace: 2444047682
>>>> GlueSAStateUsedSpace: 37931710
>>>
>>> when I run, as edginfo, /opt/lcg/libexec/lcg-info-wrapper,
>> it takes less than
>>> a second and produces output including:
>>>> GlueSAStateAvailableSpace: 00
>>>> GlueSAStateUsedSpace: 00
>>> in the output.  I checked /opt/lcg/var/gip/tmp and the file
>>> lcg-info-dynamic-dcache.ldif.7010 is being updated but is
>> only zero sized.
>>>
>>> So the output of the dynamic-se script does not seem to be getting
>>> incorporated into the output of the wrapper script.
>>>
>>> cheers,
>>> Owen.
>>>
>>> Alessandra Forti wrote:
>>>> no I have
>>>>
>>>> mapping
>>>>
>> 
"/C=UK/O=eScience/OU=Manchester/L=HEP/CN=bohr0013.tier2.hep.ma
>> n.ac.uk/emailAddress=alessandra.forti@manchester.ac.uk"
>>>> edginfo
>>>>
>>>> login edginfo read-write 18948 18948 / / /
>>>>
>>>>
>> 
/C=UK/O=eScience/OU=Manchester/L=HEP/CN=bohr0013.tier2.hep.man
>> .ac.uk/emailAddress=alessandra.forti@manchester.ac.uk
>>>>
>>>> cheers
>>>> alessandra
>>>>
>>>> On Fri, 8 Jul 2005, Steve Traylen wrote:
>>>>
>>>>> On Thu, Jul 07, 2005 at 04:54:33PM +0100 or thereabouts,
>> Philip Clark
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I don't think we have a workaround, it just works?
>>>>>>>
>>>>>>> I expect you have mentioned it before but what is the problem?
>>>>>>
>>>>>>
>>>>>> http://savannah.cern.ch/bugs/?func=detailitem&item_id=8777
>>>>>>
>>>>>> We need to understand why you are not seeing this bug.
>> IC, Manchester
>>>>>> and Edinburgh all seem to have it. If we try to monitor
>> your storage
>>>>>> through the lcg information system then I expect it will
>> show up too.
>>>>>
>>>>>
>>>>> Does your dcache.kpwd contain
>>>>>
>>>>>
>> 
/C=UK/O=eScience/OU=Manchester/L=HEP/CN=bohr0013.tier2.hep.man
>> .ac.uk/E=alessandra.forti@manchester.ac.uk
>>>>>
>>>>> i.e are you seeing this one.
>>>>>
>>>>> https://savannah.cern.ch/bugs/?func=detailitem&item_id=5295
>>>>>
>>>>> but I'm sure I have already asked this question twice so
>> feel free to
>>>>> scream if we are going through the same loop?
>>>>>
>>>>> Steve
>>>>>
>>>>>>
>>>>>> -Phil
>>>>>
>>>>>
>>>>> --
>>>>> Steve Traylen
>>>>> s.traylen@rl.ac.uk
>>>>> http://www.gridpp.ac.uk/
>>>>>
>>>>
>>>

Re: It's a globus thing

"Philip Clark"

Hi Alessandra,

Well done!! That is excellent news. Thanks for tracking this down! Now we just need to understand the ldif cache file problem.

I am looking forward to seeing a new slope in the monitoring plots :)

-Phil

lcdif cache file

"Alessandra Forti"

Hi,

I got it finally working. This line

$ENV{PATH} = "/opt/d-cache/srm/bin";

needs to be added to /opt/lcg/libexec/lcg-info-dynamic-dcache

I put it after

$ENV{HOME}         = "/var/tmp";
$ENV{SRM_PATH}     = "/opt/d-cache/srm";

for house keeping.

cheers

alessandra

dcache problems

"Kostas Georgiou"

Hi,

Once again our Dcache pool nodes started getting the following error message:

07/13 13:33:50 Cell(cmsdsk00_2@cmsdsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<
07/13 13:33:50 Cell(cmsdsk00_2@cmsdsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<
07/13 13:33:50 Cell(cmsdsk00_2@cmsdsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<
07/13 13:33:50 Cell(cmsdsk00_2@cmsdsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<
07/13 13:33:50 Cell(cmsdsk00_2@cmsdsk00Domain) : removeFiles : invalid 
syntax in remove filespec >null<

The admin node is also under load presumably because it's sending the remove command non stop although i can't see anything at the logs.

Since the partition with the logs fills up pretty fast this affects our ability to log file transfers so we need a solution to the problem.

Can someone with access to the developers or the source code have a look at what is causing this?

Cheers,

kostas

Re: dcache problems

"Ross, D \(Derek\)"

Hi Kostas,

We're also seeing this at the Tier 1 now, I've mailed the developers. Restarting the pool makes it stop (for a while).

Derek

Information system

"Greig A Cowan"

Hi everyone,

Just to let you all know that in order to get Edinburgh publishing the correct storage I had to add in an extra step in addition to what Alessandra previously mentioned. Even after making Alessandra changes, I was still finding that the wrong version of openssl (i.e. the non-globus one) was being used in the /opt/d-cache/bin/grid-mapfile2dcache-kpwd script. To rectify this, I added /opt/globus/bin to the PATH variable in /etc/crontab (this was in addition to adding /opt/globus/bin to PATH in /etc/cron.d/edg-mkgridmap).

The correct version of openssl is now being used, meaning that there are no more references to emailAddress= in the /opt/dcache/etc/dcache.kpwd file. You can see that our storage is now being correctly reported at:

http://www.ph.ed.ac.uk/~jfergus7/gridppDiscStatus.html

Mona: if you need a hand with Imperials information publishing, let me know.

Thanks,

Greig

Draining/removing a pool

"Greig A Cowan"

Draining/removing a pool

"Greig A Cowan"

Hi everyone,

I am currently working on some scripts that should hopefully automate the process of draining a pool and then removing it (although I am surprised that dCache does not have some built in function for doing this). Has no one else done this before?

I just need to finish off a small part that allows me to make a pool read only before it is drained. Once this is done I can post the scripts to the list if people want. They are fairly basic, but it should probably be enough for people to understand what needs to be done so that they can modify them as they see fit.

Cheers,

Greig

Re: Draining/removing a pool

"Steve Traylen"

Hi everyone,

They are working on something slightly more automated. If you have a tape back end things are much easier and since they are in that situation this is currently available. It is something they are working on though and they definitely accept it as something to be added.

As you say it is currently possible but a pain, your scripts should make it very much easier. Thanks.

Cheers,

Greig

Re: Dcache problems

"Kostas Georgiou"

On Wed, Jul 13, 2005 at 01:55:12PM +0100, Ross, D (Derek) wrote:

> Hi Kostas,
>
> We're also seeing this at the Tier 1 now, I've mailed the developers. 
Restarting the pool makes it stop (for a while).

I think it shows up when a pool gets full and Dcache tries to reclaim 
the space.
It's likely that the server gets confused somehow and it tries to remove 
files
that have been removed by hand or something similar.
At restart time i see that the pool has some files in a weird state.

07/13 15:15:37 Cell(cmsdsk00_1@cmsdsk00Domain) : Starting Flushing 
Thread
07/13 15:15:37 Cell(cmsdsk00_1@cmsdsk00Domain) : Constructor done (still 
waiting for 'inventory')
07/13 15:15:37 Cell(cmsdsk00_2@cmsdsk00Domain) : New Pool Mode : disabled(fetch,store,stage,p2p-client,p2p-server,)
07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168D8 : CacheException(rc=210;msg=Illegal Control State : receiving.cient)
07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : Trying to recover : 0001000000000000000168D8
07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168D8 : Trying to get storageinfo
07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : PnfsHandler : CacheException (10001) : Pnfs error : Pnfs File not found : 0001000000000000000168D8
07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168D8 : get storageinfo got CacheException(rc=10001;msg=Pnfs error : Pnfs File not found : 0001000000000000000168D8)
07/13 15:15:38 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168D8 : recover : file not found -> removed
07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168E0 : CacheException(rc=210;msg=Illegal Control State : receiving.cient)
07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : Trying to recover : 0001000000000000000168E0 
07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168E0 : Trying to get storageinfo 
07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : PnfsHandler : CacheException (10001) : Pnfs error : Pnfs File not found : 0001000000000000000168E0
07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168E0 : get storageinfo got CacheException(rc=10001;msg=Pnfs error : Pnfs File not found : 0001000000000000000168E0)
07/13 15:15:41 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168E0 : recover : file not found -> removed
07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168E8 : CacheException(rc=210;msg=Illegal Control State : receiving.cient)
07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : Trying to recover : 0001000000000000000168E8 
07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168E8 : Trying to get storageinfo
07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : PnfsHandler : CacheException (10001) : Pnfs error : Pnfs File not found : 0001000000000000000168E8 
07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 0001000000000000000168E8 : get storageinfo got CacheException(rc=10001;msg=Pnfs error : Pnfs File not found : 0001000000000000000168E8)
07/13 15:15:44 Cell(cmsdsk00_2@cmsdsk00Domain) : 0001000000000000000168E8 : recover : file not found -> removed 07/13 15:15:47 Cell(cmsdsk00_2@cmsdsk00Domain) : 000100000000000000016930 : CacheException(rc=210;msg=Illegal Control State :
receiving.cient)
07/13 15:15:47 Cell(cmsdsk00_2@cmsdsk00Domain) : Trying to recover : 000100000000000000016930
07/13 15:15:47 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 000100000000000000016930 : Trying to get storageinfo
07/13 15:15:48 Cell(cmsdsk00_2@cmsdsk00Domain) : PnfsHandler : CacheException (10001) : Pnfs error : Pnfs File not found : 000100000000000000016930
07/13 15:15:48 Cell(cmsdsk00_2@cmsdsk00Domain) : Recover 000100000000000000016930 : get storageinfo got CacheException(rc=10001;msg=Pnfs error : Pnfs File not found : 000100000000000000016930)
07/13 15:15:48 Cell(cmsdsk00_2@cmsdsk00Domain) : 000100000000000000016930 : recover : file not found -> removed 07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : runInventory #=249;space=81011045121/483183820800
07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : New Pool Mode : enabled
07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : Pool enabled cmsdsk00_2
07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : Repository finished
07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : Starting Flushing Thread
07/13 15:15:50 Cell(cmsdsk00_2@cmsdsk00Domain) : Constructor done (still waiting for 'inventory')

Action: Find out correct behaviour for what to do on full System

"Owen Synge"

I spoke to Steve Derek and Andrew at Lunch about the issues of contacting production managers about your disks being full and the conclusion that Andrew suggested and Steve and Derek agreed to is to let the system fill up.

I believe that you should publish your SRM's as volatile and after using the service for some time you may wish to upgrade your service status.

Regards

Owen

Action: Adding and removing pools recipe to be added to mailing list

"Mona Aggarwal"

Action: Adding and removing pools recipe to be added to mailing list

"Mona Aggarwal"

Hi all,

Following is a list of useful links to add and remove a pool from dCache.

1. dCache4SiteAdmins.pdf ==> Section 4

http://www.gridpp.ac.uk/deployment/admin/dcache/index.html

The new version of this guide will be included in LCG 2.6.0 released.

2. UK dCache experiences FAQ

http://www.gridpp.ac.uk/deployment/admin/dcache/faq.html

Moreover, Greig is working on a script to automate file transfers between two pools.

Thanks Greig!

Regards,

Mona

Re: Action: Adding and removing pools recipe to be added to mailing list

"Greig A Cowan"

Hi everyone,

Moreover, Greig is working on a script to automate file transfers ween two pools.

I'm doing this right now, but I'm trying to make the process a little smoother. At the moment my scripts generate a list of the pnfs IDs of the files in the pool that has to be removed. The scripts then loop over this list, entering the admin interface, removing a file and then exiting. This process repeats for each file. Obviously this means that it I need to manually enter the password each time I login to the admin interface. This solution becomes very annoying if you have more than a handful of files (although it's probably not as bad as performing the entire process by hand!). Has anyone managed to successfully use ssh keys to speed up access to the admin interface?

At the moment, my ~/.ssh/config file on the admin node contains:

Host dcache_admin
        Hostname        localhost
        user            admin
        Port            22223
        Cipher          blowfish
        Ciphers         blowfish-cbc

Does anyone know how would I go about creating a key-pair for admin@localhost? I know how to do this when I am logging into a remote machine when I have access to a shell, but I'm not too sure how to do it in the case of the dCache admin interface.

Thanks,

Greig

Re: Action: Adding and removing pools recipe to be added to mailing list

"Steve Traylen"

You could also try using auto expect and expect to program the interaction with ssh server.

Steve

Tier 1 storage

"Philip Clark"

Hi Folks,

We are hopefully going to be making the GridPP storage monitoring more
visible in the GridPP pages. Could you check the link below to make sure
you site is being reported on correcting?

http://www.ph.ed.ac.uk/~jfergus7/gridppDiscStatus.html

Does anyone know why RAL jumped to 1PB? Is this tape being
included. It seems suspiciously like too round a number.

-Phil

Site policy for mapping pools to VOs

Greig A Cowan

Hi everyone,

At the moment, we at Edinburgh are re-evaluating our site policy for mapping VOs to particular pools/pool groups. I was hoping that the other sites with dCache/DPM installations could post to the list what their current policy is (if they have one). I would just like an idea of what is best practise.

For example, should we allocate 2TB to each of the LHC VOs and then leave the remaining space available for any VO to use? Do we even _need_ to have separate pools for VOs? As long as the storage is being utilised, does it matter who is using it?

What we do have to ensure is that all the experiments know the storage is volatile. This is pretty urgent, since we are going to be getting thousands of files and they are all from those different users.

Thanks,

Greig

Fw: silly question

Owen Synge

Fw: silly question

Owen Synge

Does anyone have a definitive way of checking the D-Cache version number, I tried the following commands

[root@dev01 root]# head -n3  /opt/d-cache/bin/dcache-core 
#!/bin/sh
# $Id: D-Cache-Howto-Email-Import2.xml,v 1.11 2005/08/25 16:51:43 synge Exp $
#
[root@dev01 root]# rpm -qa | grep cache
d-cache-opt-1.5.3-73
distcache-0.4.2-9.3
d-cache-client-1.0-76
d-cache-lcg-5.0.0-1
d-cache-gpp-1.1.2l-1
distcache-devel-0.4.2-9.3
d-cache-core-1.5.2-74
d-cache-gpp-admin-1.1.2l-1

hello,

i tried yesterday to update our Dcache on fal-pygrid-20 and attached node to 2_5_0, i did this over the top of the existing set up. Everything still works, and the cynic in me would like to know if the upgrade went successfully- so how can you tell what Dcache version your using? I can't seem to find a nice VERSION file or any magic command that tells me. It seems like a silly thing not to know how to do!

cheers, hope you have a good weekend.

matt

As you can see the numbers don't match this would be useful for all involved downstream of the core development team and maybe them when they want to know what version a users bug is in. We would use it to check upgrades and make bug reports more specific.

This can be accomplished using plain old CVS version numbers and using the CVS built in of variables being populated from tags

https://www.cvshome.org/docs/manual/cvs-1.11.20/cvs_12.html#SEC97

If each file in

/opt/d-cache/dcap/bin
/opt/d-cache/srm/bin
/opt/d-cache/dcap/bin

had a version option it would be good. I should also like each file in

/opt/d-cache/etc

to contain the version number so we could see which version a file started as.

Also

/opt/d-cache/docs

Could probably benefit from some version numbering too.

These version numbers can also generate the RPM's version number as well as the applications version number from the same CVS tag so making everything clear and no fear of duplication and forking.

Regards

Owen Synge

Re: Fw: silly question

Greig A Cowan

Sorry Owen, I am not sure about how to check the version number. I have 
just downloaded the current version, 1.6.5-2 and listing the contents of 
the archive gives:

tar -t --file=dcache-v1.6.5-2.tar

dcache_deploy/
dcache_deploy/d-cache-client-1.0-100-RH73.i386.rpm
dcache_deploy/d-cache-client-1.0-100.i386.rpm
dcache_deploy/d-cache-core-1.5.2-83.i386.rpm
dcache_deploy/d-cache-opt-1.5.3-84.i386.rpm
dcache_deploy/dCache-installation-instructions.txt
dcache_deploy/pnfs-3.1.10-15.i386.rpm
dcache_deploy/Release.notes

So it is not clear how the different component numbers are combined 
together to give the final version number. Could you speak to the dCache 
developers about this?

Cheers,
Greig


On Tue, 19 Jul 2005, Owen Synge wrote:

> Does anyone have a definitive way of checking the D-Cache version number, I tried the following commands
> 
> [root@dev01 root]# head -n3  /opt/d-cache/bin/dcache-core 
> #!/bin/sh
> # $Id: D-Cache-Howto-Email-Import2.xml,v 1.11 2005/08/25 16:51:43 synge Exp $
> #
> [root@dev01 root]# rpm -qa | grep cache
> d-cache-opt-1.5.3-73
> distcache-0.4.2-9.3
> d-cache-client-1.0-76
> d-cache-lcg-5.0.0-1
> d-cache-gpp-1.1.2l-1
> distcache-devel-0.4.2-9.3
> d-cache-core-1.5.2-74
> d-cache-gpp-admin-1.1.2l-1
> 
> As you can see the numbers don't match this would be useful for all involved downstream of the core development team and maybe them when they want to know what version a users bug is in. We would use it to check upgrades and make bug reports more
specific.
> 
> This can be accomplished using plain old CVS version numbers and using the CVS built in of variables being populated from tags
> 
> https://www.cvshome.org/docs/manual/cvs-1.11.20/cvs_12.html#SEC97
> 
> If each file in 
> 
> /opt/d-cache/dcap/bin
> /opt/d-cache/srm/bin
> /opt/d-cache/dcap/bin
> 
> had a version option it would be good. I should also like each file in 
> 
> /opt/d-cache/etc
> 
> to contain the version number so we could see which version a file started as.
> 
> Also 
> 
> /opt/d-cache/docs
> 
> Could probably benefit from some version numbering too.
> 
> These version numbers can also generate the RPM's version number as well as the applications version number from the same CVS tag so making everything clear and no fear of duplication and forking.
> 
> Regards
> 
> Owen Synge
> 
> 
> 
> 
> 
> Begin forwarded message:
> 
> Date: Fri, 15 Jul 2005 14:36:14 +0100
> From: "Matt Doidge" 
> To: "Synge, OM \(Owen\)" 
> Subject: silly question
> 
> 
> hello,
> i tried yesterday to update our Dcache on fal-pygrid-20 and attached
> node to 2_5_0, i did this over the top of the existing set up.
> Everything still works, and the cynic in me would like to know if the
> upgrade went successfully- so how can you tell what Dcache version your
> using? I can't seem to find a nice VERSION file or any magic command
> that tells me. It seems like a silly thing not to know how to do!
> 
> cheers, hope you have a good weekend.
> 
> matt
> 

-- 
=======================================================================
Dr Greig A Cowan                         http://www.ph.ed.ac.uk/~gcowan1
School of Physics, University of Edinburgh, James Clerk Maxwell Building

DCACHE PAGES: http://www.gridpp.ac.uk/deployment/admin/dcache/index.html
=======================================================================

My dual-homing dCache experience

Jiri Mencak

07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: copy request state changed to Done
07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: changing fr#-2147483522 to Done
07/19 10:18:35 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521Request.createCopyRequest : created new request succesfully
07/19 10:20:08 Cell(SRM@srmDomain) : remoing TransferInfo for callerId=20000
07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure
(error code 1) [Nested exception message:  Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom
message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host])
07/19 10:20:08 Cell(SRM@srmDomain) :    at org.dcache.srm.request.CopyFileRequest.runRemoteToLocalCopy(CopyFileRequest.java:666)
07/19 10:20:08 Cell(SRM@srmDomain) :    at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:770)
07/19 10:20:08 Cell(SRM@srmDomain) :    at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121)
07/19 10:20:08 Cell(SRM@srmDomain) :    at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java)
07/19 10:20:08 Cell(SRM@srmDomain) :    at java.lang.Thread.run(Thread.java:534)
07/19 10:20:08 Cell(SRM@srmDomain) : CopyFileRequest #-2147483520: copy  failed
07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request.
Custom message: Server reported transfer failure (error code 1) [Nested exception message:  Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is
org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host])
07/19 10:20:08 Cell(SRM@srmDomain) :    at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:798)
07/19 10:20:08 Cell(SRM@srmDomain) :    at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121)
07/19 10:20:08 Cell(SRM@srmDomain) :    at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java)
07/19 10:20:08 Cell(SRM@srmDomain) :    at java.lang.Thread.run(Thread.java:534)
07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521copyRequest getter_putter is non null, stopping
07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521changing fr#-2147483520 to Failed
07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521error : 
07/19 10:20:36 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.IllegalStateTransition: g illegal state transition from Canceled to Failed
07/19 10:20:36 Cell(SRM@srmDomain) :    at org.dcache.srm.scheduler.Job.setState(Job.java:532)
07/19 10:20:36 Cell(SRM@srmDomain) :    at org.dcache.srm.scheduler.Job.setState(Job.java:417)
07/19 10:20:36 Cell(SRM@srmDomain) :    at org.dcache.srm.request.CopyRequest.stateChanged(CopyRequest.java:952)
07/19 10:20:36 Cell(SRM@srmDomain) :    at org.dcache.srm.scheduler.Job.setState(Job.java:566)
07/19 10:20:36 Cell(SRM@srmDomain) :    at org.dcache.srm.scheduler.Job.setState(Job.java:417)
07/19 10:20:36 Cell(SRM@srmDomain) :    at org.dcache.srm.request.Request.getRequestStatus(Request.java:521)
07/19 10:20:36 Cell(SRM@srmDomain) :    at org.dcache.srm.SRM.getRequestStatus(SRM.java:868)
07/19 10:20:36 Cell(SRM@srmDomain) :    at diskCacheV111.srm.server.SRMServerV1.getRequestStatus(SRMServerV1.java:360)
07/19 10:20:36 Cell(SRM@srmDomain) :    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
07/19 10:20:36 Cell(SRM@srmDomain) :    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
07/19 10:20:36 Cell(SRM@srmDomain) :    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
07/19 10:20:36 Cell(SRM@srmDomain) :    at java.lang.reflect.Method.invoke(Method.java:324)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.util.reflect.Invocation.execute(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.util.reflect.Invocation.invoke(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.service.object.ObjectService.invoke(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:534)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:508)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.net.soap.http.SOAPHTTPHandler.service(SOAPHTTPHandler.java:88)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.server.http.ServletServer.service(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :    at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.net.servlet.Config.service(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.net.http.HTTPContext.service(HTTPContext.java:84)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.net.servlet.ServletContainer.service(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.net.http.WebServer.service(WebServer.java:87)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.net.socket.SocketServer.run(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.net.socket.SocketRequest.run(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :    at electric.util.thread.ThreadPool.run(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :    at java.lang.Thread.run(Thread.java:534)

dCache spawning java processes?

Greig A Cowan

dCache spawning java processes?

Greig A Cowan

Hi everyone,

We are currently involved in the file transfers from RAL. However, we have been having trouble with our pool node in that all the CPU (8*1.9 GHz) and memory (physical RAM is 32 GB) resources have been quickly used up, grinding the machine to a halt. This has prevented us from accepting files.

When Steve Thorn (NeSC) analysed the machine, it appears that dCache was spawning java processes:

 1195 ?        S      0:00 /bin/sh /opt/d-cache/jobs/pool -pool=dcache
-logfile
 1197 ?        S      0:00  \_ /usr/java/j2sdk1.4.2_08/bin/java -server
-Xmx256m
 1200 ?        S      9:55      \_ /usr/java/j2sdk1.4.2_08/bin/java
-server -Xmx
 1201 ?        S      0:57          \_ /usr/java/j2sdk1.4.2_08/bin/java
-server
 1202 ?        S      0:00          \_ /usr/java/j2sdk1.4.2_08/bin/java
-server
 1203 ?        S      0:00          \_ /usr/java/j2sdk1.4.2_08/bin/java
-server
 1204 ?        S      0:00          \_ /usr/java/j2sdk1.4.2_08/bin/java
-server
 ...

There were ~200 each using 57 MB RAM. At one point, the total RAM used was 31 GB. At the moment, Dcache services have been stopped on the pool node and after a reboot the machine appears to have returned to normal. Has anyone seen/heard of this before?

Any advice would be useful.

Cheers,

Greig

Re: dCache spawning java processes?

Kostas Georgiou

On Tue, Jul 19, 2005 at 04:17:22PM +0100, Greig A Cowan wrote:

> Hi everyone,
> 
> We are currently involved in the file transfers from RAL. However, we have
> been having trouble with our pool node in that all the CPU (8*1.9 GHz)  
> and memory (physical RAM is 32 GB) resources have been quickly used up,
> grinding the machine to a halt. This has prevented us from accepting
> files.
> 
> When Steve Thorn (NeSC) analysed the machine, it appears that dCache was
> spawning java processes:
> 
>  1195 ?        S      0:00 /bin/sh /opt/d-cache/jobs/pool -pool=dcache
> -logfile
>  1197 ?        S      0:00  \_ /usr/java/j2sdk1.4.2_08/bin/java -server
> -Xmx256m
>  1200 ?        S      9:55      \_ /usr/java/j2sdk1.4.2_08/bin/java
> -server -Xmx
>  1201 ?        S      0:57          \_ /usr/java/j2sdk1.4.2_08/bin/java
> -server
>  1202 ?        S      0:00          \_ /usr/java/j2sdk1.4.2_08/bin/java
> -server
>  1203 ?        S      0:00          \_ /usr/java/j2sdk1.4.2_08/bin/java
> -server
>  1204 ?        S      0:00          \_ /usr/java/j2sdk1.4.2_08/bin/java
> -server
>  ...
> 
> There were ~200 each using 57 MB RAM. At one point, the total RAM used was
> 31 GB. At the moment, Dcache services have been stopped on the pool node
> and after a reboot the machine appears to have returned to normal. Has
> anyone seen/heard of this before?

I can see around 400 threads from the two java processes in one of our pool
nodes. Total memory in use is ~380MB for both of them (~100 for the pool, ~180
for gridftp). Are you sure that the problem was caused because of low memory?
Threads share all the data so it's more likely to me that the process was only
using 57MB total ;P

In our pool node, there are also 165 connections from csfnfs*.rl.ac.uk and the disk
spends most of it's seeking instead of doing something useful (writing) which
causes a huge load. 

Cheers,
Kostas

Re: dCache spawning java processes?

Philip Clark

Hi Kostas,

Yes, this makes sense the threads all share the same memory. I have
suggest Greig move this thread to lcg-support-dcache. Are you on this
list? 

Maybe something else caused our downtime. Hope to find out soon. 

-Phil


Kostas Georgiou writes:

> On Tue, Jul 19, 2005 at 04:17:22PM +0100, Greig A Cowan wrote:
>
>> Hi everyone,
>> 
>> We are currently involved in the file transfers from RAL. However, we have
>> been having trouble with our pool node in that all the CPU (8*1.9 GHz)  
>> and memory (physical RAM is 32 GB) resources have been quickly used up,
>> grinding the machine to a halt. This has prevented us from accepting
>> files.
>> 
>> When Steve Thorn (NeSC) analysed the machine, it appears that dCache was
>> spawning java processes:
>> 
>>  1195 ?        S      0:00 /bin/sh /opt/d-cache/jobs/pool -pool=dcache
>> -logfile
>>  1197 ?        S      0:00  \_ /usr/java/j2sdk1.4.2_08/bin/java -server
>> -Xmx256m
>>  1200 ?        S      9:55      \_ /usr/java/j2sdk1.4.2_08/bin/java
>> -server -Xmx
>>  1201 ?        S      0:57          \_ /usr/java/j2sdk1.4.2_08/bin/java
>> -server
>>  1202 ?        S      0:00          \_ /usr/java/j2sdk1.4.2_08/bin/java
>> -server
>>  1203 ?        S      0:00          \_ /usr/java/j2sdk1.4.2_08/bin/java
>> -server
>>  1204 ?        S      0:00          \_ /usr/java/j2sdk1.4.2_08/bin/java
>> -server
>>  ...
>> 
>> There were ~200 each using 57 MB RAM. At one point, the total RAM used was
>> 31 GB. At the moment, Dcache services have been stopped on the pool node
>> and after a reboot the machine appears to have returned to normal. Has
>> anyone seen/heard of this before?
>
> I can see around 400 threads from the two java processes in one of our pool
> nodes. Total memory in use is ~380MB for both of them (~100 for the pool, ~180
> for gridftp). Are you sure that the problem was caused because of low memory?
> Threads share all the data so it's more likely to me that the process was only
> using 57MB total ;P
>
> In our pool node, there are also 165 connections from csfnfs*.rl.ac.uk and the disk
> spends most of it's seeking instead of doing something useful (writing) which
> causes a huge load. 
>
> Cheers,
> Kostas

Re: dCache spawning java processes?

Kostas Georgiou

On Tue, Jul 19, 2005 at 05:31:25PM +0100, Philip Clark wrote:

> Hi Kostas,
> 
> Yes, this makes sense the threads all share the same memory. I have
> suggest Greig move this thread to lcg-support-dcache. Are you on this
> list? 

I wasn't even aware that the mailing list existed. Any info on how to 
subscribe to it?

> Maybe something else caused our downtime. Hope to find out soon. 

If i am not mistaken you are running AS2.1 which is using the old
threading model which is far less efficient than the new one so i 
wont rule out the threads caused the problem.

Kostas

Re: dCache spawning java processes?

Owen Synge

On Tue, 19 Jul 2005 17:31:25 +0100
Philip Clark  wrote:

> Hi Kostas,
> 
> Yes, this makes sense the threads all share the same memory. I have
> suggest Greig move this thread to lcg-support-dcache. Are you on this
> list?

I am not and should like the thread to be a cross post so I can see it too without a flood of irrelevant emails, but a summary would be great.

> 
> Maybe something else caused our downtime. Hope to find out soon. 
> 
> -Phil

I think the disk thrashing issue is worth suggesting to others as a performance hit as the experiments are trying to break things to find out what breaks under what circumstances and then I think trying to find out what is the best way to work with the software stack

Just my 2 pence worth

Regards

Owen S

Parallel Transfers

Kostas Georgiou

Well i have parallel streams set to 1 for srm and it seemed to work fine for the Phedex transfers from RAL (~480/Mbit/sec). Somehow the SC3 transfers that Derek is running at the moment use something between 2 and 5 streams (from my strace logs) the end result is that we haven't managed to get more than ~80Mbit/sec :(

From the strace logs it looks like each thread in d-cache writes it's own stream as it arrives instead of merging everything back in a buffer resulting in writes like:

lseek(23, 106792960, SEEK_SET)          = 106792960
write(23, ..., 10240) = 10240
..
lseek(23, 106844160, SEEK_SET)          = 106844160
write(23, ..., 10240) = 10240

<guesswork> The OS/Raid controller might be able to merge everything back together before writing to the disk but with 250 streams that we have at the moment i think it's unlikely to happen (iostat reports minimal merges compared to writes).

Since we are using RAID5 for the disks with a stripe of 64K the non merged 10K writes result in partial stripe write which causes Read-Modify-Write operations slowing down everything even more :( </guesswork>

Too bad there is no source available to play with different settings :( I'll boot one of the pool nodes with the Anticipatory elevator which might be able to do better than the other ones but i don't expect it to do much difference :(

Cheers, Kostas

Re: Parallel Transfers

"Sansum, RA (Andrew)"

So many streams to the RAID controller is bound to be a disaster. We need to reduce the number of parallel streams being generated by FTS/dCache.

Regards

Andrew

Re: dCache spawning java processes?

"Sansum, RA (Andrew)"

Did you get a response back on this. My guess is that this is the same problem RAL are seeing where gridftp doors lock up. SARA also saw this during SC3 and have addressed it (for now) by reducing the number of // transfers.

regards

Andrew

Edinburgh dCache problems

Greig A Cowan

Hi everyone,

Following on from my post yesterday regarding dCache spawning java processes, I have carried out some more tests of our dCache and have found the following:

The problem of excessive memory/CPU usage appears to occur during all file transfers into our dCache (using any level of parallelism). Once a transfer is complete, the CPU usage returns to normal but the memory used is not released.

The problem with the sustained RAL FTS transfers is that due to the continuous nature of the transfers, the CPU usage is always high and after enough time all the memory runs out. Presumably this is even the case when just performing periodic transfers: after enough of them the system will still run out of memory. This possibly helps to explain why we have previously had problems of this nature with our pool node, even before RAL started their FTS transfers.

Any information anyone has on this matter would be useful. We are currently looking into issues regarding the memory management of our pool node. I will keep you posted on our progress.

Thanks,

Greig

failed copies to Lancaster.

Steve Traylen

Hi Brian,

As you know I was looking at the FTS logs for transfers for RAL to Lanc'

They contain the exciting. And from the srmcp below that it looks like the set done method is not working failing with an Axis error.

I don't know.

Steve


 005-07-21 13:45:01,346 [WARN ] - Starting gsiftp transfer
TURL source = gsiftp://gftp0441.gridpp.rl.ac.uk:2811//pnfs/gridpp.rl.ac.uk/data/dteam/fts_test/fts_test-20
TURL dest   = gsiftp://fal-pygrid-26.lancs.ac.uk:2811//pnfs/lancs.ac.uk/data/dteam/fts_ral/339e212941f1ff49d4332e7b90e4cfd8
FILE SIZE = 1055162368
2005-07-21 13:55:02,645 [INFO ] - STATUS:END fail:TRANSFER
2005-07-21 13:55:02,645 [INFO ] - STATUS:BEGIN:SRM_PUTDONE
2005-07-21 13:55:02,646 [DEBUG] - Performing Call to method srm__setFileStatus
2005-07-21 13:55:02,948 [DEBUG] - Call completed to srm__setFileStatus
2005-07-21 13:55:02,949 [INFO ] - STATUS:END:SRM_PUTDONE
2005-07-21 13:55:02,949 [DEBUG] - Performing Call to method srm__advisoryDelete
2005-07-21 13:55:03,881 [DEBUG] - Call completed to srm__advisoryDelete
2005-07-21 13:55:03,882 [INFO ] - STATUS:BEGIN:SRM_GETDONE
2005-07-21 13:55:03,882 [DEBUG] - Performing Call to method srm__setFileStatus
2005-07-21 13:55:04,281 [DEBUG] - Call completed to srm__setFileStatus
2005-07-21 13:55:04,282 [INFO ] - STATUS:END:SRM_GETDONE
2005-07-21 13:55:04,283 [INFO ] - STATUS:FAILED
2005-07-21 13:55:04,283 [DEBUG] - exiting listener thread which still seems active
2005-07-21 13:55:04,283 [ERROR] - FINAL:ABORT:TRANSFER - Transfer timed out.%

Also trying an srmcp on light visible host.

RM Configuration:
        debug=true
        gsissl=true
        help=false
        pushmode=false
        userproxy=true
        buffer_size=2048
        tcp_buffer_size=0
        config_file=/home/traylens/.srmconfig/config.xml
        glue_mapfile=/opt/d-cache/srm/conf/SRMServerV1.map
        webservice_path=srm/managerv1.wsdl
        webservice_protocol=https
        gsiftpclinet=globus-url-copy
        protocols_list=http,gsiftp
        save_config_file=null
        srmcphome=/opt/d-cache/srm
        urlcopy=/opt/d-cache/srm/bin/url-copy.sh
        x509_user_cert=/home/csf/traylens/.globus/usercert.pem
        x509_user_key=/home/csf/traylens/.globus/userkey.pem
        x509_user_proxy=/tmp/x509up_u27532
        x509_user_trusted_certificates=/etc/grid-security/certificates
        retry_num
        retry_timeout=10000
        wsdl_url=null
        use_urlcopy_script=true
        connect_to_wsdl=false
        from[0]=file:////etc/group
        to=srm://fal-pygrid-26.lancs.ac.uk:8443//pnfs/lancs.ac.uk/data/dteam/bingo

Thu Jul 21 14:54:50 BST 2005: starting SRMPutClient
Thu Jul 21 14:54:50 BST 2005: SRMClient(https,srm/managerv1.wsdl,true)
Thu Jul 21 14:54:50 BST 2005: connecting to server
Thu Jul 21 14:54:50 BST 2005: connected to server, obtaining proxy
SRMClientV1 : connecting to srm at httpg://fal-pygrid-26.lancs.ac.uk:8443/srm/managerv1
Thu Jul 21 14:54:51 BST 2005: got proxy of type class org.dcache.srm.client.SRMClientV1
SRMClientV1 :   put, sources[0]="/etc/group"
SRMClientV1 :   put, dests[0]="srm://fal-pygrid-26.lancs.ac.uk:8443//pnfs/lancs.ac.uk/data/dteam/bingo"
SRMClientV1 :   put, protocols[0]="http"
SRMClientV1 :   put, protocols[1]="dcap"
SRMClientV1 :   put, protocols[2]="gsiftp"
SRMClientV1 :  put, contacting service httpg://fal-pygrid-26.lancs.ac.uk:8443/srm/managerv1
doneAddingJobs is false
copy_jobs is empty
Thu Jul 21 14:54:53 BST 2005:  srm returned requestId = -2147480986
Thu Jul 21 14:54:53 BST 2005: sleeping 1 seconds ...
Thu Jul 21 14:54:58 BST 2005: FileRequestStatus with SURL=srm://fal-pygrid-26.lancs.ac.uk:8443//pnfs/lancs.ac.uk/data/dteam/bingo is Ready
Thu Jul 21 14:54:58 BST 2005:        received TURL=gsiftp://fal-pygrid-26.lancs.ac.uk:2811//pnfs/lancs.ac.uk/data/dteam/bingo
doneAddingJobs is false
copy_jobs is not empty
copying CopyJob, source = file:////etc/group destination = gsiftp://fal-pygrid-26.lancs.ac.uk:2811//pnfs/lancs.ac.uk/data/dteam/bingo
trying script copy
executing command /opt/d-cache/srm/bin/url-copy.sh -get-protocols
 exit value is 0
GridftpClient: connecting to fal-pygrid-26.lancs.ac.uk on port 2811
GridftpClient: gridFTPClient tcp buffer size is set to 1048576
GridftpClient: gridFTPWrite started, source file is java.io.RandomAccessFile@12c3327 destination path is /pnfs/lancs.ac.uk/data/dteam/bingo
GridftpClient: parallelism: 10
GridftpClient: adler 32 for file java.io.RandomAccessFile@12c3327 is f6bacd15
GridftpClient: waiting for completion of transfer
GridftpClient: gridFtpWrite: starting the transfer in emode to /pnfs/lancs.ac.uk/data/dteam/bingo
GridftpClient: DiskDataSink.close() called
GridftpClient: gridFTPWrite() wrote 649bytes
GridftpClient: closing client : org.dcache.srm.util.GridftpClient$FnalGridFTPClient@a83a13
GridftpClient: closed client
execution of CopyJob, source = file:////etc/group destination = gsiftp://fal-pygrid-26.lancs.ac.uk:2811//pnfs/lancs.ac.uk/data/dteam/bingo completed
setting file request -2147480985 status to Done
AxisFault
 faultCode: {http://xml.apache.org/axis/}HTTP
 faultSubcode: 
 faultString: (0)null
 faultActor: 
 faultNode: 
 faultDetail: 
        {}:return code:  0

        {http://xml.apache.org/axis/}HttpErrorCode:0

(0)null
        at org.apache.axis.transport.http.HTTPSender.readFromSocket(HTTPSender.java:663)
        at org.apache.axis.transport.http.HTTPSender.invoke(HTTPSender.java:94)
        at org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
        at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
        at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
        at org.apache.axis.client.AxisClient.invoke(AxisClient.java:147)
        at org.apache.axis.client.Call.invokeEngine(Call.java:2719)
        at org.apache.axis.client.Call.invoke(Call.java:2702)
        at org.apache.axis.client.Call.invoke(Call.java:2378)
        at org.apache.axis.client.Call.invoke(Call.java:2301)
        at org.apache.axis.client.Call.invoke(Call.java:1758)
        at org.dcache.srm.client.axis.ISRMStub.setFileStatus(ISRMStub.java:512)
        at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1082)
        at gov.fnal.srm.util.CopyJob.done(CopyJob.java:152)
        at gov.fnal.srm.util.Copier.run(Copier.java:305)
        at java.lang.Thread.run(Thread.java:534)
SRMClientV1 : getRequestStatus: try #0 failed with exception
java.lang.RuntimeException: (0)null
        at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1086)
        at gov.fnal.srm.util.CopyJob.done(CopyJob.java:152)
        at gov.fnal.srm.util.Copier.run(Copier.java:305)
        at java.lang.Thread.run(Thread.java:534)
setting File Request to "Done" failed
java.lang.RuntimeException: (0)null
        at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1086)
        at gov.fnal.srm.util.CopyJob.done(CopyJob.java:152)
        at gov.fnal.srm.util.Copier.run(Copier.java:305)
        at java.lang.Thread.run(Thread.java:534)
Exception in thread "main" java.lang.RuntimeException: (0)null
        at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1086)
        at gov.fnal.srm.util.CopyJob.done(CopyJob.java:152)
        at gov.fnal.srm.util.Copier.run(Copier.java:305)
        at java.lang.Thread.run(Thread.java:534)
setting file request -2147480985 status to Done
java.lang.IllegalStateException: Shutdown in progress
        at java.lang.Shutdown.add(Shutdown.java:79)
        at java.lang.Runtime.addShutdownHook(Runtime.java:190)
        at gov.fnal.srm.util.Copier.run(Copier.java:229)
        at java.lang.Thread.run(Thread.java:534)

Pool Storage

Kostas Georgiou

Pool Storage

Kostas Georgiou

Hi,

After our cms people deleted 1.5-2.0TB from Dcache with GSI DCAP we discovered that the pools haven't freed the space although the files are gone from the /pnfs name space. In the admin web page the pools show as full and the data is marked as precious.

Any ideas on how to clean up the mess?

Cheers,

Kostas

Re: Pool Storage

"Ross, D (Derek)"

Hi Kostas,

Have a look in /opt/pnfsdb/pnfs/trash/2 on your admin node, the name of every file in there is the pnfsid of a file that has been deleted from pnfs, rep rm <pnfsid> -force in the pool's cell in the admin node should delete the file.

Derek

Re: Pool Storage

Kostas Georgiou

Ah and i was about to delete the pool to recover the space :) Is there a reason why Dcache doesn't delete the files automatically? Do we have to go through the cleanup exercise manually or it was some random failure in Dcache?

There are 1195 files in the trash :( I'll have to write some script i guess. Do i delete the files from the trash after i run rep rm .... or Dcache will do something about it?

Cheers,

Kostas

Re: Pool Storage

Greig A Cowan

There are 1195 files in the trash :( I'll have to write some script i guess. Do i delete the files from the trash after i run rep rm .... or Dcache will do something about it?

I had started to write a set of scripts to remove files from a dCache pool, but then Owen S mentioned that someone (Judith Novak?) at CERN had already done this. Has there been any progress on finding out about this Owen?

Cheers,

Greig

Re: Pool Storage

"Ross, D (Derek)"

Ah and i was about to delete the pool to recover the space :) Is there a reason why Dcache doesn't delete the files automatically?

It does normally, this maybe related to the problem where the pool starts writing errors in to the log file about deleting files.

There are 1195 files in the trash :( I'll have to write some script i guess. Do i delete the files from the trash after i run rep rm .... or Dcache will do something about it?

I think you're free to delete them.

Derek

Re: Pool Storage

Kostas Georgiou

I see, i deleted everything from the pools leaving only 13GB in /pnfs for some reason even after deleting everything by rep rm ... Dcache reports that ~500GB are still in use :(

I also noticed that some files in the trash don't have an associated pool and i can't delete them, am i right to guess that those files have been deleted?

Any ideas what the fields mean? I can't seem to find anything in the "documentation".

$ cat 000100000000000000018AA0
2,0,0,0.0,0.0
:d=true;
sedsk00_1

$ cat 000100000000000000023EB8
2,0,0,0.0,0.0
:
sedsk00_1

$ cat 000100000000000000016EC8
2,0,0,0.0,0.0
:d=true;c=1:5dc00001;s=*;
sedsk00_1

$ cat 000100000000000000023D08
2,0,0,0.0,0.0
:d=true;

$ cat 00010000000000000000F3D8
2,0,0,0.0,0.0
:

Re: Pool Storage

"Brian Davies"

aren't these files deleted by the thresh command when the Dcache reaches some (settable) filled capacity. ie files are marked as deletable but are not removed until you reach x% filled capacity at which point it starts removing these files until the filled capacity is reduced back to y% or when all removable files have been removed ( whichever comes first). Brian

Re: Pool Storage

Kostas Georgiou

I see, i deleted everything from the pools leaving only 13GB in /pnfs for some reason even after deleting everything by rep rm ... Dcache reports that ~500GB are still in use :(

It seems that was caused by my script that missed some files :( Here is the improved and simplified if anyone needs to use it at some point.

Cheers, Kostas

$ ./dcache_emptytrash > commands
$ ssh dcacheadm < commands
$ cat dcache_emptytrash
#!/bin/bash
trash=/opt/pnfsdb/pnfs/trash/2
pools="sedsk00_1 cmsdsk00_1 cmsdsk00_2"
for pool in $pools; do
  echo "cd $pool"
  for file in `find $trash -type f -print0 | xargs -0 grep -sl $pool`; do
    echo "rep rm ${file##*/} -force"
  done
  echo ".."
done
echo logoff

Re: Pool Storage

Kostas Georgiou

On Mon, Jul 25, 2005 at 09:49:51AM +0100, Brian Davies wrote:

> aren't these files deleted by the thresh command when the Dcache
> reaches some (settable) filled capacity. ie files are marked as
> deletable but are not removed until you reach x% filled capacity at
> which point it starts removing these files until the filled capacity
> is reduced back to y% or when all removable files have been removed (
> whichever comes first).

From the three pools that we have the usage was 100%, 90% and 80%
while around 80% of the total space was "deleted", all the files
in the full pool were marked as deletable and the cleanup never
happened.
I'll have a look at setting the threshold to something really low
and see if it makes a difference.

Kostas

pnfs causing nfs errors

Kostas Georgiou

Hi,

Has anyone seen error messages like this ones before in the admin node? To me it smells like a corruption in the pnfs database somewhere

Cheers,

Kostas

nfs_refresh_inode: inode number mismatch expected (0xc/0x104327f), got (0xc/0x1043278)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1043497), got (0xc/0x1043490)
nfs_refresh_inode: inode number mismatch expected (0xc/0x103e53f), got (0xc/0x103e538)
nfs_refresh_inode: inode number mismatch expected (0xc/0x104367f), got (0xc/0x1043678)
nfs_refresh_inode: inode number mismatch expected (0xc/0x100e827), got (0xc/0x100e820)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1001127), got (0xc/0x1001120)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1001067), got (0xc/0x1080)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1087), got (0xc/0x1080)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1047), got (0xc/0x1040)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1027), got (0xc/0x1020)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1043687), got (0xc/0x1043680)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1043697), got (0xc/0x1043690)
nfs_refresh_inode: inode number mismatch expected (0xc/0x104373f), got (0xc/0x1043738)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1043717), got (0xc/0x1043710)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1043727), got (0xc/0x1043720)
nfs_refresh_inode: inode number mismatch expected (0xc/0x100e827), got (0xc/0x100e820)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1001127), got (0xc/0x1001120)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1001067), got (0xc/0x1080)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1087), got (0xc/0x1080)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1047), got (0xc/0x1040)
nfs_refresh_inode: inode number mismatch expected (0xc/0x1027), got (0xc/0x1020)
nfs_refresh_inode: inode number mismatch expected (0xc/0x103385f), got (0xc/0x1033858)

Re: My dual-homing dCache experience

Jiri Mencak

Dear all,

I've played a little bit with dual-homed machines and dCache with mixed success. Nevertheless, it think it is worth reporting and I'm looking forward to your feedback.

Architecture

Pentium III 600

Scientific Linux SL Release 3.0.4 (SL)

dCache

d-cache-client-1.0-100
d-cache-core-1.5.2-83
d-cache-lcg-5.0.0-1
d-cache-opt-1.5.3-84
(d-cache-gpp-v1.2.1-1)

I have done a simplified dCache installation using the GridPP storage dependency RPMs (no BDII etc.) to speed things up, LCG yaim 2.5.0 installation should work equally well.

Scenario

Admin node: dual-homed box with a /pool on the same box (I know, 3 dual-homed boxes would be better with no pool on the admin node, but this should do as a proof of concept) Pool node: dual-homed box with a /pool

Public Interfaces:  E0a (admin.public.ac.uk), E0p (pool.public.ac.uk)
Private Interfaces: E1a (192.168.0.32),       E1p (192.168.0.33)

         
                   E0a --------------- E1a
                      |               |
                  +---|     admin     |---+
                  |   | /pool         |   |
                  |    ---------------    |
                  |                       |
                  |                       | Private Net
 Public Net   ----+-----              ----+-----
 ------------|  switch  |            |  switch  |
              ----+-----              ----+-----
               |  |                       |   |
               |  |   ----------------    |   |
               |  |  |                |   |   |
               |  +--|      pool      |---+   |
               |     | /pool          |       |
               |  E0p ---------------- E1p    |
               |                              |
               |   ........................   |
               |                              |
               |  O T H E R         P O O Ls  |

Installation

1) Installed SL 3.0.4 and grid certificates

2) Made sure `hostname` returns FQDN associated with E0a and E0p, in other words, public FQDN.

3) To make internal dCache communication pass through private interfaces I've set up an internal DNS server to fool admin and pool nodes into thinking admin.public.ac.uk is 192.168.0.32 and pool.public.ac.uk is 192.168.0.33.

4) Made sure

   `hostname -d` = `grep ^search /etc/resolv.conf | awk '{print $2}'`

5) Set up site-info.def:

   MY_DOMAIN=`hostname -d`
   DCACHE_ADMIN=<E1a private FQDN>
   DCACHE_POOLS="`hostname -f`:2:/pool"

6) Installed dCache using GridPP storage dependency RPMs.

Testing

globus-url-copy and dCache SRM copy worked fine including third party copying (get) _from_ dual-homed boxes. Unfortunately, third party (put) _to_ dual-homed boxes fails. Relevant dCache log snippets attached.

Tier 2 dual-homing requirements

It would be nice to hear what the architectural requirements from Tier 2 sites are with regard to dual-homing are. I was working under the assumption that the purpose of dual-homed machines was to increase network throughput on the public interface by passing internal dCache communication through the private interface and to shield dCache from the outside world and expose only SRM and GridFTP on the public interface.

I suspect there will be other/different requirements with regard to the dual-homed architecture so it would be nice to hear them. Owen tells me that if you need dual-homing, your setup will almost certainly be Lightpath on the public interface, and university network on the private interface.

I'm now partly leaving dCache support moving onto another project, so I cannot guarantee I'll be working on dual-homing in the future.

Regards.

Jiri

07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : Failed : CacheException(rc=666;msg=can't get pnfsId (not a pnfsfile))
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : CacheException(rc=666;msg=can't get pnfsId (not a pnfsfile))
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) :      at diskCacheV111.cells.PnfsManager2.getStorageInfo(PnfsManager2.java:950)
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) :      at diskCacheV111.cells.PnfsManager2.processPnfsMessage(PnfsManager2.java:1597)
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) :      at diskCacheV111.cells.PnfsManager2$ProcessThread.run(PnfsManager2.java:1518)
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) :      at java.lang.Thread.run(Thread.java:534)
07/19 10:19:11 Cell(PnfsManager@pnfsDomain) : Error obtaining 'l' flag for getSimulatedFilesize : java.io.FileNotFoundException: /pnfs/fs/.(puse)(000100000000000000001120)(2) (Is a directory)
07/19 10:19:12 Cell(PnfsManager@pnfsDomain) : Error obtaining 'l' flag for getSimulatedFilesize : java.io.FileNotFoundException: /pnfs/fs/.(puse)(000100000000000000001120)(2) (Is a directory)

07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: copy request state changed to Done
07/19 10:16:18 Cell(SRM@srmDomain) : Request id=-2147483523: changing fr#-2147483522 to Done
07/19 10:18:35 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521Request.createCopyRequest : created new request succesfully
07/19 10:20:08 Cell(SRM@srmDomain) : remoing TransferInfo for callerId=20000
07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer
failure (error code 1) [Nested exception message:  Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException: 
Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host])
07/19 10:20:08 Cell(SRM@srmDomain) :       at org.dcache.srm.request.CopyFileRequest.runRemoteToLocalCopy(CopyFileRequest.java:666)
07/19 10:20:08 Cell(SRM@srmDomain) :       at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:770)
07/19 10:20:08 Cell(SRM@srmDomain) :       at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121)
07/19 10:20:08 Cell(SRM@srmDomain) :       at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java)
07/19 10:20:08 Cell(SRM@srmDomain) :       at java.lang.Thread.run(Thread.java:534)
07/19 10:20:08 Cell(SRM@srmDomain) : CopyFileRequest #-2147483520: copy  failed
07/19 10:20:08 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.NonFatalJobFailure: org.dcache.srm.scheduler.NonFatalJobFailure: CacheException(rc=666;msg=tranfer failed :org.globus.ftp.exception.ServerException: Server refused performing the request.
Custom message: Server reported transfer failure (error code 1) [Nested exception message:  Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host] [Nested exception is
org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected reply: 426 Transfer aborted, closing connection :java.net.NoRouteToHostException: No route to host])
07/19 10:20:08 Cell(SRM@srmDomain) :       at org.dcache.srm.request.CopyFileRequest.run(CopyFileRequest.java:798)
07/19 10:20:08 Cell(SRM@srmDomain) :       at org.dcache.srm.scheduler.Scheduler$JobWrapper.run(Scheduler.java:1121)
07/19 10:20:08 Cell(SRM@srmDomain) :       at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(PooledExecutor.java)
07/19 10:20:08 Cell(SRM@srmDomain) :       at java.lang.Thread.run(Thread.java:534)
07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521copyRequest getter_putter is non null, stopping
07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521changing fr#-2147483520 to Failed
07/19 10:20:36 Cell(SRM@srmDomain) : CopyRequest reqId # -2147483521error : 
07/19 10:20:36 Cell(SRM@srmDomain) : org.dcache.srm.scheduler.IllegalStateTransition: g illegal state transition from Canceled to Failed
07/19 10:20:36 Cell(SRM@srmDomain) :       at org.dcache.srm.scheduler.Job.setState(Job.java:532)
07/19 10:20:36 Cell(SRM@srmDomain) :       at org.dcache.srm.scheduler.Job.setState(Job.java:417)
07/19 10:20:36 Cell(SRM@srmDomain) :       at org.dcache.srm.request.CopyRequest.stateChanged(CopyRequest.java:952)
07/19 10:20:36 Cell(SRM@srmDomain) :       at org.dcache.srm.scheduler.Job.setState(Job.java:566)
07/19 10:20:36 Cell(SRM@srmDomain) :       at org.dcache.srm.scheduler.Job.setState(Job.java:417)
07/19 10:20:36 Cell(SRM@srmDomain) :       at org.dcache.srm.request.Request.getRequestStatus(Request.java:521)
07/19 10:20:36 Cell(SRM@srmDomain) :       at org.dcache.srm.SRM.getRequestStatus(SRM.java:868)
07/19 10:20:36 Cell(SRM@srmDomain) :       at diskCacheV111.srm.server.SRMServerV1.getRequestStatus(SRMServerV1.java:360)
07/19 10:20:36 Cell(SRM@srmDomain) :       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
07/19 10:20:36 Cell(SRM@srmDomain) :       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
07/19 10:20:36 Cell(SRM@srmDomain) :       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
07/19 10:20:36 Cell(SRM@srmDomain) :       at java.lang.reflect.Method.invoke(Method.java:324)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.util.reflect.Invocation.execute(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.util.reflect.Invocation.invoke(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.service.object.ObjectService.invoke(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:534)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.net.soap.SOAPMessage.invoke(SOAPMessage.java:508)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.net.soap.http.SOAPHTTPHandler.service(SOAPHTTPHandler.java:88)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.server.http.ServletServer.service(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :       at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.net.servlet.Config.service(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.net.http.HTTPContext.service(HTTPContext.java:84)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.net.servlet.ServletContainer.service(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.net.http.WebServer.service(WebServer.java:87)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.net.socket.SocketServer.run(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.net.socket.SocketRequest.run(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :       at electric.util.thread.ThreadPool.run(Unknown Source)
07/19 10:20:36 Cell(SRM@srmDomain) :       at java.lang.Thread.run(Thread.java:534)

Hi all,

sorry for replying to my own email, but I thought I'd preserve the ``thread''.

After giving the dual-homed boxes some time to rest and discussing this with dCache developers, I've given the 3rd party copies (_to_ dual-homed boxes) another chance. The weird thing is that they started to work! I admit to having rebooted the boxes for kernel upgrade and therefore restarted dCache, so that might have helped. I spent some time trying to replicate the problem, with no luck unfortunately. Dual-homed dCache (as described below) just works for me now.

Thanks and regards.

Jiri

hsmType

Greig A Cowan

Hi everyone,

I have been playing around with dCache and have recently found a problem that I wasn't previously experiencing.

If (as root) I list the contents of the /pnfs/fs directory, I get the expected output:

# ls /pnfs/fs
admin  README  usr

If I then try to view the contents of the README file, I get an error. I have previously been able to view the contents without any problem.

# cat /pnfs/fs/README
Command failed!
Server error message for [1]: "Couldn't determine hsmType" (errno 37).
Failed open file in the dCache.
cat: README: Input/output error

This same error is also generated in the directories lower down in the /pnfs tree when (for example) I try to copy the contents of the file /pnfs/fs/admin/etc/exports/127.0.0.1 to the IP address of a pool node (as you need to do to create a door on a pool node).

I have the correct dCache dcap libraries preloaded:

export LD_PRELOAD=/opt/d-cache/dcap/lib/libpdcap.so

Note that I am able to view the contents of test files that I have copied into /pnfs/epcc.ed.ac.uk/data/dteam.

The only references to hsm that I can find are in the pool setup files which contain commented out lines such as:

hsm set osm -pnfs=/pnfs/fs

Does/has anyone else experienced this problem?

Thanks,

Greig

dCache memory issues

Greig A Cowan

Hi all,

Our pool node has been taking a bit of a pounding during the FTS transfer tests from RAL. The machine has 16GB of RAM all of which was being used to handle the transfers. You will be able to see the high memory and CPU usage using the Ganglia page for our pool node:

http://mon.epcc.ed.ac.uk/ganglia/?r=day&c=ScotGrid-Edinburgh&h=dcache.epcc.ed.ac.uk

8 parallel transfers were being used during these tests.

Have any of the other Tier-2s (IC, Lancaster) experienced this sort of behaviour during sustained transfers? What about RAL? Andrew Sansum definitely mentioned that RAL and SARA may have been seeing a problem like this that was solved by reducing the number of parallel transfers.

I think it would be good to find out if this was just an issue with Edinburghs setup or a more general dCache issue due to java processes consuming large amounts of pool node resources. This may be an issue we see more of once all Tier-2s get an SRM (assuming they use dCache that is!).

Any information would be useful.

Thanks,

Greig

Re: hsmType

Owen Synge

On Wed, 27 Jul 2005 11:39:05 +0100
Greig A Cowan  wrote:

> Hi everyone,
> 
> I have been playing around with dCache and have recently found a problem 
> that I wasn't previously experiencing.
> 
> If (as root) I list the contents of the /pnfs/fs directory, I get the 
> expected output:
> 
> # ls /pnfs/fs
> admin  README  usr
> 
> If I then try to view the contents of the README file, I get an error. 
> I have previously been able to view the contents without any problem.
> 
> # cat /pnfs/fs/README
> Command failed!
> Server error message for [1]: "Couldn't determine hsmType" (errno 37).
> Failed open file in the dCache.
> cat: README: Input/output error

Odd I get the error


[root@dev01 root]# export LD_PRELOAD=/opt/d-cache/dcap/lib/libpdcap.so
[root@dev01 root]# ls /pnfs/
fs  gridpp.rl.ac.uk
[root@dev01 root]# cat  /pnfs/fs/
admin   README  usr     
[root@dev01 root]# cat  /pnfs/fs/README 
Failed to create a control line
Failed open file in the dCache.
cat: /pnfs/fs/README: Connection refused


but it used to work when I installed D-Cache and wrote this down in the HOWTO

 
> This same error is also generated in the directories lower down in the 
> /pnfs tree when (for example) I try to copy the contents of the file 
> /pnfs/fs/admin/etc/exports/127.0.0.1 to the IP address of a pool node (as 
> you need to do to create a door on a pool node).
> 
> I have the correct dCache dcap libraries preloaded:
> 
> export LD_PRELOAD=/opt/d-cache/dcap/lib/libpdcap.so
> 
> Note that I am able to view the contents of test files that I have copied 
> into /pnfs/epcc.ed.ac.uk/data/dteam.
> 
> The only references to hsm that I can find are in the pool setup files 
> which contain commented out lines such as:
> 
> hsm set osm -pnfs=/pnfs/fs
> 
> Does/has anyone else experienced this problem?

The hsm stuff is/should be a bit of a distraction as it stands for Hierarchical storage manager and is the term commonly used to describe a tape storage system.

Regards

Owen

Re: dCache memory issues

Alessandra Forti

Re: dCache memory issues

Alessandra Forti

Hi Greig,

I suspect (and might be wrong) that the kernel tuning Kostas has applied to IC might be useful. I was waiting for him to say something more about it. :)

In the meantime you can look at this page

https://uimon.cern.ch/twiki/bin/view/LCG/ServiceChallengeTwoProgressSARALogbook

to see if you find anything useful that could help you.

cheers

alessandra

Re: dCache memory issues

Greig A Cowan

Hi Alessandra,

I suspect (and might be wrong) that the kernel tuning Kostas has applied to IC might be useful. I was waiting for him to say something more about it. :)

I was looking for more information from him as well!

In the meantime you can look at this page

https://uimon.cern.ch/twiki/bin/view/LCG/ServiceChallengeTwoProgressSARALogbook

Thanks for the link, I think this could prove useful.

Greig

Re: dCache memory issues

Kostas Georgiou

Well some questions first :)

Are the java processes really use that much memory or is it the buffer cache? The kernel will try to cache as many files as possible so high memory usage is normal (your ganglia page isn't visible from the outside world)

What type of disks/controllers do you have? From what I've seen Dcache generates really bad IO patterns with parallel transfers, have you tried running with only one stream and multiple files instead?

Cheers,

Kostas

Re: dCache memory issues

Kostas Georgiou

Have a look at this paper: http://people.redhat.com/nhorman/papers/rhel3_vm.pdf It's RHEL3 specific but it will give you an idea about what you need to tune.

Is it really a problem that the file cache is taking all the unused memory? As long as the pages are "clean" the kernel can throw them out easily without any problems. You might want to tune vm.bdflush though to be more aggressive so dirty pages are written to disk as fast as possible.

During the FTS tests we were running with 8 files and 0 (?) streams. Can anyone point me to a web page with information regarding the difference between streams and files? With this setup we were seeing ~200Mb/s into our site.

With n multiple streams Dcache writes data like: stream: write 10K, seek ahead (n-1)*10K, write 10K, ....

With 1 stream you get: stream: write 10K, write 10K, ...

If the data from all the streams doesn't arrive at the same time you end up with writes all over the place and your disk IO suffers as a result. The 3ware raid cards that we use at IC get around ~70MB/sec at sequential IO and around ~1-5MB/sec at random IO in RAID5. It's only a guess in my part that the parallel streams cause that much random IO or not since i didn't had that much time to test different settings during the FTS transfers :(

Cheers,

Kostas

Prev	Up
Chapter 13. Email Import	Home