Brian reports that there was a discussion of the practice of running an SE:

* Draining disk servers, coping with lost disk servers.
* How to decommission a VO (e.g. similar to the SE_Shutdown page in the wiki,
  but just for a single VO.)  What to do if the VO is dormant and cannot be
  contacted - you can't just wipe their data (or can you)?
* How to reduce the space the VO is using?

* How to manage non-GridPP-funded space (cf IC and Oxford)

* Regarding the failure rates, need to establish whether the current rate is
  the same as it was (ie 1/4 of all failures being SE timeout related.)


Additonal notes from Sam:
Hi Jens,

Some notes from the meeting follow:


GridPP Storage Meeting

GridPP27 - topics:

Chris - file protocol support. Can "they" use it a bit more?

(NFS4.1 for DPM)

Brian: Chris is saying not to use StoRM, to use the underlying mechanism?

Chris: no, I'm saying, use the SRM, but don't use GridFTP.

(ATLAS use the local transfer mechanism, but that's because they have
their own system for doing this.)

Brian: this is because there's a lot of metadata embedded in the ATLAS
mechanisms - they've gotten around the issues with the gFAL tools. The
small VOs have to use the mechanisms available to them.
Until all SE types support a common protocol that isn't GridFTP?

nfsv4.1 for StoRM.

(The VOs are interested in Xrootd more than NFSv4.1. what's the
roadmap for this for DPM?)
Item: Roadmap to "local filesystem".

Chris: we've not had a disk server failure, recovery from that would
be good! (Automated?)

ATLAS have their consistency service. but it's not totally automated.

(Replication back from ?) but what about files with no replicas?
("More resilient storage" area for temporary files - SCRATCHDISK?)

Brian - this needs discussion later?

Alessandra: this is an important point, though.

Alessandra: when I asked ATLAS what to do when you lose files, they
didn't have a clear answer. They said after 3 days you can declare
them as lost, but there were 5 users rightly saying "where is my
data?".

Brian: we've always said that, when a Site temporarily or permanently
loses data, the first thing is to produce the list of files are
affected. It's up to ATLAS to sort out from that list what to do with
the files that have replicas and don't have replicas. (One problem is
that the T2s have more user data, and therefore more unique files.)

"Should LOCALGROUPDISK and SCRATCHDISK be made resilient at T2s?"

--

Item 2: formalised process for deletion needs to be decided. (And with
T2K we need to work better with them to move stuff into spacetokens.)

--

Item 3: 1/4 of failures that happen are puts at the end of the job -
WN to localSE.
The good news is that ATLAS can work around this with ATLAS job
recovery (files copy to local WN disk in a special cache).
We also need to work out other methods for improving resiliency.
Retries for transfers, for example, would reduce issues that are
transient.
Can we get a feel for how this is affecting non-ATLAS VOs?
It could be that there are problems with get errors from the site
side, if a disk server is briefly down? There are also issues with
files being gotten that are believed erroneously to be there.

Chris: if a job fails, the most reliable way to make the job work
seems to be to repeat the job elsewhere, or at the same site a few
hours later! Everything else is a performance optimisation - if our
storage is broken for an hour or two, there's a whole class of
problems that you aren't going to solve without rerunning the job.

Brian: Are there any T2s interested in seeing if we can get job
recovery working again?
(GLASGOW volunteered, as did MANCHESTER, and QMUL)

And, do we know how CMS is affected by the same issues?
(No CMS sites are present in this meeting.)

(We also need to be aware of LHCb's nascent reprocessing at T2s
project - happening at Manchester, so we need to pay attention to the
tests they are doing.
At the moment, Alessandra thinks they're stuck with problems with the software.)

--

Item 4: SC11 attendance? Sam isn't. Brian will go where Jens
instructs. Chris wasn't planning to, but could be persuaded. (It is
only 3 months away and in Seattle, so we guess this means "No" at this
point.)

--

item 6 (AOB):
Brian - is in conversation with people at SLAC about asymmetry in
transfer routes. May be asking people to divulge their kernel
settings.
Sam - is installing WebDAV and NFS4.1 support on DPM and wants some testing.
Govind - is worried that his DPM database is not shrinking more now
he's down to 3 months of records kept.
Govind is still having his problem with FTS performance - timeouts
quite often. Some of the DPM pool nodes are on 1.7.1, could this be
causing the problem?
Transfer preparation time is very large - 80 to 90 seconds.
What's the LAN between the head node and the storage servers like?

Brian will have a look to see if there's correlation between the RHUL
sources and the large preparation time - try to see if there's a
connection with the 1.7.1 disk servers.
Govind will update them today in any case?


On 3 August 2011 06:41, Jens Jensen <jens.jensen@stfc.ac.uk> wrote:
> Dear all,
>
> I put together a quick agenda for today: http://storage.esc.rl.ac.uk/weekly/
>
> We should also talk about Elena's ls-l error? I'm sure we have seen those "reading token data header" errors but I can't remember if we know why they occur...
>
> I am going to try to connect but I am also on a hotel network which is mostly OK but failed last night when I tried to upload an agenda, so I've made the items a little more self-explanatory than usual (hopefully) in case I have trouble connecting or similar.
>
> Incidentally, my more-than-storage OGF feedback session is currently scheduled for the 17th.
>
> Cheers
> --jens
>