Brian reports that there was a discussion of the practice of running an SE: * Draining disk servers, coping with lost disk servers. * How to decommission a VO (e.g. similar to the SE_Shutdown page in the wiki, but just for a single VO.) What to do if the VO is dormant and cannot be contacted - you can't just wipe their data (or can you)? * How to reduce the space the VO is using? * How to manage non-GridPP-funded space (cf IC and Oxford) * Regarding the failure rates, need to establish whether the current rate is the same as it was (ie 1/4 of all failures being SE timeout related.) Additonal notes from Sam: Hi Jens, Some notes from the meeting follow: GridPP Storage Meeting GridPP27 - topics: Chris - file protocol support. Can "they" use it a bit more? (NFS4.1 for DPM) Brian: Chris is saying not to use StoRM, to use the underlying mechanism? Chris: no, I'm saying, use the SRM, but don't use GridFTP. (ATLAS use the local transfer mechanism, but that's because they have their own system for doing this.) Brian: this is because there's a lot of metadata embedded in the ATLAS mechanisms - they've gotten around the issues with the gFAL tools. The small VOs have to use the mechanisms available to them. Until all SE types support a common protocol that isn't GridFTP? nfsv4.1 for StoRM. (The VOs are interested in Xrootd more than NFSv4.1. what's the roadmap for this for DPM?) Item: Roadmap to "local filesystem". Chris: we've not had a disk server failure, recovery from that would be good! (Automated?) ATLAS have their consistency service. but it's not totally automated. (Replication back from ?) but what about files with no replicas? ("More resilient storage" area for temporary files - SCRATCHDISK?) Brian - this needs discussion later? Alessandra: this is an important point, though. Alessandra: when I asked ATLAS what to do when you lose files, they didn't have a clear answer. They said after 3 days you can declare them as lost, but there were 5 users rightly saying "where is my data?". Brian: we've always said that, when a Site temporarily or permanently loses data, the first thing is to produce the list of files are affected. It's up to ATLAS to sort out from that list what to do with the files that have replicas and don't have replicas. (One problem is that the T2s have more user data, and therefore more unique files.) "Should LOCALGROUPDISK and SCRATCHDISK be made resilient at T2s?" -- Item 2: formalised process for deletion needs to be decided. (And with T2K we need to work better with them to move stuff into spacetokens.) -- Item 3: 1/4 of failures that happen are puts at the end of the job - WN to localSE. The good news is that ATLAS can work around this with ATLAS job recovery (files copy to local WN disk in a special cache). We also need to work out other methods for improving resiliency. Retries for transfers, for example, would reduce issues that are transient. Can we get a feel for how this is affecting non-ATLAS VOs? It could be that there are problems with get errors from the site side, if a disk server is briefly down? There are also issues with files being gotten that are believed erroneously to be there. Chris: if a job fails, the most reliable way to make the job work seems to be to repeat the job elsewhere, or at the same site a few hours later! Everything else is a performance optimisation - if our storage is broken for an hour or two, there's a whole class of problems that you aren't going to solve without rerunning the job. Brian: Are there any T2s interested in seeing if we can get job recovery working again? (GLASGOW volunteered, as did MANCHESTER, and QMUL) And, do we know how CMS is affected by the same issues? (No CMS sites are present in this meeting.) (We also need to be aware of LHCb's nascent reprocessing at T2s project - happening at Manchester, so we need to pay attention to the tests they are doing. At the moment, Alessandra thinks they're stuck with problems with the software.) -- Item 4: SC11 attendance? Sam isn't. Brian will go where Jens instructs. Chris wasn't planning to, but could be persuaded. (It is only 3 months away and in Seattle, so we guess this means "No" at this point.) -- item 6 (AOB): Brian - is in conversation with people at SLAC about asymmetry in transfer routes. May be asking people to divulge their kernel settings. Sam - is installing WebDAV and NFS4.1 support on DPM and wants some testing. Govind - is worried that his DPM database is not shrinking more now he's down to 3 months of records kept. Govind is still having his problem with FTS performance - timeouts quite often. Some of the DPM pool nodes are on 1.7.1, could this be causing the problem? Transfer preparation time is very large - 80 to 90 seconds. What's the LAN between the head node and the storage servers like? Brian will have a look to see if there's correlation between the RHUL sources and the large preparation time - try to see if there's a connection with the 1.7.1 disk servers. Govind will update them today in any case? On 3 August 2011 06:41, Jens Jensen wrote: > Dear all, > > I put together a quick agenda for today: http://storage.esc.rl.ac.uk/weekly/ > > We should also talk about Elena's ls-l error? I'm sure we have seen those "reading token data header" errors but I can't remember if we know why they occur... > > I am going to try to connect but I am also on a hotel network which is mostly OK but failed last night when I tried to upload an agenda, so I've made the items a little more self-explanatory than usual (hopefully) in case I have trouble connecting or similar. > > Incidentally, my more-than-storage OGF feedback session is currently scheduled for the 17th. > > Cheers > --jens >