attending: Brian Elena Gareth John B John H Matt Jens Jeremy Chris B Sam Ewan David Chris W Apols: Wahid There is a GridPP coming up. Should we have an associated storage meeting (and, potentially, an associated storage sponsor?) No firm conclusion, but given it's at IC, it should be easy enough to extend the event if desired. The topic is GridPP5. Coincidentally, the discussion(s) about "Big Data" (for lack of a better expression) is partly about whether our experiences with BD (not Bande Desinee) are of interest - and in particular whether we can argue that we have an impact. There is also a new WLCG TDR, currently in draft form (see chat for link). It looks ahead; like GridPP5 must. In connection with this, it is perhaps useful to look at our original vision and mission statement etc again - we are aiming to make GridPP (UK++) not just a participant in WLCG but also a pretty darn good one - which I think we have achieved, thanks to the people who are contributing to this group. Apropos impact, industry big data are looking to promote their products: in our context they are perhaps more interested in collaborations and experiences than in selling things directly to us. To migrate GridPP user code off to a Industry Big Data Buzzword Compliant Product 5000(C) would be mad, but maybe we should investigate similar uses of our stuff. Maybe we could contribute learned lessons, or something. Our philosophy tends to be more towards OS, and certainly to not being locked in. There is also the EGI project called MAPPER: https://www.egi.eu/community/collaborations/MAPPER.html Sam argues that most of the "disruptive" (sorry!) technology change happened during GridPP4, as with (say) HDFS. We were also able to evaluate stuff like WOS. Regarding the dark data checking (see also chat) - Brian reports that ATLAS have changed their DD checking. Dumps on the catalogue end are now space token specific; whereas the syncat is for the whole SE. Need a generic tool which can query any SE. Chris W is talking to Frank Michel from biomed, could base stuff on work from ATLAS. Sam points out that SEs aren't built to dump their whole contents metadata on stdout (or web services equivalent thereof). Different checks can be enabled at different levels, like we have discussed before. It would be useful to have slow background checks if they could be done without having to put additional transfer load onto the networks. Or disks. QMUL found 15K files without checksums - possibly failed incoming, or older, before the Age of the Checksum. What does lcg-cr do? Presumably it sets the checksum in LFC, or does it need to be told? Sam will check. There's always the question of checksumming in flight - as storage nodes have loads of cores, it is usually quite feasible to checksum in flight without incurring noticeable overhead. Sam has checksum checking tool for DPM, but it is considered an administrative tool rather than one which runs autonomously in the bkg. Also what should lost System do, when it finds a checksum problem? It may be useful to recover automatically (failover to other replica) at access time, but should also notify of a problem in case something needs fixing. But there is a problem with this approach: if we checksum the file in flight, we haven't detected the error till it has been transferred; otherwise the file will need to be checksummed before it is transferred which will delay the file access, so will probably not be acceptable. There was a MPI/ActiveMQ synchronisation effort thing which was started by J.-P., but as he was then moved on to other things, the project seems to have been orphaned. [10:02:07] Jens Jensen http://www.gridpp.ac.uk/gridpp31/ [10:09:08] Jeremy Coles http://indico.cern.ch/conferenceDisplay.py?confId=197803 [10:12:04] Jeremy Coles The WLCG 'TDR' timeline is for the period LS1 to LS2. Looks at the computing model evolutions. Questions/differences have (I believe) already arisen in discussing early drafts/input. [10:17:14] Ewan Mac Mahon Sure - could we get someone selling DPM-in-a-box boxes, for example? [10:21:10] Sam Skipsey sure, but Lustre is GPLv2 licensed! [10:21:53] Jens Jensen Like Oracle... [10:21:58] Ewan Mac Mahon Indeed, but the same argument might apply - a Dell or an Alces might be interested in selling services based on DPM etc. [10:22:08] Ewan Mac Mahon But DDN wouldn't. [10:22:11] Sam Skipsey Of course. [10:22:32] Sam Skipsey Because DDN make their own software, while Dell don't. [10:22:54] Ewan Mac Mahon Maybe Will would be interested in AlcesSTOR and AlcesTRANSFER boxes [10:02:07] Jens Jensen http://www.gridpp.ac.uk/gridpp31/ [10:09:08] Jeremy Coles http://indico.cern.ch/conferenceDisplay.py?confId=197803 [10:12:04] Jeremy Coles The WLCG 'TDR' timeline is for the period LS1 to LS2. Looks at the computing model evolutions. Questions/differences have (I believe) already arisen in discussing early drafts/input. [10:17:14] Ewan Mac Mahon Sure - could we get someone selling DPM-in-a-box boxes, for example? [10:21:10] Sam Skipsey sure, but Lustre is GPLv2 licensed! [10:21:53] Jens Jensen Like Oracle... [10:21:58] Ewan Mac Mahon Indeed, but the same argument might apply - a Dell or an Alces might be interested in selling services based on DPM etc. [10:22:08] Ewan Mac Mahon But DDN wouldn't. [10:22:11] Sam Skipsey Of course. [10:22:32] Sam Skipsey Because DDN make their own software, while Dell don't. [10:22:54] Ewan Mac Mahon Maybe Will would be interested in AlcesSTOR and AlcesTRANSFER boxes [10:23:03] Sam Skipsey I did try to have a conversation with DDN about WoS->LFC/SRM bridges [10:28:52] Ewan Mac Mahon There is a real problem with some possible academic collaborators, that, unlike us, some of them do very much like to spend a lot of money on expensive shiny things rather than getting their hands dirty with something that actually works better. [10:29:14] Sam Skipsey My understanding of MAPPER was that they talked to us (GridPP) , but depended on an entirely different CE middleware. [10:29:59] Jeremy Coles I think that is correct. But it is still worth looking at what they proposed for the storage side. [10:30:27] Ewan Mac Mahon It's always interesting to talk to interesting people, IMO; it's just not always immediately useful, but sometimes it pays off longer term. [10:30:35] Sam Skipsey Other than that the CEs expected a cluster-wide shared filesystem, I don't know either, Jeremy [10:36:20] Ewan Mac Mahon I think the key thing with the performance is to not hammer the SE; if the checker is slow but doesn't overload things, that could certainly work well. [10:36:37] Ewan Mac Mahon It's if it's slow because it's thrashing something that it's a concern. [10:37:43] Jens Jensen We had some thoughts about all the possible failure modes ... somewhere in the wiki [10:37:44] Sam Skipsey Ewan: the problem is mostly that SRM (and LFC) are not designed for "tell me everything that is on this SE" [10:47:38] Ewan Mac Mahon That's certainly true - we're certainly not short of idle CPUs on our disk servers. [10:47:56] Ewan Mac Mahon (not sure that helps though) [10:50:10] Jens Jensen https://www.gridpp.ac.uk/wiki/File_Integrity_Testing [10:53:29] Ewan Mac Mahon Indeed. Something that just walks the namespace checksumming as it goes doesn't sound too hard/bad/scary. [10:55:33] Ewan Mac Mahon OK; but I'm pretty sure I can read a file off disk three times in less time than it takes to shunt it over the WAN. [10:55:58] Sam Skipsey I was fairly sure that gridftp could do rolling checksums on streams. [10:56:22] Jeremy Coles Sorry I need to leave now. Thanks. Bye. [10:56:27] Ewan Mac Mahon Indeed, but that guarantees that you received a good copy; not that it hit disk. [10:56:45] Ewan Mac Mahon en https://www.gridpp.ac.uk/wiki/File_Integrity_Testing [10:53:29] Ewan Mac Mahon Indeed. Something that just walks the namespace checksumming as it goes doesn't sound too hard/bad/scary. [10:55:33] Ewan Mac Mahon OK; but I'm pretty sure I can read a file off disk three times in less time than it takes to shunt it over the WAN. [10:55:58] Sam Skipsey I was fairly sure that gridftp could do rolling checksums on streams. [10:56:22] Jeremy Coles Sorry I need to leave now. Thanks. Bye. [10:56:27] Ewan Mac Mahon Indeed, but that guarantees that you received a good copy; not that it hit disk. [10:56:45] Ewan Mac Mahon That really does require reading it back in. [10:56:48] Sam Skipsey Sure, but it does mean you've not incurred any cost for one of your three potential checksums. [10:58:28] Ewan Mac Mahon That CPU cost is especially small for those of us with many gridftp servers. [10:58:41] Ewan Mac Mahon Because they're then multiple checksuming servers too. [10:59:27] David Crooks Sorry, I need to drop out for another meeting; cheers. [10:59:56] Jens Jensen We need to wrap up anyway... [11:00:15] Ewan Mac Mahon Anyone worrying about individual file latency in an FTS transfer is Doing It Wrong.