Minutes of the storage EVO meeting 4 May 2011 Present: Glasgow: Sam, David Sheffield: Elena Lancaster: Matt Liverpool: Stephen, John, Rob Edinburgh: Wahid Oxford: Ewan RHUL: Govind RAL T1: Brian, Jens (chair+mins) Apologies: QMUL: Chris 1. Final pass (hopefully) through the proposed metrics Old (GridPP3) metrics were: * Space tokens defined correctly - OK in principle, but ATLAS-centric (for T2s), and even ATLAS are not extremely keen. * Transfer rate happiness - OK, a non-targeted metric so always green :-) * WLCG experiments happy (work on relevant problems) - originally based on "reports" from experiments. Generally OK, except perhaps WLCG-only, should allow for non-WLCG experiments. * Transfer success rate - OK. * Deletion rate - this one cannot fail anymore, so not useful. * Blog posts - useful. Should we increase the target? We are current meeting the target rate of 6/Q, but only just. * AHM/CHEP papers - target 2/Y, OK, but should recognise other conferences. * Engage with WLCG (talks, meetings) - 2/Q talks, 2/Q meetings - generally OK but should also recognise non-meeting meetings, eg email discussions with an agenda and an outcome. Also needs to recognise non-WLCG stakeholders, eg EMI, EGI, NGS,... In GridPP4, the current proposal is to have eight metrics, * Time to react to requests for changes in space tokens - too space token centric, but a good idea to time the reaction to changes - see later discussion. * Data transfer rates from T2s satisfy experiment requirements - again OK-ish, but lacking a target. * Information providers publishing consistently - this doesn't seem like a measurable target, but maybe something we should re-evaluate every so often * Conference paper produced each year - yes, agreed to increase to 4, but should include other conferences. Note that writing papers does take a long time and is not a core service. * Engage with storage and data management experts within WLCG (talks given, meetings, blog posts) - this metric should not include blog posts; see also note above. 4/Q is probably about right. * Work with experiment data management experts on relevant problems for LHC experiments - OK, but then relies on feedback from experiments, or on our perception of their happiness. It makes sense to have a time-to-resolve metric (and if there is nothing to resolve we have met it!) We discussed whether to use GGUS tickets relating to storage, should we even assign tickets to the storage group? In the end, however, we sort of converged on time-to-respond/resolve incidents reported to the mailing list (as from the time they were reported.) We can also measure time-to-implement for proposed changes, which could include changing space tokens, or upgrading GridPP, or something. Targets should then take into account whether it's a small change (say 1wk), or large (say, 6wks). We could ask PMB if they are interested in such a metric. Can the "blog" items include other "publications" eg publishing on the wiki, or on the list? The blog does have a high readership, and people can track it with RSS or planet. Perhaps if you publish something noteworthy on the wiki you can blog about it - Sam also does this when the toolkit is updated - OTOH in general, writing a good blog post takes time. Should there be a measure of "usability" (of networks), "availability" (of storage/sites), and/or "efficiency", and if so, how? The current metric of "meeting experiments transfer rates" is a bit like this. We have a whole separate task of improve the kind of reporting we get from FTS (I don't know if you can read this from outside RAL's firewall: http://lcgwww.gridpp.rl.ac.uk/cgi-bin/fts-mon/fts-mon.pl) Can the quarterlies be published to the group? The main report is a spreadsheet which is available on the GridPP web site (somewhere) anyway. Here they are: http://www.gridpp.ac.uk/pmb/ProjectManagement/QuarterlyReports/reports.html 2. I also quite wanted a quick pass through HEPiX - we can volunteer someone who actually went to give a summary next week. There is some stuff already we may want to look at: http://indico.cern.ch/contributionListDisplay.py/filter?confId=118192 Jens will volunteer Ian or Martin from T1 to give a report. The ASGC site report mentioned CEPH and pNFS. http://indico.cern.ch/contributionDisplay.py?contribId=51&sessionId=1&confId=118192 There is a talk on evaluation of dist'd filesystems http://indico.cern.ch/contributionDisplay.py?contribId=54&sessionId=6&confId=118192 There is a "GlusterFS", a Lustre replacement(?) http://indico.cern.ch/contributionDisplay.py?contribId=22&sessionId=6&confId=118192 Others have no slides yet: There is a European Lustre consortium (is this the same as the other one?) http://indico.cern.ch/contributionDisplay.py?contribId=4&sessionId=6&confId=118192 CVMFS: http://indico.cern.ch/contributionDisplay.py?contribId=41&sessionId=5&confId=118192 3. Summary of cloud storage planning? Jens will start a wiki page. The OGF had looked into this - deploying grids in the clouds, interoperation, the similarity between grids and cloud storage protocols. GridPP4 proposed cloud activities are not necessarily storage related. However, there may be room to do some storage stuff. Stuart should be doing some stuff. Some cloud resources are already available for free, there is one at Glasgow, ECDF has one, NGS has one (is it free?), and T1 has a Hadoop install (not IaaS cloud but PaaS.) Jens will kickstart a plan on the wiki. 4. And related to that, the use case for HDFS in GridPP - what are we aiming to do? HDFS gives you: (a) use of "idle" disk (b) replication (c) striping (d) rebalancing I think we now agreed that there are use cases for Hadoop but the original primary use case of making use of storage on WNs (as with resilient dCache) is probably no longer needed. If a WN is born with a 1TB drive and half that goes to OS and experiment software and stuff, that leaves the other half, but then it may have 16 cores that are busy doing stuff including using the networks, so distributing files on the HDs (but the file will rarely be on the host that needs it) would only put more pressure on the LAN, a pressure already increased by the multicoreness of the WN. We should keep watching and playing with Hadoop. There may be secondary use cases, such as deploying with BeStMan or StoRM (eg for a T3)?, for building a distributed filesystem on dedicated storage nodes (as OSG seem to have done), and for running MapReduce. Simon at Bristol is interested in Hadoop; Brian went to the London workshop, and James has, as was announced last week, a full Hadoop install at the T1. The rest of the items were postponed till next week: 5. Small VOs - do we need to pick on someone again? 6. Technology review revisited - proposals 7. AOB Open actions (are these still open): 401 02/06/2010 Clean up the wiki ALL Low Open 427 19/01/2011 Send Areca stuff to Sam Ewan High Open 430 19/01/2011 Test the rest of Chris's StoRM (cf 425) ALL Med Open 431 27/04/2011 Contact NeISS Sam Med Open 432 27/04/2011 Find out who is currently pheno and get in touch Jens Med Open 433 27/04/2011 Contact T2K and see if talking to us a month ago helped Jens Med Open [09:47:27] Sam Skipsey brb [09:59:06] John Bland joined [10:00:48] Rob Fay joined [10:00:55] Wahid Bhimji joined [10:01:10] Ewan Mac Mahon joined [10:01:14] Stephen Jones joined [10:01:50] David Crooks joined [10:03:35] Govind Songara joined [10:07:09] Wahid Bhimji keep [10:08:51] Ewan Mac Mahon left [10:08:52] Wahid Bhimji drop [10:08:57] Ewan Mac Mahon joined [10:11:44] Brian Davies joined [10:21:05] Ewan Mac Mahon And the periodic face to face meeting. [10:21:54] Jens Jensen http://indico.cern.ch/contributionListDisplay.py/filter?confId=118192#contributions [10:23:18] Jens Jensen http://indico.cern.ch/getFile.py/access?contribId=39&sessionId=6&resId=0&materialId=slides&confId=118192 [10:23:21] Jens Jensen Lustre [10:27:43] Wahid Bhimji THere is a setup at ECDF too [10:27:46] Wahid Bhimji you could use that [10:27:52] Wahid Bhimji contact Steve [10:30:01] John Bland gotta go. [10:30:03] John Bland left [10:38:18] Ewan Mac Mahon I think storm over lustre would be the massive favourite. [10:39:02] Ewan Mac Mahon Lustre's also quite similar to DPM in a lot of ways, so it suits the same sort of hardware. [10:39:11] Wahid Bhimji sorry to kick off a tangent discussion [10:40:24] Elena Korolkova left [10:43:52] Wahid Bhimji fair point [10:44:30] Wahid Bhimji we need to subtract those things somehow [10:47:44] Ewan Mac Mahon It's a bit wooly as to what counts as an 'incident'. [10:48:12] Ewan Mac Mahon Not least because if someone asks a preemtive question and avoids trouble, that should count as a win. [10:54:15] Stephen Jones left [10:54:48] Wahid Bhimji Yeah so I think we should cc direct correspondence back to the list (if storage related) [10:56:42] Wahid Bhimji * Data transfer rates from T2s satisfy experiment requirements. [10:56:49] Wahid Bhimji is what jens put [10:57:28] Wahid Bhimji but we should be a bit flexible about what the actual number is we put into that as we change from per file to aggregate or whatever [10:58:39] Wahid Bhimji I think we should also help some cpu efficiency measure - that we look at - so more like management information [10:58:48] Wahid Bhimji rather than performance information [10:58:52] Ewan Mac Mahon Indeed. Do we have a concept of 'critical' and 'non-critical' metrics. [10:58:57] Ewan Mac Mahon Like we do SAM tests? [10:59:24] Jens Jensen yes this sounds sensible [11:00:09] Jens Jensen we don't have critical vs non-critical metrics but we do have the concept of "nearly met" [11:02:30] Wahid Bhimji should we look at the metrics within the group meeting as well [11:04:14] Ewan Mac Mahon We should have a metric for the total running number of metrics. [11:04:24] Ewan Mac Mahon 10% year on year increase in metrics. [11:05:47] Jens Jensen could do - [11:05:55] Wahid Bhimji thanks [11:06:05] Brian Davies left [11:06:06] David Crooks left [11:06:09] Sam Skipsey left [11:06:09] Rob Fay left [11:06:12] Govind Songara left [11:06:14] Ewan Mac Mahon left [11:06:14] Matthew Doidge left