Minutes of the storage EVO meeting 4 May 2011

Present:
	Glasgow: Sam, David
	Sheffield: Elena
	Lancaster: Matt
	Liverpool: Stephen, John, Rob
	Edinburgh: Wahid
	Oxford: Ewan
	RHUL: Govind
	RAL T1: Brian, Jens (chair+mins)

Apologies:
	QMUL: Chris


1. Final pass (hopefully) through the proposed metrics

   Old (GridPP3) metrics were:

   * Space tokens defined correctly - OK in principle, but
     ATLAS-centric (for T2s), and even ATLAS are not extremely keen.
   * Transfer rate happiness - OK, a non-targeted metric so always
     green :-)
   * WLCG experiments happy (work on relevant problems) - originally
     based on "reports" from experiments.  Generally OK, except
     perhaps WLCG-only, should allow for non-WLCG experiments.
   * Transfer success rate - OK.
   * Deletion rate - this one cannot fail anymore, so not useful.

   * Blog posts - useful.  Should we increase the target?  We are
     current meeting the target rate of 6/Q, but only just.
   * AHM/CHEP papers - target 2/Y, OK, but should recognise other
     conferences.
   * Engage with WLCG (talks, meetings) - 2/Q talks, 2/Q meetings -
     generally OK but should also recognise non-meeting meetings, eg
     email discussions with an agenda and an outcome.  Also needs to
     recognise non-WLCG stakeholders, eg EMI, EGI, NGS,...

   In GridPP4, the current proposal is to have eight metrics,
   * Time to react to requests for changes in space tokens - too space
     token centric, but a good idea to time the reaction to changes -
     see later discussion.
   * Data transfer rates from T2s satisfy experiment requirements -
     again OK-ish, but lacking a target.
   * Information providers publishing consistently - this doesn't seem
     like a measurable target, but maybe something we should
     re-evaluate every so often
   * Conference paper produced each year - yes, agreed to increase to
     4, but should include other conferences.  Note that writing
     papers does take a long time and is not a core service.
   * Engage with storage and data management experts within WLCG (talks
     given, meetings, blog posts) - this metric should not include
     blog posts; see also note above.  4/Q is probably about right.
   * Work with experiment data management experts on relevant problems for
     LHC experiments - OK, but then relies on feedback from
     experiments, or on our perception of their happiness.

   It makes sense to have a time-to-resolve metric (and if there is
   nothing to resolve we have met it!)  We discussed whether to use
   GGUS tickets relating to storage, should we even assign tickets to
   the storage group?  In the end, however, we sort of converged on
   time-to-respond/resolve incidents reported to the mailing list (as
   from the time they were reported.)

   We can also measure time-to-implement for proposed changes, which
   could include changing space tokens, or upgrading GridPP, or
   something.  Targets should then take into account whether it's a
   small change (say 1wk), or large (say, 6wks).  We could ask PMB if
   they are interested in such a metric.

   Can the "blog" items include other "publications" eg publishing on
   the wiki, or on the list?  The blog does have a high readership,
   and people can track it with RSS or planet.  Perhaps if you publish
   something noteworthy on the wiki you can blog about it - Sam also
   does this when the toolkit is updated - OTOH in general, writing a
   good blog post takes time.

   Should there be a measure of "usability" (of networks),
   "availability" (of storage/sites), and/or "efficiency", and if so,
   how?  The current metric of "meeting experiments transfer rates" is
   a bit like this.  We have a whole separate task of improve the kind
   of reporting we get from FTS (I don't know if you can read this
   from outside RAL's firewall:
   http://lcgwww.gridpp.rl.ac.uk/cgi-bin/fts-mon/fts-mon.pl)

   Can the quarterlies be published to the group?  The main report is
   a spreadsheet which is available on the GridPP web site (somewhere)
   anyway.
   Here they are:
   http://www.gridpp.ac.uk/pmb/ProjectManagement/QuarterlyReports/reports.html

2. I also quite wanted a quick pass through HEPiX - we can volunteer
   someone who actually went to give a summary next week.

   There is some stuff already we may want to look at:
   http://indico.cern.ch/contributionListDisplay.py/filter?confId=118192

   Jens will volunteer Ian or Martin from T1 to give a report.

   The ASGC site report mentioned CEPH and pNFS.
   http://indico.cern.ch/contributionDisplay.py?contribId=51&sessionId=1&confId=118192
   There is a talk on evaluation of dist'd filesystems
   http://indico.cern.ch/contributionDisplay.py?contribId=54&sessionId=6&confId=118192
   There is a "GlusterFS", a Lustre replacement(?)
   http://indico.cern.ch/contributionDisplay.py?contribId=22&sessionId=6&confId=118192
   Others have no slides yet:
   There is a European Lustre consortium (is this the same as the
   other one?)
   http://indico.cern.ch/contributionDisplay.py?contribId=4&sessionId=6&confId=118192
   CVMFS:
   http://indico.cern.ch/contributionDisplay.py?contribId=41&sessionId=5&confId=118192


3. Summary of cloud storage planning?

   Jens will start a wiki page.  The OGF had looked into this -
   deploying grids in the clouds, interoperation, the similarity
   between grids and cloud storage protocols.

   GridPP4 proposed cloud activities are not necessarily storage
   related.  However, there may be room to do some storage stuff.
   Stuart should be doing some stuff.

   Some cloud resources are already available for free, there is one
   at Glasgow, ECDF has one, NGS has one (is it free?), and T1 has a
   Hadoop install (not IaaS cloud but PaaS.)

   Jens will kickstart a plan on the wiki.

4. And related to that, the use case for HDFS in GridPP - what are we
   aiming to do?

   HDFS gives you:
   (a) use of "idle" disk
   (b) replication
   (c) striping
   (d) rebalancing

   I think we now agreed that there are use cases for Hadoop but the
   original primary use case of making use of storage on WNs (as with
   resilient dCache) is probably no longer needed.  If a WN is born
   with a 1TB drive and half that goes to OS and experiment software
   and stuff, that leaves the other half, but then it may have 16
   cores that are busy doing stuff including using the networks, so
   distributing files on the HDs (but the file will rarely be on the
   host that needs it) would only put more pressure on the LAN, a
   pressure already increased by the multicoreness of the WN.

   We should keep watching and playing with Hadoop.  There may be
   secondary use cases, such as deploying with BeStMan or StoRM (eg
   for a T3)?, for building a distributed filesystem on dedicated
   storage nodes (as OSG seem to have done), and for running
   MapReduce.

   Simon at Bristol is interested in Hadoop; Brian went to the London
   workshop, and James has, as was announced last week, a full Hadoop
   install at the T1.


The rest of the items were postponed till next week:

5. Small VOs - do we need to pick on someone again?

6. Technology review revisited - proposals

7. AOB


Open actions (are these still open):

401	02/06/2010	Clean up the wiki	ALL	Low	Open
427	19/01/2011	Send Areca stuff to Sam	Ewan	High	Open
430	19/01/2011	Test the rest of Chris's StoRM (cf 425)	ALL	Med	Open
431	27/04/2011	Contact NeISS	Sam	Med	Open
432	27/04/2011	Find out who is currently pheno and get in touch	Jens	Med	Open
433	27/04/2011	Contact T2K and see if talking to us a month ago helped	Jens	Med	Open


[09:47:27] Sam Skipsey brb
[09:59:06] John Bland joined
[10:00:48] Rob Fay joined
[10:00:55] Wahid Bhimji joined
[10:01:10] Ewan Mac Mahon joined
[10:01:14] Stephen Jones joined
[10:01:50] David Crooks joined
[10:03:35] Govind Songara joined
[10:07:09] Wahid Bhimji keep
[10:08:51] Ewan Mac Mahon left
[10:08:52] Wahid Bhimji drop 
[10:08:57] Ewan Mac Mahon joined
[10:11:44] Brian Davies joined
[10:21:05] Ewan Mac Mahon And the periodic face to face meeting.
[10:21:54] Jens Jensen http://indico.cern.ch/contributionListDisplay.py/filter?confId=118192#contributions
[10:23:18] Jens Jensen http://indico.cern.ch/getFile.py/access?contribId=39&sessionId=6&resId=0&materialId=slides&confId=118192
[10:23:21] Jens Jensen Lustre
[10:27:43] Wahid Bhimji THere is a setup at ECDF too 
[10:27:46] Wahid Bhimji you could use that
[10:27:52] Wahid Bhimji contact Steve 
[10:30:01] John Bland gotta go.
[10:30:03] John Bland left
[10:38:18] Ewan Mac Mahon I think storm over lustre would be the massive favourite.
[10:39:02] Ewan Mac Mahon Lustre's also quite similar to DPM in a lot of ways, so it suits the same sort of hardware.
[10:39:11] Wahid Bhimji sorry to kick off a tangent discussion
[10:40:24] Elena Korolkova left
[10:43:52] Wahid Bhimji fair point
[10:44:30] Wahid Bhimji we need to subtract those things somehow
[10:47:44] Ewan Mac Mahon It's a bit wooly as to what counts as an 'incident'.
[10:48:12] Ewan Mac Mahon Not least because if someone asks a preemtive question and avoids trouble, that should count as a win.
[10:54:15] Stephen Jones left
[10:54:48] Wahid Bhimji Yeah so I think we should cc direct correspondence back to the list (if storage related) 
[10:56:42] Wahid Bhimji * Data transfer rates from T2s satisfy experiment requirements.
[10:56:49] Wahid Bhimji is what jens put 
[10:57:28] Wahid Bhimji but we should be a bit flexible about what the actual number is we put into that as we change from per file to aggregate or whatever
[10:58:39] Wahid Bhimji I think we should also help some cpu efficiency measure - that we look at - so more like management information 
[10:58:48] Wahid Bhimji rather than performance information 
[10:58:52] Ewan Mac Mahon Indeed. Do we have a concept of 'critical' and 'non-critical' metrics.
[10:58:57] Ewan Mac Mahon Like we do SAM tests?
[10:59:24] Jens Jensen yes this sounds sensible
[11:00:09] Jens Jensen we don't have critical vs non-critical metrics but we do have the concept of "nearly met"
[11:02:30] Wahid Bhimji should we look at the metrics within the group meeting as well 
[11:04:14] Ewan Mac Mahon We should have a metric for the total running number of metrics.
[11:04:24] Ewan Mac Mahon 10% year on year increase in metrics.
[11:05:47] Jens Jensen could do -
[11:05:55] Wahid Bhimji thanks
[11:06:05] Brian Davies left
[11:06:06] David Crooks left
[11:06:09] Sam Skipsey left
[11:06:09] Rob Fay left
[11:06:12] Govind Songara left
[11:06:14] Ewan Mac Mahon left
[11:06:14] Matthew Doidge left