Minutes of the storage EVO meeting, 4 Aug 2010

Present:
	Glasgow: Sam, David
	Edinburgh: Wahid
	Liverpool: John, Stephen
	Bristol: Winnie
	Sheffield: Elena
	QMUL: Chris W, Ben
	Oxford: Ewan
	Lancaster: Matt
	Manchester: Alessandra
	RAL: Brian, James, Jens (chair+mins)


*** Don't forget the tasklist on
*** https://savannah.cern.ch/projects/srmsupportuk/

0. Review of very few actions, and related discussion

28/07/20010	Collar Tony about the data mgmt section at GPP25	Sam	High	Open

Done.  Things we want to talk about:

* Experiments use of storage - ideally get the experiments to
  present.  Could also present what we think they think and get them
  to shout of they don't think what we think they think.

* In particular, we want to make sure the non-LHC VOs are presented.
  Ben (was) volunteered to talk about T2K - what's good, what's bad,
  what would be good in the future, etc.

  ACTION Jens volunteer more experiment reps.

* Reports from Amsterdam and IC workshops - Sam went to both and was
  volunteered to talk about them.  However, writing the material
  should be a collaborative effort between the people who went.

* Tasks - Jens - overview of what we do etc.

* Discussion - time for discussion, eg 1/2 hour.

28/07/20010	Volunteer T2K for syncat exercise	Jens	Med	Open

Here's the T2K position:
 * Using RAL LFC - and no other LFC
 * Have seen files not copied correctly
 * Would like regular checks of LFC correctness
 * Heavy use of LFC, also collaborators, using resources in Canada
 * End users also use the catalogue

Site admin can check.  Can do from both ends: from catalogue to SEs,
or from SE against catalogue.  What action is taken if a mistake is
found?  It is flagged, but there is a potential to plug in something
that can automatically fix it (eg locating a healthy replica).  We
should first get a feel for the size of the problem (eg fraction of
GUIDs with missing replicas, fraction of replicas missing - perhaps a
case for our budding statisticians to spring into action again.)

How long does it take to run?  If run directly as a client, may take a
day or two.  If we access the backend directly, it may be quicker - eg
comparing lists of files.  We suggest trying the "proper" way first
and see how long it takes.  Should be linear in number of files at
site.

Lancaster have volunteered to do this first for T2K.  They have
downtime coming up, so would be good to complete the task before the
downtime.

See also chat log for reports on files.

For StoRM, Sam is completing the syncat tool right now...

1. I would like to do a genuine round table, postponed from last
   week.

Glasgow:
   StoRM dumper
   CHEP writeups
   Silent corruption due to memory on disk server, now offline

Liverpool:
   XFS on SL5 ongoing issues.  TCP window tuning didn't help.
   Considering trying ext4 but would be better to fix XFS.

Sheffield and Bristol:
   See chat log.

Edinburgh:
   ECDF - looking at RAID sizes, and ext4 vs xfs performance.
   And Hammercloud-in-a-box.
   A student is looking at StoRM with Hadoop and got it to work,
   albeit in a hacky way, and she is currently writing up.

QMUL:
   Looking at ATLAS dark data problem
   Issues with disk servers seemed to be resolved by changing network
   cards.
   Somewhat concerned about high load on Lustre metadata server, but
   other Lustre sites seeing this, too.
   Will upgrade once blessed 1.5 is out.

Oxford:
   Resurrecting test DPM, to get back to testing Glasgow db indexing
   and storing the db on SSD.  However, being a test DPM it is empty,
   so considering cloning production db onto the test one, and then
   test.  Discuss feasibility of this on list.

Lancaster:
   XFS on SL5 problems.  High load killed an rfiod, thus taking out a
   disk server.  Noted client (lcg-cr) error message was distinctly
   unhelpful, "invalid argument".
   Got new hardware for head node, a dual quad core with RAID10 SAS
   drives, and 24 Gigs of RAM.

2. Metrics discussion

   OK, so we ran out of time again :-)

3. AOB

   NOB


== CHAT ==

[09:59:39] John Bland joined
[10:00:47] Elena Korolkova joined
[10:00:55] Wahid Bhimji joined
[10:01:07] Stephen Jones joined
[10:02:23] Queen Mary, U London London, U.K. joined
[10:06:44] Winnie Lacesso joined
[10:07:30] Ewan Mac Mahon joined
[10:13:33] Matthew Doidge joined
[10:21:55] Alessandra Forti joined
[10:31:01] Winnie Lacesso Current storage-related activities: Waiting on v1.5 SL5 64-bit StoRM, in meantime reading with ++interest all q+a on Storm lists (v1.5 is problematic still??)
[10:31:10] Winnie Lacesso Current level of (storage-related) happiness: v1.3 works well. StoRM seems (so far) easier to maintain than DPM (reading the DPM discussion on dpm-users-list) 
[10:31:32] Winnie Lacesso Site bottleneck: The gpfs I/O servers on HPC cluster are maxed out, causing intermittent gpfs sluggishness on HPC WN, occasionally causing OPS & LHCb SAM failures. Not recent/ATM (thank goodness). 
[10:32:03] Sam Skipsey Winnie: metadata or data bottlenecks for GPFS?
[10:32:50] Queen Mary, U London London, U.K. 7033 t2k.org files at QMUL
[10:32:55] Winnie Lacesso I'm pretty sure Dr Cregan said it's data - email Dr Cregan (away on Vac'n ATM) for clarify & cc me pls?
[10:33:25] Sam Skipsey Okay. Cheers.
[10:34:41] Wahid Bhimji We see the same thing at Edinburgh - ops sam failure in CA test? LHCb failure in test that checkes root versions by listing a billion files 
[10:35:13] Wahid Bhimji gpfs metadata nightmare (but I thought at bristol you had your metadata on fast disk now ?) 
[10:36:05] Wahid Bhimji so maybe it is actually something else (we haven't really ever solved it here so not sure its a metadata issue - would be interesting to see if its the same reason - I will email bob) 
[10:37:46] Elena Korolkova We are upgrading storage pooles to sl5.5 because they fixed xfs problem in this version
[10:37:57] Brian Davies I've been looking at indiviadual file transfers for file in FTS logs fo rSites and possibly changing T! WAN TCP settings. Looking into dark data at sites. examplew qmul SCRATCHDISK. dq2 thinks 2TB and 70k files, sites sees 106k files and 8TB. of the 106k files, 6k files contain 6TB. remaining 100k files contain 2TB.
[10:37:59] Jens Jensen Thanks, Elena
[10:38:01] Elena Korolkova they=linux people
[10:38:24] Elena Korolkova left
[10:38:31] Matthew Doidge left
[10:38:35] Ben Still left
[10:38:38] James Thorne left