Minutes of the storage EVO meeting, 4 Aug 2010 Present: Glasgow: Sam, David Edinburgh: Wahid Liverpool: John, Stephen Bristol: Winnie Sheffield: Elena QMUL: Chris W, Ben Oxford: Ewan Lancaster: Matt Manchester: Alessandra RAL: Brian, James, Jens (chair+mins) *** Don't forget the tasklist on *** https://savannah.cern.ch/projects/srmsupportuk/ 0. Review of very few actions, and related discussion 28/07/20010 Collar Tony about the data mgmt section at GPP25 Sam High Open Done. Things we want to talk about: * Experiments use of storage - ideally get the experiments to present. Could also present what we think they think and get them to shout of they don't think what we think they think. * In particular, we want to make sure the non-LHC VOs are presented. Ben (was) volunteered to talk about T2K - what's good, what's bad, what would be good in the future, etc. ACTION Jens volunteer more experiment reps. * Reports from Amsterdam and IC workshops - Sam went to both and was volunteered to talk about them. However, writing the material should be a collaborative effort between the people who went. * Tasks - Jens - overview of what we do etc. * Discussion - time for discussion, eg 1/2 hour. 28/07/20010 Volunteer T2K for syncat exercise Jens Med Open Here's the T2K position: * Using RAL LFC - and no other LFC * Have seen files not copied correctly * Would like regular checks of LFC correctness * Heavy use of LFC, also collaborators, using resources in Canada * End users also use the catalogue Site admin can check. Can do from both ends: from catalogue to SEs, or from SE against catalogue. What action is taken if a mistake is found? It is flagged, but there is a potential to plug in something that can automatically fix it (eg locating a healthy replica). We should first get a feel for the size of the problem (eg fraction of GUIDs with missing replicas, fraction of replicas missing - perhaps a case for our budding statisticians to spring into action again.) How long does it take to run? If run directly as a client, may take a day or two. If we access the backend directly, it may be quicker - eg comparing lists of files. We suggest trying the "proper" way first and see how long it takes. Should be linear in number of files at site. Lancaster have volunteered to do this first for T2K. They have downtime coming up, so would be good to complete the task before the downtime. See also chat log for reports on files. For StoRM, Sam is completing the syncat tool right now... 1. I would like to do a genuine round table, postponed from last week. Glasgow: StoRM dumper CHEP writeups Silent corruption due to memory on disk server, now offline Liverpool: XFS on SL5 ongoing issues. TCP window tuning didn't help. Considering trying ext4 but would be better to fix XFS. Sheffield and Bristol: See chat log. Edinburgh: ECDF - looking at RAID sizes, and ext4 vs xfs performance. And Hammercloud-in-a-box. A student is looking at StoRM with Hadoop and got it to work, albeit in a hacky way, and she is currently writing up. QMUL: Looking at ATLAS dark data problem Issues with disk servers seemed to be resolved by changing network cards. Somewhat concerned about high load on Lustre metadata server, but other Lustre sites seeing this, too. Will upgrade once blessed 1.5 is out. Oxford: Resurrecting test DPM, to get back to testing Glasgow db indexing and storing the db on SSD. However, being a test DPM it is empty, so considering cloning production db onto the test one, and then test. Discuss feasibility of this on list. Lancaster: XFS on SL5 problems. High load killed an rfiod, thus taking out a disk server. Noted client (lcg-cr) error message was distinctly unhelpful, "invalid argument". Got new hardware for head node, a dual quad core with RAID10 SAS drives, and 24 Gigs of RAM. 2. Metrics discussion OK, so we ran out of time again :-) 3. AOB NOB == CHAT == [09:59:39] John Bland joined [10:00:47] Elena Korolkova joined [10:00:55] Wahid Bhimji joined [10:01:07] Stephen Jones joined [10:02:23] Queen Mary, U London London, U.K. joined [10:06:44] Winnie Lacesso joined [10:07:30] Ewan Mac Mahon joined [10:13:33] Matthew Doidge joined [10:21:55] Alessandra Forti joined [10:31:01] Winnie Lacesso Current storage-related activities: Waiting on v1.5 SL5 64-bit StoRM, in meantime reading with ++interest all q+a on Storm lists (v1.5 is problematic still??) [10:31:10] Winnie Lacesso Current level of (storage-related) happiness: v1.3 works well. StoRM seems (so far) easier to maintain than DPM (reading the DPM discussion on dpm-users-list) [10:31:32] Winnie Lacesso Site bottleneck: The gpfs I/O servers on HPC cluster are maxed out, causing intermittent gpfs sluggishness on HPC WN, occasionally causing OPS & LHCb SAM failures. Not recent/ATM (thank goodness). [10:32:03] Sam Skipsey Winnie: metadata or data bottlenecks for GPFS? [10:32:50] Queen Mary, U London London, U.K. 7033 t2k.org files at QMUL [10:32:55] Winnie Lacesso I'm pretty sure Dr Cregan said it's data - email Dr Cregan (away on Vac'n ATM) for clarify & cc me pls? [10:33:25] Sam Skipsey Okay. Cheers. [10:34:41] Wahid Bhimji We see the same thing at Edinburgh - ops sam failure in CA test? LHCb failure in test that checkes root versions by listing a billion files [10:35:13] Wahid Bhimji gpfs metadata nightmare (but I thought at bristol you had your metadata on fast disk now ?) [10:36:05] Wahid Bhimji so maybe it is actually something else (we haven't really ever solved it here so not sure its a metadata issue - would be interesting to see if its the same reason - I will email bob) [10:37:46] Elena Korolkova We are upgrading storage pooles to sl5.5 because they fixed xfs problem in this version [10:37:57] Brian Davies I've been looking at indiviadual file transfers for file in FTS logs fo rSites and possibly changing T! WAN TCP settings. Looking into dark data at sites. examplew qmul SCRATCHDISK. dq2 thinks 2TB and 70k files, sites sees 106k files and 8TB. of the 106k files, 6k files contain 6TB. remaining 100k files contain 2TB. [10:37:59] Jens Jensen Thanks, Elena [10:38:01] Elena Korolkova they=linux people [10:38:24] Elena Korolkova left [10:38:31] Matthew Doidge left [10:38:35] Ben Still left [10:38:38] James Thorne left