Minutes of the storage EVO meeting 02 March 2010 Present: Glasgow: Sam, David Edinburgh: Wahid Sheffield: Elena Lancaster: Matt Oxford: Ewan RHUL: Govind RAL T1: James, Brian, Jens (chair+mins) Agenda for the storage EVO meeting 02 Mar 2011 0. Review of actions (see below)... and tasklist https://savannah.cern.ch/task/?group=srmsupportuk #17931 Unbalanced datasets J.-P. has now taken an interest in unbalanced datasets; both Wahid and Sam have been feeding back experiences to J.-P. See also discussion related to #16726 below. #16729 StoRM dumper Seems to be working but a slash is missing that ATLAS' tool seems to expect. Is the slash required by the standard or not? Not clear; but it seems easier to make the dumper add a slash in the correct location. #16728 Document boundary cases in testing Something was done: http://www.gridpp.ac.uk/wiki/Storage_and_Data_Management_Testing (ie testing with srmcp so it doesn't need the information system; etc.) Was anything else intended for this task? #16727 Plan for dealing with orphaned files Done; needs following up. #16726 ext4 evaluation Done. ext4 currently in production on some disk pools at Glasgow. Should new DPMs use ext4? ext4 is limited to 16TB. Maybe a good idea to limit size of filesystem anyway; eg Glasgow using 8-15 TB. DPM ignores filesystem size. Manchester have 24 and 60 TB servers - with a single filesystem per server, DPM may get upset when filesystems reach 24 TB (eg small ones fill up). Also, in general it is not good to have the disk servers completely full. As an aside, JET are short on storage and funding - could sites' decommissioned servers be used by JET to bring them to at least 10 TB? See chat log. Decommissioned servers are also used for testing distributed filesystems etc. #16725 Need writeup on StoRM 1.5 upgrade Could now be a 1.6 upgrade. #16724 Nagios monitoring tailored to DPM Lowest priority; skipped for now. It is probably done(?); much work has been done on DPM monitoring. #16723 Evaluate fsprobe at T2s Lowest priority; skipped for now. #16722 Compare values published in SE info to atlasian values This is a general issue, eg with space tokens publishing shared pools (shared between VOs), or resources not being published, or (DPM) space tokens being set up with confusion between TiB and TB (publishing should be TB, ie powers of 10.) #16721 Study storage effect on job efficiency This is a separate discussion; postponed for later. On roadmap given to PMB. #16683 Add alternative input methods to consistency checking scripts Lowest priority; skipped for now. #16351 SSD writeup Done -see CHEP papers. #16350 Hadoop writeup Ongoing. Should we have Hadoop on the agenda for the workshop? Yes. #16349 Test dCache Done (functional, not performance); writeup as google doc. With Rob Harper from RAL PP. #15359 Test pcache Done. #15358 Test Lustre with BeStMan This was the task for which Simon from IC volunteered? #15357 Test Hadoop with StoRM Some work by Wahid's disciple at Edinburgh. Problem will be integration between SRM and the underlying layers. #15356 Check SA.ACBR and VOInfo.ACBR Basic checks (eyeball) done; do we need a tool for testing this? Probably not. Issue is resource discovery: VOs look for things with their names on them; either SA directly or (more likely) VOInfo object first to discover paths and space tokens. 1. Status of StoRM 1.6? While StoRM 1.6 still has the publishing problem with the dynamic data not being updated, er, dynamically, for non-GPFS-based StoRMs, it is now our recommended version. Chris reports that it failed a staged rollout but there were no problems that some configuration fiddling didn't fix. But test it first if you can. Moreover, StoRM 1.6 works on SL5; 1.5 requires SL4. 2. DPM and MySQL? Do we need to help people with MySQL, eg optimisation? For DPM 1.8, the situation should improve because old requests are cleaned up and the database is therefore smaller. Is there any standard MySQL monitoring which could be integrated with existing DPM monitoring (given that all (UKI) DPM sites seem to be using MySQL)? Possibly; MySQL should integrate with Ganglia and Nagios. 3. File lifetimes in SRM - support status Interestingly, DPM does appear to garbage collect files whose lifetime have expired (as per the SRM protocol). Andrea Sciaba was doing some work on documenting the behaviour of these things: 4. ATLAS poor data rates for the UK? And the transferological study - discussed following the meeting. See also chatlog from 10:40 onward. In particular, how should sites be firewalled? Maybe the best is to keep the grid outside the firewall. 5. T1 representation in this group - who will follow in James' footsteps? See note in chatlog: James Adams is the lucky winner. 401 02/06/2010 Clean up the wiki ALL Low Open 427 19/01/2011 Send Areca stuff to Sam Ewan High Open 429 19/01/2011 Contact Sanger about storage (Lustre workshop and GridPP) Jens Med Open 430 19/01/2011 Test the rest of Chris's StoRM (cf 425) ALL Med Open Create baseline datapoints for transfers from Manchester, QMUL, Glasgow, source BNL and destination two remote T1s, one in Europe and one in the US. PIC or NIKHEF/SARA. 2GB size initially. Sites to look at their disk server settings Place/find appropriate datasets Check changes for Glasgow Sam to talk to Steve Lloyd about his network tests. And to Mark about firewall. [10:00:44] Wahid Bhimji joined [10:01:06] Wahid Bhimji I did [10:01:09] John Bland joined [10:02:21] Brian Davies joined [10:02:56] Wahid Bhimji added Mehrez ALACHHEB to the list [10:03:29] Govind Songara joined [10:04:13] Stephen Jones joined [10:04:15] Ewan Mac Mahon joined [10:04:25] Queen Mary, U London London, U.K. joined [10:04:38] Matthew Doidge joined [10:05:53] Queen Mary, U London London, U.K. left [10:05:59] Jens Jensen https://savannah.cern.ch/task/?group=srmsupportuk\ [10:06:24] Jens Jensen https://savannah.cern.ch/task/?group=srmsupportuk [10:07:07] Queen Mary, U London London, U.K. joined [10:13:38] Wahid Bhimji 16 T [10:20:25] Wahid Bhimji cheeky [10:20:39] Wahid Bhimji give them to your Edinburgh cousins [10:21:05] James Thorne Send an email to Martin Bly [10:21:20] James Thorne We're decommissioning some kit this year [10:21:31] James Thorne (about 100 servers) [10:21:38] Wahid Bhimji we'll have one ! [10:21:55] Wahid Bhimji and 5 of the RAL ones ! [10:21:57] James Thorne :q [10:22:12] James Thorne (sorry meant to put that in vi) [10:22:50] Ewan Mac Mahon James: I will indeed email/talk to Martin. Roughly when's this likely to happen? [10:23:00] James Thorne TBD [10:23:11] Ewan Mac Mahon OK [10:23:17] John Bland it might be worth sending that to tb-support or something, lots of sites would be glad of some old kit [10:23:54] Ewan Mac Mahon And recycling Tier 1 kit mught be less financially icky that Tier 2 kit. [10:25:11] Wahid Bhimji yes [10:25:32] Wahid Bhimji at least then you can shout at the developers [10:26:30] James Thorne I'll speak to Martin and let him announce it depending how much we have to offer (some may need to be held back for spares until all are out of prod) [10:27:40] Wahid Bhimji nagios monitoring there is a new guy in the DPM cern team working on it. [10:27:56] Wahid Bhimji he has sent me a link to prototype probes - I will look at it [10:28:03] Sam Skipsey The SSD writeup is done (if a CHEP paper counts...) [10:28:11] Wahid Bhimji I have a student in the summer who may augment whats there if its not good enough for us [10:28:18] Sam Skipsey The pcache testing is also done (there's a wiki page on it) [10:29:12] Jens Jensen Ta [10:41:46] Wahid Bhimji I said that too ! [10:41:58] Wahid Bhimji Ah yes - alessandra was the counter example [10:42:51] Wahid Bhimji yes [10:43:56] James Thorne Just FYI, James Adams will take over from me. [10:43:56] Stephen Jones left [10:44:12] Wahid Bhimji ok [10:44:15] James Thorne left [10:44:16] Govind Songara left [10:45:40] John Bland left [10:48:27] Jens Jensen I am back. There is no NGS surgery right now so either there was none or they finished early. [10:52:56] Alessandra Forti http://ks.tier2.hep.manchester.ac.uk/T2/sonar/Manchester-networking.pdf [10:53:03] Alessandra Forti this is the network in manchester [10:53:53] Sam Skipsey What does the Cisco do? Just ACLs, nothing stateful? [10:53:54] Alessandra Forti the sysctl.conf changes haven't had any effect on the sonar tests although they seem to have improved the lan transfers to the WNs [10:55:08] Alessandra Forti it shouldn't do anything harmful [10:55:27] Alessandra Forti but I haven' digged in the configuration yet [10:56:05] Alessandra Forti transfers from data server to data server done with scp [10:56:11] Alessandra Forti are 40MBs [10:56:20] Alessandra Forti even 45MBs [10:56:49] Alessandra Forti these transfers are going through the cisco [10:57:50] Alessandra Forti Brian iperf from RAL was 100 MBs yesterday. I don't know if he has made systematic measures [10:57:54] Alessandra Forti though [10:59:00] Alessandra Forti not this week [11:00:31] Wahid Bhimji Well it could be a combination of site specific and generic issues [11:00:45] Sam Skipsey This was what I was trying to say, Wahid [11:01:08] Wahid Bhimji indeed - no silver bullet [11:02:28] Jens Jensen Is there a use case for asking for better resources for dteam? I have also experienced sites not supporting dteam well (when trying to debug stuff) [11:03:29] Wahid Bhimji I'll make you an atlas proxy if you want to do it as atlas [11:03:34] Wahid Bhimji (i have prod too) [11:06:33] Wahid Bhimji http://pprc.qmul.ac.uk/~lloyd/gridpp/nettest.html [11:06:39] Wahid Bhimji so we could add a test to a remote site [11:07:30] Alessandra Forti are these test reliable? [11:07:37] Sam Skipsey ish [11:08:13] Alessandra Forti manchester doesn't even appear in the final T1 table [11:08:29] Alessandra Forti and only man2 appers in other places [11:08:44] Alessandra Forti last bit is not a problem cause the storage is the same [11:08:55] Alessandra Forti but it's a bit indicative [11:09:05] Alessandra Forti that it needs perfectioning [11:09:07] Sam Skipsey I'm not sure when Steve last updated his list of endpoints to test, so that would be why the names are out of date. [11:09:43] Wahid Bhimji - yeah well it should be changed to something useful since it is supposed to be testing what we are talking about here. [11:10:00] Sam Skipsey Well, at least, within the UK. [11:10:11] Wahid Bhimji If we suggest to steve some different endpoints (and maybe different filesizes ) then it should be hard for him to add them [11:10:26] Wahid Bhimji our suggested endpoints can be without the uk [11:10:33] Wahid Bhimji outwith 11:11:05] Wahid Bhimji ie manchester / qmul / glasgow [11:11:09] Sam Skipsey They can, but we should also get the existing internal tests improved... [11:17:43] Alessandra Forti there are the sonar tests files we don't need to create much. We can use scratch disk rather than datadisk [11:17:57] Alessandra Forti to make T1 less upset [11:17:58] Sam Skipsey Well, exactly. [11:18:26] Sam Skipsey (assuming there's no special tuning going on to prioritise transfers to different spacetokens) [11:18:34] Sam Skipsey (which I doubt) [11:18:45] Alessandra Forti i doubt that [11:19:18] Sam Skipsey So, yeah, I don't think there's an issue with just sticking files in scratchdisk [11:23:57] Wahid Bhimji actually ECDF don't trust anyone [11:24:02] Wahid Bhimji even me and Andy [11:24:02] Alessandra Forti [11:24:23] Sam Skipsey yeah, but we trust You, Wahid. [11:27:32] Wahid Bhimji Netherlands [11:27:37] Wahid Bhimji ok