Present: Oxford: Ewan Edinburgh: Wahid Cambridge: Santanu Glasgow: Gareth, Sam, David Liverpool: Stephen, Mark RALT1: Brian, Jens (chair+mins) Imperial: Duncan QMUL: Chris Lancaster: Matt https://savannah.cern.ch/task/?group=srmsupportuk 20475 C 20474 ? Ricardo sent Matt something to test 17931 + unbalanced disk server load - Sam has beta code to do some rebalancing, making use of DPM improvements (fs waiting - no new writes but files can be deleted) 16729 - check with Chris 16728 - wiki page started? 16727 C 16725 C 16724 C exists in DPM core - should be prod (sites are not all using it) 16723 C DPM got better at checksumming, on demand 16722 O clarify which values are needed 16721 C Lots of studies around - ATLAS GDB before last - ATLAS monitoring available, Wahid will send links (see chat) 16683 C no longer needed 16350 C Done, wiki 16349 C Done, wiki 15359 C Done 15359 C No longer relevant 15357 C Isn't going to work: StoRM needs POSIX ACLs. 15356 C Done hepsysman, T1 "EOS" evaluations: ** Criteria for evaluations - which candidates should the T1 be considering? How files will migrate from the disk system to the tape system? They don't, or it's managed from the outside eg with FTS. - BeStMan on Lustre? - Is BeStMan not being supported anymore? Will they move away from SRM? ** Puppet templates for DPM not out? Santanu found templates on CERN web site, trying to do everything from installation to configuration. ** Two solutions layered on top of storage, StorageD and SDB: would something git-like be suitable. ** iRODS can interface to other things, QMUL will be running an iRODS. Sam and Wahid and Chris will be going to CHEP/WLCG There shall be a meeting next, nonetheless. [09:59:49] Stephen Jones joined [10:00:29] David Crooks joined [10:01:08] Wahid Bhimji hah [10:01:19] Wahid Bhimji what a milestone 200m meetings! [10:01:21] David Crooks left [10:01:40] Brian Davies joined [10:02:08] Wahid Bhimji I can't open it [10:02:19] Ewan Mac Mahon joined [10:02:37] Jens Jensen https://savannah.cern.ch/task/?group=srmsupportuk [10:03:08] Mark Norman joined [10:04:43] Duncan Rand joined [10:06:32] Ewan Mac Mahon No, but you can set it's fs weight to several bajillion. [10:08:18] Ewan Mac Mahon So for the rebalancer if you just set every fs other than the target to weight 0, it'll definitely g\ o where you want. [10:12:21] Sam Skipsey https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Dev/Components [10:12:51] PPRC QMUL joined [10:13:42] Ewan Mac Mahon I'm not sure anyone actually got fsprobe running, but I think we all decided that it was a) a good i\ dea and b) should be fairly easy [10:15:40] Ewan Mac Mahon On demand. [10:15:46] Wahid Bhimji when an atlas job runs on the data it picks corruption up [10:15:53] Wahid Bhimji which I think is good enough [10:16:27] Ewan Mac Mahon Though you could imagine a DPM-fsprobe that would do the fsprobe style background checking, but via \ DPM and comparing with its stored checksums. [10:16:59] Matthew Doidge joined [10:20:27] Wahid Bhimji I meant this link [10:20:28] Wahid Bhimji http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=UK&sitesSort\ =8&start=null&end=null&timeRange=lastMonth&sortBy=0&granularity=Daily&generic=0&series=All&type=pfe [10:20:40] Govind Songara joined [10:20:58] Govind Songara left [10:21:05] Brian Davies yes, actually below is an exapmple fo r asibngle site ( oxford in this case) [10:21:06] Brian Davies http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary#button=successfailures&site\ s[]=UKI-SOUTHGRID-OX-HEP&sitesSort=0&start=null&end=null&timerange=lastWeek&granularity=Hourly&generic=0&sortby=0&series=All [10:21:41] Wahid Bhimji My link takes you straight to the error codes for all UK sites [10:23:08] Wahid Bhimji Bestman support is ending - no point in us using [10:24:05] Ewan Mac Mahon But the action to test it is complete - it was tested, it didn't work. [10:24:24] Duncan Rand dpm is being prepared for hdfs [10:24:52] Wahid Bhimji DMLITE indeed may bring us a whole heap of new combos [10:25:07] Sam Skipsey And yeah, one of them is a hadoop backend. [10:25:22] Sam Skipsey (Which is actually interesting) [10:25:32] Ewan Mac Mahon HepSysMan was good; we like it. [10:25:55] Ewan Mac Mahon Though I'm not completely clear from Sam's talk - which fs is it we're supposed to be using? [10:26:56] Sam Skipsey I'm assuming you're trolling, Ewan. :p [10:27:04] Ewan Mac Mahon [10:27:15] Duncan Rand didn't the consensus end up with storm & lustre? [10:27:39] Sam Skipsey "Storm and Lustre" is "good enough", yes. And actually works. [10:27:49] Wahid Bhimji so does DPM [10:27:54] Sam Skipsey (Which I think is an underrated virtue) [10:27:54] Gareth Roy left [10:28:05] Duncan Rand so does dcache [10:28:43] Sam Skipsey Sure: DMLITE + HDFS would also be good. [10:28:47] Gareth Roy joined [10:28:58] Wahid Bhimji so there are your candidates + EOS (shudder ) [10:29:05] Ewan Mac Mahon I think for doing it right now it's got to be STORM+lustre [10:29:16] Wahid Bhimji bestman has no long term support [10:29:16] Sam Skipsey (You might notice that I'm increasingly pro filesystem backends that can actually do block level parall\ elism) [10:29:20] Ewan Mac Mahon All the others are potentially interesting for the future. [10:29:21] Wahid Bhimji so I'd say forget it [10:29:31] Duncan Rand don't forget gpfs [10:29:43] Ewan Mac Mahon No, really, DO forget gpfs. [10:29:51] Ewan Mac Mahon There be dragons. [10:29:56] Sam Skipsey And expenses. [10:30:43] Ewan Mac Mahon Not so much expenses, apparantly. There are some academic deals around (the Oxford e-Research Centre\ seem to run it for mostly gratis) [10:30:50] Ewan Mac Mahon But still dragons. [10:31:40] Duncan Rand sam are you suggesting DMLITE + HDFS for the tier-1? if so how many replicas? [10:31:52] Ewan Mac Mahon I think Andrew ruled HDFS out. [10:31:52] Duncan Rand cos replicas cost money [10:31:59] Ewan Mac Mahon For that very reason. [10:32:24] Ewan Mac Mahon Something RAIN like might be interesting, but replicas of whole files are just too damn expensive. [10:32:25] Sam Skipsey you don't have to replicate. [10:32:53] Stephen Jones We use puppet, without any specific templates. [10:32:57] Sam Skipsey In which case, it's the block level distribution that's still useful, as it smears load. [10:33:28] Sam Skipsey (Annoyingly, all the things that do RAIN that aren't expensive are still beta.) [10:33:42] Ewan Mac Mahon I think HDFS would still count as slightly wierd compared with Lustre as a more mainstream feeling o\ ption. [10:33:49] Wahid Bhimji well no replicas kind of rules EOS out too - anyway I found some of the criteria floated as mandatory \ as maybe not all that mandatory so would be interesting to see criteria [10:34:09] Wahid Bhimji iterate on that a bit and then write down the options [10:34:21] Sam Skipsey Lots of things rule out EOS, Wahid, last time I talked to the RAL guys they were unkeen on it. [10:35:07] Brian Davies for last month in UK ~15% of jobs failed due to inpur file missing (possiblysolved by federation copy \ in, also retry?) 15% of jobs failed copying form WN to local Se wchihc wpould be solved by job recovery. [10:36:42] Duncan Rand what's wrong with EOS? [10:37:34] Ewan Mac Mahon iRODS on top of SRM might be fun. [10:37:49] Ewan Mac Mahon For all the ex-NGS would be iRODS users. [10:38:47] Ewan Mac Mahon And Sanger in the place with the 16PB Lustre that I keep mentioning. [10:39:06] Ewan Mac Mahon (and some grid front end nodes that don't get any use) [10:42:24] Duncan Rand 16 PB is not an insignificant volume [10:42:26] Ewan Mac Mahon Alternatively, get the responsible folks to blog each of the five bits separately. [10:42:32] Ewan Mac Mahon What he said. [10:43:07] Ewan Mac Mahon Jasmine. [10:43:42] Ewan Mac Mahon What you need to do, Brain, is blog more [10:45:10] Govind Songara left [10:45:15] Wahid Bhimji left