Mark Norman Duncan Rand Ewan Mac Mahon Wahid Bhimji John Hill Gareth Roy John Bland David Crooks Sam Skipsey Stephen Jones Brian Davies Chris Walker Jens Jensen (chairing) Matt Doidge Someone on phone bridge? 1. Roundup of DPM issues. Do we need to look at the documentation? When I say "documentation," I often meen "source code"... Documentation is out of date, and there's lots of new stuff which says how to install it. If we got more involved, we may have to read more source code and/or do more testing, but could also update the "actual" documentation. Publishing negative values in BDII. The BDII output is using dpm-listspaces which is a python script. Maybe we need to look at the database - maybe there's something wrong in the database? The site in question (reminded by Daniela) seems to not have eliminated the problem yet (previously we had tebi/tera). Could Ricardo help resolve it...? There should be checks in listspaces, which should try to catch negative values and set them to zero. Maybe only does one of free? In InstalledCapacity, total = free + used, so we sometimes get free=total-used but used is not straightforward to measure. Govind - send output of dpm-listspaces to list. Cambridge issue seems to have been resolved - DPM can segfault if files are deleted under certain obscure circs, but 1.8.3 has fixed the bug (and John has 1.8.3 running). It's not the threading problem? Sam will send a mail to the list, with a list of things that John should send to the list. 2. FTS 3 for testing - hooray! This is excellent news - testing plans? Presumably we can feed stuff back into the process. How does the small file globbing transfer work, does it work with SRM or only with GridFTP? Andrew Lahiff looked into the features; CERN firewalls blocked testing a bit, though. Only supports ATLAS, LHCb, and CMS. There are possibilities for changing the deployment models, because it is no longer channel based. We should continue to assume that T1 will run FTS. Prototype 1 has no authentication, or could you just not renew delegated proxies? Brian to talk to Catalin and Andrew L to get one set up, even a prototype. 3. How good is "replica" really? How to measure it...? Is it site dependent? Are there any statistical methods useful to decide whether a file is likely to go missing? Other faults may skew the results, eg zero length transfers, or timeouts. ATLAS keep stats on which files have gone missing. Do we need a plan to report this information? Chris and Wahid suggest that we can collect this without too much hassle. Maybe need to look both at ingest and the storage, if the ingest failure is at the T2. Checking for files which have become corrupted would help the storage stats, but not files which have gone missing while stored. Checksumming a whole filesystem would take a long time, so instead we should maybe check files which were written six months ago (say), or similar? Interesting research on how to preserve files and recover from errors, eg the loss of a certain number of drives. If we find a corrupted file, it should be useful to compare to the orginal? 4. Back to stress testing SEs (postponed from last week, but might end up getting postponed again - how urgent is it?) Benchmarking? 5. AOB Brian - Default SE value published - are non-LHC VOs using them? Sam believes that lcg-cp (or used to be environment variables)? Sam - John is applying the [11:04:51] John Bland joined [11:04:51] David Crooks joined [11:04:51] Gareth Roy joined [11:04:53] Mark Norman joined [11:04:53] Duncan Rand joined [11:04:56] Ewan Mac Mahon joined [11:04:56] Wahid Bhimji joined [11:04:56] John Hill joined [11:04:57] Sam Skipsey joined [11:04:58] Stephen Jones joined [11:06:48] Brian Davies joined [11:11:08] Christopher Walker joined [11:11:13] Christopher Walker left [11:11:13] Matt Doidge joined [11:11:17] Christopher Walker joined [11:11:34] Christopher Walker left [11:11:37] Christopher Walker joined [11:12:51] Phone Bridge joined [11:13:02] Ewan Mac Mahon Yes, but, philosophically speaking, the sites are right and everyone else is wrong. So there :-P [11:13:15] Ewan Mac Mahon *prthhrpp* [11:13:19] Wahid Bhimji we've been around this before. [11:15:19] Stephen Jones It's a case of GIGO, by the sound of it. [11:15:54] Christopher Walker Wasn't there a way to weight storage - and if you gave it a weight of 0, no new files would go there. [11:15:57] Ewan Mac Mahon Someone like Sam and/or Wahid need to get a login on the box and poke at it. [11:16:14] Ewan Mac Mahon anything else is just pointlessly faffy. [11:16:42] Ewan Mac Mahon Chris: I think so, but that's only in suitably recent DPM. [11:17:23] Wahid Bhimji I have to leave at 10.30 btw... [11:17:24] Sam Skipsey Indeed, the 1.8.x series allows that, Chris. [11:20:26] Ewan Mac Mahon And by human being we mean Santanu. [11:20:33] Ewan Mac Mahon Worth checking, I think. [11:21:36] Wahid Bhimji ps the thing we are talking about is covered in the relaese notes here: [11:21:36] Wahid Bhimji https://svnweb.cern.ch/trac/lcgdm/blog/official-release-lcgdm-183 [11:21:43] Wahid Bhimji export GLOBUS_THREAD_MODEL="pthread" [11:22:01] Wahid Bhimji in the sysconfig for dpm and dpndaemon [11:24:45] David Crooks Sorry, I'm going to have to drop out a bit early today. [11:24:49] David Crooks left [11:26:41] Wahid Bhimji please do [11:29:50] Wahid Bhimji Can we schedule a transfer with the CERN prototype (if we have a cern account etc). Is there any instruction on how to do that [11:30:34] John Bland another meeting, bye [11:30:37] John Bland left [11:34:33] Wahid Bhimji I also have to go in a minute - sorry - hopefully can catch up on this later. [11:35:37] Wahid Bhimji ...bye.... [11:35:42] Wahid Bhimji left [11:36:28] Ewan Mac Mahon For things that aren't lustre there's the possibility of using the pool nodes to do the checksumming, so your bandwidth scales with storage. [11:36:49] Ewan Mac Mahon Plus you avoid over-the-network transfers [11:36:51] Duncan Rand exactly - do it in parallel [11:37:30] Ewan Mac Mahon Sam should write a dm-lite plugin for that [11:38:17] Sam Skipsey I should say that my checksumming tool *does* to the checksumming on the pool nodes [11:38:37] Sam Skipsey That's why it needs you to annoyingly have a python-ssh module installed. [11:40:13] John Hill left