Minutes of the storage EVO meeting 07 July 2011 Present: Glasgow: Sam, David Edinburgh: Wahid Lancaster: Matt Liverpool: John, Stephen Manchester: Alessandra RAL/Storage: Brian, Jens (chair+mins) 1. T2K space tokens? Can files move into the space without moving? Yes, but it requires downtime, as DPM maintains a copy of its world view in memory. It is a slightly involved process so not something you'd do more than once, so current efforts focus on ensuring that T2K can access their new stuff with their workflows etc via space tokens, before we move their old stuff. 2. DPM update (Wahid) and Disk pool management revisited? Test suites being written at CERN. Instrumenting the logs more carefully, so it will be possible to delve into the timings of file actions. The NFS plugin is "ready" but maybe not performant; will need testing - one of Wahid's students looking into it. As for the client side, DESY have a patched kernel for SL5. As regards the old chestnut, managing data on disk pools (and draining in particular), there is code which uses rfcp instead of GridFTP to move files between nodes, but it is not yet in production. The code is in EMI release 1 (which corresponds possibly to DPM 1.8.1 or 1.8.2 - Wahid has an instance running). Note that unlike most other upgrades, you will need to upgrade pool nodes as well: there is no support for compatibility between the two methods of moving data. Which in turn implies that your pool nodes must be SL5. Some people have been reporting problems with this release. It is known that the EMI release will do certain things in different ways from the gLite way, e.g. use different uids/gids, doesn't check the BDII, some problem with a cron job path. Config files are in different places. Maybe best to wait for an officially sanctioned gLite version. There is no "official" gLite release, but at least unofficially there is one :-) 3. Site transfer performance problems debugging? It is useful to have monitoring (and to be able to run tests), but it is necessary to do it in a standardised way: e.g. using the same packet sizes, etc. There will be a discussion next week at the WLCG workshop - there may be wider work we can report on. Otherwise Brian and Matt can draft a proposal for testing. Also T2 endpoints may appear in the sonar tests. Moreover, QMUL and Oxford are currently going through some network related upgrades (router at QMUL) 4. Any outcome from hepsysman that we can usefully pick up? Mainly that Peter at Lancaster is running Hadoop. Some issues with Fuse. Would it be useful if Peter and James (from T1) talk together? 5. The "data grid" - scientific data management Possibly perplexing points: pensively pursuing parallel perspectives So this is about looking again at how users use us as a "data grid" and how we compare to other "data grids" for processing and storing research data. Possible studies include bioscience at Glasgow, and NeISS, also at Glasgow. 6. AOB Chris reminds people about the Lustre workshop, and to register if they are attending. 09:57:28] Wahid Bhimji morning [09:59:12] John Bland joined [09:59:25] Stephen Jones joined [10:00:09] Wahid Bhimji sorry I missed you jens - I was only there on thu [10:01:49] Alessandra Forti joined [10:03:20] Brian Davies joined [10:17:04] Pete Gronbech joined [10:19:02] Alessandra Forti is 1.8.2 out? [10:19:31] Alessandra Forti it has the corrected info system and the --si tools [10:19:37] Alessandra Forti options [10:20:26] Wahid Bhimji No its 1.8.1 [10:20:49] Wahid Bhimji in EMI 1 [10:27:36] Christopher Walker joined [10:30:23] Christopher Walker Network upgrade for QMUL has left WAN link slightly degraded. Am working on it with the site network team. [10:37:36] Brian Davies wlcg workshop next we3k [10:37:46] Christopher Walker QMUL Lustre workshop next week [09:59:42] David Crooks joined [09:59:57] Brian Davies joined [10:00:15] Matthew Doidge joined [10:01:16] Martin Bly joined [10:01:45] Wahid Bhimji joined [10:02:33] Ewan Mac Mahon joined [10:02:50] Duncan Rand joined [10:03:31] Queen Mary, U London London, U.K. joined [10:04:40] Alessandra Forti joined [10:06:12] David Crooks Sam's just in the office now [10:06:40] Stephen Jones joined [10:06:51] Govind Songara joined [10:07:28] Peter Grandi joined [10:11:22] Sam Skipsey joined [10:12:44] Queen Mary, U London London, U.K. The GridPP Oversight committee is meeting next door - so don't say anything you don't want them to hear... [10:13:42] Jens Jensen [10:14:51] Peter Grandi i like glusterfs but it may not be the same thng [10:15:00] Peter Grandi it is the same thing (bricks) [10:15:57] Queen Mary, U London London, U.K. Lustre is opensource and posix too [10:16:01] Peter Grandi glusterfs is very different from Lustre, it is more similar to Ceph IIRC. [10:16:10] Queen Mary, U London London, U.K. Lustre doesn't have replication [10:16:24] Sam Skipsey Peter's quite right - gluster is very much a "proper resilient distributed filesystem" [10:16:56] Sam Skipsey Lustre relies on the underlying storage hardware for resilience instead (which is reasonable, in some situations). [10:20:23] Ewan Mac Mahon It's pretty reasonable in our situation - it's hard to get more than about six disks in a box without needing a specialist controller and they're basically all RAID capable. [10:20:31] Peter Grandi morbid curiosity about castor [10:22:04] Sam Skipsey Ewan - but we do get RAID6 arrays failing (and rebuilding a 60TB RAID6 array takes... time.). We're more constrained by the available space and money to buy 2x the storage we need (for example) so we could replicate it. [10:23:17] Ewan Mac Mahon Indeed, but those constraints are why we need the (more or less) high density RAID boxes. [10:23:24] Sam Skipsey I agree. [10:23:26] Peter Grandi Lustre is excellent, including the storage-does-replicas, for its "native" application, which is not necessarily archival of LHC data [10:24:03] Sam Skipsey Which isn't what T2s should be expected to do, Peter (the fact that we end up doing so for certain dataset types for ATLAS is a bug, not a feature). [10:24:31] Ewan Mac Mahon Replication is a terrible strategy for redundancy, btw, because it's hopelessly inefficient. We need something that can do RAID5/6 style parity across unreliable servers, not just dumb mirroring. [10:25:34] Sam Skipsey Well, it also gives improved bandwidth and parallelisation (which is why, I agree, it only really makes sense if you're running code that is data-aware, like MapReduce etc). [10:26:12] Sam Skipsey Oh, and I note that there was some discussion of ceph doing RAID5 across storage arrays or something equally awesome. [10:26:36] Ewan Mac Mahon Even our test 'el cheapo' box it's "only" about half the price per terabyte of Supermicro RAID servers, so if you double the raw storage requirements you're right back where you started. [10:26:36] Peter Grandi usual scroll of banishment of parity raid: http://WWW.BAARF.com/ [10:26:41] Sam Skipsey (btrfs can already do RAID5 and RAID6 across different devices, but it's not network-aware, unfortunately) [10:27:08] Sam Skipsey Ewan: but is it half the reliability? [10:27:29] Ewan Mac Mahon And I'm not sure about the efficiency/bandwidth either. I rather suspect that a decent machine with a fast RAID controller and (e.g.) 10GbE will beat out a bunch of cheaper machines and replication. [10:27:31] Sam Skipsey I'm all for keeping the current model, except for the storage tokens used by ATLAS for non-replicated data, which should be stuck on something better. [10:27:58] Ewan Mac Mahon Obviously, shiny boxes and multiple replicas will beat that, but then we can't afford that. [10:27:59] Sam Skipsey (the argument for efficiency is that you also run code on the data storage node that has the data it wants) [10:28:38] Sam Skipsey The something better might be redundant storage using something like gluster/hadoop/ceph/NFS4.1pNFS etc [10:29:23] Queen Mary, U London London, U.K. btrfs is a possible replacement for ext4 as the underlying storage for lustre. [10:29:56] Sam Skipsey I like btrfs, in principle. [10:30:04] Sam Skipsey (It's the default underlying storage for ceph, too.) [10:30:41] Wahid Bhimji (or underlying storage for dpm ) [10:31:26] Sam Skipsey (dpm isn't clever enough to take advantage of btrfs in the ways lustre and ceph do, though, Wahid.) [10:31:55] Sam Skipsey (or rather, it's not low-enough level in its management of data) [10:33:05] Ewan Mac Mahon "Ewan: but is it half the reliability?" - probably not quote that bad; el cheapo box had (software) RAID5 vs Supermicros at RAID6. Somewhat fewer drives though. [10:34:32] Wahid Bhimji Yes Jiri Horky [10:34:43] Govind Songara left [10:34:48] Wahid Bhimji only works on POSIX filesystems [10:35:11] Sam Skipsey It's basically just like the blktrace/fio stuff I played with a while back. [10:35:24] Sam Skipsey never did the full set of tests like they seem to have, though. [10:35:55] Wahid Bhimji that is tweakable [10:36:12] Sam Skipsey It's also by design. [10:36:16] Wahid Bhimji I saw that too for HDFS with atlas jobs - but Brian B told me the tweak [10:36:29] Sam Skipsey Hadoop is designed in the principle that, by default, you'll be operating in append-only mode. [10:45:49] Queen Mary, U London London, U.K. I believe OPENSFS.org has a subscription fee of around 1million/year [10:46:23] Jens Jensen That;s a bit out of our range... [10:46:49] Ewan Mac Mahon That's only dollars though. [10:47:14] Jens Jensen Ah, that;s OK then [10:47:39] Peter Grandi the exchange used to be 4 dollars/pound [10:48:28] Duncan Rand cambridge? [10:48:33] Ewan Mac Mahon I have lustre on the non-grid systems. [10:48:53] Wahid Bhimji sussex too - or maybe cambridge do to [10:48:54] Sam Skipsey We have lustre, but not for anything critical (we use it for our ARC cache) [10:49:06] Wahid Bhimji that us (lancs + me) for dpm [10:49:48] Ewan Mac Mahon Is this the same kit that lancs have now [10:51:44] Peter Grandi usual warning about DDN: they use a variant of RAID3