Minutes of the storage EVO meeting 07 July 2011

Present:
	Glasgow: Sam, David
	Edinburgh: Wahid
	Lancaster: Matt
	Liverpool: John, Stephen
	Manchester: Alessandra
	RAL/Storage: Brian, Jens (chair+mins)

1. T2K space tokens?

   Can files move into the space without moving?  Yes, but it requires
   downtime, as DPM maintains a copy of its world view in memory.  It
   is a slightly involved process so not something you'd do more than
   once, so current efforts focus on ensuring that T2K can access
   their new stuff with their workflows etc via space tokens, before
   we move their old stuff.

2. DPM update (Wahid) and Disk pool management revisited?

   Test suites being written at CERN.  Instrumenting the logs more
   carefully, so it will be possible to delve into the timings of file
   actions.

   The NFS plugin is "ready" but maybe not performant; will need
   testing - one of Wahid's students looking into it.

   As for the client side, DESY have a patched kernel for SL5.

   As regards the old chestnut, managing data on disk pools (and
   draining in particular), there is code which uses rfcp instead of
   GridFTP to move files between nodes, but it is not yet in
   production.  The code is in EMI release 1 (which corresponds
   possibly to DPM 1.8.1 or 1.8.2 - Wahid has an instance running).

   Note that unlike most other upgrades, you will need to upgrade pool
   nodes as well: there is no support for compatibility between the
   two methods of moving data.  Which in turn implies that your pool
   nodes must be SL5.

   Some people have been reporting problems with this release.  It is
   known that the EMI release will do certain things in different ways
   from the gLite way, e.g. use different uids/gids, doesn't check the
   BDII, some problem with a cron job path.  Config files are in
   different places.

   Maybe best to wait for an officially sanctioned gLite version.
   There is no "official" gLite release, but at least unofficially
   there is one :-)

3. Site transfer performance problems debugging?

   It is useful to have monitoring (and to be able to run tests), but
   it is necessary to do it in a standardised way: e.g. using the same
   packet sizes, etc.

   There will be a discussion next week at the WLCG workshop - there
   may be wider work we can report on.  Otherwise Brian and Matt can
   draft a proposal for testing.  Also T2 endpoints may appear in the
   sonar tests.

   Moreover, QMUL and Oxford are currently going through some network
   related upgrades (router at QMUL)

4. Any outcome from hepsysman that we can usefully pick up?

   Mainly that Peter at Lancaster is running Hadoop.  Some issues with
   Fuse.  Would it be useful if Peter and James (from T1) talk together?

5. The "data grid" - scientific data management

   Possibly perplexing points: pensively pursuing parallel perspectives

   So this is about looking again at how users use us as a "data grid"
   and how we compare to other "data grids" for processing and storing
   research data.  Possible studies include bioscience at Glasgow, and
   NeISS, also at Glasgow.

6. AOB

   Chris reminds people about the Lustre workshop, and to register if
   they are attending.




09:57:28] Wahid Bhimji morning
[09:59:12] John Bland joined
[09:59:25] Stephen Jones joined
[10:00:09] Wahid Bhimji sorry I missed you jens - I was only there on thu 
[10:01:49] Alessandra Forti joined
[10:03:20] Brian Davies joined
[10:17:04] Pete Gronbech joined
[10:19:02] Alessandra Forti is 1.8.2 out?
[10:19:31] Alessandra Forti it has the corrected info system and the --si tools
[10:19:37] Alessandra Forti options
[10:20:26] Wahid Bhimji No its 1.8.1
[10:20:49] Wahid Bhimji in EMI 1 
[10:27:36] Christopher Walker joined
[10:30:23] Christopher Walker Network upgrade for QMUL has left WAN link slightly degraded. Am working on it with the site network team.
[10:37:36] Brian Davies wlcg workshop next we3k
[10:37:46] Christopher Walker QMUL Lustre workshop next week



























[09:59:42] David Crooks joined
[09:59:57] Brian Davies joined
[10:00:15] Matthew Doidge joined
[10:01:16] Martin Bly joined
[10:01:45] Wahid Bhimji joined
[10:02:33] Ewan Mac Mahon joined
[10:02:50] Duncan Rand joined
[10:03:31] Queen Mary, U London London, U.K. joined
[10:04:40] Alessandra Forti joined
[10:06:12] David Crooks Sam's just in the office now
[10:06:40] Stephen Jones joined
[10:06:51] Govind Songara joined
[10:07:28] Peter Grandi joined
[10:11:22] Sam Skipsey joined
[10:12:44] Queen Mary, U London London, U.K. The GridPP Oversight committee is meeting next door - so don't say anything you don't want them to hear...
[10:13:42] Jens Jensen  
[10:14:51] Peter Grandi i like glusterfs but it may not be the same thng
[10:15:00] Peter Grandi it is the same thing (bricks)
[10:15:57] Queen Mary, U London London, U.K. Lustre is opensource and posix too
[10:16:01] Peter Grandi glusterfs is very different from Lustre, it is more similar to Ceph IIRC. 
[10:16:10] Queen Mary, U London London, U.K. Lustre doesn't have replication 
[10:16:24] Sam Skipsey Peter's quite right - gluster is very much a "proper resilient distributed filesystem"
[10:16:56] Sam Skipsey Lustre relies on the underlying storage hardware for resilience instead (which is reasonable, in some situations).
[10:20:23] Ewan Mac Mahon It's pretty reasonable in our situation - it's hard to get more than about six disks in a box without needing a specialist controller and they're basically all RAID capable.
[10:20:31] Peter Grandi morbid curiosity about castor  
[10:22:04] Sam Skipsey Ewan - but we do get RAID6 arrays failing (and rebuilding a 60TB RAID6 array takes... time.). We're more constrained by the available space and money to buy 2x the storage we need (for example) so we could replicate it.
[10:23:17] Ewan Mac Mahon Indeed, but those constraints are why we need the (more or less) high density RAID boxes.
[10:23:24] Sam Skipsey I agree.
[10:23:26] Peter Grandi Lustre is excellent, including the storage-does-replicas, for its "native" application, which is not necessarily archival of LHC data  
[10:24:03] Sam Skipsey Which isn't what T2s should be expected to do, Peter (the fact that we end up doing so for certain dataset types for ATLAS is a bug, not a feature).
[10:24:31] Ewan Mac Mahon Replication is a terrible strategy for redundancy, btw, because it's hopelessly inefficient. We need something that can do RAID5/6 style parity across unreliable servers, not just dumb mirroring.
[10:25:34] Sam Skipsey Well, it also gives improved bandwidth and parallelisation (which is why, I agree, it only really makes sense if you're running code that is data-aware, like MapReduce etc).
[10:26:12] Sam Skipsey Oh, and I note that there was some discussion of ceph doing RAID5 across storage arrays or something equally awesome.
[10:26:36] Ewan Mac Mahon Even our test 'el cheapo' box it's "only" about half the price per terabyte of Supermicro RAID servers, so if you double the raw storage requirements you're right back where you started.
[10:26:36] Peter Grandi usual scroll of banishment of parity raid: http://WWW.BAARF.com/ 
[10:26:41] Sam Skipsey (btrfs can already do RAID5 and RAID6 across different devices, but it's not network-aware, unfortunately)
[10:27:08] Sam Skipsey Ewan: but is it half the reliability?
[10:27:29] Ewan Mac Mahon And I'm not sure about the efficiency/bandwidth either. I rather suspect that a decent machine with a fast RAID controller and (e.g.) 10GbE will beat out a bunch of cheaper machines and replication.
[10:27:31] Sam Skipsey I'm all for keeping the current model, except for the storage tokens used by ATLAS for non-replicated data, which should be stuck on something better.
[10:27:58] Ewan Mac Mahon Obviously, shiny boxes and multiple replicas will beat that, but then we can't afford that.
[10:27:59] Sam Skipsey (the argument for efficiency is that you also run code on the data storage node that has the data it wants)
[10:28:38] Sam Skipsey The something better might be redundant storage using something like gluster/hadoop/ceph/NFS4.1pNFS etc
[10:29:23] Queen Mary, U London London, U.K. btrfs is a possible replacement for ext4 as the underlying storage for lustre.
[10:29:56] Sam Skipsey I like btrfs, in principle.
[10:30:04] Sam Skipsey (It's the default underlying storage for ceph, too.)
[10:30:41] Wahid Bhimji (or underlying storage for dpm ) 
[10:31:26] Sam Skipsey (dpm isn't clever enough to take advantage of btrfs in the ways lustre and ceph do, though, Wahid.)
[10:31:55] Sam Skipsey (or rather, it's not low-enough level in its management of data)
[10:33:05] Ewan Mac Mahon "Ewan: but is it half the reliability?" - probably not quote that bad; el cheapo box had (software) RAID5 vs Supermicros at RAID6. Somewhat fewer drives though.
[10:34:32] Wahid Bhimji Yes Jiri Horky
[10:34:43] Govind Songara left
[10:34:48] Wahid Bhimji only works on POSIX filesystems 
[10:35:11] Sam Skipsey It's basically just like the blktrace/fio stuff I played with a while back.
[10:35:24] Sam Skipsey never did the full set of tests like they seem to have, though.
[10:35:55] Wahid Bhimji that is tweakable 
[10:36:12] Sam Skipsey It's also by design.
[10:36:16] Wahid Bhimji I saw that too for HDFS with atlas jobs - but Brian B told me the tweak 
[10:36:29] Sam Skipsey Hadoop is designed in the principle that, by default, you'll be operating in append-only mode.
[10:45:49] Queen Mary, U London London, U.K. I believe OPENSFS.org has a subscription fee of around 1million/year
[10:46:23] Jens Jensen That;s a bit out of our range...
[10:46:49] Ewan Mac Mahon That's only dollars though.
[10:47:14] Jens Jensen Ah, that;s OK then  
[10:47:39] Peter Grandi the exchange used to be 4 dollars/pound  
[10:48:28] Duncan Rand cambridge?
[10:48:33] Ewan Mac Mahon I have lustre on the non-grid systems.
[10:48:53] Wahid Bhimji sussex too - or maybe cambridge do to 
[10:48:54] Sam Skipsey We have lustre, but not for anything critical (we use it for our ARC cache)
[10:49:06] Wahid Bhimji that us (lancs + me) for dpm
[10:49:48] Ewan Mac Mahon Is this the same kit that lancs have now
[10:51:44] Peter Grandi usual warning about DDN: they use a variant of RAID3