Notes from Storage EVO meeting 2011-11-23 Present: Manchester: Alessandra Edinburgh: Wahid Glasgow: Sam, David QMUL: Chris RAL: Brian, Jens (chair+mins) RHUL: Govind Sheffield: Elena Liverpool: Stephen Oxford: Ewan Lancaster: Matt 1. Review of actions Wiki cleanup - ongoing (supposedly) DPM moving stuff - needs testing. - Are T2K actually using space tokens? - Check with T2K - Jens will mail Jon - Possibly work in new year planned for T2K A quick glance at the srmsupportuk page in Savannah revealed the need to tidy it up a bit... the unbalanced dataset problem is worth followin up on again; Sam is looking at his rebalancer again to see if it works with more recent versions of DPM. New generation of storage has x3 more capacity and rebalancing (which assumes that directory == dataset, which is broadly true for ATLAS), spreading the set over disk servers. Does it change accounting - no, all the disk servers "belong" to the same space token. Non-trivial: occasional need to manually move datasets. 2. Suggestions for tech topics (and tech speakers): - Cloud storage, Amazon S3, Rackspace, comparing SRM to cloud. (Jens had done work on the latter) - WebDAV - also in the web seminar for DPM, now available online. - Hardware purchasing (see also discussion later and the chat window) - if people are buying the same kit, may be interested in the management tools. Or we can get the person who most recently bought stuff to talk about it. - Aligning partitions on disk - Chris knows someone who claims one can improve performance this way, will also try it himself. - Shortage of disks - do we need to run kit longer? 3. NFS 4 update - Brian still working on the client setup, need for kernel transplant in client. 4. GO discussion - you've seen DW's mail to the list. Collectively speaking, we are now happy(er). The User Experience was considered paramount - this is now fixed by removing the GridPP endpoints so transfers will not appear wrongly to have been successful. A problem is seen with limiting transfers, but this is not specific to GO, because users could move data "manually" with GridFTP anyway. Improvements to the announcement process also welcome. It is interesting to get SRM support in GO: DW spoke to GO at SC (SuperComputing), with input from JJ, on the feasibility of building SRM support. 5. EMI install revisited Ewan has problems with his MySQL configuration (see mail on list), but it is unclear why because SAM has an earlier EMI install with no problems - so it may be because it's a later one. Brian proposes that sites upgrade to gLite 3.2 before they upgrade to EMI - but we may need to understand the problems better? It may help to have SL6 support for DPM disk servers (which Wahid has heard will come in April) - Ewan speculates that one could install them now if you bypass YAIM. Alessandra is on 1.8.2 and has previously steered clear of EMI but will have a stab at it and report back next week. [10:01:41] Alessandra Forti joined [10:01:45] Wahid Bhimji joined [09:59:23] Wahid Bhimji morning [10:01:47] Sam Skipsey joined [10:01:47] Christopher Walker joined [10:01:49] Brian Davies joined [10:01:50] Govind Songara joined [10:02:08] David Crooks joined [10:02:28] Elena Korolkova joined [10:03:57] Stephen Jones joined [10:05:26] Ewan Mac Mahon joined [10:05:56] Elena Korolkova I think if you give instruction to Jon he will be happy to move data to spacetoken [10:06:08] Elena Korolkova I can talk to him [10:06:57] Jens Jensen https://savannah.cern.ch/task/?group=srmsupportuk [10:06:59] Elena Korolkova I think he hopes that we can move the data by ourselves [10:07:31] Matthew Doidge joined [10:07:41] Brian Davies OF course its easy to check to see if they are using STs by checking lcg-stmd at sites who have enabled STs for T2K. [10:08:42] Wahid Bhimji Test Hadoop with StoRM - not going to do any more on that then what we wrote up from Msc student last year [10:09:10] Wahid Bhimji https://savannah.cern.ch/task/?16724 - Nagios monitoring in DPM - there is now a package from DPM which we can use [10:10:07] Wahid Bhimji Plan for dealing with orphaned files - what does this refer to - Atlas data is now regularly cleaned up [10:10:28] Wahid Bhimji "Test dcache" is a funny one. [10:13:09] Wahid Bhimji Maybe we should add what we are actually working on to it - I will try and do that... [10:14:02] Wahid Bhimji yes - that would be good - who do we know ... [10:16:16] Alessandra Forti same hardware for us [10:16:52] Alessandra Forti although could be 20x3TB rather than 30X20TB depending on the cost. [10:17:02] Alessandra Forti 30x2TB [10:17:05] Alessandra Forti of course [10:17:32] Wahid Bhimji in what R510s ? can you get 30 disks in that [10:17:47] Wahid Bhimji or you mean the same as you had before - rather than the same as other people [10:17:50] Alessandra Forti viglen 36bay storage system [10:17:53] Ewan Mac Mahon No, an R510 is 12 drives. [10:18:10] Ewan Mac Mahon Though there is a point of interest about those with 3Tb drives. [10:21:53] Wahid Bhimji Very good to know [10:22:03] Alessandra Forti can people mute? [10:22:13] Alessandra Forti there is an echo [10:22:18] Elena Korolkova Matt has a 3TB test disk server in Sheffield [10:22:38] Ewan Mac Mahon You can, I wouldn't. [10:22:43] Elena Korolkova with SW raid as usual [10:23:03] Wahid Bhimji Ewan - why - just cost effectiveness? [10:23:26] Ewan Mac Mahon Yes, you don't benefit much on the money, and you have a problem with bandwidth. [10:24:40] Wahid Bhimji even with 10 Gig (like we are all being encouraged to get ) - bandwitdth should be OK then - I mean people are ok with their big Viglens... [10:25:07] Elena Korolkova we estimates, 2 TB servers are cheaper than 3 TB servers (for the same capacity) [10:26:11] Wahid Bhimji By April I heard [10:26:22] Elena Korolkova 36X2Tb are cheaper than 24X3TB [10:26:37] Wahid Bhimji - indeed might work sooner ... (or even now- I can ask Ricardo [10:26:54] Ewan Mac Mahon Well, we have 10Gbit ethernet on each 12 drive R510. I wouldn't fancy having only one of those per (say) three sets of 12 drives. [10:27:21] Sam Skipsey Also, we're more concerned about rackspace, as well. I forget if R510s are more dense than 36bays, but I don't think they are (in terms of storage/U) [10:27:30] Ewan Mac Mahon Sussex, by way of contrast, have systems with three sets of 12 drives, but they're connected over QDR infiniband, and that's about 36Gbit. [10:28:09] Ewan Mac Mahon Sam: They're not. They're 12 drives in 2U, so 24 drives in four U, whereas the SuperMicros get 36 in 4U. [10:28:14] Wahid Bhimji From ricardo on SL6 [10:28:17] Wahid Bhimji the epel packaging will bring dpm on sl6 too we already have a nightly build with it [10:28:29] Sam Skipsey Right. So, that's a small minus for us against the R510s. [10:28:37] Sam Skipsey Which is a pity, since they do look nice... [10:29:00] Ewan Mac Mahon Wahid - will that have YAIM or not (I can't see YAIM making it into Fedora somehow)? [10:30:44] Wahid Bhimji so SL6 dpm disk will be available (From DPM directly) by the end of the year - though it may not be release through EMI until later... [10:30:55] Wahid Bhimji re yaim [10:30:56] Wahid Bhimji not... it will be in a separate repository that you need to enable just for yaim but we do test yaim in the nightlies of sl6 [10:31:16] Wahid Bhimji so it should work... [10:31:20] Ewan Mac Mahon Well that sounds fine them, assuming it works. Which we should test. [10:31:39] Ewan Mac Mahon We're unlikely to have uEFI systems in and running before that. [10:31:53] Elena Korolkova What is a main problem for big disk servers? [10:32:10] Wahid Bhimji Indeed .... I guess I'll test it - but probably not sure I'll run the risk of a server that only works on SL6 just yet.... [10:32:50] Ewan Mac Mahon Elena: Two things really, speed (smaller servers are generally quicker (it's more complicated than that, but it's a fair approximation)) [10:33:46] Ewan Mac Mahon And secondly if you lose one server out of lots of them you can cope. If you lose one humungous disk server out of only a handful, then you're likely stuffed. [10:34:09] Govind Songara Santnau also having problem with EMI disk server install [10:35:11] Christopher Walker A smaller disk server is probably further from the hardware limits of things like the RAID card, pci bus etc. [10:36:34] Ewan Mac Mahon Plus it's worth remembering that the 36 bay ones are sort-of 34 bay ones, in that you have to set them up as two arrays, you can't have a single 36 drive RAID, so you lose four drives to parity. [10:37:18] Christopher Walker Other advantage was that you could get away with cheaper networking - but perhaps that's not an advantage these days... [10:37:58] Sam Skipsey Given that Ewan is sticking 10Gbit on his R510s, Chris... [10:38:44] Alessandra Forti I'm going to put 10Gbit on the data servers whether we get the networking money or not. [10:39:41] Ewan Mac Mahon And the R510s can make good use of the 10Gbit too - not saturate it, but they're quite capable of pushing a good few Gbit. [10:40:25] Ewan Mac Mahon They probably could saturate it (not off disk though), but we've seen several Gbit off them in actual use by analysis jobs. [10:42:27] Wahid Bhimji I can try and install a EMI version [10:42:34] Wahid Bhimji and import my db [10:42:53] Sam Skipsey I need to update svr025 anyway, so I was going to move it to EMI ... [10:43:15] Ewan Mac Mahon ^ and that would be good too. Just to check whether anyone can install the current EMI [10:43:42] Christopher Walker I think we'd put 10Gig on R510s whether we get network money or not too. [10:44:03] Wahid Bhimji I can't make it next week actually - stupid teachning assisting feedback thing [10:44:11] Alessandra Forti left [10:44:14] Ewan Mac Mahon You do need to think about switching for the 10Gbit though. Particularly the people that don't already have some. [10:44:17] Stephen Jones left [10:44:20] Matthew Doidge left