Storage Meeting 20th April 2011 Minutes (Round table Surgery) Stephen Jones- everything okay @ Liverpool Rob Fay - everything okay @ Liverpool Ewan - Oxford, all fine except for 2 disk servers out of commission while bin explored. Have updated the firmware on the machines, which seems to have resolved the problem on one of the machines. Another machine has started being odd (the same SuperMicro/Adaptec issue). Still running SL4 nodes (inc Head node). Storage will be upgraded in place, which shouldn't be a problem. Ordering more kit. 2/3 PB after ordering. Initial report on disk server consistency tools is positive. Need to check a few things manually. Generally positive - they do the two things that were missing from disk check tools. (These tools will gain optional checksumming capability this week, probably) John Bland (Liverpool)- some servers out with the Supermicr/Adaptec. Different solutions - Western Digital suggest downgrading the disk firmware to help compat with the adaptec backplanes. This seems to maybe be okay after a week or so of testing. Once the storage is online, we can start dealing with atlas data. Lots of activity from t2k. Putting data into their t2k pool, filling it up, and then spilling into the shared pool and filling it! We'd very much like a spacetoken for t2k, as we don't want to be horrible to them, but we do want them to not affect others. (Elena noted that they've been invited to use spacetokens, but have little effort.) Steve Lloyd tests come in without spacetokens and will do the same overflowing issue for ATLAS. (he should be using the SCRATCHDISK token, and Chris will talk to him about this.). Chris W threatened to not let Steve Lloyd install stuff outside of space tokens, but discovered that other things (on his lustre system) are outside them too. Chris created some workaround for this. Wahid will write to Alessandro about spacetokens for his tar files. Chris Walker (QMUL): Installed StoRM on the Monday (9 days ago). Crashed after seeming fine - we share gridmapfile across the nodes, but the user mappings were inconsistent, causing StoRM to be crashed by the CE doing stuff. Fixed, now okay. Enabled checksums - checksums in StoRM are synchronous - this causes backlogs on files as it waits for the checksum to complete after transfer. Upped the concurrent checksum limit and timeout. Current aircon problems. Will delay storage onlining a bit. DELL R510s various tips for upgrading bios. The tool to upgrade bios checks the content of redhat-release, and won't work on non RHEL machines! (but you can spoof it) Still need to resolve the 10GB simplex issue. Ewan: why bios upgrade? Chris: partly, because we'd like to update before bringing online. Secondly, 3 machines have rejected a disk, had the disk replaced and in 2 cases this didn't resolve the problem. Seemed to be related to the slot in one case, not the disk. In another case, swapping disks between slots caused an entirely *different* disk. Bios upgrade was recommended, but it didn't seem to fix it. - [09:53:11] Peter Grandi joined [09:56:13] David Crooks joined [09:59:01] John Bland joined [10:00:05] Queen Mary, U London London, U.K. joined [10:00:32] Rob Fay joined [10:01:43] Ewan Mac Mahon joined [10:01:45] Peter Grandi left [10:01:47] Stephen Jones joined [10:01:51] Peter Grandi joined [10:03:06] Govind Songara joined [10:06:01] Elena Korolkova joined [10:06:47] Wahid Bhimji joined [10:06:51] Wahid Bhimji hi sorry [10:10:05] Wahid Bhimji does it take a long time to run ... [10:10:38] Peter Grandi EPEL should not be a dependency [10:10:52] Sam Skipsey Not that long - it's all database queries and metadata on the filesystem. [10:11:13] Peter Grandi gLite have decided to use DAG and they say that putting in EPEL as a repo causes conflicts... [10:11:27] Wahid Bhimji ok - so it could run as a daily cron and highlight any discrepant files [10:11:36] Ewan Mac Mahon gLite is about to cease to exist though, and using DAG is a lunatic idea anyway [10:12:16] Wahid Bhimji brilliant testing plan - cheers [10:18:31] Elena Korolkova I have adviced t2k guys to use spacetokens. [10:18:55] Elena Korolkova I think they have lots of problem and very limited manpower. [10:19:37] Elena Korolkova If they set spacetokens they need to redistribute the data [10:20:25] Wahid Bhimji yes indeed the tar install files are outside [10:20:41] John Bland elena: that's the main headache with it, but if they're going to do it then getting it done now at the start could prevent headaches in the future [10:21:49] Wahid Bhimji -S [10:29:41] Rob Fay left [10:30:30] Peter Grandi There is a pretty good PowerEdge mailing list with pointers etc. [10:30:47] Ewan Mac Mahon ^ And the address of that would be good too. [10:31:26] Peter Grandi https://lists.us.dell.com/mailman/listinfo/linux-poweredge/ [10:31:55] Wahid Bhimji ps I found the yum repo for Svr Admin useful way of getting it [10:31:56] Wahid Bhimji http://support.dell.com/support/edocs/software/svradmin/ [10:32:09] Peter Grandi http://linux.dell.com/wiki/index.php/Repository/hardware [10:32:13] Wahid Bhimji perfectly chaired! [10:32:47] John Bland t'ra [10:32:47] Stephen Jones left [10:32:50] John Bland left [10:32:50] Elena Korolkova left [10:32:58] Wahid Bhimji left