Minutes for GridPP Storage Meeting 12 Jan 2011. Chair + Minutes: Sam Skipsey Attending: John Bland Wahid Bhimji David Crooks Brian Davies Alessandra Forti Stephen Jones Winnie Lacesso Duncan Rand Govind Songara James Thorn Chris Walker Action Review: 401 - Clean up Wiki (Ongoing. Wahid and Sam have discussed some things they might update in the storage bit.) 416 - Report on to Sam on Areca problems. (Deadline: end of next week. James Thorne is writing something Now!) 422 - Check membership of the European Lustre Consortium (in the Whamcloud discussion thread in the GridPP-Storage mailing list. List repeated below: which says the (initial set of) members are: FZJ, Bull GmbH, CEA/DAM, DataDirect Networks, Universities of Zürich, Switzerland and Paderborn, Germany, Helmholtzzentrum für Schwerionenforschung GmbH, credativ GmbH, T-Platforms, HPCFS, Mellanox, Whamcloud, Leibniz Rechenzentrum(LRZ) ParTec. ) CLOSED 423 - Talk to Martin about the disk stress testing tool. (James Thorne has followed up, licensing issues resolved by removing the problematic tool from the set. James will upload it to the GridPP Sysadmin Repository) CLOSED 424 - Report on Blue Sky Data Transfer Stuff proposed by JANET. (Jens not present. We remember he did say some stuff last meeting.) CLOSED? - Item 1: DTeam action against the Storage Group to discuss how sites should deal with loss of disk servers. The action is closed (Sam convinced Jeremy that the existing page on the GridPP wiki that covers VO perspectives on data loss is sufficient), but we should discuss if it really is sufficient. Wahid brought up the case of Lancaster's recent disk issue. One problem that was raised was the apparent lack of sufficient tools to properly investigate data loss events. The Storage Group consistency checker might be repurposable for this task (Wahid and Sam will discuss this next week.). Brian noted that there will always been a tension between the (WLCG) VOs' wish to quickly remove corrupt/lost replicas so they can replace them and the Site's wish to properly investigate the cause of issues to prevent them from recurring. Chris W further noted that the tension also exists between the different depth of replication available to large and small VOs (a small VO is less likely to *have* redundant replicas of data elsewhere on the Grid). (Brian further noted that this tension even exists within the large VOs - user generated data, for example, is often unique at a site.) ACTION: Sam to extend the wiki page to mention issues with smaller VOs (although it is a little out of scope for the page itself). ACTION: Sam and Wahid to bring the consistency tool back into fighting shape for repurposed use for local integrity checking for specific disk servers. Item 2: Wahid brought up Lancaster's lcg-cr problem that Matt posted to the group about. Brian and Sam both remember seeing this in other places historically. It tends to be a transient issue, and Sam and Chris W note it is correlated with BDII timeout issues when they've investigated. Chris W fixed a similar high rate of lcg-cp failures for QMUL by moving to the Imperial BDII when the RAL BDII was being flaky. ACTION: Sam to reply to the email to reassure Matt (in the meantime, the issue seems to have gone away for Lancaster). Item 3: DTeam Action against the Storage Group to generate official recommendations for Disk Storage provisioning. TB_disk to Gb/s_network ratios, for example. Wahid, Brian and Sam will discuss this offline. Item 4: CVMFS Testing. Current position: The T1 and QMUL both have testing installs of CVMFS for experiment software. While these appear to be very successful, the official position of the Storage Group is to recommend any other sites NOT install CVMFS until it moves out of its own beta testing phase. (Similarly, ATLAS internal policy on CVMFS is to not extend use of it until it gets moved to a more centrally supported position at CERN.) AOB: Brian brought up the possibility of revising our recommended minimum versions for Storage Middleware again. (Especially for DPM, which has had a few important releases since our last recommendation.) This also relates to StoRM, which Chris W is investigating at QMUL. Duncan noted that Chris is planning on using their new storage acquisition to properly test StoRM 1.6. The group will follow this up in the next meeting. Everyone is reminded that today's GDB includes the updates on all the Storage Demonstrators. AGENDA ITEM: Summary of the demonstrator updates next week! [09:56:06] Wahid Bhimji joined [09:56:54] Sam Skipsey hi Wahid. [09:57:33] David Crooks joined [09:58:55] Wahid Bhimji hello [09:59:19] James Thorne joined [09:59:39] James Thorne left [09:59:40] Alessandra Forti joined [09:59:45] Duncan Rand joined [09:59:51] John Bland joined [10:00:05] James Thorne joined [10:01:31] Winnie Lacesso joined [10:02:01] James Thorne Phew, I was just writing our report this morning [10:03:19] James Thorne No sound rejoining [10:03:22] James Thorne left [10:03:22] Queen Mary, U London London, U.K. joined [10:03:27] James Thorne joined [10:03:45] Stephen Jones joined [10:04:48] Govind Songara joined [10:06:10] Brian Davies joined [10:16:13] Wahid Bhimji i'll try reconnecting too [10:16:16] Wahid Bhimji left [10:16:42] Wahid Bhimji joined [10:18:14] Wahid Bhimji lancaster was suffering much more from this than other sites for a couple of days this week. [10:19:36] Wahid Bhimji lancs has only 2% failure rate now - so ... no [10:20:40] Wahid Bhimji ok bye [10:20:45] Queen Mary, U London London, U.K. left [10:24:20] James Thorne As far as I know [10:24:40] James Thorne I'm only 2 feet form ian. [10:28:17] Wahid Bhimji thanks sam [10:28:19] James Thorne Cheers Sam. Bye. [10:28:21] James Thorne left [10:28:22] Winnie Lacesso left [10:28:23] Govind Songara left [10:28:23] David Crooks left [10:28:24] Wahid Bhimji left [10:28:25] Brian Davies left [10:28:25] Alessandra Forti bye [10:28:28] John Bland left [10:28:28] Duncan Rand left [10:28:30] Alessandra Forti left [10:28:31] Stephen Jones left