Present: Jens (chair+mins), Duncan, Luke, Matt, RobC, Sam, Teng, Dan, Winnie, Brian, Steve, Govind 0. Operational blog posts and baseline versions New blog post from Rob! No operational issues reported. 1. DPM workshop has been announced: https://indico.cern.ch/event/699602/ Sam was planning on going, but it should not necessarily stop others from going: if you work with the DPM developers, it is useful to meet them in person at least once to build a rapport with them. We should also have a think about what we can present - other than a generic "UK status" 2. Agenda for GridPP40: we have a goodish section to cover data topics, so let's make sure we have some interesting stuff to say! (In fact having a précis of the smorgasbord of goodnessy goodness for someone like Pete C to take to the UKT0 meeting in March some time might be useful?) The agenda slot is two hours, so we could split it into six slots of 20 mins as an initial proposal; if there is something controversial we could have a panel discussion but the discussion today did not support that. Each slot could be done by one person but it would be good to do double or triple acts so more people get to say something, particularly if we can stick to the time. Current proposals for topics are: * Current status and evolutions of T2s, particularly if we can get an update from Bham. * Hadoop - Bristol has one, Dan's been looking at Hadoop for Lustre, and James has/had one at T1. * xroot proxy cache * Data lakes (see also next topic), real and imagined. Storage for ML? or infrastructure for ML. CVMFS for experiment data? * Hybrid infrastructures - BYOC, EUDAT/EGI(?), DiRAC, iRODS, CDMI * Learning from non-LHC (e.g. climate) and supporting non-LHC (i.e. what can they learn from us, e.g. SKA, UKT0?) Dan mentioned that iRODS is not currently widely used by physics but there are examples such as HyperK's JP/UK link. Also EUDAT uses iRODS as storage node for one of its services (= B2SAFE) with GridFTP as an interface (= B2STAGE). But there is now iRODS for Lustre... and there is a StoRM plugin for CDMI. There doesn't seem to be a dedicated T1 slot on the agenda yet, so we should incorporate T1 contributions in the above. Conversely, we could also contribute to other topics elsewhere on the agenda, e.g. storage accounting or networking. 3. Technology watch vs buzzword watch. We regularly talk about "UKT0" ie GridPP as a DTZ and recently Data Lake. Now CERN has a "Data Lake" proposal; I have extended my demo that you caught a brief glimpse of at hepsysman. The PMB tells us to get engaged more, I think. If Pete G joins, he can tell us more (the minutes of the PMBs are sent around but only much later...) * Andrew suggests the PMB want us to do more tracking of peripheral technologies, like the stuff we regularly talk about as potentially worth exploring but rarely have time to actually do something with (see topics above.) * CERN proposal for "data lake" supporting SKA. Initial document call for participation, but doesn't seem radically different from what we are currently doing with linking data centres (e.g. T2s) through federated storage (either at the protocol level, like dynafed, or linked as in the NorduGrid dCache.) * CERN proposal is linking data centres (in a performant way), prototyping and understanding different underlying storage platforms, and investigating hierarchical management. Brian points out (see chatlog) there are some thoughts on the analysis side as well. For the storage, however, CERN plan to use EOS, so for GridPP to be involved, our (limited) experiences with EOS may be useful. We should follow up. * PMB wishes GridPP to be involved, possibly in part because the proposal is to support SKA: and supporting SKA is a Good Thing(tm) because (a) they benefit from our experiences, (b) big data projects should work together, and (c) they will be doing some cool science. Brian has been volunteered with some 10-20% of his time. $. AOB NOB ... thanks to Matt for saving the chatlog; due to a pasteo the minuter managed to lose it. jens: (07/02/2018 10:03) https://indico.cern.ch/event/699602/ https://indico.cern.ch/event/684659/timetable/ Daniel Peter Traynor: (10:08 AM) interesting projects we could look at at QMUL in time for gridp: hadoop for lustre https://github.com/intel-hpdd/lustre-connector-for-hadoop ; irods for lustre https://github.com/irods-contrib/irods_tools_lustre ; with storm plugin for cdmmi https://www.snia.org/cdmi http://italiangrid.github.io/storm/documentation/sysadmin-guide/1.11.12/#cdmistorm Duncan Rand: (10:18 AM) NDGF Brian: (10:20 AM) https://indico.cern.ch/event/656157/contributions/2866315/attachments/1594891/2525563/Datalakes_EOS_Workshop_Feb2018.pdf Lukasz Kreczko: (10:21 AM) More on Data Lake: https://indico.cern.ch/event/702591/contributions/2881715/attachments/1595179/2526156/DOMA-post-CWP.pdf Brian: (10:22 AM) such as data transfer zones Duncan Rand: (10:22 AM) https://news.aarnet.edu.au/cerns-new-data-services-for-science-and-collaborations-with-aarnet/ Ste Jones: (10:23 AM) A data lake is one step up from a data puddle, but smaller than a data sea. Samuel Cadellin Skipsey: (10:24 AM) It turns out that, genuinely, Ste, there's a term "Data Swamp" for a Data Lake with lots of stale data Lukasz Kreczko: (10:33 AM) and "Big CVMFS"