diff --git a/content/get-started/new-leads/index.md b/content/get-started/new-leads/index.md index f4d996d3eb..0cf4460b19 100644 --- a/content/get-started/new-leads/index.md +++ b/content/get-started/new-leads/index.md @@ -4,40 +4,84 @@ title: Set up your own community
-There are many communities established within Galaxy. We call them [Special Interest Groups](/community/sig/). We have SIGs based for example on -[**region**](/community/sig/#regional-communities) or [**domain of science**](/community/sig/#communities-of-practice) but -if you feel like you just don't see your own or don't feel identified with them, you can create your own community! +There exist many scientific groups within the Galaxy project. The members of these groups are organized into _communities of practice_ - usually referred to as [Special Interest Groups](/community/sig/) (SIG). -Below you'll find a suggested path to follow before you can start your own community, a list of Do's and Don'ts and also useful links you should probably check out. +We encourage you to look through the existing [Special Interest Groups](/community/sig/) and find the one which is appropriate for your needs. You can always join an existing SIG and contribute to that community. However, if you don't find a SIG that fits your work, domain or region, or you don't feel identified with any of them, you can create your own community of practice SIG! -### Path +You are still wondering if and why you should start a new SIG? Find some arguments [here](/community/governance/gcb/#why-make-a-sig)! + +### What is _good to know_ before you start building a SIG + +The goal of building a new community is to establish a working environment related to your domain of expertise. In the process of creating and maintaining a SIG, regardless of your research domain, you will need to address the following issues: + +1. Identify the technical challenges you want to handle. They will be the main focus for your SIG and will determine the major part of its activities. Here is a very _inexhaustive_ list of potential requirements you may want to meet: + + * create relevant tools and software + * define and manage special data formats + * manage big data (specific to your domain?) + * share and reproduce the processes and the results of your work + * obtain the optimal computational resources for achieving your goals + +2. Administer your group + + * Set up means of communication (mailing lists, chat channels) + * Set up a group organization and nominate members responsible for SIG's routines (contact person, tool support person, etc.) + +3. Publish your results and promote your experience in order to: + + * extend and share the acquired expertise + * attract new people into your SIG by identifying other groups who might benefit from your experience + +4. Organise a training process for all interested in your SIG activites + + * set up a [training network](https://training.galaxyproject.org/) for your domain + * maintain a training documentation (slides, hands-on, automation routines) + +### What is _good to do_ before you start building a SIG #### Learn the basics about Galaxy -First of all, you should start with our dedicated [**Get Started**](/get-started/) page to get a head-start of what Galaxy is about. +First of all, you should start with our dedicated [**Get Started**](https://galaxyproject.org/get-started/) page to get a head-start of what Galaxy is about. You'll find all the necessary resources and you could even participate in the [**Galaxy Mentorship Network**](https://galaxy-mentor-network.netlify.app/) to get started with the help of a Galaxy Mentor. +#### Get acquainted with the Galaxy Training Network + +The Galaxy Training Network ([GTN](https://training.galaxyproject.org)) is collection of tutorials developed and maintained by the worldwide Galaxy community. +It will show you how various SIGs, e.g. **Climate**, **Metabolomics**, carry out their work, prepare training materials and organize their events + #### Learn about Tool Development -Galaxy has many tools already installed and ready to be used but when creating your community, you'll probably need specific tools for your Galaxy analysis. Don't worry, see our dedicated page for [**Tool Authors**](/tools/) where you'll find everything you'll need and also reach out to the [**Tools Working Group**](/community/wg/) via Gitter. +Galaxy has many tools already installed and ready to be used but when creating your community, you'll probably need specific tools for your Galaxy analysis. Don't worry, see our dedicated page for [**Tool Authors**](/tools/) where you'll find information about tool development. Don't forget to follow the [**Best Practices**](https://galaxy-iuc-standards.readthedocs.io/en/latest/best_practices.html) for writing [**Galaxy Tools**](https://toolshed.g2.bx.psu.edu/). #### Get involved -- Introduce yourself and dive in our [Working Groups](/community/wg/) +- Introduce yourself and dive in our [Special Interest Groups](/community/sig/) - Participate in our [worldwide Galaxy Events](/events/) -- If you want to start your own [Special Interest Group](/community/sig/) we encourage you to read the [Galaxy Community Board](/community/governance/gcb/#creating-a-new-sig) page. -### Do's and Don'ts -
+### Please read the success stories below! They are full of inspiring examples and solutions! + +| __Special Interest Group__ | __Here is how they did it__ | +|---|---| +| Biodiversity | [the story](https://galaxyproject.org/get-started/new-leads/succesful-stories/biodiversity.md) | +| Climate | [the story](https://galaxyproject.org/get-started/new-leads/succesful-stories/climate.md) | +| Material Science | [the story](https://galaxyproject.org/get-started/new-leads//succesful-stories/materials-science.md)| +| Astrophysics | [the story](https://galaxyproject.org/get-started/new-leads/succesful-stories/astrophysics.md) | + + -This curated table was made with the help of Community member's experience in hope to avoid obstacles in your path to create your Community. +### Some useful recomendations -| __DO__ | __DON'T__ | +
+ +| __DO__ | __DON'T__ | |---|---| | Be patient! Take the time to learn the basics about Galaxy. | Hurry and dive in straight to create without knowing a lot about Galaxy and its workarounds. | -| Ask if a tool for what you need already exists before starting to develop one. | Get hazy with all your tool's possible needs as there are many installed already that can be close to what you would like.| +| Join the Galaxy [Matrix channel](https://matrix.org/docs/chat_basics/matrix-for-im/) and consult the [Galaxy help pages](https://help.galaxyproject.org/) | Spend hours or days to read and debug the galaxy code | +| Ask if a tool you need already exists before starting to develop one. | Get hazy with all your tool's possible needs as there are many installed already that can be close to what you would like.| | Request your tool to be reviewed, avoid managing your repo alone! | Be a lone wolf. The main servers may not want to install your tool as they want to make sure it follows the [**standards**](https://galaxy-iuc-standards.readthedocs.io/en/latest/best_practices.html) they have defined for developing tools. | -| When you happen to need a tool: Make the tool -> write the training -> set up an event. Repeat events and take advantage of Galaxy’s infrastructure (Smorgasbord, CoFests, GCC, etc.) to improve your tool ,maximize its use and receive feedback from the community. | Set to create very specific tools and avoid feedback. | +| When you happen to need a tool: Make the tool -> write the training -> set up an event. Repeat events and take advantage of Galaxy’s infrastructure [Smorgasbord](https://gallantries.github.io/video-library/events/smorgasbord3/) , [CoFests](https://galaxyproject.org/events/cofests/), [GCC](https://galaxyproject.org/gcc/), etc. to improve your tool, maximize its use and receive feedback from the community. | Set to create very specific tools and avoid feedback. | | Use already established servers. Asking for a subdomain within the [**usegalaxy.eu**](https://usegalaxy.eu/) server is a great option! |Set up a server all by yourself (consider resources, complexity, security issues and more).| + + diff --git a/content/get-started/new-leads/successful-stories/astrophysics.md b/content/get-started/new-leads/successful-stories/astrophysics.md new file mode 100644 index 0000000000..dbcfb0b97f --- /dev/null +++ b/content/get-started/new-leads/successful-stories/astrophysics.md @@ -0,0 +1,70 @@ + +## Our community onboarding + +### How did we get to know about Galaxy project and framework and its potential + +Diverse web-based data analysis platforms are reasonably well advanced and accepted in astrophysics. As we continued to develop our own research and infrastructure projects in this area, we got advice from some of our colleagues involved in EOSC about the new EuroScienceGateway project, which in particular advances the Galaxy platform in broad range communities. We joined EuroScienceGateway to learn more about Galaxy and make it useful in the astrophysical community. + +Prior to EuroScienceGateway we did not know about the Galaxy, and never heard about it from anyone in the broad astrophysical community. Later we learned that Galaxy was considered as a possible “science platform” for SKA, but was not so far selected (but the potential for adoption remains). + + +### What were our needs / challenges : + + +Each telescope/infrastructure relies on a data reduction pipeline, typically developed and maintained by telescope collaboration and/or a telescope’s Science Data Center. **The telescope tools typically share few reusable components**, except for some common libraries for manipulating common data formats (such as [astropy](https://github.com/astropy/astropy)). A notable exception is the [HEASOFT](https://heasarc.gsfc.nasa.gov/lheasoft/) package, which includes software for data reduction of most NASA space telescopes. The situation became more complex as **telescopes became more diverse** in the last decades, with proliferation of Gravitational Wave and Neutrino observatories, with very different data reduction techniques and practices. Telescope data reduction is often a **resource-consuming process** and requires dedicated infrastructures. + +Sky objects are more or less the same for all observers, and **telescopes often combine their observations**. Astronomers do not control their subject of study, and observations of transient phenomena are often **opportunistic**, and even **small telescopes can make big discoveries**, meaning that **interoperability is relevant** for big and small infrastructures. The practice of combining observations from very different telescopes to derive a complete view of astrophysical sources is encompassed within “**multi-wavelength**” and “**multi-messenger**” astrophysics discipline. Inter-telescope interoperability is primarily concerned with applying joint analysis techniques (cross-correlation, broad-band modelling, etc) to high-level scientific products: images, emission spectra, light curves. + +The understanding that astrophysical questions can be best answered by combining diverse data led to adoption of FAIR practices in astrophysics. To enable them, platforms and portals were developed by big ([ESA](https://datalabs.esa.int/), [NASA](https://heasarc.gsfc.nasa.gov/)) and smaller (just in Switzerland [MMODA](https://www.astro.unige.ch/mmoda/), [DACE](https://dace.unige.ch/), [Renku](https://renkulab.io/)) actors. Several EOSC projects made important strides in this direction: ASTERICS-OBELICS, ESCAPE. + +High level data are usually (although not always) smaller and can be more easily shared. On the other hand, new astrophysical questions occasionally require re-analysis of lower-level data, meaning that low-level analysis also needs to fit in the FAIR paradigm. + +Since astrophysical data are inherently non-repeatable and unique, both low-level and high-level data need to be preserved in a reusable long-term way. Astrophysical Archives are taking care of this preservation in accordance with suitable standards, in particular those developed by International Virtual Observatory Alliance ([IVOA](https://www.ivoa.net/)). + +### What were the steps we made: + +Since we were dealing with this specialised data format, it was suggested to us to try making Galaxy recognize this format, and visualise it, and [we did this](https://github.com/usegalaxy-eu/galaxy/issues/194), making Galaxy much more familiar to astronomers. + +As an initial tool development attempt, we [took](https://github.com/esg-epfl-apc/tools-astro/tree/main/tools/astropytools) simple transformations of astrophysical data, part of any real analysis workflow, as provided by [astropy](https://github.com/astropy/astropy) package. These tools also allow converting astrophysical data into formats consumable by many of the Galaxy tools. + + +While considering the next tools to add, we realised that many of our [typical workflows](https://doi.org/10.48550/arXiv.2002.12895) contain **hundreds of different tools, repeated hundreds of thousands of times**. Furthermore, many of these tools are unique and not especially useful outside a single workflow. Instead, **we decided to focus on tools and workflows producing and consuming standard reusable data types** (images, spectra, light curves). This way, the analysis complexity is hidden inside some of the tools. + +Converting tools to Galaxy format, we noticed that much of the work consists in mechanical mapping of astronomical tool annotation into galaxy tools. To simplify this process, we decided to make a converter from existing collections of tools into Galaxy. Examples of tool collections are [EOSSR](https://pypi.org/project/eossr/), [HEASOFT](https://heasarc.gsfc.nasa.gov/lheasoft/), [Renku Projects](https://gitlab.renkulab.io/astronomy/mmoda). First case we took creates [PRs for tools](https://github.com/esg-epfl-apc/tools-astro/pulls) for workflows in Renkulab. It quickly became apparent that some of our collaborators benefit from a public instance for previewing our tools before delivering them, and we implemented a [small galaxy instance](https://galaxy.odahub.fr/). + +As of now, we are continuing implementation of astronomical galaxy tools, especially those useful in multi-messenger analysis, when interoperability and reusability of workflows is especially crucial. + +While reaching-out to various astronomical RIs (e.g. CTA, SKA, ESA) we realized that one of the key concerns of the astro community is making use of large data volumes: getting them into the galaxy, running massively data-parallel workflows. For discovering and selecting data we adopted **IVOA TAP protocol**, and implemented a **galaxy tool** making use of it. + +We are also exploring the possibility of using Galaxy as a GUI for particular telescope workflows, in cases when such a GUI is needed and when it is not yet available. + +### What have we achieved given our level of maturity: + +With the features and tools we added to Galaxy, it now provides a much more familiar environment for an astronomer. + +With the prototype versions of several astronomical telescope data reduction tools, we are demonstrating how Galaxy can produce publication-ready astronomical results fitting real-life needs of astronomers. + +### Our setup (technical file) + +We are relying on usegalaxy.eu for making our tools broadly available. +To facilitate galaxy live tool review in galaxy tool flow, we setup a small [preview instance](https://galaxy.odahub.fr/) . + +On preview instance - local users only, **authentication**, **basic job configuration** and +**storage** + + +### Problems to solve + +We are still trying to understand how to best use Galaxy with “**Big Data**”: data which is costly to relocate and hence should be treated by reference (following “**deferred data**” Galaxy concept). We are reaching out to other communities with similar needs. + +We want to make use of our dedicated compute resources, much of them within the “Grid” computing paradigm (relying on technologies like DIRAC, ARC, Rucio), fitting well within plans of WP4 of ESG. + +There is a generic difficulty in choosing a suitable degree of detail for the workflow producing scientific outcome. Since telescope data reduction pipelines are quite unique and have few reusable pieces, we focused on the workflows **combining high-level products**, while **specific telescope workflows are contained within individual tools**. It is to be determined if a high degree of detail is needed for each individual telescope workflow. + +### Our community outreach + +- Personal contacts + - We are participating in CTAO and SKAO research infrastructures. + - We have close contacts with ESA and preparing tools for data reduction of several space telescopes + + diff --git a/content/get-started/new-leads/successful-stories/biodiversity.md b/content/get-started/new-leads/successful-stories/biodiversity.md new file mode 100644 index 0000000000..4712c85fb8 --- /dev/null +++ b/content/get-started/new-leads/successful-stories/biodiversity.md @@ -0,0 +1,78 @@ + +## Our community onboarding + +### How did we get to know about Galaxy project and framework and its potential + +Galaxy’s Biodiversity community is focusing on one important aspect: the description of existing biodiversity through the sequencing, assembly and annotation of wild species. + +At the start of this community, the Galaxy Genome Annotation SIG was born in 2017 during a GCC meeting, when different members of the Galaxy community met and showed shared interests in using Galaxy for the genome annotation of various organisms, in particular phages or insects. At the time, some initial versions of JBrowse and Apollo tools, as well as prokaryote genome annotation tools like Prokka, were already implemented and demonstrated how powerful Galaxy could be in this field. In particular, with Galaxy it was finally possible to execute these tools without the burden of installing and configuring them, and JBrowse/Apollo tools allow to generate and share perfect looking visualisations in a few clicks instead of multiple complex command lines. + +Over the years, this first stage of our community gave birth to several tools, workflows and visualisations that were integrated into Galaxy, and demonstrated how it was a perfect platform to improve the practices in this field, for all sorts of organisms. + +More recently, large scale genome sequencing projects have emerged throughout the world, under the (EBP) umbrella (ERGA, BGE, VGP, ATLASEA, …). These projects aim to sequence thousands or even millions of new genomes in the coming years. This is a dramatic change in the volume of data to be treated, and we thought that Galaxy could address the computing challenge while respecting the FAIR and open science principles. This was a second stage for this Biodiversity community, and coincided with the start of the EuroScienceGateway (ESG) project. + +### What were our needs / challenges : + +Outside of Galaxy, a typical genome annotation analysis is performed using complex pipelines (widely available like Braker, or developed in-house), composed of many different steps, with tools that are often hard to install and hard to tweak (many options, many different ways to perform model training steps). Finding and accessing the correct raw genome sequence data can also be a challenge when faced with millions of new genomes. + +This led to hard-to-reproduce results and a lack of standard procedures which can be applied routinely to new genome sequences. + +Finally there was a great need for standard evaluation tools to compare results, and good visualisations to interpret them. + + +### What were the steps we made: + +We started with the integration of state of the art tools within Galaxy (e.g. Maker, Funannotate, Braker). As these tools are complex pipelines, with huge lists of dependencies, integration was only possible after significant efforts on packaging. As Galaxy relies on the standard Conda and Biocontainers eco-system, it was a good opportunity to package these tools properly for Galaxy itself, but also for a wider community of people willing to use these tools. + +In parallel to analysis tools, we also worked on the integration of visualisations within Galaxy, and on the integration of Dockerized external web applications (Tripal, Jbrowse, Apollo, Chado), making it possible to integrate Galaxy analyses into a larger flow of data treatments. Users could then assemble a genome in Galaxy, perform an initial automatic annotation, load the results into external visualisations applications, set up a project of manual curation of annotation with Apollo, and finally load back the curated data into Galaxy for a next round of analysis. + +We went on by working on training material : Galaxy proved to be a very powerful platform for training thanks to its Galaxy Training Network. Training material allowed us to train scientists to use these tools on their own data, but was also a great way to promote our work and aggregate new people to our community and learn more about their needs. + +The latest developments are targeted to the integration of new generation annotation tools (Braker3, Helixer, Compleasm, Omark, …), the addition of external EBP data sources, and on the creation of standard automated workflows. + +### What have we have achieved given our level of maturity: + +At the time of writing this report, we have produced a very complete set of state-of-the-art Galaxy tools for genome annotation, that can be applied to the output of recently published VGP assembly workflows. + +Assembly data from VGP and ERGA projects can also be accessed directly from the European Galaxy server, as standard remote data sources. + +Advanced visualisations and manual curation projects can already be created and shared directly from usegalaxy.eu. + +Finally a collection of training material is already available allowing any researcher or student to learn how to make use of these developments for their own projects. + +### What have we have in mind for the remainder of the project: +Our plans for the remainder of the ESG project are to: + + - Create new standard workflows, ready to be used at large scale on EBP data + - Reference these workflows on standard repository, including WorkflowHub + - Apply our new workflows on real EBP data + - Rely on the new European Pulsar network, giving access to the computing resources needed for the annotation of thousands of new genomes + - Create new training material making use of our latest developments + +### Our setup (technical file) + +#### Setup +The Biodiversity community relies on UseGalaxy.* servers, and in particular on the UseGalaxy.eu infrastructure. + +#### Authentication + +#### Job configuration + +#### Storage + +#### Reference data + +#### Problems to solve + +The provision of additional CPU and GPU resources within EuroScienceGateway is a must, and that should satisfy most compute needs of a vast majority of Galaxy users. However the main issues we see for some of the newest computationally demanding tools such as Helixer on GPUs is that there are compatibility issues with existing conda packages and related biocontainers. Without going into the details of the hardware and software requirements, and while NVIDIA, AMD and Intel are major vendors of GPUs, most (if not all) conda packages are only adapted to NVIDIA GPUs: this makes it currently impossible to exploit powerful AMD GPUs, for instance. + +### Ideas/proposals/solutions + + +## Our community outreach + + - Personal contacts + - Publications, conferences, workshops + - The other communities are: + - how similar their needs are + - how difficult it is to satisfy their needs diff --git a/content/get-started/new-leads/successful-stories/climate.md b/content/get-started/new-leads/successful-stories/climate.md new file mode 100644 index 0000000000..6dc041e776 --- /dev/null +++ b/content/get-started/new-leads/successful-stories/climate.md @@ -0,0 +1,109 @@ +## Our community onboarding + +### How did we get to know about Galaxy project and framework and its potential + +We, representing the Climate community within the EuroScienceGateway project, are a group of engineers and scientists from the University of Oslo in Norway (UiO), which was originally formed at the Geosciences Department, and then some of us transferred to the IT Department while others left academia. + +
800
+Climate Science Workbench which predated EuroScienceGateway + +Way before the EuroScienceGateway (ESG) project started some of us had already identified the potential of Galaxy beyond bioinformatics applications and introduced it to the climate science community. Over the years a number of tools more suited to geoscience applications were packaged and added to the toolshed. A more domain [specific Galaxy portal](https://climate.usegalaxy.eu/), was also set up where a selection of climate-related tools was made more easily accessible on the front page and in the meantime a lot of training material was developed. + +
800
+Landing page of the https://climate.usegalaxy.eu/ + +This work was largely presented at conferences, workshops were organised, and a number of climate Galaxy users were taken on board and trained. We therefore had a good knowledge and practice of Galaxy, as opposed to the astrophysics, catalysis and muon communities. + + +### What were our needs / challenges : + +We quickly realised that the tools we had introduced into Galaxy because we thought that they were what the climate community needed were not necessarily those that would attract more users. On the other hand some of these tools, like those to display multi-dimensional data, were also appreciated by users from other disciplines. + +Basically the climate community is formed of a wide range of people with a large variety of interests and broad technical skills. Some are involved in developing and running large numerical models on supercomputers, for Earth System Modelling typically, or analysing large amounts of spatiotemporal data, and these users are not afraid by the command-line interface. On the opposite side of the spectrum are more terrain people, those going on sites, doing measurements in the field or experiments in the laboratory, and these largely prefer a Graphical User Interface. + +Attracting computer literates and those allergic to it to Galaxy is challenging because the former are used to high-performance and see Galaxy as a mere toy and do not see the need to learn something new, whereas the latter will only come to Galaxy if they get immediate access to applications that they can use without much effort, via a nice GUI. In both cases users expect to be able to achieve more with Galaxy than on their laptop, with more resources and more facilities. For them it is not the number of tools available that matters but access to a Workflow Management System making it possible to showcase their research and reproduce entire sequences of tasks in an automatic way. + + +### What were the steps we made: + +Based on this experience and considering that there are “enough” generic tools in the Climate Galaxy already, we decided to focus with EuroScienceGateway on providing workflows and examples of applications that go beyond what climate scientists can easily achieve on their own laptop. Thereby demonstrating that Galaxy is a perfectly suited place for performing complex tasks including for instance Machine Learning with all the data preparation, model training and forecasting that is already difficult to carry out for most practitioners in a dedicated framework, or running a fully fledged Earth System Model in Galaxy, taking advantage of the large amount of resources to be made available through the Bring Your Own Compute (BYOC) service introduced by the EuroScienceGateway project. + +This meant that we had to select a number of use cases, touching on topical research and starting from scientific communications, with related Jupyter notebooks, devising a method to split these notebooks into manageable “chunks” which can be converted into Galaxy tools to reform workflows: + + - IceNet is a complex probabilistic, deep learning sea ice forecasting system which learns how sea ice changes from climate simulations and observational data and is able to forecast up to 6 months in advance the monthly-averaged sea ice concentration maps at 25km resolution. + - FArLiG is also a machine learning model that uses meteorological reanalysis data from ERA5-land at ~10km resolution to detect winter warming periods, and combines it with a detailed satellite derived vegetation cover classification (initially at 100m resolution) to forecast the concentration of moss and lichen in the Arctic (which is vital for wildlife and local reindeer herders). + +For both the examples, we needed to sort out the provision of the input data, create new tools when appropriate and reusing existing ones where available, optimise the workflow and make it to work as well as the original Jupyter notebooks, all at the click of a button. The idea here is to demonstrate that very intricate data analysis can be accessible to climate Galaxy users whereas it would have required significant work for them to achieve it by themselves, that the entire workflow is reproducible but also reusable for other applications. + + +### What have we have achieved given our level of maturity: + +At the time of writing this report, the machine learning part of the IceNet and FArLiG workflows which form the heart of these use cases work and we now have to finalise the packaging, add the data preparation for FArLiG (for the downscaling) and sort out some persistent storage for the inputs. For this purpose we envisage several options, including using the Norwegian Infrastructure for Research Data (NIRD) archive which is being deployed, if individual entries can be accessed as S3-compatible, using the Lumi-O object storage partition (although what happens to the data after the end of the allocation period is not clear yet), or Bring Your Own Storage (BYOS) provided this also offers a permanent solution. We also have to think about how and how often the data will be updated for future forecasts, and we aim to develop the associated documentation and training material. + +
800
+The updated NIRD API which should now provide S3 links (work in progress). + +To advertise further Galaxy and highlight the potential benefits of Open Science, FAIR principles and reproducible workflows, we are used to communicate with the wider climate community via conferences (for presenting IceNEt and FArLiG for example at EGU and GCC) and get feedback. We also tried to raise awareness among scientists in our own community but also other academic disciplines that study human cultures, including history, philosophy, literature, arts, and language (humanities) by taking part in a series of workshops where we introduced and illustrated what are Open Science and the European Open Science Cloud (EOSC), also explaining what is behind the words “Findability, Accessibility, Interoperability, and Reusability” of research data, what Research Objects are and more generally how Galaxy portals and the project EuroScienceGateway can be useful to researchers in their day to day work/activities [https://zenodo.org/records/10478824](https://zenodo.org/records/10478824). + +
800
+One of the presentations made at the Digital Scholarship Days 2024. + +### What have we have in mind for the remainder of the project: + +Regarding the possibility of running climate simulation within Galaxy, we already have a [tool for CESM] (https://anaconda.org/bioconda/cesm), the Community Earth System Model, however this being built using generic conda packages it can only be used on a single node (i.e., it lacks the system libraries specific to individual supercomputers to allow inter-node communications and hence the ability to run on a large number of processors). As a result this fantastic tool is not used much, despite offering climate Galaxy users access to a state-of-the-art model to test themselves various scenarios without them having to get hold of the model source code, then install and port it on their own computer. The main reason for that is that running climate simulations to be able to perform sensible analysis requires several model years (in climate science we consider 25-30 years is the minimum) and given the current model throughout (i.e., number of simulated years per computation day) achievable with the limited compute resources available from Galaxy Europe (a couple of months per day at best) this is a no go. Taking advantage of ESG’s BYOC we hope to leverage thousands of CPUs to make climate simulations in Galaxy practical. Note that this relies on revisiting the entire feedstock to add interconnect protocols aware packages and hence a significant amount of work that we may not be able to finish within the timeframe of this project. + +
800
+The Community Earth System Model conda package used in the eponymous Galaxy tool. + +There is also an increased demand for tools related to the “Urban Heat Island” (UHI) effect, which occurs when cities replace the natural land cover and vegetation with a dense concentration of buildings and pavements, with surfaces and materials which absorb and retain heat, as opposed to nearby more rural areas. This UHI effect is normally most noticeable during the night, however due to climate change it seems to impact territories that were previously unaffected, and is associated with increased energy consumption, elevated cooling demands, reduced air quality, potential health and security risks for urban residents. We therefore endeavour, if time permits, to add a UHI tool to Galaxy, the exact form and capabilities of this tool still remain to define but it has to be easy to use and produce something visually appealing like heat maps to “condense” the information. + + + +### Our setup (technical file) + +#### Setup +UiO Galaxy server: + +Test server for the Climate Galaxy Pulsar node: 16 vCPUs (main Galaxy front end) +Job runner: 16 vCPUS (Pulsar) + +
800
+This was setup following the Galaxy Server administration tutorials and is intended to evaluate the possible deployment of a Pulsar node on Lumi, currently in standby. + + + +### Authentication + +### Job configuration + +### Storage + +### Reference data + +No satisfying solution for “persistent” storage of reference data implemented yet. Several options will be explored, including the NIRD archive, Lumi-O and BYOS. + + +### Problems to solve + +The provision of additional CPU and GPU resources within EuroScienceGateway is a must, and that should satisfy most compute needs of a vast majority of Galaxy users. However the main issues we see for some of the computationally demanding tools such as CESM (climate model) or for Machine Learning on GPUs is that there are compatibility issues with existing conda packages and related biocontainers. Without going into the details of the hardware and software requirements, and while NVIDIA, AMD and Intel are major vendors of GPUs, most (if not all) conda packages are only adapted to NVIDIA GPUs: this makes it currently impossible to exploit the powerful AMD GPUs on the Lumi-G partition, for instance. + +The other issue which has been largely overlooked is about the inability to run Galaxy jobs spanning multiple nodes on High-Performance Computers (HPCs) high-speed interconnects without the tools themselves being compatible with the host transport protocol implemented. This came to light after Lumi’s recent interconnect upgrade to Slingshot-11 with OFI (OpenFabrics Interfaces) instead of supporting UCX (Unified Communication X). That stopped short all efforts to deploy a Pulsar node on Lumi. +Addressing this will require significant community efforts to completely rewrite the feedstock recipes for a number of tools, along with testing and experimenting, otherwise these tools will be restricted to running on a single node on the host (and will provide extremely poor performance on multi-node applications). + + +### Ideas/proposals/solutions + +Regarding the hardware/software compatibility issue we are looking into adapting the biocontainers to work on different platforms with minimal additional changes and will test this approach in the coming months. Should this not be achievable we will have to add relevant metadata to cater for the diversity of Pulsar nodes’ hardware (drivers) and software stacks and send jobs only to the hosts able to accommodate them. +We may also have to refine the scope of the CESM tool and focus more on facilitating training newcomers to climate modelling than running long simulations on Galaxy. + + + +## Our community outreach + + - Personal contacts + - Publications, conferences, workshops + - The other communities are: + - how similar their needs are + - how difficult it is to satisfy their needs + + diff --git a/content/get-started/new-leads/successful-stories/materials-science.md b/content/get-started/new-leads/successful-stories/materials-science.md new file mode 100644 index 0000000000..a1d327947e --- /dev/null +++ b/content/get-started/new-leads/successful-stories/materials-science.md @@ -0,0 +1,119 @@ + + +## Our community onboarding + +### How did we get to know about Galaxy project and framework and its potential + +We are based in the Scientific Computing Department (SCD) of the Rutherford Appleton Laboratory (RAL), at the Science and Technologies Facilities Council (STFC) in the United Kingdom. RAL hosts the UK’s neutron and muon sources, as well as the UK’s synchrotron light source and the Central Laser Facility. + +As part of our role in the EuroScienceGateway (ESG) project, we have been working to engage and onboard two specific materials science communities: the muon science and the catalysis science communities. One of the members of our team had worked in bioinformatics, she knew about the Galaxy platform and its capabilities, and suggested we try it to improve the onboarding of these communities. + +The needs of the catalysis community were quite similar to those of the muon community, but a key difference was that we did not develop the base software tools: we just worked to secure their ease of use by creating associated Galaxy tools. + + +### What were our needs / challenges : + +For both the muon and catalysis communities, we needed to increase the uptake and distribution of the associated software tools; improve the transparency and reproducibility of the research results obtained by using the tools; and develop a simple GUI for those tools. + +At the beginning of the ESG project, we had already developed command-line tools for modelling muon experiments. As regards the catalysis community, tha analysis tools were already there. All of these were desktop-based tools that needed to be downloaded and installed in the users’ computers. Moreover, the data analysis required the sequential application of these desktop-based tools, in what was a sort of unstructured workflow. The muon and catalysis communities comprise beamline scientists, domain specialists, experimental scientists and modellers. Hence, given this diversity, the transparency and reproducibility of the results depended a lot on each particular user. + +So, those were some of the key needs for both communities, and to tackle them we developed Galaxy tools for muons and catalysis, and moved the associated workflow into Galaxy. + +One of the advantages of Galaxy for building and running workflows is that the GUIs for all the tools follow the same template, which makes these tools easy to understand and connect into workflows. And not only that, Galaxy is quite good in terms of transparency and reproducibility of its results. Because all of its workflows can be saved in Galaxy and shared or modified at any time, which makes the scientific work associated to the workflow more transparent and reproducible. + +And all these Galaxy capabilities have been quite helpful for engaging and onboarding the muon and catalysis communities. Because Galaxy has: + +- provided sustainable and user-friendly software tools that have improved the interpretation of the associated experiments. +- given web-based tools that are easy to distribute among the community. +- provided the Galaxy Training Network (GTN) infrastructure that we are using to give tutorials about our tools. + + +### What were the steps we made: + +In the case of the muon community, we lead the Muon Spectroscopy Computational Project (MSCP), where we develop software tools for the interpretation of muon experiments. The MSCP started in 2017 and, by the time we decided to use Galaxy, we already knew the muon community quite well. And what we needed to do for this community at the beginning of the ESG project was to improve the use of computational modelling in muon experiments. + +Galaxy has helped us to improve the use of modelling in muon science and, in the process of developing our Galaxy tools for muons, we developed an expertise on Galaxy that we then expanded to other materials science communities. In particular, the catalysis science community, where we started to develop Galaxy tools for processing results from x-ray absorption spectroscopy experiments used in catalysis. + +At the time of writing this report, we are in the process of building the Galaxy catalysis community. An important advantage that we have for this is that the ESG team is based at the Rutherford Appleton Lab, where the catalysis experiments are done. + +### What have we achieved given our level of maturity: + +#### Muon community: + +Muons are subatomic particles that are generated in target stations at the Rutherford Appleton Lab, and then fired into the samples that we want to study. A particular property of the muon is that it “stops” at a particular site inside the sample, and knowing this stopping site is very useful for interpreting muon experiments. The MSCP developed a method for finding this stopping site and we have implemented this method in the Galaxy platform. + +We have also developed a tutorial explaining this method that is now part of the GTN and we have members of the international muon community who are using some of our tools. + +We are working on implementing a new command-line tool, that we have developed for the muon community, into the Galaxy platform. + +#### Catalysis community: + +The challenge we are tackling here is the interpretation of x-ray spectroscopy (XAS) experiments used in catalysis. XAS experiments for catalysis generate data that is processed by several software tools which are connected in workflows. These workflows are quite complex, and we have used Galaxy to improve transparency and reproducibility of these workflows. + +The workflows that we are processing using Galaxy are connected to: +- the processing and normalization of the raw data coming from the XAS instruments and +- its subsequent analysis using a set of well-established software tools. + +Currently, most of the catalysis community performs (b) using a set of desk-based tools, which have names such as DAWN, ATHENA, ARTEMIS AND FEFF, and can be downloaded with the DEMETER package. These tools are connected in a workflow shown in the flow diagram below: + +
800
+ +To move this workflow into Galaxy, we have broken-up the process into four associated prototype Galaxy tools that can be linked into an equivalent workflow. The picture below shows the names of the tools and what part of the workflow each tool represents. For instance, Larch Athena, is used for processing and normalizing the raw data coming from the instrument. + +
800
+ +We are now testing these tools and workflow by trying to reproduce a set of published XAS results for catalysis. + +### Our setup (technical file) + +#### Setup +Galaxy server: 4 core VM, CentOS 7 (dev and prod) +Job runner: 12 core VM, Rocky 8 (dev and prod) +Monitoring: 2 core VM, Rocky 8 (dev and prod) +Storage: 2 core VM, Rocky 8 (shared between dev and prod) + +#### Authentication +Local authentication currently in use. LDAP in use for other services onsite (federated) + +#### Job configuration +HPC (Slurm) exists onsite, have considered integration but not actively working towards this yet + +All jobs use Slurm, with “Galaxy server” acting as control and execution node. “Job runner” is prioritized in Slurm (jobs will only run on server if runner queue is full). Static job allocation (more cores allocated for some tools). + +All jobs currently within the Galaxy instance, as galaxy user. + +#### Storage +Data uploaded to Galaxy, lives on the storage VM and NFS mounted to the server and runner. + +#### Reference data +No use of reference data. + + +### Problems to solve + +Current Muon workflows are relatively lightweight simulation, starting from small plaintext files. So far we have not yet needed to use HPC or transfer large amounts of data. However, X-ray use cases are likely to involve larger numbers of hdf5 data, so this is something we will need a solution for at some point. Some problems or work we anticipate (many are related): + +- Integration with existing HPC resources onsite + +- Implementing authentication using existing LDAP service + - Likely a prerequisite for using HPC resources + +- Data transfer + - X-ray data is archived on tape, neutron/muon on disk. We run a web application for browsing and restoring data, accessible via HTTPS (short lived), Globus transfer, restoring to our HPC system (since disabled due to lack of use) or direct to the facility. This should be integrated with Galaxy somehow + - No matter where the data ends up, there always ends up being restrictions on how long it lives there - so we might have to duplicate data into Galaxy rather than using the Rucio style approach? + +### Ideas/proposals/solutions +The ideas we’ve had for addressing some of our problems (not actioned any of them yet, may not be feasible or other options might be better): + +- Use Pulsar to integrate with HPC + - Get a node inside existing Slurm network running Pulsar, send jobs there (probably using the “submit as user option” - dependent on setting up proper auth) + - Not spoken with people running our HPC service, so not clear what hoops we’d need to jump through + - Precedent for similar projects in the past (restoring data to HPC, creating accounts and permissioning as needed) +- Setup LDAP auth + - Hopefully just out of the box + - But have also considered whether we might need to do some manual configuration of Galaxy user roles etc. in order to allow both internal and external users but only allow the internal users to access HPC +- Set up a data/filesource tool to integrate with our archive + - ideally just redirects them to the archive UI, they login separately, browse, select the data and it is sent (or registered) in Galaxy + - in cases where the data needs to be restored from tape, this might be more complicated - don’t want to make the user wait for restoration, so maybe make restoration a tool that runs in the workflow? But then it would need to authn to the archive service - not sure what best practice would be. + - alternatively, rather than Galaxy pulling the data, could push it from the archive UI (might be easier since we directly develop that, so can tailor it to our needs) + diff --git a/content/images/successful-stories/climate1.png b/content/images/successful-stories/climate1.png new file mode 100644 index 0000000000..601a3354a4 Binary files /dev/null and b/content/images/successful-stories/climate1.png differ diff --git a/content/images/successful-stories/climate2.png b/content/images/successful-stories/climate2.png new file mode 100644 index 0000000000..df7cedefbb Binary files /dev/null and b/content/images/successful-stories/climate2.png differ diff --git a/content/images/successful-stories/climate3.png b/content/images/successful-stories/climate3.png new file mode 100644 index 0000000000..cebfefdd5e Binary files /dev/null and b/content/images/successful-stories/climate3.png differ diff --git a/content/images/successful-stories/climate4.png b/content/images/successful-stories/climate4.png new file mode 100644 index 0000000000..b0b3ec420d Binary files /dev/null and b/content/images/successful-stories/climate4.png differ diff --git a/content/images/successful-stories/climate5.png b/content/images/successful-stories/climate5.png new file mode 100644 index 0000000000..4580f17493 Binary files /dev/null and b/content/images/successful-stories/climate5.png differ diff --git a/content/images/successful-stories/climate6.png b/content/images/successful-stories/climate6.png new file mode 100644 index 0000000000..23db7bd1c4 Binary files /dev/null and b/content/images/successful-stories/climate6.png differ diff --git a/content/images/successful-stories/materials-science1.png b/content/images/successful-stories/materials-science1.png new file mode 100644 index 0000000000..9b7b980fa0 Binary files /dev/null and b/content/images/successful-stories/materials-science1.png differ diff --git a/content/images/successful-stories/materials-science2.png b/content/images/successful-stories/materials-science2.png new file mode 100644 index 0000000000..4fda12fa14 Binary files /dev/null and b/content/images/successful-stories/materials-science2.png differ