From a5f71ff8adf1188d6047bb918b184eeabb813a06 Mon Sep 17 00:00:00 2001 From: Quarto GHA Workflow Runner Date: Mon, 18 Mar 2024 06:18:50 +0000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- .well-known/security.txt | 10 - about.html | 2 + architecture.html | 132 ++-- get-started/automate-data-ingestion.html | 2 +- get-started/index.html | 11 +- index.html | 2 +- index.yaml | 366 ----------- install-magasin.sh | 789 ----------------------- search.json | 114 ++-- sitemap.xml | 68 +- uninstall-magasin.sh | 258 -------- why-magasin.html | 72 +-- 13 files changed, 175 insertions(+), 1653 deletions(-) delete mode 100644 .well-known/security.txt delete mode 100644 index.yaml delete mode 100755 install-magasin.sh delete mode 100755 uninstall-magasin.sh diff --git a/.nojekyll b/.nojekyll index d99fb4a2..bbcc53bd 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -9354ee7a \ No newline at end of file +3d9b7359 \ No newline at end of file diff --git a/.well-known/security.txt b/.well-known/security.txt deleted file mode 100644 index ac13938a..00000000 --- a/.well-known/security.txt +++ /dev/null @@ -1,10 +0,0 @@ -# UNICEF security contacts and policy - -# Our security contact channels -Contact: mailto: d _ c _ o _ e _ @ _ u _ n _ i _ c _ e _ f _ . _ o _ r _ g (remove the _ and the spaces) - -# Link to our vulnerability disclosure policy -Policy: https://unicef.github.com/magasin/contributing/vulnerability-disclosure.html - -# Languages that our team speaks and understands -Preferred-Languages: en-US \ No newline at end of file diff --git a/about.html b/about.html index dc76aa09..82061386 100644 --- a/about.html +++ b/about.html @@ -370,6 +370,7 @@

About

1 About Magasin

Magasin (pronounced /ma.ɡa.zɛ̃/) is an end-to-end collection of open-source tools that are stitched together to provide enterprise ready automated ingestion, transformation, and storage as well as the enabling infrastructure and tools to perform data analysis, visualization of datasets and ML / AI.

+

You can learn more about why we created magasin and its architecture.

2 Brief history

@@ -377,6 +378,7 @@

Also informed by experiences of the Magic Box applied data science initiative, a range of components were evaluated and trialed with country offices to arrive at the current components and architecture.

In 2021, ICTD engaged one of its long term agreement vendors to assist with evolving the proof-of-concept into a minimum viable product (MVP).

In 2023, UNICEF started the journey to detach magasin from its organizational and cloud infrastructure dependencies, and release it as an open-source platform with the goal of becoming a Digital Public Good.

+

In February 2nd,2024, UNICEF released magasin as an open source solution. In this release, it was included an easy to install process, as well as how to get started documentation.

diff --git a/architecture.html b/architecture.html index 321e5ecd..7668b061 100644 --- a/architecture.html +++ b/architecture.html @@ -343,27 +343,18 @@

On this page

@@ -391,69 +382,80 @@

Architecture

Magasin is a scalable end-to-end data platform based on open-source components that is natively run in a Kubernetes cluster.

By end-to-end this describes a data processing pipeline including from how to ingest raw data from multiple data sources, transform the data, run analyses on the processed data, storage in a cloud or local filesystem to enabling visualisation.

-

Kubernetes is a container orchestration system designed to automate the deployment, scaling, and management of containerized applications. It is an integral part of services offered by major cloud providers. Kubernetes, being open source, can also be set up on-premises. For testing purposes, it is even possible to install it on a desktop computer.

-

Magasin uses Kubernetes in combination with Helm, a package manager for Kubernetes applications. Helm is the equivalent to apt, pip, npm, pacman, snap, conda. Using Helm, users specify the configuration of required Kubernetes resources to deploy magasin through a values file or command-line overrides. A package in helm is called a chart.

-

A fundamental contrast between magasin and other helm-based Kubernetes applications lies in their architectural approach. Typically, an application is characterized by a sole root helm chart governing all deployment rules. However, in magasin, each component operates as an autonomous helm chart. This design choice enables the establishment of a loosely-coupled architecture among its components. Rather than mandating a rigid structure for the entire architecture, magasin embraces a more open and modular approach, fostering flexibility in component selection and integration.

-

The core components of magasin are independent mature open source projects that support.

+

Kubernetes is a container orchestration system designed to automate the deployment, scaling, and management of containerized applications. It is an integral part of services offered by major cloud providers. Kubernetes, being open source, can also be set up on-premises. For testing purposes, it is even possible to install it on a desktop computer. By using kubernetes, we ensure the scalability of the solution.

+

Magasin uses Kubernetes in combination with Helm, a package manager for Kubernetes applications that eases the initial setup of the different components of magasin. Helm is the equivalent to apt, pip, npm, pacman, snap, conda. Using Helm, users specify the configuration of required Kubernetes resources to deploy magasin through a values file or command-line overrides.

+

The main components of magasin architecture are independent mature open source projects. They are loosely coupled and glued through set of conventions and tools.

+
+

1 magasin is a loosely-coupled architecture

+

Magasin identifies the sets of needs for setting up an end-to-end data platform that can scale, and provides the set of open source components and an standard underlying technologies (helm, kubernetes) that can be used to solve the needs.

+

However, organizations may already have some of the elements in the architecture already in place. For example, an organization may already have a preferred data visualization platform such as Microsoft PowerBI or Tableau, and switching to another visualization tool may entail more costs than the licenses themselves. Magasin gets organizations covered on that. Each of the components of magasin is not strictly required for an implementation.

+

This is achieved using helm packaged system in a slightly different way. Generally, in one single helm package all the components are including, setting up an more opinionated way of deploying a particular application. In magasin, each component operates as an autonomous helm chart. This design choice enables the establishment of a loosely-coupled architecture among its components. It allows you to install each component independently. Therefore, rather than mandating a rigid structure for the entire architecture, magasin embraces a more open and adaptable approach, fostering flexibility in component selection and integration.

+
+
+

2 Magasin’s components

+

Magasin is built on top of a set of mature open source projects to conform an base architecture for deploying an end-to-end data platform.

-
Magasin Architecture
+
Magasin architecture
-
-

1 Underlying technologies

-
-

1.1 Kubernetes containerization

-

Kubernetes is a container orchestration system designed to automate the deployment, scaling, and management of containerized applications. It is an integral part of services offered by major cloud providers. Kubernetes, being open source, can also be set up on-premises. For testing purposes, it is even possible to install it on a desktop computer.

-
-
-

1.2 Helm charts

-

Magasin uses Kubernetes in combination with Helm, a package manager for Kubernetes applications. Helm is the equivalent to apt, pip, npm, pacman, snap, conda, etc. Using Helm, users specify the configuration of required Kubernetes resources to deploy magasin through a values file or command-line overrides. A package in helm is called chart.

-
-
-

1.3 Loosely-coupled architecture

-

A fundamental contrast between magasin and other helm-based Kubernetes applications lies in their architectural approach. Typically, an application is characterized by a sole root helm chart governing all deployment rules. However, in magasin, each component operates as an autonomous helm chart. This design choice enables the establishment of a loosely-coupled architecture among its components. Rather than mandating a rigid structure for the entire architecture, magasin embraces a more open and adaptable approach, fostering flexibility in component selection and integration.

+
+

2.1 Ingestion: Dagster

+

The Dagster framework is the primary tool for orchestration of data pipelines for ingestion, transformation, analysis, and machine learning. Each pipeline is isolated and encapsulated on its own container, so different tasks may utilize different versions of a library.

+

In addition to pipeline isolation, it provides some advantages:

+
    +
  1. A user interface that provides visibility of pipelines’ tasks, scheduling, run status, debugging messages, materialized assets, resources, and modes…

  2. +
  3. Dagster pipelines are written in python, which is a very familiar language for data scientist and data engineer.

  4. +
  5. A framework for creating pipelines that scale. Whereas early in the data ingestion processes, pipelines are simple and straightforward, with time, when mixing different sources, which requires more complex cleaning and the need of armonizing identifiiers, the pipelines become very prone to be built as spaguetty code. This lmiits the maintainability, discoverability of issues and efficiency of introducing new changes. With dagster, you get a way of building pipelines that will be more structured and easy to maintain.

  6. +
-
-

2 Core components

-
-

2.1 Ingestion: Dagster

-

The Dagster framework is the primary tool for orchestration of data pipelines for ingestion, transformation, analysis, and machine learning. Each pipeline is isolated and encapsulated, so different tasks may utilize different versions of the same library, for example, and each pipeline run is executed in a short-lived pod on a Kubernetes cluster.

-
-

2.1.1 Dagit

-

Dagster’s Dagit UI provides visibility of pipelines’ tasks, scheduling, run status, materialized assets, resources, and modes.

+
+

3 Store: A file based approach

+

In the magasin architecture, as general approach, we stand to store data assets as files. In particular, we recommend the use of Apache parquet file format.

+

The main reason to use a file based approach is:

+
    +
  1. First, because it is an economic way to store data. Storage services in the cloud or in premises is relatively cheap.

  2. +
  3. Second, because it does provide more flexibility when changes on the underlying structure are introduced, at least compared with setting up a SQL database downstream.

  4. +
  5. In addition, it allows also to easily store more types of data such as documents or images.

  6. +
  7. Lastly, in terms of governance and sharing the datasets, the problem is simplified to setting up file sharing permissions.

  8. +
+

To support this file based approach, there are two components that are introduced in the architecture. The first one is MinIO, which provides magasin with a layer that introduces agnosticity against the cloud provider. The second one is Apache Drill, which provides a SQL query engine that eases the extraction of insights from the files.

+
+

3.0.1 MinIO: A cloud agnostic approach

+

Magasin can be installed in any cloud provider on in premises. However, each cloud provider has a different service to store data. In order to provide a consistent way of storing data we have included as part of the standard list of components MinIO, a high-performance object storage system designed for cloud-native and containerized applications has been included as part of the magasin.

+

Founded in 2014, MinIO offers an S3-compatible API, enabling seamless integration with existing cloud storage ecosystems. It is known for its simplicity, scalability, and speed, making it a popular choice for organizations seeking efficient data storage solutions. MinIO’s architecture is optimized for modern data workloads, leveraging erasure coding and distributed techniques to ensure data resilience and high availability. With its lightweight footprint and easy deployment on standard hardware, MinIO empowers developers to build scalable storage infrastructures tailored to their specific needs, whether for on-premises, hybrid, or multi-cloud environments.

+

Whereas MinIO comes out of the box to provide uniform and cloud agnostic storage layer, there may be organizations that prefer to use a cloud native storage such as Azure Blob, S3 Buckets or Google Cloud Storage.

-
-
-

2.2 Cloud storage: MinIO

-

MinIO is an open-source, high-performance object storage system designed for cloud-native and containerized applications. Founded in 2014, MinIO offers an S3-compatible API, enabling seamless integration with existing cloud storage ecosystems. It is known for its simplicity, scalability, and speed, making it a popular choice for organizations seeking efficient data storage solutions. MinIO’s architecture is optimized for modern data workloads, leveraging erasure coding and distributed techniques to ensure data resilience and high availability. With its lightweight footprint and easy deployment on standard hardware, MinIO empowers developers to build scalable storage infrastructures tailored to their specific needs, whether for on-premises, hybrid, or multi-cloud environments.

-
-
-

2.3 Query engine: Apache Drill

-

Apache Drill is an open-source, schema-free query engine that provides a SQL interface to a wide range of non-relational datastores, such as NoSQL databases and collections of files such as JSON, CSV, ESRI shapefiles, SPSS & SAS formats, Parquet, and others.

+
+

3.0.2 Query engine: Apache Drill

+

The last piece of the file based approach is Apache Drill. Apache Drill is an open-source, schema-free query engine that provides a SQL interface to a wide range of non-relational datastores, such as NoSQL databases and collections of files such as JSON, CSV, ESRI shapefiles, SPSS & SAS formats, Parquet, and others.

While data marts for specific business functions or locations traditionally require hosting and maintenance of a relational database on a server or virtual machine, Apache Drill enables comparable functionality without need for running and hosting a database or maintaining schema changes from source systems over time.

-

Instead, a Dagster ingestion and transformation pipeline stores an ‘analyst-ready’ dataset that Apache Drill can query directly.

-
-

2.4 Dashboards: Apache Superset

+
+

3.1 Visualization: Apache Superset

Apache Superset is an open-source business intelligence product with comprehensive charting, dashboarding, and querying capabilities.

+

Apache Superset consumes SQL data sources, so Apache Drill is a nice companion for Superset to work with magasin’s file based approach.

-
-

2.5 Notebook environment: Daskhub

-

Daskhub is a Helm chart to easily install JupyterHub and Dask Gateway for multiple users on a Kubernetes cluster.

-
-

2.5.1 JupyterHub

-

The multi-tenant JupyterHub component creates on-demand, isolated pods for authenticated users, each with persistent storage for their R and Python notebook workspace.

+
+

3.2 Analysis: Notebook environment JupyterHub

+

A Jupyter notebook is an open-source allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It is a powerful tool that combines code execution, rich text, mathematics, plots, and rich media into a single document. They are widely used in data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

+

The advantages of using Jupyter Notebooks are numerous. They provide an interactive computing environment that promotes exploratory analysis, making them ideal for data science projects. Notebooks can be easily shared, promoting reproducible research and facilitating collaboration. They also allow for inline plotting, providing immediate visual feedback on the data being analyzed. Typical uses of Jupyter Notebooks include data analysis and visualization, machine learning, statistical modeling, and educational purposes. For instance, a data scientist might use a Jupyter Notebook to clean a dataset, perform exploratory data analysis, build a machine learning model, and then visualize the results, all within a single, cohesive document.

+

The multi-tenant JupyterHub component creates on-demand, isolated containers for authenticated users, each with persistent storage for their R and Python notebook workspace.

-
-

2.5.2 Dask Gateway

-

Dask Gateway allows easy utilization of a Dask cluster from notebook environments for distributed computation of massive datasets or parallelizable operations.

- - +
+

3.3 Parallel computing: Dask Gateway

+

A Dask cluster is a flexible tool for parallel computing in Python. It allows you to write python code that will be run in multiples machines taking advantage of all the compute resources of the kubernetes cluster in which magasin is installed.

+

It is composed of a central scheduler and multiple distributed workers. Dask works well at many scales, ranging from a single machine to clusters of many machines. It enables parallel processing and extends the size of convenient datasets from “fits in memory” to “fits on disk”1. However, it can also work across a cluster of multiple machines.

+

Dask cluster Dask Gateway allows easy utilization of a Dask cluster from notebook environments for distributed computation of massive datasets or parallelizable operations.

+
+

4 Mag-cli

+

Mag-cli is the command line interface of magasin. Helps to manage the different modules of magasin and it makes easier to perform common administration tasks.

+ +
diff --git a/get-started/automate-data-ingestion.html b/get-started/automate-data-ingestion.html index 7f819013..7bdbb42d 100644 --- a/get-started/automate-data-ingestion.html +++ b/get-started/automate-data-ingestion.html @@ -599,7 +599,7 @@

#... ) -

Ok, So now let’s test if this is working so far. To do that we will first install the pipeline package in editable mode (-e). This allows you to edit the package without needing to install it again.

+

Ok, so now let’s test if this is working so far. To do that we will first install the pipeline package in editable mode (-e). This allows you to edit the package without needing to install it again.

pip install -e '.[dev]'

Then, we will launch the Dagster user interface:

dagster dev
diff --git a/get-started/index.html b/get-started/index.html index 129c3f09..fb22f081 100644 --- a/get-started/index.html +++ b/get-started/index.html @@ -377,7 +377,7 @@

On this page

    -
  • 1 Pre-requisite: a Kubernetes cluster
  • +
  • 1 Install pre-requisite: a Kubernetes cluster
  • 2 Install magasin
  • 3 Verify everything is working
  • 4 Next steps
  • @@ -406,11 +406,12 @@

    Get started

    -

    Magasin is a scalable end-to-end data platform based on open-source components that is natively run in a Kubernetes cluster.

    +

    Magasin is a scalable end-to-end data platform based on open-source components that is natively run in a Kubernetes cluster.

    +

    Magasin offers a value-for-money end-to-end data solution implementing a loosely-coupled architecture for organizations that need to setup a framework to scale the ingestion, storage, analysis and visualization of datasets. In addition, It also includes the capability of parallel computing for analyzing large datasets or AI model training.

    In this getting started you will install magasin on your local machine for testing purposes, then you will perform an end-to-end data processing task that involves: exploratory analysis of a data source, creating a pipeline to automate data ingestion and authoring a dashboard to present your findings.

    -

    Let’s start by getting the pre-requisites:

    -
    -

    1 Pre-requisite: a Kubernetes cluster

    +

    Before you continue, you may want to learn more about why magasin and its technical architecture, otherwise, let’s start by knowing the pre-requisite for installing magasin.

    +
    +

    1 Install pre-requisite: a Kubernetes cluster

    Prior to installing magasin, you need to have a Kubernetes cluster. But don’t worry, you can setup one on your local machine very easily. In layman terms, Kubernetes is just a technology that enables managing cloud ready applications, such as magasin.

    In this getting started tutorial, we are going to setup a Kubernetes cluster through Docker Desktop, an application that can be installed on most computers. However, if you already have a cluster you can go directly to the install magasin section.

    First, install Docker Desktop. It is available for:

    diff --git a/index.html b/index.html index 8b9dfd8a..950caae9 100644 --- a/index.html +++ b/index.html @@ -137,7 +137,7 @@

    diff --git a/index.yaml b/index.yaml deleted file mode 100644 index 10fe0c63..00000000 --- a/index.yaml +++ /dev/null @@ -1,366 +0,0 @@ -apiVersion: v1 -entries: - dagster: - - apiVersion: v2 - appVersion: 1.6.4 - created: "2024-02-29T11:05:36.515187666Z" - dependencies: - - condition: dagster-user-deployments.enableSubchart - name: dagster-user-deployments - repository: "" - version: 1.6.4 - - condition: postgresql.enabled - name: postgresql - repository: https://raw.githubusercontent.com/bitnami/charts/eb5f9a9513d987b519f0ecd732e7031241c50328/bitnami - version: 8.1.0 - - condition: rabbitmq.enabled - name: rabbitmq - repository: https://raw.githubusercontent.com/bitnami/charts/eb5f9a9513d987b519f0ecd732e7031241c50328/bitnami - version: 6.16.3 - - condition: redis.internal - name: redis - repository: https://raw.githubusercontent.com/bitnami/charts/eb5f9a9513d987b519f0ecd732e7031241c50328/bitnami - version: 12.7.4 - description: The data orchestration platform built for productivity. - digest: 570c9c5ee42e9fad01af15a966c959ccefe7f4e1c627a5f31186e6226a77effe - icon: https://dagster.io/images/brand/logos/dagster-primary-mark.png - keywords: - - analytics - - data-orchestrator - - data-pipelines - - etl - - workflow - kubeVersion: '>= 1.18.0-0' - maintainers: - - email: support@dagsterlabs.com - name: Dagster Labs - url: https://dagster.io/about - name: dagster - sources: - - https://github.com/dagster-io/dagster/tree/master/helm/dagster - type: application - urls: - - https://github.com/unicef/magasin/releases/download/v0.3.0-charts/dagster-1.6.4.tgz - version: 1.6.4 - - apiVersion: v2 - appVersion: 1.6.0 - created: "2024-02-02T15:14:32.240273332Z" - dependencies: - - condition: dagster-user-deployments.enableSubchart - name: dagster-user-deployments - repository: "" - version: 1.6.0 - - condition: postgresql.enabled - name: postgresql - repository: https://raw.githubusercontent.com/bitnami/charts/eb5f9a9513d987b519f0ecd732e7031241c50328/bitnami - version: 8.1.0 - - condition: rabbitmq.enabled - name: rabbitmq - repository: https://raw.githubusercontent.com/bitnami/charts/eb5f9a9513d987b519f0ecd732e7031241c50328/bitnami - version: 6.16.3 - - condition: redis.internal - name: redis - repository: https://raw.githubusercontent.com/bitnami/charts/eb5f9a9513d987b519f0ecd732e7031241c50328/bitnami - version: 12.7.4 - description: The data orchestration platform built for productivity. - digest: 94b7b3fdd8ab14fc7bfde84afcf0324fcb151a44b8867ce29c6d0d03b6415469 - icon: https://dagster.io/images/brand/logos/dagster-primary-mark.png - keywords: - - analytics - - data-orchestrator - - data-pipelines - - etl - - workflow - kubeVersion: '>= 1.18.0-0' - maintainers: - - email: support@dagsterlabs.com - name: Dagster Labs - url: https://dagster.io/about - name: dagster - sources: - - https://github.com/dagster-io/dagster/tree/master/helm/dagster - type: application - urls: - - https://github.com/unicef/magasin/releases/download/v0.1.0-charts/dagster-1.6.0.tgz - version: 1.6.0 - - apiVersion: v2 - appVersion: 1.5.9 - created: "2023-11-29T10:16:06.344653394Z" - dependencies: - - condition: dagster-user-deployments.enableSubchart - name: dagster-user-deployments - repository: "" - version: 1.5.9 - - condition: postgresql.enabled - name: postgresql - repository: https://raw.githubusercontent.com/bitnami/charts/eb5f9a9513d987b519f0ecd732e7031241c50328/bitnami - version: 8.1.0 - - condition: rabbitmq.enabled - name: rabbitmq - repository: https://raw.githubusercontent.com/bitnami/charts/eb5f9a9513d987b519f0ecd732e7031241c50328/bitnami - version: 6.16.3 - - condition: redis.internal - name: redis - repository: https://raw.githubusercontent.com/bitnami/charts/eb5f9a9513d987b519f0ecd732e7031241c50328/bitnami - version: 12.7.4 - description: The data orchestration platform built for productivity. - digest: 1fa81a21539df12b8b0443387a453e367721c157aa20227f3421ab6d1d20cbf3 - icon: https://dagster.io/images/brand/logos/dagster-primary-mark.png - keywords: - - analytics - - data-orchestrator - - data-pipelines - - etl - - workflow - kubeVersion: '>= 1.18.0-0' - maintainers: - - email: support@dagsterlabs.com - name: Dagster Labs - url: https://dagster.io/about - name: dagster - sources: - - https://github.com/dagster-io/dagster/tree/master/helm/dagster - type: application - urls: - - https://github.com/unicef/magasin/releases/download/v0.0.1-charts/dagster-1.5.9.tgz - version: 1.5.9 - daskhub: - - apiVersion: v2 - appVersion: jh3.2.1-dg2023.9.0 - created: "2024-02-29T11:05:36.520388245Z" - dependencies: - - import-values: - - child: rbac - parent: rbac - name: jupyterhub - repository: https://jupyterhub.github.io/helm-chart/ - version: 3.2.1 - - name: dask-gateway - repository: https://helm.dask.org/ - version: 2023.9.0 - description: Multi-user JupyterHub and Dask deployment. - digest: 8867e788af67c3c4df28ac328afb36780bddfcd6295d8ceeb391e9c56b62c292 - icon: https://avatars3.githubusercontent.com/u/17131925?v=3&s=200 - maintainers: - - email: jtomlinson@nvidia.com - name: Jacob Tomlinson (Nvidia) - - email: jhamman@ucar.edu - name: Joe Hamman (NCAR) - - email: erik@sundellopensource.se - name: Erik Sundell - - email: tom.w.augspurger@gmail.com - name: Tom Augspurger - name: daskhub - urls: - - https://github.com/unicef/magasin/releases/download/v0.3.0-charts/daskhub-2024.1.0.tgz - version: 2024.1.0 - - apiVersion: v2 - appVersion: jh2.0.0-dg2023.1.1 - created: "2023-11-29T10:16:06.351722903Z" - dependencies: - - import-values: - - child: rbac - parent: rbac - name: jupyterhub - repository: https://jupyterhub.github.io/helm-chart/ - version: 2.0.0 - - name: dask-gateway - repository: https://helm.dask.org/ - version: 2023.1.1 - description: Multi-user JupyterHub and Dask deployment. - digest: 0419f358edea2e5a1e9005f1fbe6e1b60dcf964b6fd9f3745e06d54c4fb33e00 - icon: https://avatars3.githubusercontent.com/u/17131925?v=3&s=200 - maintainers: - - email: jtomlinson@nvidia.com - name: Jacob Tomlinson (Nvidia) - - email: jhamman@ucar.edu - name: Joe Hamman (NCAR) - - email: erik@sundellopensource.se - name: Erik Sundell - - email: tom.w.augspurger@gmail.com - name: Tom Augspurger - name: daskhub - urls: - - https://github.com/unicef/magasin/releases/download/v0.0.1-charts/daskhub-2023.1.0.tgz - version: 2023.1.0 - drill: - - apiVersion: v2 - appVersion: 1.21.1-3.9.1 - created: "2024-02-29T11:05:36.52112017Z" - description: Helm charts for deploying Apache Drill on a Kubernetes cluster - digest: 8b55b9de4530c7187a4255f35414d331e8a3b704f89c67a71cbfd319b3b568be - home: http://drill.apache.org/ - keywords: - - apache - - drill - - zookeeper - - magasin - - unicef - maintainers: - - email: merlos@users.noreply.github.com - name: merlos - name: drill - sources: - - https://github.com/unicef/magasin-drill - type: application - urls: - - https://github.com/unicef/magasin/releases/download/v0.3.0-charts/drill-0.7.1.tgz - version: 0.7.1 - - apiVersion: v2 - appVersion: 1.21.1-3.9.1 - created: "2024-02-09T09:47:54.038852893Z" - description: Helm charts for deploying Apache Drill on a Kubernetes cluster - digest: d9a2f30cfa6e38dbd63fd7a79854b3e08ae9a097867b7cf2bf6d71ea8bf25b56 - home: http://drill.apache.org/ - keywords: - - apache - - drill - - zookeeper - - magasin - - unicef - maintainers: - - email: merlos@users.noreply.github.com - name: merlos - name: drill - sources: - - https://github.com/unicef/magasin-drill - type: application - urls: - - https://github.com/unicef/magasin/releases/download/v0.2.0-charts/drill-0.6.1.tgz - version: 0.6.1 - operator: - - apiVersion: v2 - appVersion: v5.0.11 - created: "2024-02-29T11:05:36.522148177Z" - description: A Helm chart for MinIO Operator - digest: fa8c0c70966291cee728c044046a012103e8fe9c4804b8dba6a2f983c5b254b4 - home: https://min.io - icon: https://min.io/resources/img/logo/MINIO_wordmark.png - keywords: - - storage - - object-storage - - S3 - maintainers: - - email: dev@minio.io - name: MinIO, Inc - name: operator - sources: - - https://github.com/minio/operator - type: application - urls: - - https://github.com/unicef/magasin/releases/download/v0.3.0-charts/operator-5.0.11.tgz - version: 5.0.11 - superset: - - apiVersion: v2 - appVersion: 3.1.0 - created: "2024-02-29T11:05:36.531490735Z" - dependencies: - - condition: postgresql.enabled - name: postgresql - repository: https://charts.bitnami.com/bitnami - version: 12.1.6 - - condition: redis.enabled - name: redis - repository: https://charts.bitnami.com/bitnami - version: 17.9.4 - description: Apache Superset is a modern, enterprise-ready business intelligence - web application - digest: eb77730e65fdd74fd661ddc1f3f9d9f1b19ab3cc5f15c8a2be0f5d7c3eb496f8 - home: https://superset.apache.org/ - icon: https://artifacthub.io/image/68c1d717-0e97-491f-b046-754e46f46922@2x - keywords: - - business intelligence - - data science - maintainers: - - email: craig@craigrueda.com - name: craig-rueda - url: https://github.com/craig-rueda - name: superset - sources: - - https://github.com/apache/superset - urls: - - https://github.com/unicef/magasin/releases/download/v0.3.0-charts/superset-0.12.6.tgz - version: 0.12.6 - - apiVersion: v2 - appVersion: 3.1.0 - created: "2024-02-09T09:47:54.048858841Z" - dependencies: - - condition: postgresql.enabled - name: postgresql - repository: https://charts.bitnami.com/bitnami - version: 12.1.6 - - condition: redis.enabled - name: redis - repository: https://charts.bitnami.com/bitnami - version: 17.9.4 - description: Apache Superset is a modern, enterprise-ready business intelligence - web application - digest: 3b09b5aa9eff983f1f9ef76c08d1a43671b2bb8b839c7c509dc4233c5eb67e7c - home: https://superset.apache.org/ - icon: https://artifacthub.io/image/68c1d717-0e97-491f-b046-754e46f46922@2x - keywords: - - business intelligence - - data science - maintainers: - - email: craig@craigrueda.com - name: craig-rueda - url: https://github.com/craig-rueda - name: superset - sources: - - https://github.com/apache/superset - urls: - - https://github.com/unicef/magasin/releases/download/v0.2.0-charts/superset-0.12.1.tgz - version: 0.12.1 - - apiVersion: v2 - appVersion: 3.0.1 - created: "2023-11-29T10:16:06.36911367Z" - dependencies: - - condition: postgresql.enabled - name: postgresql - repository: https://charts.bitnami.com/bitnami - version: 12.1.6 - - condition: redis.enabled - name: redis - repository: https://charts.bitnami.com/bitnami - version: 17.9.4 - description: Apache Superset is a modern, enterprise-ready business intelligence - web application - digest: afb72faf3e0748348bd8dccfd70c99294d12d58ccec1271eb20bcbd276491c93 - home: https://superset.apache.org/ - icon: https://artifacthub.io/image/68c1d717-0e97-491f-b046-754e46f46922@2x - keywords: - - business intelligence - - data science - maintainers: - - email: craig@craigrueda.com - name: craig-rueda - url: https://github.com/craig-rueda - name: superset - sources: - - https://github.com/apache/superset - urls: - - https://github.com/unicef/magasin/releases/download/v0.0.1-charts/superset-0.10.15.tgz - version: 0.10.15 - tenant: - - apiVersion: v2 - appVersion: v5.0.11 - created: "2024-02-29T11:05:36.532325242Z" - description: A Helm chart for MinIO Operator - digest: c453a5a6081fd5e87768f843aa06d5a67c859cc00b7a7338d07c46d62cd83ad1 - home: https://min.io - icon: https://min.io/resources/img/logo/MINIO_wordmark.png - keywords: - - storage - - object-storage - - S3 - maintainers: - - email: dev@minio.io - name: MinIO, Inc - name: tenant - sources: - - https://github.com/minio/operator - type: application - urls: - - https://github.com/unicef/magasin/releases/download/v0.3.0-charts/tenant-5.0.11.tgz - version: 5.0.11 -generated: "2024-02-29T11:05:36.506937064Z" diff --git a/install-magasin.sh b/install-magasin.sh deleted file mode 100755 index 6966d638..00000000 --- a/install-magasin.sh +++ /dev/null @@ -1,789 +0,0 @@ -#!/bin/bash - -# This script checks and installs kubectl, helm, and Homebrew (if needed) on your system. -# It supports automatic installation with the `-y` flag and provides -# detailed user prompts for confirmation. Additionally, it verifies -# the functionality of all tools once installed. - - -# If curl piping is used the name of the script is set to the name of the shell (f.i zsh or bash) -# We replace the script_name variable with this value when zsh or bash -# It should be the name of the file containing the script. -DEFAULT_SCRIPT_NAME='install-magasin.sh' - -# Magasin base url -BASE_URL=http://unicef.github.io/magasin - -# Values folder. Use -f to overwrite -VALUES_FOLDER='./' - -# Helm repo URL (-u to overwrite) -MAGASIN_DEFAULT_HELM_REPO=$BASE_URL -MAGASIN_HELM_REPO=$BASE_URL -# If -u is set overwritten -CUSTOM_HELM_REPO=false - -# Link to documentation on how to install magasin manually -MANUAL_INSTALL_LINK=$BASE_URL/install/manual-installation.html -UNINSTALL_MAGASIN_LINK=$BASE_URL/install/uninstall.html -GET_STARTED_LINK=$BASE_URL/get-started/tutorial-overview.html - -# Skip prompting the user? -AUTO_INSTALL=false - -# Only install local dependencies. Skip installing magasin -# in the kubernetes cluster -ONLY_LOCAL_INSTALL=false - -# Only check if there is missing stuff -ONLY_CHECK=false - -# Debug mode -DEBUG=false - -# Default REALM -REALM_ARG='magasin' # default -REALM_PREFIX='magasin' -REALM_POSTFIX='' - -# Get PLATFORM -PLATFORM=$(uname) -LINUX="Linux" -MACOS="Darwin" - -# Function to display messages in red -echo_debug() { - if [ "$DEBUG" = true ]; then - printf "\033[38;5;208m%s\033[0m\n" "$@" - fi -} - -# Function to display a line of dashes with the width of the terminal window. -echo_line() { - local width=$(tput cols) # Get the width of the terminal window - printf "%${width}s\n" | tr ' ' '-' # Print dashes to fill the width -} - -# Function to display messages prepending [ v ] (v = check) -echo_success() { - printf "\033[32m[ \xE2\x9C\x93 ]\033[0m %s\n" "$@" -} - -# Information message prepended by [ i ] -echo_info() { - printf "\033[34m[ i ]\033[0m %s\n" "$@" -} - - -# Function to display failure to comply with a condition. -# Prepends and x. -echo_fail() { - printf "\033[31m[ \xE2\x9C\x97 ]\033[0m %s\n" "$@" # \e[31m sets the color to red, \e[0m resets the color -} - -# Function to display warning messages. -# Prepends two !! in orangish color. -echo_warning() { - printf "\033[38;5;208m[ W ]\033[0m %s\n" "$@" -} - -# Function to display error messages in red. Prepends ERROR -echo_error() { - printf "\033[31mERROR:\033[0m %s\n" "$@" -} - -# Exit displaying how to debug -exit_error() { - local code=$1 - echo_error "$code" - echo_error "You may get more information about the issue by running the script including the debug option (-d):" - echo_error " $script_name -d " - echo "" - exit $code -} - -echo_magasin() { - -echo "" -echo " Welcome to the world of tomorrow " -printf "\033[31m" -printf " ▄ \n" -printf " ███ \n" -printf " ▀ \033[0m\n" -echo " ▐█▙█▖ ▟██▖ ▟█▟▌ ▟██▖▗▟██▖ ██ ▐▙██▖ " -echo " ▐▌█▐▌ ▘▄▟▌▐▛ ▜▌ ▘▄▟▌▐▙▄▖▘ █ ▐▛ ▐▌ " -echo " ▐▌█▐▌▗█▀▜▌▐▌ ▐▌▗█▀▜▌ ▀▀█▖ █ ▐▌ ▐▌ " -echo " ▐▌█▐▌▐▙▄█▌▝█▄█▌▐▙▄█▌▐▄▄▟▌▗▄█▄▖▐▌ ▐▌ " -echo " ▝▘▀▝▘ ▀▀▝▘ ▞▀▐▌ ▀▀▝▘ ▀▀▀ ▝▀▀▀▘▝▘ ▝▘ " -echo " ▜█▛▘ " -echo "" - -} - -function usage { - echo "Usage: $1 [-y] [-c] [-i] [-r realm_prefix-realm_postfix (magasin)] [-f values_folder (./)] [-d] [-h]" - echo "" - echo "This script checks dependencies and installs magasin components" - echo "Each component is installed within its own namespace." - echo "" - echo "Options:" - echo " -y Skip prompting questions during installation" - echo " -c Only check if all pre-requisites are installed in the local machine." - echo " -i Only install all pre-requisites in the local machine. Does not install magasin in Kubernetes" - echo " -r Realm prefix and suffix (default: magasin). Prefix and suffix are separated by '-'." - echo " If more than one '-', the last one will be used as separator." - echo " The realm 'magasin-new-dev' will set 'magasin-new' as prefix and 'dev' as suffix." - echo " -f Folder with custom values.yaml files (default: ./)." - echo " Files within the folder shall have the same name as the component. Example:" - echo " drill.yaml, dagster.yaml, superset.yaml, daskhub.yaml" - echo " -u URL/path to the magasin's helm repository (default: https://unicef.github.io/magasin/)" - echo " " - echo " -d Enable debug mode (displays all commands run)." - echo " -h Display this help message and exit." - echo " " - echo "Examples:" - echo " - Only check if all requirements are installed" - echo " $1 -c " - echo " - Setup the realm 'test'. Will use test- as namespaces" - echo " $1 -r test" - echo " - Enable debug mode, skip being promted, and setup the realm 'magasin-dev'" - echo " (which results in magasin--dev as namespaces)" - echo " $1 -d -y -r magasin-dev" - exit 0 -} - -script_name=$(basename "$0") - -# When curl piping the name is set to the name of the shell, so we setup a default name. -if [[ "$script_name" == "zsh" || "$script_name" == "bash" ]]; then - script_name=$DEFAULT_SCRIPT_NAME -fi - - -while getopts ":f:u:r:yichd" opt; do - case $opt in - y) - AUTO_INSTALL=true - ;; - c) - ONLY_CHECK=true - ;; - i) - ONLY_LOCAL_INSTALL=true - ;; - d) - DEBUG=true - ;; - u) - CUSTOM_HELM_REPO=true - MAGASIN_HELM_REPO=$OPTARG - ;; - f) - # Check if the folder exists. - if [ -d $OPTARG ]; then - echo_debug "Values folder exists $OPTARG" - VALUES_FOLDER=$OPTARG - fi - ;; - esac -done - -# Function to display warning messages. -# Prepends two !! in orangish color. -echo_warning() { - printf "\033[38;5;208m[ W ]\033[0m %s\n" "$@" -} - -# Function to display error messages in red. Prepends ERROR -echo_error() { - printf "\033[31mERROR:\033[0m %s\n" "$@" -} - -# Exit displaying how to debug -exit_error() { - local code=$1 - echo_error "$code" - echo_error "You may get more information about the issue by running the script including the debug option (-d):" - echo_error " $script_name -d " - echo "" - exit $code -} - -echo_magasin() { - -echo "" -echo " Welcome to the world of tomorrow " -printf "\033[31m" -printf " ▄ \n" -printf " ███ \n" -printf " ▀ \033[0m\n" -echo " ▐█▙█▖ ▟██▖ ▟█▟▌ ▟██▖▗▟██▖ ██ ▐▙██▖ " -echo " ▐▌█▐▌ ▘▄▟▌▐▛ ▜▌ ▘▄▟▌▐▙▄▖▘ █ ▐▛ ▐▌ " -echo " ▐▌█▐▌▗█▀▜▌▐▌ ▐▌▗█▀▜▌ ▀▀█▖ █ ▐▌ ▐▌ " -echo " ▐▌█▐▌▐▙▄█▌▝█▄█▌▐▙▄█▌▐▄▄▟▌▗▄█▄▖▐▌ ▐▌ " -echo " ▝▘▀▝▘ ▀▀▝▘ ▞▀▐▌ ▀▀▝▘ ▀▀▀ ▝▀▀▀▘▝▘ ▝▘ " -echo " ▜█▛▘ " -echo "" - -} - -function usage { - echo "Usage: $1 [-y] [-c] [-i] [-r realm_prefix-realm_postfix (magasin)] [-f values_folder (./)] [-d] [-h]" - echo "" - echo "This script checks dependencies and installs magasin components" - echo "Each component is installed within its own namespace." - echo "" - echo "Options:" - echo " -y Skip prompting questions during installation" - echo " -c Only check if all pre-requisites are installed in the local machine." - echo " -i Only install all pre-requisites in the local machine. Does not install magasin in Kubernetes" - echo " -r Realm prefix and suffix (default: magasin). Prefix and suffix are separated by '-'." - echo " If more than one '-', the last one will be used as separator." - echo " The realm 'magasin-new-dev' will set 'magasin-new' as prefix and 'dev' as suffix." - echo " -f Folder with custom values.yaml files (default: ./)." - echo " Files within the folder shall have the same name as the component. Example:" - echo " drill.yaml, dagster.yaml, superset.yaml, daskhub.yaml" - echo " -u URL/path to the magasin's helm repository (default: https://unicef.github.io/magasin/)" - echo " " - echo " -d Enable debug mode (displays all commands run)." - echo " -h Display this help message and exit." - echo " " - echo "Examples:" - echo " - Only check if all requirements are installed" - echo " $1 -c " - echo " - Setup the realm 'test'. Will use test- as namespaces" - echo " $1 -r test" - echo " - Enable debug mode, skip being promted, and setup the realm 'magasin-dev'" - echo " (which results in magasin--dev as namespaces)" - echo " $1 -d -y -r magasin-dev" - exit 0 -} - -script_name=$(basename "$0") - -# Check if the script name is zsh or bash and replace it with the default name (install_magasin.sh) -if [[ "$script_name" == "zsh" || "$script_name" == "bash" ]]; then - script_name="install_magasin.sh" -fi - -# If script name is zsh or bash replace is with default name (install_magasin.sh) - - -while getopts ":f:u:r:yichd" opt; do - case $opt in - y) - AUTO_INSTALL=true - ;; - c) - ONLY_CHECK=true - ;; - i) - ONLY_LOCAL_INSTALL=true - ;; - d) - DEBUG=true - ;; - u) - CUSTOM_HELM_REPO=true - MAGASIN_HELM_REPO=$OPTARG - ;; - f) - # Check if the folder exists. - if [ -d $OPTARG ]; then - echo_debug "Values folder exists $OPTARG" - VALUES_FOLDER=$OPTARG - # Check if hte folder name ends with - if [[ "$VALUES_FOLDER" != */ ]]; then - echo_debug "Adding slash to values folder $VALUES_FOLDER" - VALUES_FOLDER="$VALUES_FOLDER/" - fi - echo_info "Folder with value files exists ($VALUES_FOLDER)." - else - echo_error "Folder $OPTARG does not exist." - exit 101 - fi - - ;; - r) - argument=$OPTARG - # Extracting prefix and postfix - last_dash=$(echo "$argument" | grep -o '[^-]*$') - if [[ "$last_dash" == "$argument" ]]; then - REALM_PREFIX=$argument - REALM_POSTFIX="" - else - REALM_PREFIX=${argument%-$last_dash} - REALM_POSTFIX=$last_dash - fi - REALM_ARG=$argument - echo_info "Magasin realm set:" - echo_info " Realm: $REALM_ARG" - echo_info " Realm prefix '$REALM_PREFIX'" - echo_info " Realm suffix '$REALM_SUFFIX'" - - ;; - h) - usage $script_name - ;; - \?) - echo_error "Invalid option: -$OPTARG" - echo "" - usage $script_name - exit 102 - ;; - :) - echo_error "Option -$OPTARG requires an argument." >&2 - exit 103 - ;; - esac -done - -# This is addded in helm command line. -HELM_DEBUG_FLAG='' -# If debug display all commands -if [ "$DEBUG" = true ]; then - echo_info Setting DEBUG mode ON - # This will enable debug in helm commands. It is added in all helm command calls - HELM_DEBUG_FLAG=" --debug" - set -x -fi - -echo_magasin -echo "Launching installer..." -sleep 4 - - -# Display vars -echo "-----------" -echo_info "BASE_URL: $BASE_URL" -echo_info "MAGASIN_HELM_REPO: $MAGASIN_HELM_REPO" -echo_info "CUSTOM_HELM_REPO: $CUSTOM_HELM_REPO" -echo_info "MANUAL_INSTALL_LINK: $MANUAL_INSTALL_LINK" -echo_info "REALM_ARG: $REALM_ARG" -echo_info "REALM_PREFIX: $REALM_PREFIX" -echo_info "REALM_POSTFIX: $REALM_POSTFIX" -echo_info "AUTO_INSTALL: $AUTO_INSTALL" -echo_info "ONLY_CHECK: $ONLY_CHECK" -echo_info "ONLY_LOCAL_INSTALL: $ONLY_LOCAL_INSTALL" -echo_info "PLATFORM: $PLATFORM" -echo_info "PATH: $PATH" -echo_info "HELM_DEBUG_FLAG: $HELM_DEBUG_FLAG" -echo "-----------" -# Initialize report variables -declare -A install_status - - -if [[ $- == *i* ]]; then - echo_info "Interactive shell" -else - echo_warning "Non-Interactive shell detected. Setting automatic install of dependencies..." - AUTO_INSTALL=true -fi - - -# if a command does not exist this variable is set to true -command_missing=false - -# Checks if a command exists. Returns "installed" and "not installed" -# if the command does not exist sets $command_missing to true -function command_exists { - local command="$1" - if ! command -v "$command" &> /dev/null; then - command_missing=true - echo_fail "$command not installed" - install_status["$command"]="not installed" - return 1 - fi - command_path="$(command -v "$command")" - echo_success "$command installed ($command_path)" - install_status["$command"]="installed" - return 0 -} - -# Checks if all the required commands to perform the installation are in linux -function check_linux_requirements_status { - echo_info "** magasin installer for a GNU/Linux system (Linux)" - command_exists "kubectl" - command_exists "helm" - command_exists "mc" - command_exists "mag" - # only check if any of the previous does not exist - if [[ "$command_missing" == true ]]; then - command_exists "apt-get" # debian like package installer - fi -} - -# Checks if all the required commands are available for MacOS -function check_macos_requirements_status { - echo_info "magasin installer for a MacOS system (Darwin)" - command_exists "kubectl" - command_exists "helm" - command_exists "pip3" - command_exists "mc" - command_exists "mag" - - if [[ "$command_missing" == true ]]; then - command_exists "brew" - fi -} - - -# Check based on system -echo "" -if [[ $PLATFORM == $LINUX ]]; then - check_linux_requirements_status - - # Only debian / apt-get systems supported - if [ "${install_status["apt-get"]}" = "not installed" ]; then - echo "" - echo_error "apt-get is not installed. Are you in a Debian like system?" - echo_info "This installation script only works on Debian GNU/Linux like systems (f.i. Debian, Ubuntu, raspbian, Kali...)." - echo_info "Please read the "Manual Installation" section in:" - echo "" - echo_info " $MANUAL_INSTALL_LINK" - echo "" - exit 1 - fi - -elif [[ $PLATFORM == $MACOS ]]; then - check_macos_requirements_status -else - echo "" - echo_error Platform = $PLATFORM - echo_error "This system is not supported by this installation script." - echo_info " Please visit $MANUAL_INSTALL_LINK" - echo "" - exit 2 -fi - - -echo_debug "Is there any command missing?" -if [[ "$command_missing" == true ]]; then - echo "" - echo_fail "There are missing dependencies." -else - echo "" - echo_success "All dependencies are installed." -fi - -# If -c option is set, then end. -echo_debug "Only check? $ONLY_CHECK" -if [[ "$ONLY_CHECK" == true ]]; then - echo_debug "ONLY_CHECK=true" - exit 0 -fi - -# -# Install missing pre-requisites -# -# Is there a command missing? - -echo_debug "Is there any dependency missing missing? $command_missing" -if [[ "$command_missing" == true ]]; then - # Perform installation based on report - if [[ $PLATFORM == $LINUX ]]; then - echo_debug "Install missing commands in $LINUX" - if [[ "$AUTO_INSTALL" == false ]]; then - # If not auto install Prompt for installation if any tool is missing and -y flag is not provided - echo "" - read -r -p "Do you want to install the missing tools (y/N)?" response - if [[ ! $response =~ ^[Yy]$ ]]; then - echo_fail "Installation aborted." - echo_info "For more information on how install manually: $MANUAL_INSTALL_LINK" - exit 0 - fi # response - fi # auto_install - echo_info "Installing pre-requisites for GNU/Linux.." - if [ "${install_status["kubectl"]}" == "not installed" ]; then - # https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/ - echo_info "Installing kubectl..." - echo_line - sudo apt-get update - # apt-transport-https may be a dummy package; if so, you can skip that package - sudo apt-get install -y apt-transport-https ca-certificates curl - # Note: In releases older than Debian 12 and Ubuntu 22.04, /etc/apt/keyrings does not exist by default, and can be created using sudo mkdir -m 755 /etc/apt/keyrings - sudo mkdir -m 755 /etc/apt/keyrings - curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg - # This overwrites any existing configuration in /etc/apt/sources.list.d/kubernetes.list - echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list - sudo apt-get update - sudo apt-get install -y kubectl - echo_line - fi - - if [ "${install_status["helm"]}" == "not installed" ]; then - echo_info "Installing helm..." - echo_line - curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null - sudo apt-get install apt-transport-https --yes - echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list - sudo apt-get update - sudo apt-get install helm - echo_line - fi - - if [ "${install_status["pip3"]}" == "not installed" ]; then - echo_info "Installing pip..." - echo_line - sudo apt-get install python3 python3-pip --yes - echo_line - fi - - if [ "${install_status["mc"]}" == "not installed" ]; then - echo_info "Installing mc at /usr/local/bin/mc..." - echo_line - sudo curl https://dl.min.io/client/mc/release/linux-amd64/mc \ - --create-dirs \ - -o /usr/local/bin/mc - sudo chmod +x /usr/local/bin/mc - echo_line - fi - - if [ "${install_status["mag"]}" == "not installed" ]; then - echo_info "Installing mag CLI..." - echo_info "Running: sudo pip install mag" - echo_line - sudo pip install mag-cli - echo_line - fi - - - elif [[ $PLATFORM == $MACOS ]]; then - if [[ "$AUTO_INSTALL" == false ]]; then - # If not auto install Prompt for installation if any tool is missing and -y flag is not provided - # zsh read format - echo "" - read "response?Do you want to install the missing tools (y/N)?" - if [[ ! $response =~ ^[Yy]$ ]]; then - echo_fail "Installation aborted." - echo_info "For more information on how install manually: $MANUAL_INSTALL_LINK" - exit 0 - fi # response - fi # auto install - echo_info "Installing pre-requisites for MacOS..." - # If brew does not exist => install it - if [[ "${install_status["brew"]}" == "not installed" ]]; then - echo_info "Installing brew..." - echo_line - /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" - echo_line - fi - echo_info "Running: brew install kubectl helm python minio/stable/mc" - echo_line - brew install kubectl helm python minio/stable/mc - echo_line - - if [[ "${install_status["mag"]}" == "not installed" ]]; then - echo_info "Installing mag CLI..." - echo_info "Running pip install mag-cli" - echo_line - pip install mag-cli - echo_line - fi - - else - # this probably will never be reached - echo_error "System not supported ($PLATFORM)." - exit 1 - fi # else -fi # command missing - -# Verify installations - -# Verify kubectl functionality -not_working=false - - -echo "" -echo_info "Verifying commands are working..." -# Verify kubeclt -if ! kubectl &> /dev/null; then - echo_error "The kubectl command ($(command -v "kubectl")) is not working properly." - echo_error "Installation documentation:" - echo_error " - For Linux: https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/" - echo_error " - For macOS: https://kubernetes.io/docs/tasks/tools/install-kubectl-macos/" - not_working=true -else - echo_success "kubectl is working" -fi - -# Verify helm functionality -if ! helm version &> /dev/null; then - echo_error "The helm command ($(command -v "helm")) is not working properly." - echo_error "Installation documentation:" - echo_error " https://helm.sh/docs/intro/install/" - not_working=true - -else - echo_success "helm is working" -fi - -# Verify mc functionality -if ! mc --version &> /dev/null; then - echo_error "The mc command ($(command -v "mc")) is not working properly." - echo_error "Installation documentation:" - echo_error " https://min.io/docs/minio/linux/reference/minio-mc.html#install-mc" - not_working=true - -else - echo_success "mc is working" -fi - -if ! mag --version &> /dev/null; then - echo_error "The mag command ($(command -v "mag")) is not working properly." - echo_error "Installation documentation:" - echo_error " https://min.io/docs/minio/linux/reference/minio-mc.html#install-mc" - not_working=true - -else - echo_success "mc is working" -fi - -if [[ "$not_working" == true ]]; then - echo_error "Some of the commands are not working." - exit_error 10 -fi - -if [[ "$ONLY_LOCAL_INSTALL" == true ]]; then - echo "" - - echo_success "All dependencies are installed and working" - echo_info "Skipping installing magasin in the kubernetes cluster" - echo "" - exit 0 -fi - - -# Check if repository already exists -if helm repo list | grep -q "magasin"; then - echo_info "magasin helm repository already exists. Resetting it..." - echo_info "Running: helm repo remove magasin; helm repo add magasin $MAGASIN_HELM_REPO $HELM_DEBUG_FLAG" - echo_line - helm repo remove magasin - helm repo add magasin $MAGASIN_HELM_REPO $HELM_DEBUG_FLAG - if [[ $? -ne 0 ]]; then - echo_line - echo_error "Failed to add magasin repo after removing the existing one." - exit_error 20 - else - echo_line - echo_success "magasin helm repo successfully re-added as 'magasin' pointing to $MAGASIN_HELM_REPO" - if "$CUSTOM_HELM_REPO" = true; then - echo_warning "magasin helm repo URL is set to $MAGASIN_HELM_REPO which is not the official one" - echo_warning "You can set back the official magasin helm repo by running" - echo_warning " helm repo remove magasin" - echo_warning " helm repo add magasin $MAGASIN_DEFAULT_HELM_REPO" - fi - fi -else - # magain helm repo does not exist => add it - echo_info "Adding magasin helm repo ($MAGASIN_HELM_REPO)..." - echo_info "Running: helm repo add $MAGASIN_HELM_REPO $HELM_DEBUG_FLAG" - echo_line - helm repo add magasin $MAGASIN_HELM_REPO $HELM_DEBUG_FLAG - if [[ $? -ne 0 ]]; then - echo_line - echo_error "Failed to add magasin repo." - exit_error 30 - else - echo_line - echo_success "magasin helm repo successfully added as 'magasin'." - fi -fi - - -# check if the realm namespace exits -if kubectl get namespace "$REALM_ARG" &> /dev/null; then - echo_error "The realm namespace '$REALM_ARG' exists." - echo_error " Do you have a magasin instance already installed? You can try: " - echo_error " 1. Install magasin in another realm: '$script_name -r myrealm'" - echo_error " 2. Uninstalling '$REALM_ARG' realm instance (see $UNINSTALL_MAGASIN_LINK)" - echo_error " 3. Remove the namespace: 'kubectl delete namespace $REALM_ARG'" - exit 60 -fi - - -# -# Add a configmap with a json with some metadata -# -echo_info "Creating the magasin realm namespace." -echo_info "kubectl create namespace $REALM_ARG" -kubectl create namespace $REALM_ARG - - -# Flag.Set to true when calling install_chart -# it fails to install the chart. -install_chart_failed=false - -# -# Install magasin helm charts in the kubernetes cluster -# $1 = chart name -# -function install_chart { - local chart=$1 - - echo_debug "Install_chart $chart" - - if [[ -n "$REALM_POSTFIX" ]]; then - # realm postfix is not empty - namespace="$REALM_PREFIX-$chart-$REALM_POSTFIX" - else - # realm postfix is empty - namespace="$REALM_PREFIX-$chart" - fi - - echo_info "Installing magasin/$chart in the namespace $namespace." - values_helm_arg="" - values_file="$VALUES_FOLDER$chart.yaml" - echo_debug "values_file = $values_file" - - # check if $chart.yaml file exists - if [ -f "$values_file" ]; then - echo_success "Custom values file for $chart exists ($values_file)" - # the -f option in helm allows you to set custom values - # Include it as part of the - values_helm_arg="-f $values_file" - else - echo_info "Custom values file for $chart does NOT exist ($values_file)" - fi - # Check if the namespace already exists if so ask and warn the user. - - - echo_info "helm install $chart magasin/$chart $values_helm_arg --namespace $namespace --create-namespace $HELM_DEBUG_FLAG" - echo_line - helm install $chart magasin/$chart $values_helm_arg --namespace $namespace --create-namespace $HELM_DEBUG_FLAG - if [[ $? -ne 0 ]]; then - echo_line - echo_error "Could not install magasin/$chart in the namespace $namespace" - install_chart_failed=true - #exit_error 7 - else - echo_line - echo_success "magasin/$chart installed in namespace $namespace" - fi - - -} - -install_chart dagster -install_chart drill -install_chart daskhub -install_chart superset -install_chart operator -install_chart tenant - -# -# Check if the variable is true -if [ "$install_chart_failed" = true ]; then - echo_line - echo_warning "Atention!!" - echo_warning "Some of the components were not installed successfully" - echo_warning "Check the messages above. You can try to install the failed charts manually" - echo_warning "More information about manual installation:" - echo_warning " $MANUAL_INSTALL_LINK" -fi - -echo_line -echo_info "Next step start using magasin. Take a look at the tutorial:" -echo_info " $GET_STARTED_LINK" -echo_line diff --git a/search.json b/search.json index 16a84815..622fe217 100644 --- a/search.json +++ b/search.json @@ -298,21 +298,21 @@ "href": "about.html", "title": "About", "section": "", - "text": "Magasin (pronounced /ma.ɡa.zɛ̃/) is an end-to-end collection of open-source tools that are stitched together to provide enterprise ready automated ingestion, transformation, and storage as well as the enabling infrastructure and tools to perform data analysis, visualization of datasets and ML / AI." + "text": "Magasin (pronounced /ma.ɡa.zɛ̃/) is an end-to-end collection of open-source tools that are stitched together to provide enterprise ready automated ingestion, transformation, and storage as well as the enabling infrastructure and tools to perform data analysis, visualization of datasets and ML / AI.\nYou can learn more about why we created magasin and its architecture." }, { "objectID": "about.html#about-magasin", "href": "about.html#about-magasin", "title": "About", "section": "", - "text": "Magasin (pronounced /ma.ɡa.zɛ̃/) is an end-to-end collection of open-source tools that are stitched together to provide enterprise ready automated ingestion, transformation, and storage as well as the enabling infrastructure and tools to perform data analysis, visualization of datasets and ML / AI." + "text": "Magasin (pronounced /ma.ɡa.zɛ̃/) is an end-to-end collection of open-source tools that are stitched together to provide enterprise ready automated ingestion, transformation, and storage as well as the enabling infrastructure and tools to perform data analysis, visualization of datasets and ML / AI.\nYou can learn more about why we created magasin and its architecture." }, { "objectID": "about.html#brief-history", "href": "about.html#brief-history", "title": "About", "section": "2 Brief history", - "text": "2 Brief history\nMagasin grew out of Office of Innovation’s and Information and Communication Technology Division (ICTD) work to establish a RapidPro Data Warehouse to enable aggregation of global metrics across vendor instances.\nAlso informed by experiences of the Magic Box applied data science initiative, a range of components were evaluated and trialed with country offices to arrive at the current components and architecture.\nIn 2021, ICTD engaged one of its long term agreement vendors to assist with evolving the proof-of-concept into a minimum viable product (MVP).\nIn 2023, UNICEF started the journey to detach magasin from its organizational and cloud infrastructure dependencies, and release it as an open-source platform with the goal of becoming a Digital Public Good." + "text": "2 Brief history\nMagasin grew out of Office of Innovation’s and Information and Communication Technology Division (ICTD) work to establish a RapidPro Data Warehouse to enable aggregation of global metrics across vendor instances.\nAlso informed by experiences of the Magic Box applied data science initiative, a range of components were evaluated and trialed with country offices to arrive at the current components and architecture.\nIn 2021, ICTD engaged one of its long term agreement vendors to assist with evolving the proof-of-concept into a minimum viable product (MVP).\nIn 2023, UNICEF started the journey to detach magasin from its organizational and cloud infrastructure dependencies, and release it as an open-source platform with the goal of becoming a Digital Public Good.\nIn February 2nd,2024, UNICEF released magasin as an open source solution. In this release, it was included an easy to install process, as well as how to get started documentation." }, { "objectID": "get-started/create-a-dashboard.html", @@ -427,7 +427,7 @@ "href": "get-started/automate-data-ingestion.html#create-a-dagster-pipeline", "title": "Step 2: Automate data ingestion", "section": "3 Create a Dagster pipeline", - "text": "3 Create a Dagster pipeline\nThe next step is to create a pipeline using Dagster. A pipeline is just a piece of code that moves data from place to another and that can introduce some changes before saving it in the destination place. In our case the pipeline will take the data from the DPGA API and store it in a MinIO bucket.\nThe first thing we need to do is to install Dagster.\npip install dagster==1.6.4 dagster-webserver==1.6.4\n\n\n\n\n\n\nNote\n\n\n\nDagster is a very agile product that is continuously evolving, this means that you have to be cognizant of the version you’re running.\nYou can check the version installed in your cluster by running helm list --all-namespaces and looking at the APP VERSION column.\nThen run pip install pip install dagster==<version>\n\n\n\n3.1 Add the pipeline code\nOnce Dagster is installed, we’re going to create a new project using the default structure prodivded by Dagster. This should be the default procedure for creating any new pipeline.\ndagster project scaffold --name dpga-pipeline\nCreating a Dagster project at /home/magasin/dpga-pipeline.\nCreating a Dagster code location at /home/magasin/dpga-pipeline.\nGenerated files for Dagster code location in /home/magasin/dpga-pipeline.\nGenerated files for Dagster project in /home/magasin/dpga-pipeline.\nSuccess! Created dpga-pipeline at /home/magasin/dpga-pipeline.\nBy scaffolding our project, Dagster creates a basic structure of a python package that could be installed using pip as any other package as well as some additional metadata files that will be used by Dagster to run the pipeline. You have some more info in the Dagster documentation.\nNow, lets add our code. Open the file dpga-pipeline/dpga_pipeline/assets.py\n\n\ndpga-pipeline/dpga_pipeline/assets.py\n\nimport requests\nimport pandas as pd\nfrom pandas import DataFrame\n\nfrom dagster import asset\n\n@asset\ndef raw_dpgs() -> DataFrame:\n \"\"\" DPGs data from the API\"\"\"\n dpgs_json_dict = requests.get(\"https://api.digitalpublicgoods.net/dpgs\").json()\n df = pd.DataFrame.from_dict(dpgs_json_dict)\n return df\n\n@asset\ndef deployment_countries(raw_dpgs: DataFrame) -> DataFrame:\n df = raw_dpgs\n df_loc = pd.merge(df, pd.json_normalize(df[\"locations\"]), left_index=True, right_index=True)\n df_deployment_countries = df_loc.explode(\"deploymentCountries\")\n df_deployment_countries[[\"name\",\"deploymentCountries\"]]\n\n return df_deployment_countries\n\nAs you can see the code seems pretty similar to what we wrote in our exploratory analysis.\nThe in the code we have defined two @assets. An asset according to the Dagster definition is:\n\nAn asset is an object in persistent storage, such as a table, file, or persisted machine learning model. A Software-defined Asset is a Dagster object that couples an asset to the function and upstream assets used to produce its contents.\n\nIn our case, raw_dpgs, stores the dpgs as they come from the API as a DataFrame, and deployment_countries that extracts the one row per country in which the DPG has been deplayed.\nAnother thing that you can notice in the code is that in the definition of the deployment_countries asset, we are passing raw_dpgs: DataFrame. That will tell Dagster that deployment_countries depends on the raw_dpgs and it will be used as input.\nAs you noticed, we are using a couple of packages that need to be installed pandas and requests. To install them, in dpga-pipeline/setup.py we add them in the install_requires array.\n\n\ndagster-pipeline/setup.py\n\nsetup(\n # ...\n install_requires=[\n \"dagster\",\n \"dagster-cloud\",\n \"pandas\", # <--- Add this line \n \"requests\" # <---- Add this line too\n ],\n #...\n)\n\nOk, So now let’s test if this is working so far. To do that we will first install the pipeline package in editable mode (-e). This allows you to edit the package without needing to install it again.\npip install -e '.[dev]'\nThen, we will launch the Dagster user interface:\ndagster dev\nThis launches a local instance of dagster server in port 3000 on localhost. So just open http://localhost:3000. Note, instance of dagster is similar to what you are running on the cluster but directly on your computer. In this case you are not using the one installed in the cluster.\nYou should see something like:\n\n\n\nDagster user interface\n\n\n\n\n\n\n\n\nNote\n\n\n\nYou have available the source code of the pipeline in magasin’s source code repository within the folder examples/dpga-pipeline/dpga-pipeline-store-local\n\n\n\n3.1.1 Save the assets in MinIO.\nTill now, we’ve been working on the development machine file system. The next step is to save the information we want to keep in MinIO.\nTo access the MinIO bucket we will use fsspec. This python library provides an standard interface regardless of the underlying filesystem. So, if you chose to use other file system to run this example, you can just change the environment variables and the address.\nMinIO provides an S3 compatible bucket file system, so we will use it. First we will add the dependencies fsspec and s3fs.\n\n\ndpga-pipeline/setup.py\n\nsetup(\n #...\n install_requires=[\n \"dagster\",\n \"dagster-cloud\",\n \"pandas\",\n \"requests\",\n \"fsspec\", # <---- New dependency\n \"s3fs\" # <---- New dependency\n ],\n #...\n)\n\nNow, we’re going to modify our assets to use the minio filesystem.\n\n\ndpga-pipeline/dpga_pipeline/assets.py\n\nimport fsspec\nimport requests\nimport pandas as pd\nfrom pandas import DataFrame\nfrom dagster import asset\n\n@asset\ndef raw_dpgs() -> DataFrame:\n \"\"\" DPGs data from the API\"\"\"\n # Load from API\n dpgs_json_dict = requests.get(\"https://api.digitalpublicgoods.net/dpgs\").json() \n\n # Convert to pandas dataframe\n df = pd.DataFrame.from_dict(dpgs_json_dict)\n return df\n\n@asset\ndef deployment_countries(raw_dpgs: DataFrame) -> DataFrame:\n \n df = raw_dpgs\n df_loc = pd.merge(df, pd.json_normalize(df[\"locations\"]), left_index=True, right_index=True)\n df_deployment_countries = df_loc.explode(\"deploymentCountries\")\n df_deployment_countries = df_deployment_countries[[\"id\", \"name\",\"deploymentCountries\"]]\n \n # Save to MinIO\n fs= fsspec.filesystem('s3')\n with fs.open('/magasin/data/deployment_countries.parquet','wb') as f:\n df_deployment_countries.to_parquet(f)\n \n return df_deployment_countries\n\nThen, we will setup some environment variables that will setup the Minio S3 bucket credentials. Add the .env file in the root of your project (same folder as setup.py).\nFSSPEC_S3_ENDPOINT_URL='http://localhost:9000'\nFSSPEC_S3_KEY='minio'\nFSSPEC_S3_SECRET='minio123'\nAs you can see we are indicating in the .env file that the endpoint of our minio is in localhost port 9000. To enable this service we need to run the following command\nmag minio api\nAs earlier, while this command is running it will forward any connection in our localhost:9000 to the our MinIO instance in the Kubernetes cluster. You shoud keep running during this till you are instructed to do close it.\nIn another terminal, we need to reinstall the pipeline so the new dependencies are loaded, and, then, we can run Dagster:\n\npip install -e '.[dev]'\ndagster dev\nNote that after you launch dagster dev you should see something like:\ndagster - INFO - Loaded environment variables from .env file: \nFSSPEC_S3_ENDPOINT_URL,FSSPEC_S3_KEY,FSSPEC_S3_SECRET\nThis is because Dagster loads all the .env file automatically and exposes the variables to the code.\nOpen again the browser pointing to http://localhost:3000 and in the dagster UI and run Materialize all.\nThis time, all files should have been materialized in the magasin bucket.\nTo test if the files are there. In a terminal run:\nmc ls myminio/magasin/data\n\n\n\n3.2 Adding a job scheduler\nUntil now, we have been materializing manually our assets. However, automating this task is indeed the ultimate goal of setting up a pipeline.\nIn Dagster, you have available schedulers which basically run your pipeline, or pieces of it, in a fixed interval. Dagster schedulers follow a cron style format.\n\n\ndpga-pipeline/dpga_pipeline/assets.py\n\n#__init__.py\nfrom dagster import Definitions, load_assets_from_modules, define_asset_job, ScheduleDefinition\nfrom . import assets\n\nall_assets = load_assets_from_modules([assets])\n\n# Create an asset job that materializes all assets of the pipeline\nall_assets_job = define_asset_job(name=\"all_assets_job\",\n selection=all_assets,\n description=\"Gets all the DPG assets\")\n# Create a scheduler\nmain_schedule = ScheduleDefinition(job=all_assets_job,\n cron_schedule=\"* * * * *\"\n )\n\ndefs = Definitions(\n assets=all_assets,\n jobs=[all_assets_job],\n schedules=[main_schedule]\n)\n\nWhat we did in the code above is to:\n\nAdd a job. A job, is basically a selection of assets that will be materialized together in the same run.\nDefine a schedule. The schedule will launch the job at specified time intervals. In our case every minute (* * * * *).\n\n\n\n\n\n\n\nTip: Understanding cron jobs\n\n\n\nThe job cron format is used to specify the schedule for recurring tasks or jobs in Unix-like operating systems and cron job scheduling systems. It consists of five fields separated by spaces, representing different aspects of the schedule:\n<minute> <hour> <day-of-month> <month> <day-of-week>\n\nMinute (0-59): Specifies the minute of the hour when the job should run. Valid values range from 0 to 59.\nHour (0-23): Specifies the hour of the day when the job should run. Valid values range from 0 to 23, where 0 represents midnight and 23 represents 11 PM.\nDay of Month (1-31): Specifies the day of the month when the job should run. Valid values range from 1 to 31, depending on the month.\nMonth (1-12): Specifies the month of the year when the job should run. Valid values range from 1 to 12, where 1 represents January and 12 represents December.\nDay of Week (0-7): Specifies the day of the week when the job should run. Both 0 and 7 represent Sunday, while 1 represents Monday, and so on, up to 6 representing Saturday.\n\nEach field can contain a single value, a list of values separated by commas, a range of values specified with a hyphen, or an asterisk (*) to indicate all possible values. Additionally, you can use special characters such as slashes (/) for specifying intervals and question marks (?) for leaving a field unspecified (e.g., for day of month or day of week when the other field should match).\nHere you have some examples of cron intervals\n\n\n\n\n\n\n\nCron Expression\nDescription\n\n\n\n\n0 0 * * *\nRun a task every day at midnight (00:00).\n\n\n15 2 * * *\nRun a task at 2:15 AM every day.\n\n\n0 0 * * 1\nRun a task every Monday at midnight (00:00).\n\n\n0 12 * * 1-5\nRun a task every weekday (Monday to Friday) at 12 PM (noon).\n\n\n*/15 * * * *\nRun a task every 15 minutes.\n\n\n0 */2 * * *\nRun a task every 2 hours, starting from midnight.\n\n\n30 3 * * 6\nRun a task every Saturday at 3:30 AM.\n\n\n0 0 1 * *\nRun a task at midnight on the first day of every month.\n\n\n0 0 1 1 *\nRun a task at midnight on January 1st every year.\n\n\n\n\n\nIf you launch again dagster dev and you go to Overview -> Jobs, you can enable the job.\n\n\n\nScheduled job\n\n\n\n\n3.3 Deploy the pipeline in the cluster\nUntil now we have been running dagster on our own computer by enabling the access to the MinIO installed in our kubernetes cluster through mag minio api. But we want our pipeline to run entirely within our Kubernetes cluster. To do that we will deploy a container (pod) in our cluster that Dagster will use to run our pipeline.\nWe will follow this steps:\n\nPrepare the Docker image. Our pipeline will reside in a container that will be called by Dagster to run the pipeline. So we need to create a Docker image that will hold all our code and is ready to be called by Dagster.\nAdd the environment variables as secrets. We need to provide to our image the environmental variables. In Kubernetes this is done through secrets. Secrets are a special type of resource for holding sensitive information that exists in Kubernetes.\nRe-Deploy Dagster. After we have prepared our image with the pipeline, we need to tell our Dagster instance to deploy it, and use it. The simplest way is to re-deploy magasin’s dagster helm chart.\n\n\n3.3.1 Prepare the Docker image\nEdit the setup.py file of your project and add a new dependency dagster-postgres:\n\n\ndpga-pipeline/setup.py\n\nfrom setuptools import find_packages, setup\n\nsetup(\n name=\"dpga_pipeline\",\n packages=find_packages(exclude=[\"dpga_pipeline_tests\"]),\n install_requires=[\n \"dagster\",\n \"dagster-cloud\",\n \"dagster-postgres\", #<------------- Add this line\n \"pandas\",\n \"requests\",\n \"fsspec\",\n \"s3fs\"\n ],\n extras_require={\"dev\": [\"dagster-webserver\", \"pytest\"]},\n)\n\nThis is because now Dagster is going to use now the PosgreSQL database that is used in the cluster for keeping the logs data. Earlier, when we were using the local Dagster setup.\n\nIn the same folder as the `setup.py` file of your dagster project create a new file called [`Dockerfile`](https://docs.docker.com/engine/reference/builder/) and add the following.\n\n```{.sh filename=\"dagster-pipeline/Dockerfile\"}\n# Select the base image \nFROM python:3.10-slim\n\n# Copy all our code into the container\nCOPY . /\n\n# Install the module within the container\n# This will install all the dependencies\nRUN pip install .\n\nNow we are going to build the image. To prevent issues while running it, we are going to build a multi-architecture image. Currently, there are two major architectures we have to deal with ARM64 (RaspberryPi’s and Apple M1/M2…) and AMD64 (regular Intel and AMD computers). By building a multi-architecture image it will run regardless of the architecture.\nIf you’re use to create Docker images, something that you may have noticed is that in our Dockerfile we did not define an ENTRYPOINT or launched command CMD, in our Dockerfile basically we just installed our pipeline code. Whereas in Docker it is common to end the Dockerfile with one of these two commands, in our case the command that launches dagster will be injected during the deployment of the image. We will set it up later.\n# first we create a builder. This just allows us to build for architectures different that our owns.\n# This only needs to be run once per computer.\ndocker buildx create --driver=docker-container --name=magasin-builder \n\n# In the command below replace <registry> by your registry.\n# If you are using docker hub, it is your user name (you need to login first.\n# In other registries such as Azure Container Registry (my-registry.azurecr.io)or Amazon ECR, please check the documentation of the provider.\n\ndocker buildx build --builder=magasin-builder --platform linux/amd64,linux/arm64 -t <registry>/dpga-pipeline:latest --push .\nNow in our registry we have a new image dpga-pipeline with the tag latest. Note that this image will be publicly available.\nFor the rest of the tutorial we will use this image: merlos/dpga-pipeline:latest, you can replace it with yours.\n\n\n\n\n\n\nTip\n\n\n\nIf you want to check what other architectures are supported run:\n docker buildx ls\nNAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS\nmagasin-builder docker-container \n magasin-builder0 desktop-linux running v0.12.3 linux/arm64, linux/amd64, linux/amd64/v2, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6\n\n\n\n\n3.3.2 Add the environment variables as secrets\nPreviously, we set some environment variables with our credentials to access MinIO. When deploying an image to Kubernetes, the typical way to set sensitive information is through Secrets. Secrets is a simple way for us to set variables that are somewhat sensitive. For production deployments you should follow these good practices for Kubernetes secrets.\nkubectl create secret generic dpga-pipeline-secret \\\n --namespace magasin-dagster \\\n --from-literal=FSSPEC_S3_ENDPOINT_URL=http://myminio-ml.magasin-tenant.svc.cluster.local \\\n --from-literal=FSSPEC_S3_KEY='minio' \\\n --from-literal=FSSPEC_S3_SECRET='minio123'\nThis command will create a secret called dpga-pipeline-secret in the namespace magasin-dagster. Remember that a namespace in Kubernetes is something that can be compared to a folder.\nNote that the FSSPEC_S3_ENDPOINT_URL is no longer localhost, but the URL of the minio server on the cluster. Internal names follow this pattern <service-name>.<namespace>.svc.cluster.local.\nTo check the secret was created you can run this command:\nkubectl get secrets --namespace magasin-dagster\nAnd check there is a line with dpga-pipeline-secret with 3 in the data column:\nNAME TYPE DATA AGE\ndagster-postgresql Opaque 1 3d22h\ndagster-postgresql-secret Opaque 1 3d22h\ndpga-pipeline-secret Opaque 3 3m16s\nsh.helm.release.v1.dagster.v1 helm.sh/release.v1 1 3d22h\nTo see the contents of each data point:\nkubectl get secret dpga-pipeline-secret -n magasin-dagster -o jsonpath='{.data.FSSPEC_S3_ENDPOINT_URL}' | base64 --decode\nNotice the | base64 --decode, this is because the screts are encoded in base64. For example minio is encoded as bWluaW8=.\nIf you need to update the secret, one simple way is to delete and then add it back. To delete run the command:\n# kubectl delete secret <secretname> --namespace <namespace-name>\nkubectl delete secret dpga-pipeline-secret --namespace magasin-dagster\n\n\n3.3.3 Re-Deploy Dagster\nThe last thing we have to do is to re-deploy Dagster so that it includes our new pipeline.\nCreate a new file called dagster-helm-values.yml with the following contents:\ndagster-user-deployments:\n enabled: true\n deployments:\n - name: \"dpga-pipeline-k8s\"\n image:\n repository: \"merlos/dpga-pipeline\"\n tag: latest\n pullPolicy: Always\n dagsterApiGrpcArgs:\n - \"--package-name\"\n - \"dpga_pipeline\"\n port: 3030\n envSecrets:\n - name: dpga-pipeline-secret\n includeConfigInLaunchedRuns:\n enabled: true\nThis file can also hold ConfigMaps or labels. You have more details about the dagster user deployment options\nThis file telling to include in the deployment our pipeline image (merlos/dpga-pipeline) as well as the environment secret envSecret called dpga-pipeline-secret.\nAlso we have defined in the file dagsterApiGrpcArgs. This includes the arguments for dagster api grpc, which you can get by running dagster api grpc --help. As we said earlier, it is on the deployment where we set launch command for the image. This is the command. Dagster uses Remote Procedure Calls, which for the purposes of this tutorial you can understand as an regular API to communicate the main dagster daemon and our deployments. The daemon is the long-lasting process that keeps track of the sensor,shedules, etc. And this daemon communicates with theIn our case we tell the command that Dagster uses remote procedure calls between the dagster main process and our image.\nNow we have to update our kubernetes deployment to include this new pipeline (a.k.a. code location in Dagster terminology).\nGo to the folder where the dagster-helm-values.yaml is located, and then run:\nhelm upgrade dagster magasin/dagster --namespace magasin-dagster -f ./dagster-helm-values.yml\nThis will update the deployment of the dagster instance of magasin. You should see something like:\nRelease \"dagster\" has been upgraded. Happy Helming!\nNAME: dagster\nLAST DEPLOYED: Tue Feb 13 09:28:32 2024\nNAMESPACE: magasin-dagster\nSTATUS: deployed\nREVISION: 2\nTEST SUITE: None\nNOTES:\nLaunched. You can access the Dagster webserver/UI by running the following commands:\n\nexport DAGSTER_WEBSERVER_POD_NAME=$(kubectl get pods --namespace magasin-dagster -l \"app.kubernetes.io/name=dagster,app.kubernetes.io/instance=dagster,component=dagster-webserver\" -o jsonpath=\"{.items[0].metadata.name}\")\necho \"Visit http://127.0.0.1:8080 to open the Dagster UI\"\nkubectl --namespace magasin-dagster port-forward $DAGSTER_WEBSERVER_POD_NAME 8080:80\nTo open the Dagster user interface of the instance running in our Kubernetes cluster we need to run\nmag dagster ui\nNow, this will open the dagster instance in your Kubernetes cluster.\n\n\n\n\n\n\nNote\n\n\n\nYou have available the source code of the pipeline, Dockerfile, dagster-helm-values.yml in magasin’s source code repository within the folder examples/dpga-pipeline/dpga-pipeline-store-minio\n\n\n\n\n\n3.4 Troubleshooting the deployment\nIn case you face any issue here you have some ways of trying to find out what’s going on. This and seeking some help on a search engine or large language model, typically helps:\n\n\n3.5 Commands to inspect status\nCheck if everything is running fine. You can check the status of the pods in the magasin-dagster namespace\nkubectl get pods --namespace magasin-dagster\nNAME READY STATUS RESTARTS AGE\ndagster-daemon-7c6474cbfd-7rgtr 1/1 Running 0 3h41m\ndagster-dagster-user-deployments-dpga-pipeline-k8s-5kqtc 1/1 Running 0 64m\ndagster-dagster-webserver-76ff9c7689-zv89b 1/1 Running 0 3h41m\ndagster-postgresql-0 1/1 Running 6 (5h53m ago) 4d2h\ndagster-run-745684fc-80c5-45e5-a238-ce5fdc0c0dbe-nzh8x 0/1 Error 0 124m\nHere you can see the run had an error.\nDescribe the dagster-run pod:\nkubectl describe pod dagster-run-745684fc-80c5-45e5-a238-ce5fdc0c0dbe-nzh8x -n magasin-dagster\nGet the logs of the run pod:\nkubectl logs dagster-run-745684fc-80c5-45e5-a238-ce5fdc0c0dbe-nzh8x -n magasin-dagster\nkubectl describe job dagster-run-745684fc-80c5-45e5-a238-ce5fdc0c0dbe-nzh8x -n \nmagasin-dagster\nInspect the logs of the deployed main pod:\nkubectl logs dagster-dagster-user-deployments-dpga-pipeline-k8s-5kqtc --namespace magasin-dagster\n2024-02-13 10:26:42 +0000 - dagster.code_server - INFO - Starting Dagster code server for package dpga_pipeline on port 3030 in process 1\n2024-02-13 10:26:42 +0000 - dagster.code_server - INFO - Started Dagster code server for package dpga_pipeline on port 3030 in process 1\nLastly, on the Dagster user interface (launched with mag dagster ui), in the Runs tab, within your failed run click on View run button.", + "text": "3 Create a Dagster pipeline\nThe next step is to create a pipeline using Dagster. A pipeline is just a piece of code that moves data from place to another and that can introduce some changes before saving it in the destination place. In our case the pipeline will take the data from the DPGA API and store it in a MinIO bucket.\nThe first thing we need to do is to install Dagster.\npip install dagster==1.6.4 dagster-webserver==1.6.4\n\n\n\n\n\n\nNote\n\n\n\nDagster is a very agile product that is continuously evolving, this means that you have to be cognizant of the version you’re running.\nYou can check the version installed in your cluster by running helm list --all-namespaces and looking at the APP VERSION column.\nThen run pip install pip install dagster==<version>\n\n\n\n3.1 Add the pipeline code\nOnce Dagster is installed, we’re going to create a new project using the default structure prodivded by Dagster. This should be the default procedure for creating any new pipeline.\ndagster project scaffold --name dpga-pipeline\nCreating a Dagster project at /home/magasin/dpga-pipeline.\nCreating a Dagster code location at /home/magasin/dpga-pipeline.\nGenerated files for Dagster code location in /home/magasin/dpga-pipeline.\nGenerated files for Dagster project in /home/magasin/dpga-pipeline.\nSuccess! Created dpga-pipeline at /home/magasin/dpga-pipeline.\nBy scaffolding our project, Dagster creates a basic structure of a python package that could be installed using pip as any other package as well as some additional metadata files that will be used by Dagster to run the pipeline. You have some more info in the Dagster documentation.\nNow, lets add our code. Open the file dpga-pipeline/dpga_pipeline/assets.py\n\n\ndpga-pipeline/dpga_pipeline/assets.py\n\nimport requests\nimport pandas as pd\nfrom pandas import DataFrame\n\nfrom dagster import asset\n\n@asset\ndef raw_dpgs() -> DataFrame:\n \"\"\" DPGs data from the API\"\"\"\n dpgs_json_dict = requests.get(\"https://api.digitalpublicgoods.net/dpgs\").json()\n df = pd.DataFrame.from_dict(dpgs_json_dict)\n return df\n\n@asset\ndef deployment_countries(raw_dpgs: DataFrame) -> DataFrame:\n df = raw_dpgs\n df_loc = pd.merge(df, pd.json_normalize(df[\"locations\"]), left_index=True, right_index=True)\n df_deployment_countries = df_loc.explode(\"deploymentCountries\")\n df_deployment_countries[[\"name\",\"deploymentCountries\"]]\n\n return df_deployment_countries\n\nAs you can see the code seems pretty similar to what we wrote in our exploratory analysis.\nThe in the code we have defined two @assets. An asset according to the Dagster definition is:\n\nAn asset is an object in persistent storage, such as a table, file, or persisted machine learning model. A Software-defined Asset is a Dagster object that couples an asset to the function and upstream assets used to produce its contents.\n\nIn our case, raw_dpgs, stores the dpgs as they come from the API as a DataFrame, and deployment_countries that extracts the one row per country in which the DPG has been deplayed.\nAnother thing that you can notice in the code is that in the definition of the deployment_countries asset, we are passing raw_dpgs: DataFrame. That will tell Dagster that deployment_countries depends on the raw_dpgs and it will be used as input.\nAs you noticed, we are using a couple of packages that need to be installed pandas and requests. To install them, in dpga-pipeline/setup.py we add them in the install_requires array.\n\n\ndagster-pipeline/setup.py\n\nsetup(\n # ...\n install_requires=[\n \"dagster\",\n \"dagster-cloud\",\n \"pandas\", # <--- Add this line \n \"requests\" # <---- Add this line too\n ],\n #...\n)\n\nOk, so now let’s test if this is working so far. To do that we will first install the pipeline package in editable mode (-e). This allows you to edit the package without needing to install it again.\npip install -e '.[dev]'\nThen, we will launch the Dagster user interface:\ndagster dev\nThis launches a local instance of dagster server in port 3000 on localhost. So just open http://localhost:3000. Note, instance of dagster is similar to what you are running on the cluster but directly on your computer. In this case you are not using the one installed in the cluster.\nYou should see something like:\n\n\n\nDagster user interface\n\n\n\n\n\n\n\n\nNote\n\n\n\nYou have available the source code of the pipeline in magasin’s source code repository within the folder examples/dpga-pipeline/dpga-pipeline-store-local\n\n\n\n3.1.1 Save the assets in MinIO.\nTill now, we’ve been working on the development machine file system. The next step is to save the information we want to keep in MinIO.\nTo access the MinIO bucket we will use fsspec. This python library provides an standard interface regardless of the underlying filesystem. So, if you chose to use other file system to run this example, you can just change the environment variables and the address.\nMinIO provides an S3 compatible bucket file system, so we will use it. First we will add the dependencies fsspec and s3fs.\n\n\ndpga-pipeline/setup.py\n\nsetup(\n #...\n install_requires=[\n \"dagster\",\n \"dagster-cloud\",\n \"pandas\",\n \"requests\",\n \"fsspec\", # <---- New dependency\n \"s3fs\" # <---- New dependency\n ],\n #...\n)\n\nNow, we’re going to modify our assets to use the minio filesystem.\n\n\ndpga-pipeline/dpga_pipeline/assets.py\n\nimport fsspec\nimport requests\nimport pandas as pd\nfrom pandas import DataFrame\nfrom dagster import asset\n\n@asset\ndef raw_dpgs() -> DataFrame:\n \"\"\" DPGs data from the API\"\"\"\n # Load from API\n dpgs_json_dict = requests.get(\"https://api.digitalpublicgoods.net/dpgs\").json() \n\n # Convert to pandas dataframe\n df = pd.DataFrame.from_dict(dpgs_json_dict)\n return df\n\n@asset\ndef deployment_countries(raw_dpgs: DataFrame) -> DataFrame:\n \n df = raw_dpgs\n df_loc = pd.merge(df, pd.json_normalize(df[\"locations\"]), left_index=True, right_index=True)\n df_deployment_countries = df_loc.explode(\"deploymentCountries\")\n df_deployment_countries = df_deployment_countries[[\"id\", \"name\",\"deploymentCountries\"]]\n \n # Save to MinIO\n fs= fsspec.filesystem('s3')\n with fs.open('/magasin/data/deployment_countries.parquet','wb') as f:\n df_deployment_countries.to_parquet(f)\n \n return df_deployment_countries\n\nThen, we will setup some environment variables that will setup the Minio S3 bucket credentials. Add the .env file in the root of your project (same folder as setup.py).\nFSSPEC_S3_ENDPOINT_URL='http://localhost:9000'\nFSSPEC_S3_KEY='minio'\nFSSPEC_S3_SECRET='minio123'\nAs you can see we are indicating in the .env file that the endpoint of our minio is in localhost port 9000. To enable this service we need to run the following command\nmag minio api\nAs earlier, while this command is running it will forward any connection in our localhost:9000 to the our MinIO instance in the Kubernetes cluster. You shoud keep running during this till you are instructed to do close it.\nIn another terminal, we need to reinstall the pipeline so the new dependencies are loaded, and, then, we can run Dagster:\n\npip install -e '.[dev]'\ndagster dev\nNote that after you launch dagster dev you should see something like:\ndagster - INFO - Loaded environment variables from .env file: \nFSSPEC_S3_ENDPOINT_URL,FSSPEC_S3_KEY,FSSPEC_S3_SECRET\nThis is because Dagster loads all the .env file automatically and exposes the variables to the code.\nOpen again the browser pointing to http://localhost:3000 and in the dagster UI and run Materialize all.\nThis time, all files should have been materialized in the magasin bucket.\nTo test if the files are there. In a terminal run:\nmc ls myminio/magasin/data\n\n\n\n3.2 Adding a job scheduler\nUntil now, we have been materializing manually our assets. However, automating this task is indeed the ultimate goal of setting up a pipeline.\nIn Dagster, you have available schedulers which basically run your pipeline, or pieces of it, in a fixed interval. Dagster schedulers follow a cron style format.\n\n\ndpga-pipeline/dpga_pipeline/assets.py\n\n#__init__.py\nfrom dagster import Definitions, load_assets_from_modules, define_asset_job, ScheduleDefinition\nfrom . import assets\n\nall_assets = load_assets_from_modules([assets])\n\n# Create an asset job that materializes all assets of the pipeline\nall_assets_job = define_asset_job(name=\"all_assets_job\",\n selection=all_assets,\n description=\"Gets all the DPG assets\")\n# Create a scheduler\nmain_schedule = ScheduleDefinition(job=all_assets_job,\n cron_schedule=\"* * * * *\"\n )\n\ndefs = Definitions(\n assets=all_assets,\n jobs=[all_assets_job],\n schedules=[main_schedule]\n)\n\nWhat we did in the code above is to:\n\nAdd a job. A job, is basically a selection of assets that will be materialized together in the same run.\nDefine a schedule. The schedule will launch the job at specified time intervals. In our case every minute (* * * * *).\n\n\n\n\n\n\n\nTip: Understanding cron jobs\n\n\n\nThe job cron format is used to specify the schedule for recurring tasks or jobs in Unix-like operating systems and cron job scheduling systems. It consists of five fields separated by spaces, representing different aspects of the schedule:\n<minute> <hour> <day-of-month> <month> <day-of-week>\n\nMinute (0-59): Specifies the minute of the hour when the job should run. Valid values range from 0 to 59.\nHour (0-23): Specifies the hour of the day when the job should run. Valid values range from 0 to 23, where 0 represents midnight and 23 represents 11 PM.\nDay of Month (1-31): Specifies the day of the month when the job should run. Valid values range from 1 to 31, depending on the month.\nMonth (1-12): Specifies the month of the year when the job should run. Valid values range from 1 to 12, where 1 represents January and 12 represents December.\nDay of Week (0-7): Specifies the day of the week when the job should run. Both 0 and 7 represent Sunday, while 1 represents Monday, and so on, up to 6 representing Saturday.\n\nEach field can contain a single value, a list of values separated by commas, a range of values specified with a hyphen, or an asterisk (*) to indicate all possible values. Additionally, you can use special characters such as slashes (/) for specifying intervals and question marks (?) for leaving a field unspecified (e.g., for day of month or day of week when the other field should match).\nHere you have some examples of cron intervals\n\n\n\n\n\n\n\nCron Expression\nDescription\n\n\n\n\n0 0 * * *\nRun a task every day at midnight (00:00).\n\n\n15 2 * * *\nRun a task at 2:15 AM every day.\n\n\n0 0 * * 1\nRun a task every Monday at midnight (00:00).\n\n\n0 12 * * 1-5\nRun a task every weekday (Monday to Friday) at 12 PM (noon).\n\n\n*/15 * * * *\nRun a task every 15 minutes.\n\n\n0 */2 * * *\nRun a task every 2 hours, starting from midnight.\n\n\n30 3 * * 6\nRun a task every Saturday at 3:30 AM.\n\n\n0 0 1 * *\nRun a task at midnight on the first day of every month.\n\n\n0 0 1 1 *\nRun a task at midnight on January 1st every year.\n\n\n\n\n\nIf you launch again dagster dev and you go to Overview -> Jobs, you can enable the job.\n\n\n\nScheduled job\n\n\n\n\n3.3 Deploy the pipeline in the cluster\nUntil now we have been running dagster on our own computer by enabling the access to the MinIO installed in our kubernetes cluster through mag minio api. But we want our pipeline to run entirely within our Kubernetes cluster. To do that we will deploy a container (pod) in our cluster that Dagster will use to run our pipeline.\nWe will follow this steps:\n\nPrepare the Docker image. Our pipeline will reside in a container that will be called by Dagster to run the pipeline. So we need to create a Docker image that will hold all our code and is ready to be called by Dagster.\nAdd the environment variables as secrets. We need to provide to our image the environmental variables. In Kubernetes this is done through secrets. Secrets are a special type of resource for holding sensitive information that exists in Kubernetes.\nRe-Deploy Dagster. After we have prepared our image with the pipeline, we need to tell our Dagster instance to deploy it, and use it. The simplest way is to re-deploy magasin’s dagster helm chart.\n\n\n3.3.1 Prepare the Docker image\nEdit the setup.py file of your project and add a new dependency dagster-postgres:\n\n\ndpga-pipeline/setup.py\n\nfrom setuptools import find_packages, setup\n\nsetup(\n name=\"dpga_pipeline\",\n packages=find_packages(exclude=[\"dpga_pipeline_tests\"]),\n install_requires=[\n \"dagster\",\n \"dagster-cloud\",\n \"dagster-postgres\", #<------------- Add this line\n \"pandas\",\n \"requests\",\n \"fsspec\",\n \"s3fs\"\n ],\n extras_require={\"dev\": [\"dagster-webserver\", \"pytest\"]},\n)\n\nThis is because now Dagster is going to use now the PosgreSQL database that is used in the cluster for keeping the logs data. Earlier, when we were using the local Dagster setup.\n\nIn the same folder as the `setup.py` file of your dagster project create a new file called [`Dockerfile`](https://docs.docker.com/engine/reference/builder/) and add the following.\n\n```{.sh filename=\"dagster-pipeline/Dockerfile\"}\n# Select the base image \nFROM python:3.10-slim\n\n# Copy all our code into the container\nCOPY . /\n\n# Install the module within the container\n# This will install all the dependencies\nRUN pip install .\n\nNow we are going to build the image. To prevent issues while running it, we are going to build a multi-architecture image. Currently, there are two major architectures we have to deal with ARM64 (RaspberryPi’s and Apple M1/M2…) and AMD64 (regular Intel and AMD computers). By building a multi-architecture image it will run regardless of the architecture.\nIf you’re use to create Docker images, something that you may have noticed is that in our Dockerfile we did not define an ENTRYPOINT or launched command CMD, in our Dockerfile basically we just installed our pipeline code. Whereas in Docker it is common to end the Dockerfile with one of these two commands, in our case the command that launches dagster will be injected during the deployment of the image. We will set it up later.\n# first we create a builder. This just allows us to build for architectures different that our owns.\n# This only needs to be run once per computer.\ndocker buildx create --driver=docker-container --name=magasin-builder \n\n# In the command below replace <registry> by your registry.\n# If you are using docker hub, it is your user name (you need to login first.\n# In other registries such as Azure Container Registry (my-registry.azurecr.io)or Amazon ECR, please check the documentation of the provider.\n\ndocker buildx build --builder=magasin-builder --platform linux/amd64,linux/arm64 -t <registry>/dpga-pipeline:latest --push .\nNow in our registry we have a new image dpga-pipeline with the tag latest. Note that this image will be publicly available.\nFor the rest of the tutorial we will use this image: merlos/dpga-pipeline:latest, you can replace it with yours.\n\n\n\n\n\n\nTip\n\n\n\nIf you want to check what other architectures are supported run:\n docker buildx ls\nNAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS\nmagasin-builder docker-container \n magasin-builder0 desktop-linux running v0.12.3 linux/arm64, linux/amd64, linux/amd64/v2, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6\n\n\n\n\n3.3.2 Add the environment variables as secrets\nPreviously, we set some environment variables with our credentials to access MinIO. When deploying an image to Kubernetes, the typical way to set sensitive information is through Secrets. Secrets is a simple way for us to set variables that are somewhat sensitive. For production deployments you should follow these good practices for Kubernetes secrets.\nkubectl create secret generic dpga-pipeline-secret \\\n --namespace magasin-dagster \\\n --from-literal=FSSPEC_S3_ENDPOINT_URL=http://myminio-ml.magasin-tenant.svc.cluster.local \\\n --from-literal=FSSPEC_S3_KEY='minio' \\\n --from-literal=FSSPEC_S3_SECRET='minio123'\nThis command will create a secret called dpga-pipeline-secret in the namespace magasin-dagster. Remember that a namespace in Kubernetes is something that can be compared to a folder.\nNote that the FSSPEC_S3_ENDPOINT_URL is no longer localhost, but the URL of the minio server on the cluster. Internal names follow this pattern <service-name>.<namespace>.svc.cluster.local.\nTo check the secret was created you can run this command:\nkubectl get secrets --namespace magasin-dagster\nAnd check there is a line with dpga-pipeline-secret with 3 in the data column:\nNAME TYPE DATA AGE\ndagster-postgresql Opaque 1 3d22h\ndagster-postgresql-secret Opaque 1 3d22h\ndpga-pipeline-secret Opaque 3 3m16s\nsh.helm.release.v1.dagster.v1 helm.sh/release.v1 1 3d22h\nTo see the contents of each data point:\nkubectl get secret dpga-pipeline-secret -n magasin-dagster -o jsonpath='{.data.FSSPEC_S3_ENDPOINT_URL}' | base64 --decode\nNotice the | base64 --decode, this is because the screts are encoded in base64. For example minio is encoded as bWluaW8=.\nIf you need to update the secret, one simple way is to delete and then add it back. To delete run the command:\n# kubectl delete secret <secretname> --namespace <namespace-name>\nkubectl delete secret dpga-pipeline-secret --namespace magasin-dagster\n\n\n3.3.3 Re-Deploy Dagster\nThe last thing we have to do is to re-deploy Dagster so that it includes our new pipeline.\nCreate a new file called dagster-helm-values.yml with the following contents:\ndagster-user-deployments:\n enabled: true\n deployments:\n - name: \"dpga-pipeline-k8s\"\n image:\n repository: \"merlos/dpga-pipeline\"\n tag: latest\n pullPolicy: Always\n dagsterApiGrpcArgs:\n - \"--package-name\"\n - \"dpga_pipeline\"\n port: 3030\n envSecrets:\n - name: dpga-pipeline-secret\n includeConfigInLaunchedRuns:\n enabled: true\nThis file can also hold ConfigMaps or labels. You have more details about the dagster user deployment options\nThis file telling to include in the deployment our pipeline image (merlos/dpga-pipeline) as well as the environment secret envSecret called dpga-pipeline-secret.\nAlso we have defined in the file dagsterApiGrpcArgs. This includes the arguments for dagster api grpc, which you can get by running dagster api grpc --help. As we said earlier, it is on the deployment where we set launch command for the image. This is the command. Dagster uses Remote Procedure Calls, which for the purposes of this tutorial you can understand as an regular API to communicate the main dagster daemon and our deployments. The daemon is the long-lasting process that keeps track of the sensor,shedules, etc. And this daemon communicates with theIn our case we tell the command that Dagster uses remote procedure calls between the dagster main process and our image.\nNow we have to update our kubernetes deployment to include this new pipeline (a.k.a. code location in Dagster terminology).\nGo to the folder where the dagster-helm-values.yaml is located, and then run:\nhelm upgrade dagster magasin/dagster --namespace magasin-dagster -f ./dagster-helm-values.yml\nThis will update the deployment of the dagster instance of magasin. You should see something like:\nRelease \"dagster\" has been upgraded. Happy Helming!\nNAME: dagster\nLAST DEPLOYED: Tue Feb 13 09:28:32 2024\nNAMESPACE: magasin-dagster\nSTATUS: deployed\nREVISION: 2\nTEST SUITE: None\nNOTES:\nLaunched. You can access the Dagster webserver/UI by running the following commands:\n\nexport DAGSTER_WEBSERVER_POD_NAME=$(kubectl get pods --namespace magasin-dagster -l \"app.kubernetes.io/name=dagster,app.kubernetes.io/instance=dagster,component=dagster-webserver\" -o jsonpath=\"{.items[0].metadata.name}\")\necho \"Visit http://127.0.0.1:8080 to open the Dagster UI\"\nkubectl --namespace magasin-dagster port-forward $DAGSTER_WEBSERVER_POD_NAME 8080:80\nTo open the Dagster user interface of the instance running in our Kubernetes cluster we need to run\nmag dagster ui\nNow, this will open the dagster instance in your Kubernetes cluster.\n\n\n\n\n\n\nNote\n\n\n\nYou have available the source code of the pipeline, Dockerfile, dagster-helm-values.yml in magasin’s source code repository within the folder examples/dpga-pipeline/dpga-pipeline-store-minio\n\n\n\n\n\n3.4 Troubleshooting the deployment\nIn case you face any issue here you have some ways of trying to find out what’s going on. This and seeking some help on a search engine or large language model, typically helps:\n\n\n3.5 Commands to inspect status\nCheck if everything is running fine. You can check the status of the pods in the magasin-dagster namespace\nkubectl get pods --namespace magasin-dagster\nNAME READY STATUS RESTARTS AGE\ndagster-daemon-7c6474cbfd-7rgtr 1/1 Running 0 3h41m\ndagster-dagster-user-deployments-dpga-pipeline-k8s-5kqtc 1/1 Running 0 64m\ndagster-dagster-webserver-76ff9c7689-zv89b 1/1 Running 0 3h41m\ndagster-postgresql-0 1/1 Running 6 (5h53m ago) 4d2h\ndagster-run-745684fc-80c5-45e5-a238-ce5fdc0c0dbe-nzh8x 0/1 Error 0 124m\nHere you can see the run had an error.\nDescribe the dagster-run pod:\nkubectl describe pod dagster-run-745684fc-80c5-45e5-a238-ce5fdc0c0dbe-nzh8x -n magasin-dagster\nGet the logs of the run pod:\nkubectl logs dagster-run-745684fc-80c5-45e5-a238-ce5fdc0c0dbe-nzh8x -n magasin-dagster\nkubectl describe job dagster-run-745684fc-80c5-45e5-a238-ce5fdc0c0dbe-nzh8x -n \nmagasin-dagster\nInspect the logs of the deployed main pod:\nkubectl logs dagster-dagster-user-deployments-dpga-pipeline-k8s-5kqtc --namespace magasin-dagster\n2024-02-13 10:26:42 +0000 - dagster.code_server - INFO - Starting Dagster code server for package dpga_pipeline on port 3030 in process 1\n2024-02-13 10:26:42 +0000 - dagster.code_server - INFO - Started Dagster code server for package dpga_pipeline on port 3030 in process 1\nLastly, on the Dagster user interface (launched with mag dagster ui), in the Runs tab, within your failed run click on View run button.", "crumbs": [ "Home", "Get started", @@ -715,7 +715,7 @@ "href": "why-magasin.html", "title": "Why magasin?", "section": "", - "text": "In today’s data-informed world, governments and organizations face a monumental challenge: fragmented data spread across siloed systems. Departments, divisions, and units gather data independently, leading to inefficiencies and risks:\n\n\n\nTool Fragmentation: Organizations, especially those without centralized structures, struggle with diverse technologies across teams, hindering resource mobility and causing technology duplication.\nCapacity Issues: Siloed work exacerbates resource allocation challenges, limiting the organization’s overall potential.\n\n\n\n\n\nSecurity Concerns: Without secure data storage and sharing mechanisms, organizations risk data breaches and unauthorized access to sensitive information.\n\n\n\n\n\nLack of Comprehensive Insights: Siloed data prevents organizations from gaining a holistic understanding of their operations and stakeholders, leading to shortsighted decision-making.\n\nTo overcome these challenges and unlock the full potential of modern data analysis, machine learning, and artificial intelligence, organizations need a comprehensive set of tools.", + "text": "In today’s data-informed world, governments and organizations face a monumental challenge: fragmented data spread across siloed systems. Departments, divisions, and units gather data independently, leading to inefficiencies and risks:\n\nFragmentation in data tools and capacity. Organizations, particularly those lacking centralized structures, often grapple with data tool fragmentation. This fragmentation is characterized by the use of diverse technologies across different teams, which can impede resource mobility and lead to technology duplication.\nMoreover, capacity issues frequently arise as a result of siloed work. This lack of integration exacerbates resource allocation challenges, thereby limiting the organization’s overall potential. Addressing these issues is crucial for enhancing operational efficiency and unlocking the full potential of the organization.\nMyopic data analysis and lack of comprehensive insights. Siloed data prevents organizations from gaining a holistic understanding of their operations and stakeholders, leading to shortsighted decision-making.\n\nTo overcome these challenges and unlock the full potential of modern data analysis, machine learning, and artificial intelligence, organizations need a comprehensive set of tools.", "crumbs": [ "Home", "Welcome", @@ -727,7 +727,7 @@ "href": "why-magasin.html#the-challenge", "title": "Why magasin?", "section": "", - "text": "In today’s data-informed world, governments and organizations face a monumental challenge: fragmented data spread across siloed systems. Departments, divisions, and units gather data independently, leading to inefficiencies and risks:\n\n\n\nTool Fragmentation: Organizations, especially those without centralized structures, struggle with diverse technologies across teams, hindering resource mobility and causing technology duplication.\nCapacity Issues: Siloed work exacerbates resource allocation challenges, limiting the organization’s overall potential.\n\n\n\n\n\nSecurity Concerns: Without secure data storage and sharing mechanisms, organizations risk data breaches and unauthorized access to sensitive information.\n\n\n\n\n\nLack of Comprehensive Insights: Siloed data prevents organizations from gaining a holistic understanding of their operations and stakeholders, leading to shortsighted decision-making.\n\nTo overcome these challenges and unlock the full potential of modern data analysis, machine learning, and artificial intelligence, organizations need a comprehensive set of tools.", + "text": "In today’s data-informed world, governments and organizations face a monumental challenge: fragmented data spread across siloed systems. Departments, divisions, and units gather data independently, leading to inefficiencies and risks:\n\nFragmentation in data tools and capacity. Organizations, particularly those lacking centralized structures, often grapple with data tool fragmentation. This fragmentation is characterized by the use of diverse technologies across different teams, which can impede resource mobility and lead to technology duplication.\nMoreover, capacity issues frequently arise as a result of siloed work. This lack of integration exacerbates resource allocation challenges, thereby limiting the organization’s overall potential. Addressing these issues is crucial for enhancing operational efficiency and unlocking the full potential of the organization.\nMyopic data analysis and lack of comprehensive insights. Siloed data prevents organizations from gaining a holistic understanding of their operations and stakeholders, leading to shortsighted decision-making.\n\nTo overcome these challenges and unlock the full potential of modern data analysis, machine learning, and artificial intelligence, organizations need a comprehensive set of tools.", "crumbs": [ "Home", "Welcome", @@ -739,7 +739,7 @@ "href": "why-magasin.html#marketplace-gaps", "title": "Why magasin?", "section": "Marketplace gaps", - "text": "Marketplace gaps\nWhen we go to the global market we find gaps.\n\nOverwhelming landscape\n\n\n\nBig Data Landascape 2019 by Matt Turck. Source\n\n\nEntering the world of data can be daunting, with a myriad of products each requiring trade-offs.\n\n\nLeaders are solving a specific problems set\nMost data systems are optimized for massive scale and low-latency, crucial for time-sensitive tasks like targeted advertising. However, not all organizations face such time-pressured scenarios.\n\n\nWith systems that require a high cost entry\nIt is important to note that these data systems are not designed for low-end hardware or low cost of entry, further complicating the landscape for organizations exploring data solutions without having a deep pocket.\n\n\nThat are proprietary\nTraditional end-to-end data platforms often come with proprietary restrictions, limiting flexibility and tying organizations to specific cloud vendors or industry niches.\nThis presents significant challenges for entities with decentralized structures and external collaborations, in particularly, UNICEF’s government partners who demand a cloud-agnostic, open-source solution that delivers maximum value for their investment.", + "text": "Marketplace gaps\nWhen we go to the global market we find several gaps.\n\nOverwhelming landscape. Entering the world of data can be daunting, with a myriad of products each requiring trade-offs. Just look at the picture below that displays which was the DATA & AI Landscape already in 2019\n\n\n\nBig Data Landascape 2019 by Matt Turck. Source\n\n\nData/AI Leaders are solving a specific problems set. Most data systems are optimized for massive scale and low-latency, crucial for time-sensitive tasks like targeted advertising. However, not all organizations face such time-pressured scenarios.\nSystems that require a high cost entry. These data systems are generally not designed for low-end hardware or low cost of entry, further complicating the landscape for organizations exploring data solutions without having a deep pocket.\nEnd-to-end solutions are proprietary Traditional end-to-end data platforms often come with proprietary restrictions, limiting flexibility and tying organizations to specific cloud vendors or industry niches.\n\nTherefore, find in the market that there is not an end-to-end, open source solution that has a low cost of entry, that sets a uniform but flexible approach for different teams and that can scale with the needs of the organizaitons.\nHence, it is observed in the market that there lacks a comprehensive, open-source solution. Such a solution would ideally have a low entry cost, establish a uniform yet adaptable approach for various teams, and possess the capability to scale in accordance with the evolving needs of organizations. This gap presents a significant opportunity for innovation and development in the field.", "crumbs": [ "Home", "Welcome", @@ -747,59 +747,23 @@ ] }, { - "objectID": "architecture.html", - "href": "architecture.html", - "title": "Architecture", - "section": "", - "text": "Magasin is a scalable end-to-end data platform based on open-source components that is natively run in a Kubernetes cluster.\nBy end-to-end this describes a data processing pipeline including from how to ingest raw data from multiple data sources, transform the data, run analyses on the processed data, storage in a cloud or local filesystem to enabling visualisation.\nKubernetes is a container orchestration system designed to automate the deployment, scaling, and management of containerized applications. It is an integral part of services offered by major cloud providers. Kubernetes, being open source, can also be set up on-premises. For testing purposes, it is even possible to install it on a desktop computer.\nMagasin uses Kubernetes in combination with Helm, a package manager for Kubernetes applications. Helm is the equivalent to apt, pip, npm, pacman, snap, conda. Using Helm, users specify the configuration of required Kubernetes resources to deploy magasin through a values file or command-line overrides. A package in helm is called a chart.\nA fundamental contrast between magasin and other helm-based Kubernetes applications lies in their architectural approach. Typically, an application is characterized by a sole root helm chart governing all deployment rules. However, in magasin, each component operates as an autonomous helm chart. This design choice enables the establishment of a loosely-coupled architecture among its components. Rather than mandating a rigid structure for the entire architecture, magasin embraces a more open and modular approach, fostering flexibility in component selection and integration.\nThe core components of magasin are independent mature open source projects that support.", - "crumbs": [ - "Home", - "Welcome", - "Architecture" - ] - }, - { - "objectID": "architecture.html#kubernetes-containerization", - "href": "architecture.html#kubernetes-containerization", - "title": "Architecture", - "section": "1.1 Kubernetes containerization", - "text": "1.1 Kubernetes containerization\nKubernetes is a container orchestration system designed to automate the deployment, scaling, and management of containerized applications. It is an integral part of services offered by major cloud providers. Kubernetes, being open source, can also be set up on-premises. For testing purposes, it is even possible to install it on a desktop computer.", - "crumbs": [ - "Home", - "Welcome", - "Architecture" - ] - }, - { - "objectID": "architecture.html#helm-charts", - "href": "architecture.html#helm-charts", - "title": "Architecture", - "section": "1.2 Helm charts", - "text": "1.2 Helm charts\nMagasin uses Kubernetes in combination with Helm, a package manager for Kubernetes applications. Helm is the equivalent to apt, pip, npm, pacman, snap, conda, etc. Using Helm, users specify the configuration of required Kubernetes resources to deploy magasin through a values file or command-line overrides. A package in helm is called chart.", - "crumbs": [ - "Home", - "Welcome", - "Architecture" - ] - }, - { - "objectID": "architecture.html#loosely-coupled-architecture", - "href": "architecture.html#loosely-coupled-architecture", - "title": "Architecture", - "section": "1.3 Loosely-coupled architecture", - "text": "1.3 Loosely-coupled architecture\nA fundamental contrast between magasin and other helm-based Kubernetes applications lies in their architectural approach. Typically, an application is characterized by a sole root helm chart governing all deployment rules. However, in magasin, each component operates as an autonomous helm chart. This design choice enables the establishment of a loosely-coupled architecture among its components. Rather than mandating a rigid structure for the entire architecture, magasin embraces a more open and adaptable approach, fostering flexibility in component selection and integration.", + "objectID": "why-magasin.html#whats-next", + "href": "why-magasin.html#whats-next", + "title": "Why magasin?", + "section": "What’s next", + "text": "What’s next\n\nGetting started with magasin. Learn how to install, and start using magasin with a simple example.\nMagasin architecture. Learn more about the different components that conform magasin.", "crumbs": [ "Home", "Welcome", - "Architecture" + "Why magasin?" ] }, { - "objectID": "architecture.html#ingestion-dagster", - "href": "architecture.html#ingestion-dagster", + "objectID": "architecture.html", + "href": "architecture.html", "title": "Architecture", - "section": "2.1 Ingestion: Dagster", - "text": "2.1 Ingestion: Dagster\nThe Dagster framework is the primary tool for orchestration of data pipelines for ingestion, transformation, analysis, and machine learning. Each pipeline is isolated and encapsulated, so different tasks may utilize different versions of the same library, for example, and each pipeline run is executed in a short-lived pod on a Kubernetes cluster.\n\n2.1.1 Dagit\nDagster’s Dagit UI provides visibility of pipelines’ tasks, scheduling, run status, materialized assets, resources, and modes.", + "section": "", + "text": "Magasin is a scalable end-to-end data platform based on open-source components that is natively run in a Kubernetes cluster.\nBy end-to-end this describes a data processing pipeline including from how to ingest raw data from multiple data sources, transform the data, run analyses on the processed data, storage in a cloud or local filesystem to enabling visualisation.\nKubernetes is a container orchestration system designed to automate the deployment, scaling, and management of containerized applications. It is an integral part of services offered by major cloud providers. Kubernetes, being open source, can also be set up on-premises. For testing purposes, it is even possible to install it on a desktop computer. By using kubernetes, we ensure the scalability of the solution.\nMagasin uses Kubernetes in combination with Helm, a package manager for Kubernetes applications that eases the initial setup of the different components of magasin. Helm is the equivalent to apt, pip, npm, pacman, snap, conda. Using Helm, users specify the configuration of required Kubernetes resources to deploy magasin through a values file or command-line overrides.\nThe main components of magasin architecture are independent mature open source projects. They are loosely coupled and glued through set of conventions and tools.", "crumbs": [ "Home", "Welcome", @@ -807,11 +771,11 @@ ] }, { - "objectID": "architecture.html#cloud-storage-minio", - "href": "architecture.html#cloud-storage-minio", + "objectID": "architecture.html#magasin-is-a-loosely-coupled-architecture", + "href": "architecture.html#magasin-is-a-loosely-coupled-architecture", "title": "Architecture", - "section": "2.2 Cloud storage: MinIO", - "text": "2.2 Cloud storage: MinIO\nMinIO is an open-source, high-performance object storage system designed for cloud-native and containerized applications. Founded in 2014, MinIO offers an S3-compatible API, enabling seamless integration with existing cloud storage ecosystems. It is known for its simplicity, scalability, and speed, making it a popular choice for organizations seeking efficient data storage solutions. MinIO’s architecture is optimized for modern data workloads, leveraging erasure coding and distributed techniques to ensure data resilience and high availability. With its lightweight footprint and easy deployment on standard hardware, MinIO empowers developers to build scalable storage infrastructures tailored to their specific needs, whether for on-premises, hybrid, or multi-cloud environments.", + "section": "1 magasin is a loosely-coupled architecture", + "text": "1 magasin is a loosely-coupled architecture\nMagasin identifies the sets of needs for setting up an end-to-end data platform that can scale, and provides the set of open source components and an standard underlying technologies (helm, kubernetes) that can be used to solve the needs.\nHowever, organizations may already have some of the elements in the architecture already in place. For example, an organization may already have a preferred data visualization platform such as Microsoft PowerBI or Tableau, and switching to another visualization tool may entail more costs than the licenses themselves. Magasin gets organizations covered on that. Each of the components of magasin is not strictly required for an implementation.\nThis is achieved using helm packaged system in a slightly different way. Generally, in one single helm package all the components are including, setting up an more opinionated way of deploying a particular application. In magasin, each component operates as an autonomous helm chart. This design choice enables the establishment of a loosely-coupled architecture among its components. It allows you to install each component independently. Therefore, rather than mandating a rigid structure for the entire architecture, magasin embraces a more open and adaptable approach, fostering flexibility in component selection and integration.", "crumbs": [ "Home", "Welcome", @@ -819,11 +783,11 @@ ] }, { - "objectID": "architecture.html#query-engine-apache-drill", - "href": "architecture.html#query-engine-apache-drill", + "objectID": "architecture.html#magasins-components", + "href": "architecture.html#magasins-components", "title": "Architecture", - "section": "2.3 Query engine: Apache Drill", - "text": "2.3 Query engine: Apache Drill\nApache Drill is an open-source, schema-free query engine that provides a SQL interface to a wide range of non-relational datastores, such as NoSQL databases and collections of files such as JSON, CSV, ESRI shapefiles, SPSS & SAS formats, Parquet, and others.\nWhile data marts for specific business functions or locations traditionally require hosting and maintenance of a relational database on a server or virtual machine, Apache Drill enables comparable functionality without need for running and hosting a database or maintaining schema changes from source systems over time.\nInstead, a Dagster ingestion and transformation pipeline stores an ‘analyst-ready’ dataset that Apache Drill can query directly.", + "section": "2 Magasin’s components", + "text": "2 Magasin’s components\nMagasin is built on top of a set of mature open source projects to conform an base architecture for deploying an end-to-end data platform.\n\n\n\nMagasin architecture\n\n\n\n2.1 Ingestion: Dagster\nThe Dagster framework is the primary tool for orchestration of data pipelines for ingestion, transformation, analysis, and machine learning. Each pipeline is isolated and encapsulated on its own container, so different tasks may utilize different versions of a library.\nIn addition to pipeline isolation, it provides some advantages:\n\nA user interface that provides visibility of pipelines’ tasks, scheduling, run status, debugging messages, materialized assets, resources, and modes…\nDagster pipelines are written in python, which is a very familiar language for data scientist and data engineer.\nA framework for creating pipelines that scale. Whereas early in the data ingestion processes, pipelines are simple and straightforward, with time, when mixing different sources, which requires more complex cleaning and the need of armonizing identifiiers, the pipelines become very prone to be built as spaguetty code. This lmiits the maintainability, discoverability of issues and efficiency of introducing new changes. With dagster, you get a way of building pipelines that will be more structured and easy to maintain.", "crumbs": [ "Home", "Welcome", @@ -831,11 +795,11 @@ ] }, { - "objectID": "architecture.html#dashboards-apache-superset", - "href": "architecture.html#dashboards-apache-superset", + "objectID": "architecture.html#store-a-file-based-approach", + "href": "architecture.html#store-a-file-based-approach", "title": "Architecture", - "section": "2.4 Dashboards: Apache Superset", - "text": "2.4 Dashboards: Apache Superset\nApache Superset is an open-source business intelligence product with comprehensive charting, dashboarding, and querying capabilities.", + "section": "3 Store: A file based approach", + "text": "3 Store: A file based approach\nIn the magasin architecture, as general approach, we stand to store data assets as files. In particular, we recommend the use of Apache parquet file format.\nThe main reason to use a file based approach is:\n\nFirst, because it is an economic way to store data. Storage services in the cloud or in premises is relatively cheap.\nSecond, because it does provide more flexibility when changes on the underlying structure are introduced, at least compared with setting up a SQL database downstream.\nIn addition, it allows also to easily store more types of data such as documents or images.\nLastly, in terms of governance and sharing the datasets, the problem is simplified to setting up file sharing permissions.\n\nTo support this file based approach, there are two components that are introduced in the architecture. The first one is MinIO, which provides magasin with a layer that introduces agnosticity against the cloud provider. The second one is Apache Drill, which provides a SQL query engine that eases the extraction of insights from the files.\n\n3.0.1 MinIO: A cloud agnostic approach\nMagasin can be installed in any cloud provider on in premises. However, each cloud provider has a different service to store data. In order to provide a consistent way of storing data we have included as part of the standard list of components MinIO, a high-performance object storage system designed for cloud-native and containerized applications has been included as part of the magasin.\nFounded in 2014, MinIO offers an S3-compatible API, enabling seamless integration with existing cloud storage ecosystems. It is known for its simplicity, scalability, and speed, making it a popular choice for organizations seeking efficient data storage solutions. MinIO’s architecture is optimized for modern data workloads, leveraging erasure coding and distributed techniques to ensure data resilience and high availability. With its lightweight footprint and easy deployment on standard hardware, MinIO empowers developers to build scalable storage infrastructures tailored to their specific needs, whether for on-premises, hybrid, or multi-cloud environments.\nWhereas MinIO comes out of the box to provide uniform and cloud agnostic storage layer, there may be organizations that prefer to use a cloud native storage such as Azure Blob, S3 Buckets or Google Cloud Storage.\n\n\n3.0.2 Query engine: Apache Drill\nThe last piece of the file based approach is Apache Drill. Apache Drill is an open-source, schema-free query engine that provides a SQL interface to a wide range of non-relational datastores, such as NoSQL databases and collections of files such as JSON, CSV, ESRI shapefiles, SPSS & SAS formats, Parquet, and others.\nWhile data marts for specific business functions or locations traditionally require hosting and maintenance of a relational database on a server or virtual machine, Apache Drill enables comparable functionality without need for running and hosting a database or maintaining schema changes from source systems over time.\n\n\n3.1 Visualization: Apache Superset\nApache Superset is an open-source business intelligence product with comprehensive charting, dashboarding, and querying capabilities.\nApache Superset consumes SQL data sources, so Apache Drill is a nice companion for Superset to work with magasin’s file based approach.\n\n\n3.2 Analysis: Notebook environment JupyterHub\nA Jupyter notebook is an open-source allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It is a powerful tool that combines code execution, rich text, mathematics, plots, and rich media into a single document. They are widely used in data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.\nThe advantages of using Jupyter Notebooks are numerous. They provide an interactive computing environment that promotes exploratory analysis, making them ideal for data science projects. Notebooks can be easily shared, promoting reproducible research and facilitating collaboration. They also allow for inline plotting, providing immediate visual feedback on the data being analyzed. Typical uses of Jupyter Notebooks include data analysis and visualization, machine learning, statistical modeling, and educational purposes. For instance, a data scientist might use a Jupyter Notebook to clean a dataset, perform exploratory data analysis, build a machine learning model, and then visualize the results, all within a single, cohesive document.\nThe multi-tenant JupyterHub component creates on-demand, isolated containers for authenticated users, each with persistent storage for their R and Python notebook workspace.\n\n\n3.3 Parallel computing: Dask Gateway\nA Dask cluster is a flexible tool for parallel computing in Python. It allows you to write python code that will be run in multiples machines taking advantage of all the compute resources of the kubernetes cluster in which magasin is installed.\nIt is composed of a central scheduler and multiple distributed workers. Dask works well at many scales, ranging from a single machine to clusters of many machines. It enables parallel processing and extends the size of convenient datasets from “fits in memory” to “fits on disk”1. However, it can also work across a cluster of multiple machines.\nDask cluster Dask Gateway allows easy utilization of a Dask cluster from notebook environments for distributed computation of massive datasets or parallelizable operations.", "crumbs": [ "Home", "Welcome", @@ -843,11 +807,11 @@ ] }, { - "objectID": "architecture.html#notebook-environment-daskhub", - "href": "architecture.html#notebook-environment-daskhub", + "objectID": "architecture.html#mag-cli", + "href": "architecture.html#mag-cli", "title": "Architecture", - "section": "2.5 Notebook environment: Daskhub", - "text": "2.5 Notebook environment: Daskhub\nDaskhub is a Helm chart to easily install JupyterHub and Dask Gateway for multiple users on a Kubernetes cluster.\n\n2.5.1 JupyterHub\nThe multi-tenant JupyterHub component creates on-demand, isolated pods for authenticated users, each with persistent storage for their R and Python notebook workspace.\n\n\n2.5.2 Dask Gateway\nDask Gateway allows easy utilization of a Dask cluster from notebook environments for distributed computation of massive datasets or parallelizable operations.", + "section": "4 Mag-cli", + "text": "4 Mag-cli\nMag-cli is the command line interface of magasin. Helps to manage the different modules of magasin and it makes easier to perform common administration tasks.", "crumbs": [ "Home", "Welcome", @@ -1155,7 +1119,7 @@ "href": "get-started/index.html", "title": "Get started", "section": "", - "text": "Magasin is a scalable end-to-end data platform based on open-source components that is natively run in a Kubernetes cluster.\nIn this getting started you will install magasin on your local machine for testing purposes, then you will perform an end-to-end data processing task that involves: exploratory analysis of a data source, creating a pipeline to automate data ingestion and authoring a dashboard to present your findings.\nLet’s start by getting the pre-requisites:", + "text": "Magasin is a scalable end-to-end data platform based on open-source components that is natively run in a Kubernetes cluster.\nMagasin offers a value-for-money end-to-end data solution implementing a loosely-coupled architecture for organizations that need to setup a framework to scale the ingestion, storage, analysis and visualization of datasets. In addition, It also includes the capability of parallel computing for analyzing large datasets or AI model training.\nIn this getting started you will install magasin on your local machine for testing purposes, then you will perform an end-to-end data processing task that involves: exploratory analysis of a data source, creating a pipeline to automate data ingestion and authoring a dashboard to present your findings.\nBefore you continue, you may want to learn more about why magasin and its technical architecture, otherwise, let’s start by knowing the pre-requisite for installing magasin.", "crumbs": [ "Home", "Get started", @@ -1163,11 +1127,11 @@ ] }, { - "objectID": "get-started/index.html#pre-requisite-a-kubernetes-cluster", - "href": "get-started/index.html#pre-requisite-a-kubernetes-cluster", + "objectID": "get-started/index.html#install-pre-requisite-a-kubernetes-cluster", + "href": "get-started/index.html#install-pre-requisite-a-kubernetes-cluster", "title": "Get started", - "section": "1 Pre-requisite: a Kubernetes cluster", - "text": "1 Pre-requisite: a Kubernetes cluster\nPrior to installing magasin, you need to have a Kubernetes cluster. But don’t worry, you can setup one on your local machine very easily. In layman terms, Kubernetes is just a technology that enables managing cloud ready applications, such as magasin.\nIn this getting started tutorial, we are going to setup a Kubernetes cluster through Docker Desktop, an application that can be installed on most computers. However, if you already have a cluster you can go directly to the install magasin section.\nFirst, install Docker Desktop. It is available for:\n\nGNU/Linux\nMac OS X\nWindows\n\nOnce installed. Go to Settings / Kubernetes , and enable Kubernetes. It will automatically install everything required, including the command line utility kubectl.\n\n\n\nScreenshot of Docker Desktop Kubernetes Settings that allows to enable Kubernetes\n\n\nIn addition go to ** Settings / Resources ** and give it as much as CPU and Memory with a minimum of 14GB.\n\n\n\nScreenshot of Docker Desktop Resource Settings\n\n\nLastly, on a command line, create the new cluster and use it:\nkubectl config set-context magasin --namespace default --cluster docker-desktop --user=docker-desktop\nkubectl config use-context magasin\nTo ensure that the kubernetes cluster is the correct one check if the name corresponds to the\nkubectl get nodes\nNAME STATUS ROLES AGE VERSION\ndocker-desktop Ready control-plane 48m v1.28.2\nkubectl get namespaces\nNAME STATUS AGE\ndefault Active 49m\nkube-node-lease Active 49m\nkube-public Active 49m\nkube-system Active 49m\nAlternatively, you can also install minikube or if you have a cluster in any cloud provider you can also use it. At the end, you just need your kubectl to be setup to use whatever kubernetes cluster you want to use.", + "section": "1 Install pre-requisite: a Kubernetes cluster", + "text": "1 Install pre-requisite: a Kubernetes cluster\nPrior to installing magasin, you need to have a Kubernetes cluster. But don’t worry, you can setup one on your local machine very easily. In layman terms, Kubernetes is just a technology that enables managing cloud ready applications, such as magasin.\nIn this getting started tutorial, we are going to setup a Kubernetes cluster through Docker Desktop, an application that can be installed on most computers. However, if you already have a cluster you can go directly to the install magasin section.\nFirst, install Docker Desktop. It is available for:\n\nGNU/Linux\nMac OS X\nWindows\n\nOnce installed. Go to Settings / Kubernetes , and enable Kubernetes. It will automatically install everything required, including the command line utility kubectl.\n\n\n\nScreenshot of Docker Desktop Kubernetes Settings that allows to enable Kubernetes\n\n\nIn addition go to ** Settings / Resources ** and give it as much as CPU and Memory with a minimum of 14GB.\n\n\n\nScreenshot of Docker Desktop Resource Settings\n\n\nLastly, on a command line, create the new cluster and use it:\nkubectl config set-context magasin --namespace default --cluster docker-desktop --user=docker-desktop\nkubectl config use-context magasin\nTo ensure that the kubernetes cluster is the correct one check if the name corresponds to the\nkubectl get nodes\nNAME STATUS ROLES AGE VERSION\ndocker-desktop Ready control-plane 48m v1.28.2\nkubectl get namespaces\nNAME STATUS AGE\ndefault Active 49m\nkube-node-lease Active 49m\nkube-public Active 49m\nkube-system Active 49m\nAlternatively, you can also install minikube or if you have a cluster in any cloud provider you can also use it. At the end, you just need your kubectl to be setup to use whatever kubernetes cluster you want to use.", "crumbs": [ "Home", "Get started", @@ -1239,7 +1203,7 @@ "href": "index.html", "title": "", "section": "", - "text": "magasin is the cloud native open-source end-to-end data platform\n\n\nmagasin enables organizations to perform of automatic data ingestion, storage, analysis, ML/AI compute and visualization at scale\n\n\n\nGet Started" + "text": "magasin is the cloud native open-source end-to-end data platform\n\n\nmagasin enables organizations to perform of automatic data ingestion, storage, analysis, ML/AI compute and visualization at scale\n\n\n\nGet started Why magasin" }, { "objectID": "docs-home.html", diff --git a/sitemap.xml b/sitemap.xml index 492b25c1..1e96f0a1 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,138 +2,138 @@ https://unicef.github.io/magasin/admin-guides/superset.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/admin-guides/minio.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/admin-guides/drill.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/admin-guides/daskhub.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/install/index.html - 2024-03-14T12:06:27.029Z + 2024-03-18T06:17:57.268Z https://unicef.github.io/magasin/install/advanced.html - 2024-03-14T12:06:27.029Z + 2024-03-18T06:17:57.268Z https://unicef.github.io/magasin/install/troubleshooting.html - 2024-03-14T12:06:27.029Z + 2024-03-18T06:17:57.268Z https://unicef.github.io/magasin/privacy.html - 2024-03-14T12:06:27.029Z + 2024-03-18T06:17:57.268Z https://unicef.github.io/magasin/about.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/get-started/create-a-dashboard.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/get-started/automate-data-ingestion.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/get-started/tutorial-overview.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/contributing/vulnerability-disclosure.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/contributing/index.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/contributing/helm-repo-dev.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/contributing/repository-workflows.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/why-magasin.html - 2024-03-14T12:06:27.029Z + 2024-03-18T06:17:57.268Z https://unicef.github.io/magasin/architecture.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/end-user-guides.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/contributing/repositories.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/contributing/installer-dev.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/contributing/documentation.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/deployment.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/get-started/exploratory-analysis.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/get-started/index.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/security.html - 2024-03-14T12:06:27.029Z + 2024-03-18T06:17:57.268Z https://unicef.github.io/magasin/index.html - 2024-03-14T12:06:27.029Z + 2024-03-18T06:17:57.268Z https://unicef.github.io/magasin/docs-home.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/install/setup-kubernetes.html - 2024-03-14T12:06:27.029Z + 2024-03-18T06:17:57.268Z https://unicef.github.io/magasin/install/manual-installation.html - 2024-03-14T12:06:27.029Z + 2024-03-18T06:17:57.268Z https://unicef.github.io/magasin/install/uninstall.html - 2024-03-14T12:06:27.029Z + 2024-03-18T06:17:57.268Z https://unicef.github.io/magasin/admin-guides/mag-cli.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/admin-guides/kubernetes.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z https://unicef.github.io/magasin/admin-guides/dagster.html - 2024-03-14T12:06:26.969Z + 2024-03-18T06:17:57.208Z diff --git a/uninstall-magasin.sh b/uninstall-magasin.sh deleted file mode 100755 index 6f94c270..00000000 --- a/uninstall-magasin.sh +++ /dev/null @@ -1,258 +0,0 @@ -#!/bin/bash -# -# Uninstall magasin from a kubernetes cluster -# - - -# Default REALM -REALM_ARG='magasin' # default -REALM_PREFIX='magasin' -REALM_POSTFIX='' - -# Skip prompting -AUTO_INSTALL=false - -# Only check if working -ONLY_CHECK=false - -################### UTILITY FUNCTIONS ######### - - -# Function to display messages in red -echo_debug() { - if [ "$DEBUG" = true ]; then - printf "\033[38;5;208m%s\033[0m\n" "$@" - fi -} - -# Function to display failure to comply with a condition. -# Prepends and x. -echo_fail() { - printf "\033[31m[ \xE2\x9C\x97 ]\033[0m %s\n" "$@" # \e[31m sets the color to red, \e[0m resets the color -} - - -# Function to display error messages in red. Prepends ERROR -echo_error() { - printf " [\033[31mERROR:\033[0m %s\n" "$@" -} - -# Exit displaying how to debug -exit_error() { - local code=$1 - echo_error "$code" - echo_error "You may get more information about the issue by running the script including the debug option (-d):" - echo_error " $script_name -d " - echo "" - exit $code -} - -# Function to display messages in green -echo_success() { - printf "\033[32m[ \xE2\x9C\x93 ]\033[0m %s\n" "$@" -} - -# Information message in blue -echo_info() { - printf "\033[34m[ i ]\033[0m %s\n" "$@" -} - -is_namespace() { - local namespace="$1" - local exists=$(kubectl get namespace "$namespace" >/dev/null 2>&1 && echo 1 || echo 0) - echo "$exists" -} - - -##################################### - - -function usage { - echo "Usage: $1 [-c] [-r realm_prefix-realm_postfix (magasin)] [-d] [-h]" - echo "" - echo "This script uninstall all magasin components from a kubernetes cluster" - echo "" - echo "Options:" - echo " -y Skip prompting questions during uninstall." - echo " -c Only check if all pre-requisites are installed in the local machine." - echo " -r Realm prefix and suffix (default: magasin). Prefix and suffix are separated by '-'." - echo " If more than one '-', the last one will be used as separator." - echo " The realm 'magasin-new-dev' will set 'magasin-new' as prefix and 'dev' as suffix." - echo " -d Enable debug mode (displays all commands run)." - echo " -h Display this help message and exit." - echo " " - echo "Examples:" - echo " Only check if all requirements are installed" - echo " $1 -c " - echo " Uninstall 'magasin-dev' realm" - echo " $1 -r magasin-dev" - exit 0 -} - - -script_name=$(basename "$0") - -while getopts ":f:u:r:ychd" opt; do - case $opt in - y) - AUTO_INSTALL=true - ;; - c) - ONLY_CHECK=true - ;; - d) - DEBUG=true - ;; - r) - argument=$OPTARG - # Extracting prefix and postfix - last_dash=$(echo "$argument" | grep -o '[^-]*$') - if [[ "$last_dash" == "$argument" ]]; then - REALM_PREFIX=$argument - REALM_POSTFIX="" - else - REALM_PREFIX=${argument%-$last_dash} - REALM_POSTFIX=$last_dash - fi - REALM_ARG=$argument - echo_info "Magasin realm set:" - echo_info " Realm: $REALM_ARG" - echo_info " Realm prefix '$REALM_PREFIX'" - echo_info " Realm suffix '$REALM_SUFFIX'" - - ;; - h) - usage $script_name - ;; - \?) - echo_error "Invalid option: -$OPTARG" - echo "" - usage $script_name - exit 102 - ;; - :) - echo_error "Option -$OPTARG requires an argument." >&2 - exit 103 - ;; - esac -done - -# This is addded in helm command line. -HELM_DEBUG_FLAG='' -# If debug display all commands -if [ "$DEBUG" = true ]; then - echo_info Setting DEBUG mode ON - # This will enable debug in helm commands. It is added in all helm command calls - HELM_DEBUG_FLAG=" --debug" - set -x -fi - -# Display -echo "-----------" -echo_info "REALM_ARG: $REALM_ARG" -echo_info "REALM_PREFIX: $REALM_PREFIX" -echo_info "REALM_POSTFIX: $REALM_POSTFIX" -echo_info "ONLY_CHECK: $ONLY_CHECK" -echo_info "PLATFORM: $PLATFORM" -echo_info "PATH: $PATH" -echo_info "HELM_DEBUG_FLAG: $HELM_DEBUG_FLAG" -echo "-----------" - -# -# Verify kubectl and helm are working -# -not_working=false - -echo_info "Verifying pre-required commands are working..." -if ! kubectl &> /dev/null; then - echo_error "The kubectl command ($(command -v "kubectl")) is not working properly." - echo_error "Installation documentation:" - echo_error " - For Linux: https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/" - echo_error " - For macOS: https://kubernetes.io/docs/tasks/tools/install-kubectl-macos/" - not_working=true -else - echo_success "kubectl is working" -fi - -# Verify helm functionality -if ! helm version &> /dev/null; then - echo_error "The helm command ($(command -v "helm")) is not working properly." - echo_error "Installation documentation:" - echo_error " https://helm.sh/docs/intro/install/" - not_working=true -else - echo_success "helm is working" -fi - -if [ "$not_working" = true ]; then - echo_error "Some of the commands are not working." - exit_error 3 -fi - -# If -c option is set, then end. -if [[ "$ONLY_CHECK" == true ]]; then - echo_debug "ONLY_CHECK=true" - exit 0 -fi - -function uninstall_chart { - local chart=$1 - - echo_info "Uninstalling chart $chart..." - - if [[ -n "$REALM_POSTFIX" ]]; then - # realm postfix is not empty - namespace="$REALM_PREFIX-$chart-$REALM_POSTFIX" - else - # realm postfix is empty - namespace="$REALM_PREFIX-$chart" - fi - - # Check if the namespace exists. If does not exist => stop - if [ "$(is_namespace "$namespace")" -eq 0 ]; then - echo_info "Namespace '$namespace' does not exist. Skipping $chart uninstallation." - return - fi - - - echo_info "Uninstalling magasin/$chart in the namespace $namespace." - echo_info "helm uninstall $chart --namespace $namespace" - helm uninstall $chart --namespace $namespace $HELM_DEBUG_FLAG - if [[ $? -ne 0 ]]; then - echo_fail "Could not uninstall magasin/$chart in the namespace $namespace" - #exit_error 7 - else - echo_success "magasin/$chart uninstalled from the namespace $namespace" - echo_info "Removing namespace '$namespace' (be patient, it may take a while)..." - kubectl delete namespace $namespace --wait=false - # Remove the namespace - if [[ $? -ne 0 ]]; then - echo_fail "Could not remove namespace $namespace" - else - echo_success " Namespace $namespace successfully deleted" - fi - fi - -} - -echo "" -echo_info "Starting magasin uninstallation..." -uninstall_chart drill -uninstall_chart daskhub -uninstall_chart superset -uninstall_chart dagster -uninstall_chart tenant -uninstall_chart operator - - -# Remove the magasin realm namespace. -namespace=$REALM_ARG -# Check if the namespace exists. If does not exist => stop -if [ "$(is_namespace "$namespace")" -eq 0 ]; then - echo_info "Namespace '$namespace' does not exist. Skipping removal" -else - kubectl delete namespace "$namespace" - echo_success "Namespace '$namespace' deleted." -fi - -echo_success "magasin uninstallaltion finished" \ No newline at end of file diff --git a/why-magasin.html b/why-magasin.html index 96c28504..114bdcfe 100644 --- a/why-magasin.html +++ b/why-magasin.html @@ -342,20 +342,12 @@

    On this page

    @@ -384,53 +376,30 @@

    Why magasin?

    The challenge

    In today’s data-informed world, governments and organizations face a monumental challenge: fragmented data spread across siloed systems. Departments, divisions, and units gather data independently, leading to inefficiencies and risks:

    -
    -

    Fragmentation in tools and capacity

    -
      -
    • Tool Fragmentation: Organizations, especially those without centralized structures, struggle with diverse technologies across teams, hindering resource mobility and causing technology duplication.

    • -
    • Capacity Issues: Siloed work exacerbates resource allocation challenges, limiting the organization’s overall potential.

    • -
    -
    -
    -

    Risks of Data Breaches

      -
    • Security Concerns: Without secure data storage and sharing mechanisms, organizations risk data breaches and unauthorized access to sensitive information.
    • -
    -
    -
    -

    Myopic Data Analysis

    -
      -
    • Lack of Comprehensive Insights: Siloed data prevents organizations from gaining a holistic understanding of their operations and stakeholders, leading to shortsighted decision-making.
    • +
    • Fragmentation in data tools and capacity. Organizations, particularly those lacking centralized structures, often grapple with data tool fragmentation. This fragmentation is characterized by the use of diverse technologies across different teams, which can impede resource mobility and lead to technology duplication.

      +

      Moreover, capacity issues frequently arise as a result of siloed work. This lack of integration exacerbates resource allocation challenges, thereby limiting the organization’s overall potential. Addressing these issues is crucial for enhancing operational efficiency and unlocking the full potential of the organization.

    • +
    • Myopic data analysis and lack of comprehensive insights. Siloed data prevents organizations from gaining a holistic understanding of their operations and stakeholders, leading to shortsighted decision-making.

    To overcome these challenges and unlock the full potential of modern data analysis, machine learning, and artificial intelligence, organizations need a comprehensive set of tools.

    -

    Marketplace gaps

    -

    When we go to the global market we find gaps.

    -
    -

    Overwhelming landscape

    +

    When we go to the global market we find several gaps.

    +
      +
    1. Overwhelming landscape. Entering the world of data can be daunting, with a myriad of products each requiring trade-offs. Just look at the picture below that displays which was the DATA & AI Landscape already in 2019

      Big Data Landascape 2019 by Matt Turck. Source
      -
      -

      Entering the world of data can be daunting, with a myriad of products each requiring trade-offs.

      -
    -
    -

    Leaders are solving a specific problems set

    -

    Most data systems are optimized for massive scale and low-latency, crucial for time-sensitive tasks like targeted advertising. However, not all organizations face such time-pressured scenarios.

    -
    -
    -

    With systems that require a high cost entry

    -

    It is important to note that these data systems are not designed for low-end hardware or low cost of entry, further complicating the landscape for organizations exploring data solutions without having a deep pocket.

    -
    -
    -

    That are proprietary

    -

    Traditional end-to-end data platforms often come with proprietary restrictions, limiting flexibility and tying organizations to specific cloud vendors or industry niches.

    -

    This presents significant challenges for entities with decentralized structures and external collaborations, in particularly, UNICEF’s government partners who demand a cloud-agnostic, open-source solution that delivers maximum value for their investment.

    -
    + +
  • Data/AI Leaders are solving a specific problems set. Most data systems are optimized for massive scale and low-latency, crucial for time-sensitive tasks like targeted advertising. However, not all organizations face such time-pressured scenarios.

  • +
  • Systems that require a high cost entry. These data systems are generally not designed for low-end hardware or low cost of entry, further complicating the landscape for organizations exploring data solutions without having a deep pocket.

  • +
  • End-to-end solutions are proprietary Traditional end-to-end data platforms often come with proprietary restrictions, limiting flexibility and tying organizations to specific cloud vendors or industry niches.

  • + +

    Therefore, find in the market that there is not an end-to-end, open source solution that has a low cost of entry, that sets a uniform but flexible approach for different teams and that can scale with the needs of the organizaitons.

    +

    Hence, it is observed in the market that there lacks a comprehensive, open-source solution. Such a solution would ideally have a low entry cost, establish a uniform yet adaptable approach for various teams, and possess the capability to scale in accordance with the evolving needs of organizations. This gap presents a significant opportunity for innovation and development in the field.

    Magasin was needed

    @@ -446,8 +415,15 @@

    Magasin was needed

    For the public sector and NGO organizations it is a cross-sectoral tool that can be implemented at the national, subnational, or department level.​

    For the private sector it fits any organization that wants to leverage data science to keep relevant and improve business outcomes​ with a low cost as barrier of entry​ and with full control of the data.

    Don’t just keep up – lead the change with magasin and transform your organization’s digital future today.

    +
    +

    What’s next

    + +