Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Logging Options for provider and tenant (container) logs #250

Open
2 tasks done
anilmurty opened this issue Aug 26, 2024 · 2 comments
Open
2 tasks done

Evaluate Logging Options for provider and tenant (container) logs #250

anilmurty opened this issue Aug 26, 2024 · 2 comments
Assignees
Labels
repo/provider Akash provider-services repo issues

Comments

@anilmurty
Copy link

anilmurty commented Aug 26, 2024

Is your feature request related to a problem? Please describe.

When debugging customer issues today, we have limited logging capabilities available. Broadly (ideally) we need logs from two places - the provider (including the kubernetes cluster) and from the container (the tenants application container) so that we can determine what was the source of the issue. And ideally we want the logs to be retained for a resonable amount of time (at least a few hours if not several days pr weeks) so that we don't have a short window of opportunity to catch the issues.

This is what we have (TODAY) in terms of logs, retention and ability to query things:

Provider/ Cluster:

  • Logs from the various k8s control plane components (API server, scheduler, etcd, kubelet, kube-proxy, container-runtime)
  • Logs from akash provider code (?) - are there any?
  • Availability: typically for the last hour?
  • Key Challenge: If you don't jump in and look at the logs fast enough (within an hour of the incedent) you lose them.

Tentant/ Container:

  • Container logs streamed into Console.
  • Available for the duration of the lease. If the lease is closed, logs are lost.

Limited query capabilities - mostly have to grep for anything we want from the above two logs.

Describe the solution you'd like

The logs from the provider software, the k8s control plane components and the tenant containers (apps) are collected, stored and made queryable through some logging platform like Lodgy or Grafana/ Loki, ELK stack or similar).

The provider should be able to configure (via CLI or provider console) where to send provider and k8s logs.

The user (tenant) should be able to configure (via Console or API) where the tenant/ container logs should be sent.

As a precursor to implementing any UI and API changes to support this we want to evaluate whether fluentd is a good option to use for us to use for log collection.

Benefits of using fluentd:

  1. Open Source
  2. Can run one fluentd pod per node and collect logs from all containers on the node (?)
  3. Can configure fluentd plugins to
    • Receive both (provider and container) types of logs (?)
    • Output logs to various destinations (kafka, elasticsearch etc)
    • Output various data export formats
    • Set resource limits (in terms of CPU and memory) for the fluentd pods

Goal of the exercise:

  1. Evaluate whether fluentd can be used for colllecting akash provider and tenant container logs. Will need to configure daemonset accordingly?
  2. Evaluate resource load in terms of CPU and memory utilization of the nodes. We will want to run a multi-node cluster with some very chatty applications
  3. Evaluate network bandwidth utilization (internal/ E-W as well as external/ N-S) for exporting logs.
  4. Ensure sensitive information can be masked.
  5. Ensure logs can be visualized and queried with at least one common visualization tool (kibana, grafana, logdy)

Describe alternatives you've considered

continuing to grep logs from kubectl for provider

Search

  • I did search for other open and closed issues before opening this

Code of Conduct

  • I agree to follow this project's Code of Conduct

Additional context

No response

@andy108369
Copy link
Contributor

andy108369 commented Sep 18, 2024

I think we should look into the managed hassle-free solutions, so any Akash Provider (K8s cluster) can install the agent and get all the logs out of the box available on the dashboard:

https://newrelic.com/
https://www.datadoghq.com/
https://logz.io/
https://coralogix.com/
https://betterstack.com/logs
https://opensearch.org/ -- is an open source fork of ElasticSearch, which might be useful to go with instead of ES if we want to go down that path (to manage it ourselves)

And maintaining Elastic Search is a big pain. Few companies use cloud.elastic.co which is a managed ES solution.

I think most (if not all) of them support K8s pods logging (including akash-provider pod, etc), so we probably just need to pick the one that:

  1. ideally the cheapest
  2. easiest to install (similarly to netdata cloud agent installation, one-click install)
  3. easiest & fastest to render the logs and useful UI/UX experience to parse through the logs (can simply test querying some of the known dseq, recently deployed)

  • fluentd DaemonSet and managed ElasticSearch
    It should be very easy to install Fluentd directly into K8s since they provide DaemonSet install https://docs.fluentd.org/container-deployment/kubernetes
    then we could just point it to the managed ElasticSearch https://cloud.elastic.co and that's it.
    It looks like the basic managed ES would cost us 0.0532*(24*(365/12)) = $38/month, when looking at the defaults here https://cloud.elastic.co/pricing?elektra=pricing-page
    It is likely that we'll have to scale it vertically over time, but probably not too much. I think we'll want like at least a month or two of the most recent logs
    Elastic Cloud uses HTTPS by default and will need us to pass the fluentd auth & pass once we have the details from the ES cloud.

  • OpenSearch helm charts
    We can also try in parallel getting ES running within existing K8s cluster on of the providers without leases
    since ElasticSearch charts https://github.com/elastic/helm-charts has been archived on May 16, 2023, can use the OpenSearch instead https://opensearch.org/docs/latest/install-and-configure/install-opensearch/helm/
    That's not going to be a centralized and a hassle-free managed solution yet, but at least we can evaluate whether fluentd DaemonSet logs what we need (akash-provider pod logs, etc) and see how much of the resources does OpenSearch consume.

@chainzero chainzero added repo/provider Akash provider-services repo issues and removed awaiting-triage labels Sep 18, 2024
@andy108369
Copy link
Contributor

andy108369 commented Sep 18, 2024

@shimpa1 installed fluentd+Loki+grafana, everything looks to be working. He is now fixing a small issue with the fluentd not seeing kubernetes_metadata plugin. Once it is working, we can install it across the rest of the providers. It is not going to be a centralized solution for ALL of the providers, but for EACH of the providers. Artur asked to have it in a centralized place. So we'll get a VM with Loki+Grafana and point fluentd running on each provider to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/provider Akash provider-services repo issues
Projects
Status: Up Next (prioritized)
Development

No branches or pull requests

4 participants