Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudWatchLogsLogGroup - stuck waiting #331

Open
martivo opened this issue Sep 27, 2024 · 12 comments · May be fixed by #332
Open

CloudWatchLogsLogGroup - stuck waiting #331

martivo opened this issue Sep 27, 2024 · 12 comments · May be fixed by #332

Comments

@martivo
Copy link

martivo commented Sep 27, 2024

Creating the same issue in this fork as was closed in the old repo rebuy-de/aws-nuke#500

The problem still exists, tested with ghcr.io/ekristen/aws-nuke:v3.23.0
nuke.log

When I re-run aws-nuke then deletion is successfull (only the CloudWatchLogsLogGroup is deleted)

Do you want to continue? Enter account alias to continue.
> XXX

eu-north-1 - CloudWatchLogsLogGroup - /aws/eks/runner-main-pri/cluster - [CreatedTime: "1727419532008", LastEvent: "2024-09-27T06:45:43Z", logGroupName: "/aws/eks/runner-main-pri/cluster"] - triggered remove

Removal requested: 1 waiting, 0 failed, 113 skipped, 0 finished

eu-north-1 - CloudWatchLogsLogGroup - /aws/eks/runner-main-pri/cluster - [CreatedTime: "1727419532008", LastEvent: "2024-09-27T06:45:43Z", logGroupName: "/aws/eks/runner-main-pri/cluster"] - waiting

Removal requested: 1 waiting, 0 failed, 113 skipped, 0 finished

eu-north-1 - CloudWatchLogsLogGroup - /aws/eks/runner-main-pri/cluster - [CreatedTime: "1727419532008", LastEvent: "2024-09-27T06:45:43Z", logGroupName: "/aws/eks/runner-main-pri/cluster"] - removed

Removal requested: 0 waiting, 0 failed, 113 skipped, 1 finished

Nuke complete: 0 failed, 113 skipped, 1 finished.
@ekristen
Copy link
Owner

Just so I understand, it does delete it but it tends to delete it and it comes back simply due to the order in which deletions happen and how long it takes for things like clusters to get removed?

@martivo
Copy link
Author

martivo commented Sep 28, 2024

I don't think it comes back. It seems like it never trigers the remove or when it was trigering the remove - it is still bein used and the remove fails.
In the attatched log you can also see that it will stay waiting forever for the resource to be deleted.

eu-north-1 - CloudWatchLogsLogGroup - /aws/eks/runner-main-pri/cluster - [CreatedTime: "1727413603921", LastEvent: "2024-09-27T06:38:09Z", logGroupName: "/aws/eks/runner-main-pri/cluster", tag:Name: "/aws/eks/runner-main-pri/cluster", tag:Prefix: "runner-main", tag:Terraform: "true", tag:Workspace: "pri"] - waiting

Removal requested: 1 waiting, 0 failed, 113 skipped, 305 finished

This message is repeated until aws-nuke gives up.
When I start the program again it will find the same CloudWatchLogsLogGroup resource and delete it successfully immediately.
EKS cluster removal usually takes around 5-15minutes - so quite some time. EKS uses the CloudWatchLogsLogGroup resource so the LogGroup can not be delted before EKS itself is removed. It seems to me that the nuke program does not even try to remove the LogGroup...seems to just wait for it to be removed. If there would be a way to triger the remove of the LogGroup only after the EKS is deleted I am sure it would fix the issue.

@ekristen
Copy link
Owner

Do you ever see a trigger remove?

@martivo
Copy link
Author

martivo commented Sep 29, 2024

Yes:

eu-north-1 - CloudWatchLogsLogGroup - /aws/eks/runner-main-pri/cluster - [CreatedTime: "1727413603921", LastEvent: "2024-09-27T06:38:09Z", logGroupName: "/aws/eks/runner-main-pri/cluster", tag:Name: "/aws/eks/runner-main-pri/cluster", tag:Prefix: "runner-main", tag:Terraform: "true", tag:Workspace: "pri"] - triggered remove

Also I found that EKS removal was triggered after the CloudWatchLogsLogGroup trigger (here you can see the LogGroup was already in waiting status)

eu-north-1 - CloudWatchLogsLogGroup - /aws/eks/runner-main-pri/cluster - [CreatedTime: "1727413603921", LastEvent: "2024-09-27T06:38:09Z", logGroupName: "/aws/eks/runner-main-pri/cluster", tag:Name: "/aws/eks/runner-main-pri/cluster", tag:Prefix: "runner-main", tag:Terraform: "true", tag:Workspace: "pri"] - waiting
eu-north-1 - EKSCluster - runner-main-pri - [CreatedAt: "2024-09-27T05:07:01Z", tag:Prefix: "runner-main", tag:Terraform: "true", tag:Workspace: "pri"] - triggered remove

@ekristen
Copy link
Owner

What is happening is it is getting deleted, but gets recreated by the cluster before the tool detects it is gone, so it's waiting for a removal that's already technically happened.

Not sure at the moment the best way to handle. Need to give this some thought.

@ekristen
Copy link
Owner

I actually think this can be solved very simply by a patch to libnuke. I'll put a PR together and make the binaries available.

Ultimately what is happening is that the CloudWatch LogGroup is being deleted, but coming back, however due to how the resource matching works currently, the String() result is matched over Properties() which in this case means the same log group name is found and results in a waiting forever.

There's two fixes here, a) we need some sort of upper threshold on "waiting" resources and b) we need to prioritize property matching over stringer. The CloudWatch Log Group properties would mismatch due to LastEvent and CreatedAt time, this invalidating and trying to remove again.

However, the ultimate fix is going to be some sort of DAG for deletions.

@ekristen
Copy link
Owner

This might fix the issue -- #332

Builds should be available here -- https://github.com/ekristen/aws-nuke/actions/runs/11097585762

If you can run and test that would be appreciated. It's the best fix that can be done at the moment, the real fix will be dependency graph, but that's a ways off.

@martivo
Copy link
Author

martivo commented Sep 30, 2024

This might fix the issue -- #332

Builds should be available here -- https://github.com/ekristen/aws-nuke/actions/runs/11097585762

If you can run and test that would be appreciated. It's the best fix that can be done at the moment, the real fix will be dependency graph, but that's a ways off.

I will give it a try on Wednesday evening - I have an account lined up for that time that needs to be nuked.

@martivo
Copy link
Author

martivo commented Oct 3, 2024

It seems it did not help. With the build you referenced in #332 it now just said the log group was removed but it was actually not removed.
I had to run the nuke twice still.

@ekristen
Copy link
Owner

ekristen commented Oct 3, 2024

So it is removed just comes back, this at least fixed then problem of getting stuck waiting.

The problem of it coming back is only ever going to be solved by a dependency based delete.

@ekristen
Copy link
Owner

@martivo I think we should merge #332 -- as it fixes the infinite waiting problem, and then close this and track it against with DAG feature issue I have open as I believe that's the only way to truly solve this problem, otherwise, it's pretty much, have to run it twice sort of a scenario.

Thoughts?

@martivo
Copy link
Author

martivo commented Oct 15, 2024

I guess its ok to merge, but then you will most surely get another isse that the nuke is not actually deleting all the resources. From my perspective this behaviour is ok - at least I can now decently run it twice without haivng to wait a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants