Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate leaking allocation storage data #615

Closed
mpass99 opened this issue Jun 13, 2024 · 3 comments
Closed

Investigate leaking allocation storage data #615

mpass99 opened this issue Jun 13, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@mpass99
Copy link
Contributor

mpass99 commented Jun 13, 2024

On the 12th, we have seen 16922 objects in the nomad_allocations storage.


The case of 29-33eaa850-28a3-11ef-920d-fa163efe023e is one example of a runner that was added to this storage but never removed.
The runner is used multiple times by a user and then, after the inactivity timer, destroyed.
time="2024-06-12T11:08:04.171686Z" level=debug msg="Destroying Runner" destroy_reason="runner inactivity timeout exceeded" package=runner runner_id=29-33eaa850-28a3-11ef-920d-fa163efe023e

The Nomad Allocation events however don't contain any hint that the allocation got removed.

InfluxDB Allocation Events

2024-06-12T10:04:49.35585745Z,map[Allocation:map[AllocModifyIndex:769789 AllocatedResources:map[Shared:map[DiskMB:10 Networks:<nil> Ports:<nil>] TaskLifecycles:map[default-task:<nil>] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:<nil>] Devices:<nil> Memory:map[MemoryMB:30 MemoryMaxMB:256] Networks:<nil>]]] ClientStatus:pending CreateIndex:769789 CreateTime:1.718186688846078e+18 DesiredStatus:run EvalID:a4459036-4a27-2a99-94fd-19291c1013f1 ID:54750d38-7bb8-978c-1f0a-1ca64f1c70b4 JobID:29-33eaa850-28a3-11ef-920d-fa163efe023e Metrics:map[AllocationTime:542984 ClassExhausted:<nil> ClassFiltered:<nil> CoalescedFailures:0 ConstraintFiltered:<nil> DimensionExhausted:<nil> NodesAvailable:map[dc1:2] NodesEvaluated:2 NodesExhausted:0 NodesFiltered:0 NodesInPool:2 QuotaExhausted:<nil> ResourcesExhausted:<nil> ScoreMetaData:[map[NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NormScore:0.9286548337276099 Scores:map[binpack:0.9286548337276099 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:cb04341c-ea7d-5300-1a40-356801c6c1e8 NormScore:0.9267940453289819 Scores:map[binpack:0.9267940453289819 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:<nil>] ModifyIndex:769789 ModifyTime:1.718186688846078e+18 Name:29-33eaa850-28a3-11ef-920d-fa163efe023e.default-group[0] Namespace:poseidon NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices:<nil> DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA:<nil> Networks:<nil>] SharedResources:map[CPU:0 Cores:0 Devices:<nil> DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA:<nil> Networks:<nil>] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjE5OTNjNDcxLTQ3ZWQtMDlhZS1kMDI0LWQ2NTc4NzNiOThlNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTgxODY2ODgsImp0aSI6IjJlYWEwMzYxLTQyNDMtZjUxNS1mODFhLTNkZmQ5MWE3OWJhOCIsIm5iZiI6MTcxODE4NjY4OCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjU0NzUwZDM4LTdiYjgtOTc4Yy0xZjBhLTFjYTY0ZjFjNzBiNCIsIm5vbWFkX2pvYl9pZCI6IjI5LTMzZWFhODUwLTI4YTMtMTFlZi05MjBkLWZhMTYzZWZlMDIzZSIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoyOS0zM2VhYTg1MC0yOGEzLTExZWYtOTIwZC1mYTE2M2VmZTAyM2U6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.MKlzE5IOKYNHkUo5CgRA-7OdZXPUc1hv9h3qlvzoHyG9sYElBn1vHJeqW7qoDdRuEdlESVMPGy3LpB06s0XyPiyYiHgVnyiECEihBjkiqkRfFR8rNJTj2jYC9vubNFda2dBzjzCAGTok9ZtK9eChOFd_YqHZ8NNXnbxMh-ljAhsz24aAb_TfI2CU2WtO3IlGpTqpygZyztUoU2gHwNJ9F17p5R2sIBujFyNeP_0IrRdv3P3KPIk_jfVQGdMZGeHnQLAKVzd692UMX1wRNnD-VERayYDGIOVVRV8_XGLeqsH9M1G7EluefopxuG31SQP16NYudLcoP33IGiOCpsur6g] SigningKeyID:1993c471-47ed-09ae-d024-d657873b98e5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices:<nil> DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA:<nil> Networks:<nil>]]]],payload,poseidon_nomad_events,54750d38-7bb8-978c-1f0a-1ca64f1c70b4,production,10:04:48.852712685,Allocation,PlanResult
2024-06-12T10:04:50.35542309Z,map[Allocation:map[AllocModifyIndex:769789 AllocatedResources:map[Shared:map[DiskMB:10 Networks:<nil> Ports:<nil>] TaskLifecycles:map[default-task:<nil>] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:<nil>] Devices:<nil> Memory:map[MemoryMB:30 MemoryMaxMB:256] Networks:<nil>]]] ClientDescription:Tasks are running ClientStatus:running CreateIndex:769789 CreateTime:1.718186688846078e+18 DesiredStatus:run EvalID:a4459036-4a27-2a99-94fd-19291c1013f1 ID:54750d38-7bb8-978c-1f0a-1ca64f1c70b4 JobID:29-33eaa850-28a3-11ef-920d-fa163efe023e Metrics:map[AllocationTime:542984 ClassExhausted:<nil> ClassFiltered:<nil> CoalescedFailures:0 ConstraintFiltered:<nil> DimensionExhausted:<nil> NodesAvailable:map[dc1:2] NodesEvaluated:2 NodesExhausted:0 NodesFiltered:0 NodesInPool:2 QuotaExhausted:<nil> ResourcesExhausted:<nil> ScoreMetaData:[map[NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NormScore:0.9286548337276099 Scores:map[binpack:0.9286548337276099 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:cb04341c-ea7d-5300-1a40-356801c6c1e8 NormScore:0.9267940453289819 Scores:map[binpack:0.9267940453289819 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:<nil>] ModifyIndex:769793 ModifyTime:1.7181866897147538e+18 Name:29-33eaa850-28a3-11ef-920d-fa163efe023e.default-group[0] Namespace:poseidon NetworkStatus:map[Address: DNS:<nil> InterfaceName:] NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices:<nil> DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA:<nil> Networks:<nil>] SharedResources:map[CPU:0 Cores:0 Devices:<nil> DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA:<nil> Networks:<nil>] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjE5OTNjNDcxLTQ3ZWQtMDlhZS1kMDI0LWQ2NTc4NzNiOThlNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTgxODY2ODgsImp0aSI6IjJlYWEwMzYxLTQyNDMtZjUxNS1mODFhLTNkZmQ5MWE3OWJhOCIsIm5iZiI6MTcxODE4NjY4OCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjU0NzUwZDM4LTdiYjgtOTc4Yy0xZjBhLTFjYTY0ZjFjNzBiNCIsIm5vbWFkX2pvYl9pZCI6IjI5LTMzZWFhODUwLTI4YTMtMTFlZi05MjBkLWZhMTYzZWZlMDIzZSIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoyOS0zM2VhYTg1MC0yOGEzLTExZWYtOTIwZC1mYTE2M2VmZTAyM2U6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.MKlzE5IOKYNHkUo5CgRA-7OdZXPUc1hv9h3qlvzoHyG9sYElBn1vHJeqW7qoDdRuEdlESVMPGy3LpB06s0XyPiyYiHgVnyiECEihBjkiqkRfFR8rNJTj2jYC9vubNFda2dBzjzCAGTok9ZtK9eChOFd_YqHZ8NNXnbxMh-ljAhsz24aAb_TfI2CU2WtO3IlGpTqpygZyztUoU2gHwNJ9F17p5R2sIBujFyNeP_0IrRdv3P3KPIk_jfVQGdMZGeHnQLAKVzd692UMX1wRNnD-VERayYDGIOVVRV8_XGLeqsH9M1G7EluefopxuG31SQP16NYudLcoP33IGiOCpsur6g] SigningKeyID:1993c471-47ed-09ae-d024-d657873b98e5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices:<nil> DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA:<nil> Networks:<nil>]] TaskStates:map[default-task:map[Events:[map[Details:map[] DiskLimit:0 DisplayMessage:Task received by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866888913754e+18 Type:Received ValidationError: VaultError:] map[Details:map[message:Building Task Directory] DiskLimit:0 DisplayMessage:Building Task Directory DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message:Building Task Directory RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866888980086e+18 Type:Task Setup ValidationError: VaultError:] map[Details:map[] DiskLimit:0 DisplayMessage:Task started by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866892420828e+18 Type:Started ValidationError: VaultError:]] Failed:false FinishedAt:<nil> LastRestart:<nil> Paused: Restarts:0 StartedAt:2024-06-12T10:04:49.242139458Z State:running TaskHandle:<nil>]]]],payload,poseidon_nomad_events,54750d38-7bb8-978c-1f0a-1ca64f1c70b4,production,10:04:49.770645469,Allocation,AllocationUpdated
2024-06-12T11:00:00.314133593Z,map[Allocation:map[AllocModifyIndex:770055 AllocatedResources:map[Shared:map[DiskMB:10 Networks:<nil> Ports:<nil>] TaskLifecycles:map[default-task:<nil>] Tasks:map[default-task:map[Cpu:map[CpuShares:20 ReservedCores:<nil>] Devices:<nil> Memory:map[MemoryMB:30 MemoryMaxMB:256] Networks:<nil>]]] ClientDescription:Tasks are running ClientStatus:running CreateIndex:769789 CreateTime:1.718186688846078e+18 DesiredStatus:run EvalID:c5cfd724-34a1-e396-c37d-6dd9545c0e36 ID:54750d38-7bb8-978c-1f0a-1ca64f1c70b4 JobID:29-33eaa850-28a3-11ef-920d-fa163efe023e Metrics:map[AllocationTime:542984 ClassExhausted:<nil> ClassFiltered:<nil> CoalescedFailures:0 ConstraintFiltered:<nil> DimensionExhausted:<nil> NodesAvailable:map[dc1:2] NodesEvaluated:2 NodesExhausted:0 NodesFiltered:0 NodesInPool:2 QuotaExhausted:<nil> ResourcesExhausted:<nil> ScoreMetaData:[map[NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NormScore:0.9286548337276099 Scores:map[binpack:0.9286548337276099 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]] map[NodeID:cb04341c-ea7d-5300-1a40-356801c6c1e8 NormScore:0.9267940453289819 Scores:map[binpack:0.9267940453289819 job-anti-affinity:0 node-affinity:0 node-reschedule-penalty:0]]] Scores:<nil>] ModifyIndex:770055 ModifyTime:1.718189999979358e+18 Name:29-33eaa850-28a3-11ef-920d-fa163efe023e.default-group[0] Namespace:poseidon NetworkStatus:map[Address: DNS:<nil> InterfaceName:] NodeID:919ea902-e1f4-de7f-0743-2deeabb93628 NodeName:nomad-agent-terraform-4 Resources:map[CPU:20 Cores:0 Devices:<nil> DiskMB:10 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA:<nil> Networks:<nil>] SharedResources:map[CPU:0 Cores:0 Devices:<nil> DiskMB:10 IOPS:0 MemoryMB:0 MemoryMaxMB:0 NUMA:<nil> Networks:<nil>] SignedIdentities:map[default-task:eyJhbGciOiJSUzI1NiIsImtpZCI6IjE5OTNjNDcxLTQ3ZWQtMDlhZS1kMDI0LWQ2NTc4NzNiOThlNSIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJub21hZHByb2plY3QuaW8iLCJpYXQiOjE3MTgxODY2ODgsImp0aSI6IjJlYWEwMzYxLTQyNDMtZjUxNS1mODFhLTNkZmQ5MWE3OWJhOCIsIm5iZiI6MTcxODE4NjY4OCwibm9tYWRfYWxsb2NhdGlvbl9pZCI6IjU0NzUwZDM4LTdiYjgtOTc4Yy0xZjBhLTFjYTY0ZjFjNzBiNCIsIm5vbWFkX2pvYl9pZCI6IjI5LTMzZWFhODUwLTI4YTMtMTFlZi05MjBkLWZhMTYzZWZlMDIzZSIsIm5vbWFkX25hbWVzcGFjZSI6InBvc2VpZG9uIiwibm9tYWRfdGFzayI6ImRlZmF1bHQtdGFzayIsInN1YiI6Imdsb2JhbDpwb3NlaWRvbjoyOS0zM2VhYTg1MC0yOGEzLTExZWYtOTIwZC1mYTE2M2VmZTAyM2U6ZGVmYXVsdC1ncm91cDpkZWZhdWx0LXRhc2s6ZGVmYXVsdCJ9.MKlzE5IOKYNHkUo5CgRA-7OdZXPUc1hv9h3qlvzoHyG9sYElBn1vHJeqW7qoDdRuEdlESVMPGy3LpB06s0XyPiyYiHgVnyiECEihBjkiqkRfFR8rNJTj2jYC9vubNFda2dBzjzCAGTok9ZtK9eChOFd_YqHZ8NNXnbxMh-ljAhsz24aAb_TfI2CU2WtO3IlGpTqpygZyztUoU2gHwNJ9F17p5R2sIBujFyNeP_0IrRdv3P3KPIk_jfVQGdMZGeHnQLAKVzd692UMX1wRNnD-VERayYDGIOVVRV8_XGLeqsH9M1G7EluefopxuG31SQP16NYudLcoP33IGiOCpsur6g] SigningKeyID:1993c471-47ed-09ae-d024-d657873b98e5 TaskGroup:default-group TaskResources:map[default-task:map[CPU:20 Cores:0 Devices:<nil> DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA:<nil> Networks:<nil>]] TaskStates:map[default-task:map[Events:[map[Details:map[] DiskLimit:0 DisplayMessage:Task received by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866888913754e+18 Type:Received ValidationError: VaultError:] map[Details:map[message:Building Task Directory] DiskLimit:0 DisplayMessage:Building Task Directory DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message:Building Task Directory RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866888980086e+18 Type:Task Setup ValidationError: VaultError:] map[Details:map[] DiskLimit:0 DisplayMessage:Task started by client DownloadError: DriverError: DriverMessage: ExitCode:0 FailedSibling: FailsTask:false GenericSource: KillError: KillReason: KillTimeout:0 Message: RestartReason: SetupError: Signal:0 StartDelay:0 TaskSignal: TaskSignalReason: Time:1.7181866892420828e+18 Type:Started ValidationError: VaultError:]] Failed:false FinishedAt:<nil> LastRestart:<nil> Paused: Restarts:0 StartedAt:2024-06-12T10:04:49.242139458Z State:running TaskHandle:<nil>]]]],payload,poseidon_nomad_events,54750d38-7bb8-978c-1f0a-1ca64f1c70b4,production,10:59:59.279810253,Allocation,PlanResult

Only the Job events contain the hint that the Job got deregistered.

InfluxDB Job events

2024-06-12T10:04:49.35585745Z,map[Job:map[Affinities:<nil> AllAtOnce:false Constraints:<nil> ConsulNamespace: ConsulToken: CreateIndex:769787 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:29-33eaa850-28a3-11ef-920d-fa163efe023e JobModifyIndex:769787 Meta:<nil> ModifyIndex:769787 Multiregion:<nil> Name:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob:<nil> ParentID: Payload:<nil> Periodic:<nil> Priority:50 Region:global Spreads:<nil> Stable:false Status:pending StatusDescription: Stop:false SubmitTime:1.7181866887586696e+18 TaskGroups:[map[Affinities:<nil> Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 Disconnect:<nil> EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect:<nil> Meta:<nil> Migrate:<nil> Name:default-group Networks:<nil> PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:769787 Enabled:true ID:1e4b91f1-7ea2-ea3d-f063-a7c3435d1d1e Max:300 Min:0 ModifyIndex:769787 Policy:<nil> Target:map[Group:default-group Job:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon] Type:horizontal] Services:<nil> ShutdownDelay:<nil> Spreads:[map[Attribute:${node.unique.name} SpreadTarget:<nil> Weight:100]] StopAfterClientDisconnect:<nil> Tasks:[map[Actions:<nil> Affinities:<nil> Artifacts:<nil> CSIPluginConfig:<nil> Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_python:3.8 network_mode:none] Constraints:<nil> Consul:<nil> DispatchPayload:<nil> Driver:docker Env:<nil> Identities:<nil> Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle:<nil> LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta:<nil> Name:default-task Resources:map[CPU:20 Cores:0 Devices:<nil> DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA:<nil> Networks:<nil>] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies:<nil> Schedule:<nil> Services:<nil> ShutdownDelay:0 Templates:<nil> User: Vault:<nil> VolumeMounts:<nil>]] Update:<nil> Volumes:<nil>] map[Affinities:<nil> Constraints:<nil> Consul:map[Cluster:default Namespace: Partition:] Count:0 Disconnect:<nil> EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect:<nil> Meta:map[used:false] Migrate:<nil> Name:config Networks:<nil> PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:<nil> Services:<nil> ShutdownDelay:<nil> Spreads:<nil> StopAfterClientDisconnect:<nil> Tasks:[map[Actions:<nil> Affinities:<nil> Artifacts:<nil> CSIPluginConfig:<nil> Config:map[command:true] Constraints:<nil> Consul:<nil> DispatchPayload:<nil> Driver:exec Env:<nil> Identities:<nil> Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle:<nil> LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta:<nil> Name:config Resources:map[CPU:1 Cores:0 Devices:<nil> DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA:<nil> Networks:<nil>] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies:<nil> Schedule:<nil> Services:<nil> ShutdownDelay:0 Templates:<nil> User: Vault:<nil> VolumeMounts:<nil>]] Update:<nil> Volumes:<nil>]] Type:batch UI:<nil> Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:0]],payload,poseidon_nomad_events,29-33eaa850-28a3-11ef-920d-fa163efe023e,production,10:04:48.763443928,Job,JobRegistered
2024-06-12T10:04:49.35585745Z,map[Job:map[Affinities:<nil> AllAtOnce:false Constraints:<nil> ConsulNamespace: ConsulToken: CreateIndex:769787 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:29-33eaa850-28a3-11ef-920d-fa163efe023e JobModifyIndex:769787 Meta:<nil> ModifyIndex:769789 Multiregion:<nil> Name:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob:<nil> ParentID: Payload:<nil> Periodic:<nil> Priority:50 Region:global Spreads:<nil> Stable:false Status:running StatusDescription: Stop:false SubmitTime:1.7181866887586696e+18 TaskGroups:[map[Affinities:<nil> Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 Disconnect:<nil> EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect:<nil> Meta:<nil> Migrate:<nil> Name:default-group Networks:<nil> PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:769787 Enabled:true ID:1e4b91f1-7ea2-ea3d-f063-a7c3435d1d1e Max:300 Min:0 ModifyIndex:769787 Policy:<nil> Target:map[Group:default-group Job:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon] Type:horizontal] Services:<nil> ShutdownDelay:<nil> Spreads:[map[Attribute:${node.unique.name} SpreadTarget:<nil> Weight:100]] StopAfterClientDisconnect:<nil> Tasks:[map[Actions:<nil> Affinities:<nil> Artifacts:<nil> CSIPluginConfig:<nil> Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_python:3.8 network_mode:none] Constraints:<nil> Consul:<nil> DispatchPayload:<nil> Driver:docker Env:<nil> Identities:<nil> Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle:<nil> LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta:<nil> Name:default-task Resources:map[CPU:20 Cores:0 Devices:<nil> DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA:<nil> Networks:<nil>] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies:<nil> Schedule:<nil> Services:<nil> ShutdownDelay:0 Templates:<nil> User: Vault:<nil> VolumeMounts:<nil>]] Update:<nil> Volumes:<nil>] map[Affinities:<nil> Constraints:<nil> Consul:map[Cluster:default Namespace: Partition:] Count:0 Disconnect:<nil> EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect:<nil> Meta:map[used:false] Migrate:<nil> Name:config Networks:<nil> PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:<nil> Services:<nil> ShutdownDelay:<nil> Spreads:<nil> StopAfterClientDisconnect:<nil> Tasks:[map[Actions:<nil> Affinities:<nil> Artifacts:<nil> CSIPluginConfig:<nil> Config:map[command:true] Constraints:<nil> Consul:<nil> DispatchPayload:<nil> Driver:exec Env:<nil> Identities:<nil> Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle:<nil> LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta:<nil> Name:config Resources:map[CPU:1 Cores:0 Devices:<nil> DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA:<nil> Networks:<nil>] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies:<nil> Schedule:<nil> Services:<nil> ShutdownDelay:0 Templates:<nil> User: Vault:<nil> VolumeMounts:<nil>]] Update:<nil> Volumes:<nil>]] Type:batch UI:<nil> Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:0]],payload,poseidon_nomad_events,29-33eaa850-28a3-11ef-920d-fa163efe023e,production,10:04:48.853090004,Job,PlanResult
2024-06-12T11:00:00.314133593Z,map[Job:map[Affinities:<nil> AllAtOnce:false Constraints:<nil> ConsulNamespace: ConsulToken: CreateIndex:769787 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:29-33eaa850-28a3-11ef-920d-fa163efe023e JobModifyIndex:770052 Meta:<nil> ModifyIndex:770052 Multiregion:<nil> Name:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob:<nil> ParentID: Payload:<nil> Periodic:<nil> Priority:50 Region:global Spreads:<nil> Stable:false Status:running StatusDescription: Stop:false SubmitTime:1.718189999948456e+18 TaskGroups:[map[Affinities:<nil> Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 Disconnect:<nil> EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect:<nil> Meta:<nil> Migrate:<nil> Name:default-group Networks:<nil> PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:0 Enabled:true ID:1e4b91f1-7ea2-ea3d-f063-a7c3435d1d1e Max:300 Min:0 ModifyIndex:0 Policy:<nil> Target:map[Group:default-group Job:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon] Type:horizontal] Services:<nil> ShutdownDelay:<nil> Spreads:[map[Attribute:${node.unique.name} SpreadTarget:<nil> Weight:100]] StopAfterClientDisconnect:<nil> Tasks:[map[Actions:<nil> Affinities:<nil> Artifacts:<nil> CSIPluginConfig:<nil> Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_python:3.8 network_mode:none] Constraints:<nil> Consul:<nil> DispatchPayload:<nil> Driver:docker Env:<nil> Identities:<nil> Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle:<nil> LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta:<nil> Name:default-task Resources:map[CPU:20 Cores:0 Devices:<nil> DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA:<nil> Networks:<nil>] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies:<nil> Schedule:<nil> Services:<nil> ShutdownDelay:0 Templates:<nil> User: Vault:<nil> VolumeMounts:<nil>]] Update:<nil> Volumes:<nil>] map[Affinities:<nil> Constraints:<nil> Consul:map[Cluster:default Namespace: Partition:] Count:0 Disconnect:<nil> EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect:<nil> Meta:map[timeout:180 used:true] Migrate:<nil> Name:config Networks:<nil> PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:<nil> Services:<nil> ShutdownDelay:<nil> Spreads:<nil> StopAfterClientDisconnect:<nil> Tasks:[map[Actions:<nil> Affinities:<nil> Artifacts:<nil> CSIPluginConfig:<nil> Config:map[command:true] Constraints:<nil> Consul:<nil> DispatchPayload:<nil> Driver:exec Env:<nil> Identities:<nil> Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle:<nil> LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta:<nil> Name:config Resources:map[CPU:1 Cores:0 Devices:<nil> DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA:<nil> Networks:<nil>] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies:<nil> Schedule:<nil> Services:<nil> ShutdownDelay:0 Templates:<nil> User: Vault:<nil> VolumeMounts:<nil>]] Update:<nil> Volumes:<nil>]] Type:batch UI:<nil> Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:1]],payload,poseidon_nomad_events,29-33eaa850-28a3-11ef-920d-fa163efe023e,production,10:59:59.258575624,Job,JobRegistered
2024-06-12T11:08:04.318388598Z,map[Job:map[Affinities:<nil> AllAtOnce:false Constraints:<nil> ConsulNamespace: ConsulToken: CreateIndex:769787 Datacenters:[dc1] DispatchIdempotencyToken: Dispatched:false ID:29-33eaa850-28a3-11ef-920d-fa163efe023e JobModifyIndex:770052 Meta:<nil> ModifyIndex:770052 Multiregion:<nil> Name:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon NodePool:default NomadTokenID: ParameterizedJob:<nil> ParentID: Payload:<nil> Periodic:<nil> Priority:50 Region:global Spreads:<nil> Stable:false Status:running StatusDescription: Stop:false SubmitTime:1.718189999948456e+18 TaskGroups:[map[Affinities:<nil> Constraints:[map[LTarget:${attr.os.signals} Operand:set_contains RTarget:SIGKILL]] Consul:map[Cluster:default Namespace: Partition:] Count:1 Disconnect:<nil> EphemeralDisk:map[Migrate:false SizeMB:10 Sticky:false] MaxClientDisconnect:<nil> Meta:<nil> Migrate:<nil> Name:default-group Networks:<nil> PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:0 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:true] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:map[CreateIndex:0 Enabled:true ID:1e4b91f1-7ea2-ea3d-f063-a7c3435d1d1e Max:300 Min:0 ModifyIndex:0 Policy:<nil> Target:map[Group:default-group Job:29-33eaa850-28a3-11ef-920d-fa163efe023e Namespace:poseidon] Type:horizontal] Services:<nil> ShutdownDelay:<nil> Spreads:[map[Attribute:${node.unique.name} SpreadTarget:<nil> Weight:100]] StopAfterClientDisconnect:<nil> Tasks:[map[Actions:<nil> Affinities:<nil> Artifacts:<nil> CSIPluginConfig:<nil> Config:map[args:[infinity] command:sleep force_pull:false image:openhpi/co_execenv_python:3.8 network_mode:none] Constraints:<nil> Consul:<nil> DispatchPayload:<nil> Driver:docker Env:<nil> Identities:<nil> Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal:SIGKILL KillTimeout:5e+09 Kind: Leader:false Lifecycle:<nil> LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta:<nil> Name:default-task Resources:map[CPU:20 Cores:0 Devices:<nil> DiskMB:0 IOPS:0 MemoryMB:30 MemoryMaxMB:256 NUMA:<nil> Networks:<nil>] RestartPolicy:map[Attempts:3 Delay:0 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies:<nil> Schedule:<nil> Services:<nil> ShutdownDelay:0 Templates:<nil> User: Vault:<nil> VolumeMounts:<nil>]] Update:<nil> Volumes:<nil>] map[Affinities:<nil> Constraints:<nil> Consul:map[Cluster:default Namespace: Partition:] Count:0 Disconnect:<nil> EphemeralDisk:map[Migrate:false SizeMB:300 Sticky:false] MaxClientDisconnect:<nil> Meta:map[timeout:180 used:true] Migrate:<nil> Name:config Networks:<nil> PreventRescheduleOnLost:false ReschedulePolicy:map[Attempts:1 Delay:5e+09 DelayFunction:constant Interval:8.64e+13 MaxDelay:0 Unlimited:false] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] Scaling:<nil> Services:<nil> ShutdownDelay:<nil> Spreads:<nil> StopAfterClientDisconnect:<nil> Tasks:[map[Actions:<nil> Affinities:<nil> Artifacts:<nil> CSIPluginConfig:<nil> Config:map[command:true] Constraints:<nil> Consul:<nil> DispatchPayload:<nil> Driver:exec Env:<nil> Identities:<nil> Identity:map[Audience:[nomadproject.io] ChangeMode: ChangeSignal: Env:false File:false Name:default ServiceName: TTL:0] KillSignal: KillTimeout:5e+09 Kind: Leader:false Lifecycle:<nil> LogConfig:map[Disabled:false MaxFileSizeMB:1 MaxFiles:1] Meta:<nil> Name:config Resources:map[CPU:1 Cores:0 Devices:<nil> DiskMB:0 IOPS:0 MemoryMB:10 MemoryMaxMB:0 NUMA:<nil> Networks:<nil>] RestartPolicy:map[Attempts:3 Delay:1.5e+10 Interval:8.64e+13 Mode:fail RenderTemplates:false] ScalingPolicies:<nil> Schedule:<nil> Services:<nil> ShutdownDelay:0 Templates:<nil> User: Vault:<nil> VolumeMounts:<nil>]] Update:<nil> Volumes:<nil>]] Type:batch UI:<nil> Update:map[AutoPromote:false AutoRevert:false Canary:0 HealthCheck: HealthyDeadline:0 MaxParallel:0 MinHealthyTime:0 ProgressDeadline:0 Stagger:0] VaultNamespace: VaultToken: Version:1]],payload,poseidon_nomad_events,29-33eaa850-28a3-11ef-920d-fa163efe023e,production,11:08:04.308806678,Job,JobDeregistered

This raises the question if the Sentry issue (See #406) can be seen as an indicator for a changed allocation id when both Nomad and Poseidon crashed in a migration. Or maybe that we ignored an important event time="2024-06-12T10:59:59.280162Z" level=debug msg="Ignoring duplicate event" allocID=54750d38-7bb8-978c-1f0a-1ca64f1c70b4 package=nomad.

This should be fixed together with #602 and #612.


Another case is 29-f6160f46-0e6b-11ef-97ca-fa163e7afdf8 on the 10th of May.

@mpass99
Copy link
Contributor Author

mpass99 commented Sep 9, 2024

Let's have a look at recent events

Flux Query

To identify the allocations that have been created but not deleted, we can create a Flux query.

import "join"

data = from(bucket: "poseidon")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "poseidon_nomad_allocations")

areAllocationsLeaking = data
  |> keep(columns: ["id", "event_type", "_value"])
  |> group(columns: ["id", "event_type"])
  |> unique(column: "id")
  |> group(columns: ["event_type"])
  |> count()

created = data
  |> filter(fn: (r) => r["event_type"] == "creation")
  |> group(columns: ["id"])
  |> unique(column: "id")
  |> keep(columns: ["id", "_value"])

deleted = data
  |> filter(fn: (r) => r["event_type"] != "creation")
  |> group(columns: ["id"])
  |> unique(column: "id")
  |> keep(columns: ["id", "_value"])

joined = join.left(
    left: created,
    right: deleted,
    on: (l, r) => l.id == r.id,
    as: (l, r) => ({_value: l._value, r_value: r._value, id: l.id, r_id: r.id}),
)

createdNotDeleted = joined
  |> filter(fn: (r) => not exists r["r_value"])
  |> keep(columns: ["id"])

// areAllocationsLeaking
createdNotDeleted

areAllocationsLeaking is telling us that in the selected time we have a mismatch of 57 runners.
When checking the production environment statistics, we see that we have 57 idle runners.
Therefore, we have not leaked any allocation data since today at 10:27 UTC.
Checking again soon.

@mpass99
Copy link
Contributor Author

mpass99 commented Sep 11, 2024

No allocation data has leaked since the last Poseidon restart today at 09:04 AM UTC.

@MrSerth
Copy link
Member

MrSerth commented Sep 25, 2024

We've checked today again and did not identify any identifiable mismatches between the number of created and deleted allocations. Hence, we assume that the issue has been fixed and is no longer occurring. Closing it. 🙂

@MrSerth MrSerth closed this as completed Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants