-
-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock during publishing #1305
Comments
We merged a bunch of PRs lately, but did not observe this so far. are you using async mode for the rest API (i.e. getting a task.ID and waiting for it) ? I would suspect the queue to somehow miss a task or not finish in here: 4503580 What distribution are you on ? goxz is still used for building: https://github.com/aptly-dev/aptly/actions/runs/9545612734/job/26307439641, not sure how to get the symbols there. |
I'm on Ubuntu 22.04, and no we do not use async mode--would this be preferable? |
I would not use async mode, if your code does not need to do other things in parallel. the nightly builds should come with debug symbols in the aptly binary:
so you should be able to analyze the core dump, maybe you need to have the source code in the working dir. |
It seems there is only debug symbols for the runtime itself: Stacks of all threads
|
how did you obtain the core dump ? did aptly crash or did you trigger it ? how are you invoking gdb ? I see the source code when I pass the -d argument and provide the git repo (checkout b5bf2cb Fix functional tests' '--capture' on Python 3):
which shows the source info:
|
I attached to it and generatedd it with gdb. I actually think the stack I sent is full and correct, it's just in the runtime on all threads. I downloaded delve which can print goroutine stacks: Details
|
thanks for the backtrace ! does it look like aptly was shutting down bcs of a signal (gorouting 15) ? how did the builds "hang", REST api did not return or timeout ? was aptly still responding to other APIs ? |
Right, the REST api isn't returning. Both are stuck on Details
Re signals, I don't think so, I think that's just the goroutine responsible for waiting for signals. It seems like the goroutines with I'll leave the process running today, let me know if there is any info I can get for you by attaching. Did a little bit of digging and didn't see anything super obvious, but I'm not terribly familiar with the codebase. |
looking at the previous backtrace, I think aptly is using running with -no-lock, probably as configured in the systemd service. This disables database locking, but I think regarding concurrency, locking that database would make sense. could you try modifying the service and remove the -no-lock flag ? for running aptly commands the service then needs to be stopped. i tested with /api/publish but could not reproduce it. will try with /api/repos/{repo}/file/{dir} ... |
I see. I thought this meant it would just lock during each transaction, which would allow for using aptly on the commandline. I'll remove it and let you know if I hit more hangs. |
I haven't hit this hang again since removing that. Do you consider this a code bug or doc bug? I think at the very minimum if this is expected behavior it should be listed in the help text under Honestly, I thought |
Thanks for reporting back ! I think the problem is, that with no-lock, the database is locked for each api call, and unlocked when the api call finishes. Since we are queuing the API calls now, this leads to the deadlock. I will try to solve that and maybe make no-lock obsolete... |
Haha, right after I said that it hung again. Two builds both hitting Last few lines in the log are
Commandline is Ya know, I think I might be uploading two things to the same "files" folder symultaneously--I should probably stop doing that. backtrace
|
I'm changing it to have a random component to the dir name, that makes a lot more sense haha. Not sure exactly how aptly should function in this weird edgecase, but hanging seems wrong :) |
That would have been my next question: are you using the same upload directory from different builders? :-) aptly will remove that directory, which then causes concurrency problems. there is also an option to not remove the files, but this will need separate cleanup.. I would also suggest to use a random name for concurrent uploads. The edge case is difficult to handle I guess with the current implementation, I would prefer if aptly assigns an upload slot and the user does not need to manage that. I will look into the backtrace to check if this is expected with the current implementation. Thanks again for all the testing and analysis ! we'll get there :) |
Hi @russelltg, I think the problem was the concurrent usage of the same upload dir. It had nothing to do with the --no-lock, I assume you should be able to put that back. Let us know if the latest CI build (see: #1345 for new APT sources) works fine for you and if we can close the issue ! |
Hi @russelltg, there was more fixes for race conditions and concurrent operations merged to master! please update your ci build (1.6.0~beta1+20241022145612.767bc6bd). I don't think the -no-lock was an issue, you could also try to put this one back... let us know how this works ! |
I'd like to try to update and check if the issue is resolved. But the newest version referenced in http://repo.aptly.info/dists/nightly/main/binary-amd64/Packages is |
The APT URL has changes, see #1345
|
Thanks! Installed the Ubuntu 24.04 noble version, will report back in about a week or so whether or not we encounter the lockup anymore |
We recently updated our aptly to nightly to get #1271 which we ran into a bunch, but woke up today to 2 builds hung during publish. We use the REST api.
Context
I did save a core file, are there binaries with symbols available? Happy to update this with backtraces if I can get symbols....
The text was updated successfully, but these errors were encountered: