-
-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Webhook timeout with Gitlab #282
Comments
@esalberg The webhook does not return the 200 immediately because it determines the return code based on the result of the full deployment. It is possible that the webhook will fire and start to operate as you'd expect, but then fail near the end and return a non-200 code. This isn't necessarily the optimal way to do this, but that is the reasoning behind it. I will try and do some research to determine the correct method of response for webhooks to see if we should continue with this or change the response sequence. If you or anyone else finds some good information on this subject, please let me know. Thanks! |
Thanks - even with the timeout on the webhook increased, we're still seeing evidence of overlapping r10k runs, just not quite as frequently. I'm trying to find time to do further testing to validate that the webhook is the culprit (as suggested by Puppet Support). Per the Gitlab documentation, Gitlab doesn't care about the return code at all, so is that just being used for logging? I'm not 100% sure how to interpret the last sentence - unless they're viewing response code or status code as two separate things, I'm guessing it means that any valid HTTP response would be sufficient (successful, unsuccessful, etc. - as long as it gets back to Gitlab)? If I wanted to test returning an HTTP response immediately, would I just need to put a "status 202" line (or similar) right after this code line in the webhook bin template? This way, I might at least be able to confirm that it's the webhook taking time to return that's causing the issue with deploying branches consistently across the masters, versus something else in the environment. Another related thread: |
Re: overlapping runs. Is that because two PRs merged at the same time, or because of git lab retrying? If the former, try using the branch from #268 to see if it addresses it. If the latter, that won't fix the underlying issue but may reduce side effects. Github does care about the return code as far as logging. You can view a webhook and see the data sent and return code received for every event. It otherwise doesn't care. I think this is one of those things where we have to decide, are we signaling that the webhook was received, or that the action of the webhook completed successfully? We are doing the latter currently. |
That branch looks great - I'll test it out asap. Since the webhook is a template, I would think it could be possible to have an option to forego the extra logging for quicker webhook return? Basically, let us decide whether we want to signal on option #1 or #2 instead of only using option #2. Granted, this may be moot if the updated code helps. |
@esalberg If that does not work, or you'd rather just have it return early always, I whipped up this gist. You could save that as a template somewhere in your controlrepo, maybe in the profile class, and then add I wrote this on the road, away from where I can test it myself, so please keep that in mind - you'll definitely want to try it in a lab environment or be ready to roll back quickly if it breaks. |
@esalberg Any chance you have been able to try this? Thanks. |
Sorry about that - I got pulled off onto some other things, one of which was upgrading to PE 2016.4.2. Could there be an update to r10k in 2016.4.2 (versus 2016.2.1) that might have addressed the issue with multiple r10k runs writing to the same place? Instead of the previous log lines about local changes and needing to check them in, now I see a new type of message: Overwriting local modifications to /etc/puppetlabs/code/environments/logstash/modules/augeasproviders_grub Unlike before, it does resolve the conflict (overwriting local changes is the correct action). I still see errors like the following: I wanted to reproduce the issue before trying the patch, but I have not been able to reproduce so far, even though it looks like multiple webhook runs can still occur. I can try reducing the timeout back down and see if that causes any issues - unfortunately, I'm out on PTO until January, so I won't be able to test until then. UPDATE: when I ran simultaneous pushes for the puppet_control and puppet_hieradata repos (both using r10k), I had 3 of 6 prod servers and 1 of 3 test servers return the following for my test branch: Process output showing multiple webhook deploys:peadmin 24171 9871 0 01:05 ? 00:00:00 sh -c umask 0022; mco r10k deploy zzzzbranch |
I implemented the latest r10k module code + fork and was not able to reproduce either the multiple webhook deploys or the "parent directory does not exist" error, which I was able to do pretty consistently before the patch (even with a quick test). So I'm quite optimistic that this will work for us. However, I had been using "include ::pe_r10k", which no longer is allowed (we were on an older version pre-voxpupuli). For the test, I switched to "include ::r10k" - but is it possible to use the pe_r10k module bundled with Puppet Enterprise and the r10k module's webhook (r10k::mcollective, r10k::webhook, etc.) at the same time? I won't have a chance to test dropping the gitlab timeout (which I need to do) or the gist for the webhook return (which I probably don't need to do, as long as the locking continueds to work) until January. |
I don't know anything about the pe_r10k class so cannot comment on that
aspect. However the signs look good for the patch fixing things. Given you
probably are focused on any remaining PE upgrade issues and then some
vacation time, let's pick this back up in January. Thanks!
--
Rob Nelson
|
I was referring to this in init.pp: My masters all have: and the MoM has in addition: The issue appears to be in both webhook.pp and mcollective.pp, which both have: Commits where this was added (we are using an older commit): webhook.pp: I noticed that the locking code was just merged today, so that's great. Any idea what should be done about that new require (I'm not sure why it was added, so I don't want to just say remove it - I'm assuming it was put in for a reason) so that I can move to the latest version of the module? Do you want me to move this to a new issue? |
Yeah, let's make that a new issue. I'll try to test that change before a new release goes out. Edit: New issue opened. Is it okay to resolve this issue, now? |
I'm facing the same issue: gitlab tiemouts because the webhook answers after 10 seconds or so. Basically, mco r10k blocks until all deploy are ended, and this is way too slow to avoid gitlab issues. I can avoid this problem by forcing a fork instead of using Open3.capture3, but to do that I had to modify the code. NOw I'm a bit perplexed on how to avoid this situation. Basically, it would be better if not optimal that the webhook returns immediately, and the following operations should manage also the locking.. what do you think? |
This should be fixed as of #425 |
Close per previous comment? |
We had an issue where the default webhook timeout on Gitlab (10 seconds) was causing the webhook to fire multiple times and interfere with the creation of new directory environments.
For the short term, this has been resolved by increasing the Gitlab webhook timeout, but there's a possibility in the future that we might have other webhooks that we want to run that need a timeout shorter than 180 seconds (we may be able to reduce the timeout a bit, but some of the initial runs can take a while).
I wanted to bring up this Gitlab ticket and see if it might be worth modifying the webhook, at least for Gitlab (or an option to do so, since the webhook is a template):
https://gitlab.com/gitlab-org/gitlab-ce/issues/2229
"It's probably a better practice to change your Webhook to return 200 as soon as the message is successfully received, unless you really want the hook to retry upon some failure in the execution."
Is modifying the webhook so there's at least an option to return 200 right away something that might be possible? (I haven't written a webhook before myself, but if it's something where a PR would be helpful, I could definitely look at it - although I might not write it the best way.)
The text was updated successfully, but these errors were encountered: