-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsqd: remove --worker-id, replace with --node-id #844
Conversation
22e3612
to
d940129
Compare
added another change contemplated in #741, removing the ID from the metadata filename |
nsqd/nsqd.go
Outdated
@@ -370,6 +379,11 @@ func (n *NSQD) PersistMetadata() error { | |||
return err | |||
} | |||
|
|||
err = os.Remove(oldFileName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with this approach is that it breaks the ability to roll back from a nsqd upgrade (old nsqd would not see new file).
As much as i hate things sticking around, I'd rather see a phased deprecation where one release persists both versions with defined precedence on startup (or requiring equality between two files), and a subsequent release remove the old file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. will attempt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean we should have a point release between where we are and 1.0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strictly speaking I think we have 3 options w/r/t version promises. We
a) backport a few small deprecation/forwards compatible things that get released as 0.3.9/0.4 and complete removal in 1.0 (documenting the intermediate version requirements for clean transitions)
b) 1.0 deprecates and provides forwards compatibility w/ removal in 2.0
c) we just cold turkey switch in 1.0 and give strong documentation for the transition risk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag is an easy case because it's hard to accidentally have nsqd running with the wrong ID. Deploying a newer or older version can also mean deploying with new or old flags as well (and if you get it wrong, nsqd stops and tells you).
The metadata file case is harder. A missing metadata file is normal for first-time run, it's not an error. We may want to keep this safety rail in for a "long" time anyway, because some users may not be upgrading all instances every time. If someone didn't update and validate all nodes/instances on this new point release, and later they find out about 1.0, they might end up in an easy-silent-data-loss situation, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an idea: I don't know what it means for windows, but we could just drop a symlink in when we remove the old filename; that way i think we preserve rollback. We could then refuse to run if both are present (and one isn't a symlink) to make sure rolling forward again is clean. The symlink would mean most importantly we only update a single file going forward when persisting the metdata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They symlink idea sounds fine to me. It might even work on windows, though nsqd would probably have to be run as "Administrator".
https://blogs.windows.com/buildingapps/2016/12/02/symlinks-windows-10/#dSlfXGIY3B5ZQF0B.97
golang/go@cf521ce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is (a), I dislike the idea of maintaining compatibility (and this symlink proposal sounds questionable).
That said, I don't think it's necessary to remove the the old file? If we leave it alone, the rollback path becomes easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern is that some messages will be processed before the version rollback, so the old metadata file will have stale info, be missing new messages, have references to queue files that don't exist anymore...
apps/nsqd/nsqd.go
Outdated
@@ -80,7 +80,8 @@ func nsqdFlagSet(opts *nsqd.Options) *flag.FlagSet { | |||
flagSet.Bool("version", false, "print version string") | |||
flagSet.Bool("verbose", false, "enable verbose logging") | |||
flagSet.String("config", "", "path to config file") | |||
flagSet.Int64("worker-id", opts.ID, "unique seed for message ID generation (int) in range [0,4096) (will default to a hash of hostname)") | |||
flagSet.Int64("node-id", opts.ID, "unique part for message IDs, (int) in range [0,1024) (default is hash of hostname)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has this range been wrong this whole time? sheesh 😭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny, but it was checked correctly in nsqd.New()
anyway. Maybe everybody just lets the default "hash of hostname" take care of it.
Pushed up change to maintain old and new metadata files in parallel. It wasn't possible for metadata loading to abort nsqd startup before, I changed that so it is ... |
21edcea
to
e1cd409
Compare
(squashed in minor fixes, left major commits for discussion) |
nsqd/nsqd.go
Outdated
} | ||
if err != nil { | ||
fn = fnID | ||
data = dataID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked at this block multiple times and I still can't really make sense of it.
Is there a simpler way to implement this? What if LoadMetadata
took a filename and the caller did some of this compatibility logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use err and errID as a shorthand to indicate if fn or fnID exist. If it was any other error it would have aborted/returned already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... so if err != nil
here, then fn
did not exist and data
is empty, but fnID
did exist and dataID
was successfully read into. Below code can get whichever was found using just fn
and data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all that said I'll be glad to refactor once I figure out what the final strategy will be :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The location of the comment is misleading, I was referring to the whole new set of changes related to error handling, not this specific line or if
block.
re: strategy, it seems like regardless of version choices, we're going to land forward/backward compatible code, so these changes will be necessary in some form.
The "what to do with the old metadata file" discussion is by default hidden as a "comment on outdated changes" so I'll continue it here. I do kinda like the symlink idea. By the way, on both macOS and Linux you can use rename to atomically replace a plain file with a symlink and vice-versa. Older nsqd will read the file via the symlink, save to the .tmp file, then move it over the symlink. @judwhite does nsqd normally run as "administrator" when run as a service, and is it reasonable to have windows users run it as administrator when run manually? Any thoughts on the windows symlink situation? |
some interesting reading: go 1.5+ apparently uses http://www.weirdnet.nl/apple/rename.html |
I would like to make sure we support forwards and backwards and forwards again safely. The scenario i am concerned about is nuanced, but it's: a) running nsqd persisting to nsqd.1.dat I think there are two sane options to support this back and forth scenario safely. da) refuse to startup if both files are present (And old isn't a symlink). db) use the file w/ the newer timestamp (if timestamps are equal; refuse to start if file contents differ) For me, this is a real aspect of trying deploying a canary build; reverting, and then deploying an updated canary again. I also don't think this is extremely unusual. |
Given that we decided to support Let's just keep the file name as is, it only matters for pedantic correctness. |
I just hate having to explain how to predict what a new hosts ID will be when I see files be attempted to be moved to a new host. I really want to see that go! |
e1cd409
to
e9d936e
Compare
I think of the data-path as a portable package of nsqd state. It doesn't make sense that if you decide to change the node-id, or move the state to a new host (and use automatic node-id), then you have to rename just this one file. It would be nice to clean this up, and now would be the time to start the transition (though not necessarily in this PR). By the way, I think it would be fine to drop backwards compatibility in a 1.x minor update, I don't think we have to wait until 2.0. This is because 1.0 will be forwards-compatible with that, it would only break compatibility with pre-1.0 - so this is a bit different than supporting/deprecating a flag that you can still use in 1.0. |
OK, those seem rare but reasonable arguments. I'll leave code review to @jehiah, good catch on the rename changes! |
e9d936e
to
8ac9734
Compare
nsqd/nsqd.go
Outdated
if err != nil { | ||
return err | ||
} | ||
|
||
stat, err := os.Lstat(fileNameID) | ||
if err != nil || (stat.Mode()&os.ModeSymlink) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're curious about the spacing around &
, gofmt does this to emphasize the operator precedence. I guess that means I don't need the parentheses, I can remove them if preferred.
golang/go#12720
"version": "1.0.0-alpha" | ||
}` | ||
|
||
tmpDir, err := ioutil.TempDir("", "nsq-test-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TempDir()
does the random-suffix part automatically. Docs say:
Multiple programs calling TempDir simultaneously will not choose the same directory.
(I mention this because the mustStartNSQD()
helper composes its own random-ish suffix for TempDir() if DataPath is unset.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, another nice find/cleanup
52b62ea
to
b715e6e
Compare
I tested on Windows 7. Everything seems to work, when nsqd is run as Administrator (I used an Administrator PowerShell, and a windows build of curl to test). When not run as Administrator it fails to PersistMetadata() on startup and aborts, as expected. Still waiting for windows folks to comment on the symlink / Administrator / service / container situation on Windows, to see how acceptable this is. cc @judwhite @elvarb @esiqveland |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this LGTM, but I think we should wait for someone w/ Windows to confirm expectations around Administrator privileges.
d5955b2
to
fcb2a58
Compare
@ploxiln I'm inclined to land this and folks can follow up with opinions if we need to make changes before release. Want to squash? |
fcb2a58
to
d9e6ecc
Compare
Sounds good. IMHO these commits are logically split into separate topics, but I have no problem squashing them all together as is preferred. Done. |
nsqd/nsqd.go
Outdated
err = os.Remove(fileNameID) | ||
} | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ploxiln if os.IsNotExist(err)
is true
you'll return a value for err
, is that intentional, or would you prefer to return nil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bug, I was trying to be too concise, nice catch.
Looks like I'll back-out this windows-path variant anyway.
I think parity in the system between OS's is more important than my personal taste. Though we won't get there with symlinks, if there are going to be two files on Linux I'm fine with the same on Windows. I'll defer; whatever the core maintainers think is best. |
f16ede9
to
83ae2b2
Compare
I threw out the latest windows variant I mentioned yesterday, and squashed everything down as I found appropriate, just now. |
@jehiah what do you mean? It's deterministic, why can't it be scripted? I agree the symlink makes rollback easier, obviously. |
If you want the symlink, fine, let's just land this already. My only requirement is that I don't think the |
So in 1.0, you want it to
|
IMO it should read/write to the new metadata file and do nothing else. |
It seems harsh if one wouldn't be able to upgrade from 0.3.8 to 1.0 without either manually renaming the metadata file or losing topics/channels - unless there would be a pre-1.0 transition release series that was around for 6 months. I think the minimum that makes sense is to just fall back to reading the old metadata file if the new one does not yet exist. That's really not much code. And it's sufficient if you never rollback ... |
os.Rename() now does the same thing on Windows that atomicRename() did (since go 1.5)
no need to add our own pseudo-random "unique" suffix
symlink old metadata filename to new when loading, if both exist, ensure they match this makes rollback possible without losing messages (when rolling back forward, some manual intervention is required) on windows, Symlink() needs Administrator privs, so just write plain old metadata file includes tests
83ae2b2
to
3b4b6a5
Compare
For sake of discussion ... I've pushed up a commit that reduces this back down to what I think the minimum for 1.0 should be. |
With the "minimum back-compat" code, the rollback procedure would be something like:
I'd also like to point out that with either the symlink-style or minimal-style back-compat code, 1.0 being backward-compatible with 0.x would not necessitate all of 1.x keeping any back-compat code - if 1.0 was ever run it would write the new metadata file, which later 1.x would find even without any back-compat code. So this is not comparable to keeping a function present in a library. |
ping I think there are two options:
|
Sorry I wasn't more clear, I meant to land your changes including symlinks and stamp a non-1.0 release, then remove that temporary code and work towards 1.0. |
7678df0
to
3b4b6a5
Compare
I removed that last commit. This should be ready to go. |
ping |
<3 I heart a 1.0-migration release build for those who want easy rollout. |
FYI: i'm working to push this build out on the Bitly stack. I'm using the snippet below to prep our ability to rollback seamlessly (without causing complications rolling forward again)
|
@jehiah you building that with go 1.8 too by any chance? |
@mreiferson no unfortunately; this is my last pre-req task i want to finish before i start working on 1.8 on our stack. |
NP, just thinking in terms of a release for these changes I usually use the latest stable version of Go at the time. |
This is one of the for-1.0 changes contemplated in #741
This is the minimal-effort approach. If
--worker-id
is specified with an int, parsing options will fail, and the user will hopefully notice the change instead of just specifying "true" or "false" to work-around the failure. Or is backwards-compatibility (with just a "deprecated" warning) desired?