consumer: sendRDY error not propogating #199

armcknight · 2016-11-02T19:08:55Z

Hello,

first of all apologies if my terminology is a bit off, I'm not a regular Go programmer :)

We run a process reading from NSQ servers over an SSH tunnel. While debugging an issue when this connection breaks, we found a potential problem with how an error from sendRDY will not fully propagate.

sendRDY possibly emits an error (

go-nsq/consumer.go

Lines 950 to 964 in d71fb89

    
           func (r *Consumer) sendRDY(c *Conn, count int64) error { 
        
           	if count == 0 && c.LastRDY() == 0 { 
        
           		// no need to send. It's already that RDY count 
        
           		return nil 
        
           	} 
        
           	atomic.AddInt64(&r.totalRdyCount, -c.RDY()+count) 
        
           	c.SetRDY(count) 
        
           	err := c.WriteCommand(Ready(int(count))) 
        
           	if err != nil { 
        
           		r.log(LogLevelError, "(%s) error sending RDY %d - %s", c.String(), count, err) 
        
           		return err 
        
           	} 
        
           	return nil 
        
           }

):

func (r *Consumer) sendRDY(c *Conn, count int64) error

updateRDY, which calls sendRDY, also possibly emits an error (

go-nsq/consumer.go

Line 907 in d71fb89

func (r *Consumer) updateRDY(c *Conn, count int64) error {

):

func (r *Consumer) updateRDY(c *Conn, count int64) error

But that error isn't handled in it's own recursive call here (

go-nsq/consumer.go

Line 940 in d71fb89

r.updateRDY(c, count)

):

r.rdyRetryTimers[c.String()] = time.AfterFunc(5*time.Second,
	func() {
		r.updateRDY(c, count)
	})

We were thinking that the failure for the error to fully propagate means our process doesn't pick up the loss of connection and doesn't know to attempt a mitigation.

We also found a few other invocations of updateRDY that don't appear to handle errors, which both appear in startStopContinueBackoff , which doesn't report that it can throw an error (

go-nsq/consumer.go

Line 761 in d71fb89

func (r *Consumer) startStopContinueBackoff(conn *Conn, signal backoffSignal) {

):

go-nsq/consumer.go

Line 795 in d71fb89

r.updateRDY(c, count)
go-nsq/consumer.go

Line 810 in d71fb89

r.updateRDY(c, 0)

The text was updated successfully, but these errors were encountered:

bmhatfield · 2016-11-02T21:46:29Z

For a little bit of extra context, this seems to require a pretty specific set of circumstances for us. When the tunnel drops, sometimes it's detected and a reconnect happens, other times we see this:

2016/11/01 10:19:07 ERR    1 [csym/create] (127.0.0.1:3001) IO error - write tcp 127.0.0.1:51178->127.0.0.1:3001: write: broken pipe
2016/11/01 10:19:07 ERR    1 [csym/create] (127.0.0.1:3001) error sending RDY 1 - write tcp 127.0.0.1:51178->127.0.0.1:3001: write: broken pipe

When we fall into this mode, we do not observe a reconnect (even though the tunnel would have eventually come back up on it's own, we'd need to reinitialize the connection to NSQ)

mreiferson · 2016-11-02T21:50:36Z

@armcknight @bmhatfield thanks for the detailed info, I'll try to take a look at this!

mreiferson · 2017-04-16T21:32:17Z

I just poked around at this.

Despite not handling the returned errors in sendRDY / updateRDY, conn.WriteCommand calls c.delegate.OnIOError which calls conn.Close, which should then trigger reconnect cycle.

The only reason why it wouldn't is if messages are in flight and never "complete", meaning the rest of the cleanup logic doesn't execute. This is probably a poor assumption though, perhaps we should bound this with some timeout.

Thoughts?

djmally · 2017-04-21T22:28:04Z

We resolved this recently on our end, your explanation is pretty spot-on for what we were experiencing. We had messages still in flight when the RDY count was getting redistributed, which caused the connection with the in-flight messages to close prematurely. We fixed this by upping low_rdy_idle_timeout to 2 minutes in our NSQ client configuration.

I don't think I quite understand all of the inner workings of this package to comment on whether or not there should be a timeout on this operation in the client, so I'll leave that up to you, but hopefully the way we resolved this internally may provide some assistance in making that decision.

mreiferson changed the title ~~sendRDY error not propogating~~ consumer: sendRDY error not propogating Nov 2, 2016

mreiferson added the bug label Nov 2, 2016

djmally mentioned this issue Apr 12, 2017

frequent broken pipes when reading from/writing to queues #207

Closed

mreiferson added the needs-info label Jun 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consumer: sendRDY error not propogating #199

consumer: sendRDY error not propogating #199

armcknight commented Nov 2, 2016

bmhatfield commented Nov 2, 2016

mreiferson commented Nov 2, 2016

mreiferson commented Apr 16, 2017

djmally commented Apr 21, 2017

consumer: sendRDY error not propogating #199

consumer: sendRDY error not propogating #199

Comments

armcknight commented Nov 2, 2016

bmhatfield commented Nov 2, 2016

mreiferson commented Nov 2, 2016

mreiferson commented Apr 16, 2017

djmally commented Apr 21, 2017