-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move test.arcticdata.io to Kubernetes #1932
Comments
Similar to #1797, this requires the following steps:
|
Found entries like this near end of each indexer log: dataone-indexer 20240718-20:40:37: [ERROR]: IndexWorker.indexOjbect - identifier:
resource_map_urn:uuid:9a4bed73-b844-4415-b379-fb57bb0accb0 , the index type:
create, sending acknowledgement back to rabbitmq failed since channel is already
closed due to channel error; protocol method: #method<channel.close>(reply-code=406,
reply-text=PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out.
Timeout value used: 1800000 ms. This timeout value can be configured, see consumers
doc guide to learn more, class-id=0, method-id=0). So rabbitmq may resend the
message again [org.dataone.cn.indexer.IndexWorker:indexOjbect:389] see https://www.rabbitmq.com/docs/consumers#acknowledgement-timeout
|
We had this same problem with MetaDIG and rabbitmq. @jeanetteclark solved this by handling long timeouts independently and tracking task completion in the database. It sort of defeats the purpose of a message queue if we have to track task completion outside of the queue. @jeanetteclark can provide details. |
Thanks, @mbjones. Having dug deeper - I now recall that Jing changed the code to (temporarily) ack rmq immediately, instead of waiting until after processing is finished, so there should be no timeouts -- see this PR. (I think this was a temporary fix, inspired by @jeanetteclark's learnings.) However, if this code is working as intended, it's still a mystery why I'm seeing timeouts...? Examples: This resourcemap took 3 hours 9 minutes (!) to index, with no timeout message:dataone-indexer 20240722-20:51:24: [INFO]: IndexWorker.indexOjbect with the thread id 46 - Completed the index task from the index queue with the identifier: resource_map_urn:uuid:e9eefb55-2e60-4331-9965-9a8ea09f122b , the index type: create, the file path (null means not to have): autogen.2016091422025267642.1, the priotity: 1 and the time taking is 11368354 milliseconds [org.dataone.cn.indexer.IndexWorker:indexOjbect:416] ...and it was inserted into solr successfully: dataone-indexer 20240722-20:51:24: [INFO]: SolrIndex.insert - finished to insert the solrDoc to the solr server for object resource_map_urn:uuid:e9eefb55-2e60-4331-9965-9a8ea09f122b and it took 9811449 milliseconds. [org.dataone.cn.indexer.SolrIndex:insert:424] 9811449 mS == 2 hours 43 minutes for insert This resourcemap was smaller, but had a timeout message:dataone-indexer 20240722-20:23:14: [ERROR]: IndexWorker.indexOjbect - identifier: resource_map_urn:uuid:4949791f-00f0-4603-97c3-12d58a3cd211 , the index type: create, sending acknowledgement back to rabbitmq failed since channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 3600000 ms. This timeout value can be configured, see consumers doc guide to learn more, class-id=0, method-id=0). So rabbitmq may resend the message again [org.dataone.cn.indexer.IndexWorker:indexOjbect:389] ...but was also inserted into solr successfully: dataone-indexer 20240722-20:51:11: [INFO]: SolrIndex.insert - finished to insert the solrDoc to the solr server for object resource_map_urn:uuid:4949791f-00f0-4603-97c3-12d58a3cd211 and it took 1314284 milliseconds. [org.dataone.cn.indexer.SolrIndex:insert:424] 1314284 mS == 21 minutes for insert
|
reindex finished on 7/24. See additional info in
|
7/24/24
TOTAL TIME TAKEN FOR FULL REINDEX OF TEST.ARCTICDATA.IO with 25 indexers: |
As our initial real-world testcase, deploy test.arcticdata.io in the dev Kubernetes cluster, and use it to permanently replace the current legacy deployment.
This exercise will highlight which areas of the helm chart still need work, and which are highest priority.
The text was updated successfully, but these errors were encountered: