Move test.arcticdata.io to Kubernetes #1932

artntek · 2024-05-21T22:22:31Z

As our initial real-world testcase, deploy test.arcticdata.io in the dev Kubernetes cluster, and use it to permanently replace the current legacy deployment.

This exercise will highlight which areas of the helm chart still need work, and which are highest priority.

artntek · 2024-05-21T22:33:42Z

artntek · 2024-07-22T15:47:32Z

The installation was successful, and everything is working fine, with regard to testing. However, when I try to reindex the entire corpus, I see strange behavior...

Tried this twice:

first time with 5 indexers
second time after a fresh rsync of data from test.arcticdata.io, with 10 indexers, plus some postgres config tweaks to match the VM instance setup. Results for this second try:

Started 7/17/2024 1:37pm PDT
First 24 hours:
- all 10 indexers fully in use
- message rates just under 8/sec.
- Max queue size less than 350 -- hardly had any build-up before objects were de-queued

after 2pm 7/18, message rates tailed off to zero, and queue ramped from zero to ~22.5k objects:

Indexers appeared to be working on resource maps (very slowly) at this stage. Example log entry:

dataone-indexer 20240718-20:51:39: [INFO]: IndexWorker.indexOjbect with the thread id 44 - Completed the 
index task from the index queue with the identifier: resource_map_urn:uuid:085be1bc-74c9-4e47-aecc-c8568e6629c4 , 
the index type: create, the file path (null means not to have): autogen.2016091421595214374.1, the priotity: 1 
and the time taking is 6174237 milliseconds [org.dataone.cn.indexer.IndexWorker:indexOjbect:416]

20:51:39 UTC is 1:51pm PDT (also 7/18)
6174237mS is 1hr 42m!

However, by 7/22/25, it was obvious nothing was happening. The queue still has 22.5k entries, and there is no further indexer activity at all

MetacatUI shows: DATASETS 1 TO 25 OF 26,383

The last Metacat log entry was on 18 July at 21:05:05 (UTC):

192.168.192.155 - - [18/Jul/2024:21:05:05 +0000] "GET /metacat/d1/mn/v2/meta/urn:uuid:e9cedcd5-eef8-49d1-a1b3-a5a7349a44cf HTTP/1.1" 200 2349

The last indexer log entry was on 18 July at 21:07:05 (UTC):

dataone-indexer 20240718-21:07:05: [INFO]: IndexWorker.indexOjbect with the thread id 38 - 
Completed the index task from the index queue with the identifier: 
resource_map_urn:uuid:e9eefb55-2e60-4331-9965-9a8ea09f122b , the index type: create, the 
file path (null means not to have): autogen.2016091422025267642.1, the priotity: 1
 and the time taking is 1048455 milliseconds [org.dataone.cn.indexer.IndexWorker:indexOjbect:416]

Trying to debug...

artntek · 2024-07-22T17:49:52Z

Found entries like this near end of each indexer log:

dataone-indexer 20240718-20:40:37: [ERROR]: IndexWorker.indexOjbect - identifier:
resource_map_urn:uuid:9a4bed73-b844-4415-b379-fb57bb0accb0 , the index type:
create, sending acknowledgement back to rabbitmq failed since channel is already
closed due to channel error; protocol method: #method<channel.close>(reply-code=406,
reply-text=PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out.
Timeout value used: 1800000 ms. This timeout value can be configured, see consumers
doc guide to learn more, class-id=0, method-id=0). So rabbitmq may resend the
message again [org.dataone.cn.indexer.IndexWorker:indexOjbect:389]

see https://www.rabbitmq.com/docs/consumers#acknowledgement-timeout

7/22 @ 10:39:54 PDT - changed the timeout from 30 mins to 1 hr and did a helm upgrade:

dataone-indexer:
  rabbitmq:
    ## @param dataone-indexer.rabbitmq.extraConfiguration extra config, to be appended to rmq config
    ## Default consumer_timeout is 30 mins (i.e. 18000000mS). Increase this to 1hr to avoid timeouts
    ## with large resource maps
    extraConfiguration: |-
      consumer_timeout = 3600000

helm restarted all the indexer pods except one, and the rmq pod
Now appears that indexing has resumed...

mbjones · 2024-07-22T18:28:51Z

We had this same problem with MetaDIG and rabbitmq. @jeanetteclark solved this by handling long timeouts independently and tracking task completion in the database. It sort of defeats the purpose of a message queue if we have to track task completion outside of the queue. @jeanetteclark can provide details.

artntek · 2024-07-22T21:39:29Z

Thanks, @mbjones. Having dug deeper - I now recall that Jing changed the code to (temporarily) ack rmq immediately, instead of waiting until after processing is finished, so there should be no timeouts -- see this PR. (I think this was a temporary fix, inspired by @jeanetteclark's learnings.)

However, if this code is working as intended, it's still a mystery why I'm seeing timeouts...?

Examples:

This resourcemap took 3 hours 9 minutes (!) to index, with no timeout message:

dataone-indexer 20240722-20:51:24: [INFO]: IndexWorker.indexOjbect with the thread id 46 - Completed the index task from the index queue with the identifier: resource_map_urn:uuid:e9eefb55-2e60-4331-9965-9a8ea09f122b , the index type: create, the file path (null means not to have): autogen.2016091422025267642.1, the priotity: 1 and the time taking is 11368354 milliseconds [org.dataone.cn.indexer.IndexWorker:indexOjbect:416]

...and it was inserted into solr successfully:

dataone-indexer 20240722-20:51:24: [INFO]: SolrIndex.insert - finished to insert the solrDoc to the solr server for object resource_map_urn:uuid:e9eefb55-2e60-4331-9965-9a8ea09f122b and it took 9811449 milliseconds. [org.dataone.cn.indexer.SolrIndex:insert:424]

9811449 mS == 2 hours 43 minutes for insert

This resourcemap was smaller, but had a timeout message:

dataone-indexer 20240722-20:23:14: [ERROR]: IndexWorker.indexOjbect - identifier: resource_map_urn:uuid:4949791f-00f0-4603-97c3-12d58a3cd211 , the index type: create, sending acknowledgement back to rabbitmq failed since channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 3600000 ms. This timeout value can be configured, see consumers doc guide to learn more, class-id=0, method-id=0). So rabbitmq may resend the message again [org.dataone.cn.indexer.IndexWorker:indexOjbect:389]

...but was also inserted into solr successfully:

dataone-indexer 20240722-20:51:11: [INFO]: SolrIndex.insert - finished to insert the solrDoc to the solr server for object resource_map_urn:uuid:4949791f-00f0-4603-97c3-12d58a3cd211 and it took 1314284 milliseconds. [org.dataone.cn.indexer.SolrIndex:insert:424]

1314284 mS == 21 minutes for insert

~~I'll discuss with Jing upon his return~~
Added to backend meeting agenda

artntek · 2024-07-24T16:59:13Z

reindex finished on 7/24. See additional info in `dataone-indexer` Issue # 118

artntek · 2024-07-29T16:08:04Z

7/24/24

Wed Jul 24 17:29 PDT: Started a full reindex again
25 indexers; 5 threads/worker (default), 4-hr channel timeout, and new code changes to reopen closed channels
After approx 24h
1. Message rates -> 0
2. Queue size -> 0
3. Some (one? Plot scale not clear) reconnects shown at this point.
4. 25 consumers still shown
5. Metacat logs still showing GETs (Resource Maps still being processed)
Sat Jul 27 09:51 PDT: completely finished

TOTAL TIME TAKEN FOR FULL REINDEX OF TEST.ARCTICDATA.IO with 25 indexers:
64 hours 22 minutes

artntek added the k8s Kubernetes/Helm Related label May 21, 2024

artntek self-assigned this May 21, 2024

artntek mentioned this issue Jul 23, 2024

Bug: Index Workers lose connection to RMQ while queue still full DataONEorg/dataone-indexer#118

Open

artntek closed this as completed Aug 14, 2024

artntek mentioned this issue Aug 14, 2024

Move arcticdata.io (Production) to Kubernetes #1954

Open

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move test.arcticdata.io to Kubernetes #1932

Move test.arcticdata.io to Kubernetes #1932

artntek commented May 21, 2024

artntek commented May 21, 2024 •

edited

Loading

artntek commented Jul 22, 2024 •

edited

Loading

artntek commented Jul 22, 2024 •

edited

Loading

mbjones commented Jul 22, 2024

artntek commented Jul 22, 2024 •

edited

Loading

artntek commented Jul 24, 2024

artntek commented Jul 29, 2024

Move test.arcticdata.io to Kubernetes #1932

Move test.arcticdata.io to Kubernetes #1932

Comments

artntek commented May 21, 2024

artntek commented May 21, 2024 • edited Loading

artntek commented Jul 22, 2024 • edited Loading

artntek commented Jul 22, 2024 • edited Loading

mbjones commented Jul 22, 2024

artntek commented Jul 22, 2024 • edited Loading

This resourcemap took 3 hours 9 minutes (!) to index, with no timeout message:

This resourcemap was smaller, but had a timeout message:

artntek commented Jul 24, 2024

reindex finished on 7/24. See additional info in dataone-indexer Issue # 118

artntek commented Jul 29, 2024

7/24/24

artntek commented May 21, 2024 •

edited

Loading

artntek commented Jul 22, 2024 •

edited

Loading

artntek commented Jul 22, 2024 •

edited

Loading

artntek commented Jul 22, 2024 •

edited

Loading

reindex finished on 7/24. See additional info in `dataone-indexer` Issue # 118