Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move test.arcticdata.io to Kubernetes #1932

Closed
artntek opened this issue May 21, 2024 · 7 comments
Closed

Move test.arcticdata.io to Kubernetes #1932

artntek opened this issue May 21, 2024 · 7 comments
Assignees
Labels
k8s Kubernetes/Helm Related

Comments

@artntek
Copy link
Contributor

artntek commented May 21, 2024

As our initial real-world testcase, deploy test.arcticdata.io in the dev Kubernetes cluster, and use it to permanently replace the current legacy deployment.

This exercise will highlight which areas of the helm chart still need work, and which are highest priority.

@artntek artntek added the k8s Kubernetes/Helm Related label May 21, 2024
@artntek artntek self-assigned this May 21, 2024
@artntek
Copy link
Contributor Author

artntek commented May 21, 2024

Similar to #1797, this requires the following steps:

  1. Work with @nickatnceas to copy production data for testing:

    • Copy the production postgres data (test.adc.io:/var/lib/postgresql) to the ceph volume at /mnt/ceph/repos/arctic/postgresql (treat it like a hot backup)
    • copy the production data from test.adc.io:/var/metacat to the ceph volume at /mnt/ceph/repos/arctic/metacat
      • Time how long this takes
  2. Create Persistent Volumes

    • Set up a PV to point to /mnt/ceph/repos/arctic/metacat for metacat.
    • Set up a PV to point to /mnt/ceph/repos/arctic/postgres for postgres
  3. Do the manual upgrade steps:

    • to be compatible with PG-14 (already v14? Check)
  4. Deployment

    • Create new values.yaml overrides to mirror existing test.adc.io installation.
    • figure out metacatui deployment changes, if needed
    • Deploy to dev cluster
  5. Testing: Run any existing client-library test suites we have available:

@artntek
Copy link
Contributor Author

artntek commented Jul 22, 2024

The installation was successful, and everything is working fine, with regard to testing. However, when I try to reindex the entire corpus, I see strange behavior...

Tried this twice:

  1. first time with 5 indexers
  2. second time after a fresh rsync of data from test.arcticdata.io, with 10 indexers, plus some postgres config tweaks to match the VM instance setup. Results for this second try:
  • Started 7/17/2024 1:37pm PDT
  • First 24 hours:
    • all 10 indexers fully in use
    • message rates just under 8/sec.
    • Max queue size less than 350 -- hardly had any build-up before objects were de-queued
first 24 hours
  • after 2pm 7/18, message rates tailed off to zero, and queue ramped from zero to ~22.5k objects:
Jul18th-11am-7pm
  • Indexers appeared to be working on resource maps (very slowly) at this stage. Example log entry:

    dataone-indexer 20240718-20:51:39: [INFO]: IndexWorker.indexOjbect with the thread id 44 - Completed the 
    index task from the index queue with the identifier: resource_map_urn:uuid:085be1bc-74c9-4e47-aecc-c8568e6629c4 , 
    the index type: create, the file path (null means not to have): autogen.2016091421595214374.1, the priotity: 1 
    and the time taking is 6174237 milliseconds [org.dataone.cn.indexer.IndexWorker:indexOjbect:416]
    • 20:51:39 UTC is 1:51pm PDT (also 7/18)
    • 6174237mS is 1hr 42m!
  • However, by 7/22/25, it was obvious nothing was happening. The queue still has 22.5k entries, and there is no further indexer activity at all

    • MetacatUI shows: DATASETS 1 TO 25 OF 26,383

    • The last Metacat log entry was on 18 July at 21:05:05 (UTC):

      192.168.192.155 - - [18/Jul/2024:21:05:05 +0000] "GET /metacat/d1/mn/v2/meta/urn:uuid:e9cedcd5-eef8-49d1-a1b3-a5a7349a44cf HTTP/1.1" 200 2349
    • The last indexer log entry was on 18 July at 21:07:05 (UTC):

      dataone-indexer 20240718-21:07:05: [INFO]: IndexWorker.indexOjbect with the thread id 38 - 
      Completed the index task from the index queue with the identifier: 
      resource_map_urn:uuid:e9eefb55-2e60-4331-9965-9a8ea09f122b , the index type: create, the 
      file path (null means not to have): autogen.2016091422025267642.1, the priotity: 1
       and the time taking is 1048455 milliseconds [org.dataone.cn.indexer.IndexWorker:indexOjbect:416]
      

Trying to debug...

@artntek
Copy link
Contributor Author

artntek commented Jul 22, 2024

Found entries like this near end of each indexer log:

dataone-indexer 20240718-20:40:37: [ERROR]: IndexWorker.indexOjbect - identifier:
resource_map_urn:uuid:9a4bed73-b844-4415-b379-fb57bb0accb0 , the index type:
create, sending acknowledgement back to rabbitmq failed since channel is already
closed due to channel error; protocol method: #method<channel.close>(reply-code=406,
reply-text=PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out.
Timeout value used: 1800000 ms. This timeout value can be configured, see consumers
doc guide to learn more, class-id=0, method-id=0). So rabbitmq may resend the
message again [org.dataone.cn.indexer.IndexWorker:indexOjbect:389]

see https://www.rabbitmq.com/docs/consumers#acknowledgement-timeout

  • 7/22 @ 10:39:54 PDT - changed the timeout from 30 mins to 1 hr and did a helm upgrade:

    dataone-indexer:
      rabbitmq:
        ## @param dataone-indexer.rabbitmq.extraConfiguration extra config, to be appended to rmq config
        ## Default consumer_timeout is 30 mins (i.e. 18000000mS). Increase this to 1hr to avoid timeouts
        ## with large resource maps
        extraConfiguration: |-
          consumer_timeout = 3600000
  • helm restarted all the indexer pods except one, and the rmq pod

  • Now appears that indexing has resumed...

@mbjones
Copy link
Member

mbjones commented Jul 22, 2024

We had this same problem with MetaDIG and rabbitmq. @jeanetteclark solved this by handling long timeouts independently and tracking task completion in the database. It sort of defeats the purpose of a message queue if we have to track task completion outside of the queue. @jeanetteclark can provide details.

@artntek
Copy link
Contributor Author

artntek commented Jul 22, 2024

Thanks, @mbjones. Having dug deeper - I now recall that Jing changed the code to (temporarily) ack rmq immediately, instead of waiting until after processing is finished, so there should be no timeouts -- see this PR. (I think this was a temporary fix, inspired by @jeanetteclark's learnings.)

However, if this code is working as intended, it's still a mystery why I'm seeing timeouts...?

Examples:

This resourcemap took 3 hours 9 minutes (!) to index, with no timeout message:

dataone-indexer 20240722-20:51:24: [INFO]: IndexWorker.indexOjbect with the thread id 46 - Completed the index task from the index queue with the identifier: resource_map_urn:uuid:e9eefb55-2e60-4331-9965-9a8ea09f122b , the index type: create, the file path (null means not to have): autogen.2016091422025267642.1, the priotity: 1 and the time taking is 11368354 milliseconds [org.dataone.cn.indexer.IndexWorker:indexOjbect:416]

...and it was inserted into solr successfully:

dataone-indexer 20240722-20:51:24: [INFO]: SolrIndex.insert - finished to insert the solrDoc to the solr server for object resource_map_urn:uuid:e9eefb55-2e60-4331-9965-9a8ea09f122b and it took 9811449 milliseconds. [org.dataone.cn.indexer.SolrIndex:insert:424]

9811449 mS == 2 hours 43 minutes for insert

This resourcemap was smaller, but had a timeout message:

dataone-indexer 20240722-20:23:14: [ERROR]: IndexWorker.indexOjbect - identifier: resource_map_urn:uuid:4949791f-00f0-4603-97c3-12d58a3cd211 , the index type: create, sending acknowledgement back to rabbitmq failed since channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 3600000 ms. This timeout value can be configured, see consumers doc guide to learn more, class-id=0, method-id=0). So rabbitmq may resend the message again [org.dataone.cn.indexer.IndexWorker:indexOjbect:389]

...but was also inserted into solr successfully:

dataone-indexer 20240722-20:51:11: [INFO]: SolrIndex.insert - finished to insert the solrDoc to the solr server for object resource_map_urn:uuid:4949791f-00f0-4603-97c3-12d58a3cd211 and it took 1314284 milliseconds. [org.dataone.cn.indexer.SolrIndex:insert:424]

1314284 mS == 21 minutes for insert

I'll discuss with Jing upon his return
Added to backend meeting agenda

@artntek
Copy link
Contributor Author

artntek commented Jul 24, 2024

reindex finished on 7/24. See additional info in dataone-indexer Issue # 118

@artntek
Copy link
Contributor Author

artntek commented Jul 29, 2024

7/24/24

  1. Wed Jul 24 17:29 PDT: Started a full reindex again
  2. 25 indexers; 5 threads/worker (default), 4-hr channel timeout, and new code changes to reopen closed channels
  3. After approx 24h
    1. Message rates -> 0
    2. Queue size -> 0
    3. Some (one? Plot scale not clear) reconnects shown at this point.
    4. 25 consumers still shown
    5. Metacat logs still showing GETs (Resource Maps still being processed)
  4. Sat Jul 27 09:51 PDT: completely finished

TOTAL TIME TAKEN FOR FULL REINDEX OF TEST.ARCTICDATA.IO with 25 indexers:
64 hours 22 minutes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k8s Kubernetes/Helm Related
Projects
None yet
Development

No branches or pull requests

2 participants