Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDAP-367: Updated deletebyquery tool to improve speed #155

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion tools/deletebyquery/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,6 @@ To delete a dataset called `my_dataset`, with SDAP deployed using the Helm chart
```
cd /incubator-sdap-nexus/tools/deletebyquery
python deletebyquery.py --solr sdap-solr-svc:8983 --cassandra sdap-cassandra --cassandraUsername cassandra --cassandraPassword cassandra --query 'dataset_s:"my_dataset"'
```
```

You can provide a flag `-f` or `--force` which will cause the script to skip all prompts before deleting.
21 changes: 13 additions & 8 deletions tools/deletebyquery/deletebyquery.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,8 @@ def delete_by_query(args):
se = SearchOptions()
se.commonparams.q(args.query) \
.fl(SOLR_UNIQUE_KEY) \
.fl('id')
.fl('id') \
.rows(50000)

for fq in args.filterquery if args.filterquery is not None else []:
se.commonparams.fq(fq)
Expand All @@ -89,11 +90,11 @@ def delete_by_query(args):
else:
raise RuntimeError("either query or jsonparams is required")

if check_query(query):
if args.force or check_query(query):
logging.info("Collecting tiles ....")
solr_docs = do_solr_query(query)

if confirm_delete(len(solr_docs)):
if args.force or confirm_delete(len(solr_docs)):
deleted_ids = do_delete(solr_docs, query)
logging.info("Deleted tile IDs %s" % json.dumps([str(doc_id) for doc_id in deleted_ids], indent=2))
else:
Expand Down Expand Up @@ -160,16 +161,15 @@ def do_solr_query(query):
break
else:
next_cursor_mark = solr_response.result.nextCursorMark

doc_ids.extend([uuid.UUID(doc['id']) for doc in solr_response.result.response.docs])
ids = [uuid.UUID(doc['id']) for doc in solr_response.result.response.docs]
delete_from_cassandra(ids)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you moved the cassandra delete into the do_solr_query function? This seems like it would break things. do_solr_query is collecting all of the tile ids that need to be deleted. Once that is done it asks the user to confirm the number of documents before executing the delete. With this new flow, the cassandra rows get removed before the solr docs and a user could decide to cancel without realizing data has already been deleted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankinspace Whoops, good point. I'll refactor this

doc_ids.extend(ids)

return doc_ids


def do_delete(doc_ids, query):
logging.info("Executing Cassandra delete...")
delete_from_cassandra(doc_ids)
logging.info("Executing Solr delete...")
logging.info("Executing delete...")
delete_from_solr(query)
return doc_ids

Expand Down Expand Up @@ -260,6 +260,11 @@ def parse_args():
choices=['1', '2', '3', '4', '5'],
default='3')

parser.add_argument('-f', '--force',
help='The version of the Cassandra protocol the driver should use.',
required=False,
action='store_true')

return parser.parse_args()


Expand Down