-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet to BigQuery import for GCP-backed AnVIL snapshots (#6355) #6392
base: develop
Are you sure you want to change the base?
Changes from all commits
b0f6ff9
015189b
5f4db61
78545b6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -102,12 +102,16 @@ $(1)terraform: lambdas | |||||
|
||||||
.PHONY: $(1)deploy | ||||||
$(1)deploy: check_python $(1)terraform | ||||||
python $(project_root)/scripts/post_deploy_tdr.py | ||||||
endef | ||||||
|
||||||
$(eval $(call deploy,)) | ||||||
$(eval $(call deploy,auto_)) | ||||||
|
||||||
.PHONY: import | ||||||
import: check_python | ||||||
python $(project_root)/scripts/reindex.py --import --sources "tdr:parquet:gcp:${GOOGLE_PROJECT}:*" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
python $(project_root)/scripts/verify_tdr_sources.py | ||||||
|
||||||
nadove-ucsc marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
.PHONY: destroy | ||||||
destroy: | ||||||
$(MAKE) -C terraform destroy | ||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -73,7 +73,7 @@ def mkdict(previous_catalog: dict[str, str], | |||||
|
||||||
|
||||||
anvil_sources = mkdict({}, 3, mkdelta([ | ||||||
mksrc('bigquery', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), | ||||||
mksrc('parquet', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The source spec should state where the source is, not where it will be when it is imported. The logic should be to import every parquet source. So I think this should read
Suggested change
|
||||||
mksrc('bigquery', 'datarepo-dev-42c70e6a', 'ANVIL_CCDG_Sample_1_20230228_ANV5_202302281520', 28), | ||||||
mksrc('bigquery', 'datarepo-dev-97ad270b', 'ANVIL_CMG_Sample_1_20230225_ANV5_202302281509', 25) | ||||||
])) | ||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -64,7 +64,7 @@ def mkdict(previous_catalog: dict[str, str], | |||||
|
||||||
|
||||||
anvil_sources = mkdict({}, 3, mkdelta([ | ||||||
mksrc('bigquery', 'datarepo-dev-e53e74aa', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), | ||||||
mksrc('parquet', 'platform-anvil-dev', 'ANVIL_1000G_2019_Dev_20230609_ANV5_202306121732', 6804), | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
mksrc('bigquery', 'datarepo-dev-42c70e6a', 'ANVIL_CCDG_Sample_1_20230228_ANV5_202302281520', 28), | ||||||
mksrc('bigquery', 'datarepo-dev-97ad270b', 'ANVIL_CMG_Sample_1_20230225_ANV5_202302281509', 25) | ||||||
])) | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
""" | ||
Export Parquet files from TDR and download them to local storage. | ||
""" | ||
from argparse import ( | ||
ArgumentParser, | ||
) | ||
import logging | ||
from pathlib import ( | ||
Path, | ||
) | ||
import sys | ||
from typing import ( | ||
Iterator, | ||
) | ||
from uuid import ( | ||
UUID, | ||
) | ||
|
||
import attrs | ||
from furl import ( | ||
furl, | ||
) | ||
|
||
from azul import ( | ||
cached_property, | ||
config, | ||
reject, | ||
) | ||
from azul.http import ( | ||
HasCachedHttpClient, | ||
) | ||
from azul.logging import ( | ||
configure_script_logging, | ||
) | ||
from azul.terra import ( | ||
TDRClient, | ||
TerraStatusException, | ||
) | ||
|
||
log = logging.getLogger(__name__) | ||
|
||
|
||
@attrs.frozen | ||
class ParquetDownloader(HasCachedHttpClient): | ||
snapshot_id: str | ||
|
||
@cached_property | ||
def tdr(self) -> TDRClient: | ||
return TDRClient.for_indexer() | ||
|
||
def get_download_urls(self) -> dict[str, list[furl]]: | ||
urls = self.tdr.export_parquet_urls(self.snapshot_id) | ||
reject(urls is None, | ||
'No Parquet access information is available for snapshot %r', self.snapshot_id) | ||
return urls | ||
|
||
def get_data(self, parquet_urls: list[furl]) -> Iterator[bytes]: | ||
for url in parquet_urls: | ||
response = self._http_client.request('GET', str(url)) | ||
if response.status != 200: | ||
raise TerraStatusException(url, response) | ||
if response.headers.get('x-ms-resource-type') == 'directory': | ||
log.info('Skipping Azure directory URL') | ||
else: | ||
yield response.data | ||
|
||
def download_table(self, | ||
table_name: str, | ||
download_urls: list[furl], | ||
location: Path): | ||
data = None | ||
for i, data in enumerate(self.get_data(download_urls)): | ||
output_path = location / f'{self.snapshot_id}_{table_name}_{i}.parquet' | ||
log.info('Writing to %s', output_path) | ||
with open(output_path, 'wb') as f: | ||
f.write(data) | ||
reject(data is None, | ||
'No Parquet files found for snapshot %r. Tried URLs: %r', | ||
self.snapshot_id, download_urls) | ||
|
||
|
||
def main(argv): | ||
parser = ArgumentParser(add_help=True, description=__doc__) | ||
parser.add_argument('snapshot_id', | ||
type=UUID, | ||
help='The UUID of the snapshot') | ||
parser.add_argument('-O', | ||
'--output-dir', | ||
type=Path, | ||
default=Path(config.project_root) / 'parquet', | ||
help='Where to save the downloaded files') | ||
args = parser.parse_args(argv) | ||
|
||
downloader = ParquetDownloader(args.snapshot_id) | ||
|
||
urls_by_table = downloader.get_download_urls() | ||
for table_name, urls in urls_by_table.items(): | ||
downloader.download_table(table_name, urls, args.output_dir) | ||
|
||
|
||
if __name__ == '__main__': | ||
configure_script_logging(log) | ||
main(sys.argv[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure assuming that every snapshot is as big as 1000G leads to practical timeout.
A timeout is a heuristic defense against hung workloads, i.e., workloads that stop making significant progress. We don't want to constantly update the timeout, we don't want it to prematurely kill workloads that are progressing at the average rate, and we don't want the workload to be in the hung state for > 80% of it's running time. A 5min timeout goes against the first rule, a 30h timeout goes against the last.