Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: simpler indexing #8

Merged
merged 1 commit into from
Aug 16, 2023
Merged

feat: simpler indexing #8

merged 1 commit into from
Aug 16, 2023

Conversation

trym-b
Copy link
Contributor

@trym-b trym-b commented Aug 16, 2023

Motivation

Previously we needed to run quite a few lines of code to index some WARC files. Now you can simply run index_it.py.

Interface

$ python index_it.py --help
usage: index_it.py [-h] --collection COLLECTION
  --warc-file-directory WARC_FILE_DIRECTORY
  --number-of-threads NUMBER_OF_THREADS

options:
  -h, --help            show this help message and exit
  --collection COLLECTION
                        Name of collection
  --warc-file-directory WARC_FILE_DIRECTORY
                        Directory containing WARC files belonging to a collection
  --number-of-threads NUMBER_OF_THREADS
                        Number of threads to use for indexing

Quirks

Indexing a collection will not leave the solr or tomcat running, so if you want to view the results you would have to start these services manually.

Future work

Whenever the next version of solrwayback is released, we can remove "JAVA_TOOL_OPTIONS": "-Dfile.encoding=UTF8",, as it has been fixed upstream.

# Motivation

Previously we needed to run quite a few lines of code to index some WARC
files. Now you can simply run `index_it.py`.

# Interface

```
$ python index_it.py --help
usage: index_it.py [-h] --collection COLLECTION
  --warc-file-directory WARC_FILE_DIRECTORY
  --number-of-threads NUMBER_OF_THREADS

options:
  -h, --help            show this help message and exit
  --collection COLLECTION
                        Name of collection
  --warc-file-directory WARC_FILE_DIRECTORY
                        Directory containing WARC files belonging to a collection
  --number-of-threads NUMBER_OF_THREADS
                        Number of threads to use for indexing
```

# Quirks

Indexing a collection will not leave the `solr` or `tomcat` running, so
if you want to view the results you would have to start these services
manually.

# Future work

Whenever the next version of `solrwayback` is released, we can remove
`"JAVA_TOOL_OPTIONS": "-Dfile.encoding=UTF8",`, as it has been fixed
upstream.
@trym-b trym-b merged commit 2f2a0a9 into main Aug 16, 2023
1 check passed
@trym-b trym-b deleted the feat/simplify-indexing branch August 16, 2023 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant