Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrates ACL Sentence Piece Model for calculating affinity scores #72

Open
wants to merge 180 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
180 commits
Select commit Hold shift + click to select a range
7a42615
Add basic Flask functionality
haroldrubio Jul 15, 2021
ce443f9
Add flask to dependencies
haroldrubio Jul 15, 2021
60707c0
Add default config
haroldrubio Jul 16, 2021
869e951
Offload expertise model to separate class
haroldrubio Jul 16, 2021
ce72b4e
Add /expertise endpoint for dataset creation and modeling
haroldrubio Jul 19, 2021
0439f56
Add job queue scaffolding code
haroldrubio Jul 22, 2021
a8efeef
Update typing and queue semantics
haroldrubio Jul 22, 2021
014778f
Implement private job handler and job search
haroldrubio Jul 22, 2021
63ad2c9
Implement queue daemon
haroldrubio Jul 22, 2021
87f98e6
Implement public get and cancel functions
haroldrubio Jul 22, 2021
4fda4cf
Add directory setup on job creation
haroldrubio Jul 23, 2021
82d823d
Update status handling and documentation
haroldrubio Jul 23, 2021
74c25b2
Add not implemented yet exceptions to JobQueue
haroldrubio Jul 23, 2021
9f01168
Add separate expertise and dataset metadata objects
haroldrubio Jul 23, 2021
8b86795
Add scaffolding code for integrating queue with expertise
haroldrubio Jul 23, 2021
6389a3e
Update create dataset scaffolding
haroldrubio Jul 23, 2021
e01924d
Implement expertise get result function
haroldrubio Jul 25, 2021
8043daa
Chain dataset queue into expertise queue
haroldrubio Jul 25, 2021
ad21879
Create new abstraction to handle multi-step queues
haroldrubio Jul 25, 2021
958b238
Add logging functionality to queue objects
haroldrubio Jul 28, 2021
6be267a
Adjust default max_jobs semantics
haroldrubio Jul 28, 2021
181b0aa
Change logger formatting and add daemon thread start
haroldrubio Jul 28, 2021
723af6f
Add test for single job queue
haroldrubio Jul 28, 2021
42a2c8b
Add scaffolding for several unit tests
haroldrubio Jul 28, 2021
25b1787
Implement single job unit tests
haroldrubio Jul 28, 2021
198b11f
Implement mult-job and multithreaded unit tests
haroldrubio Jul 28, 2021
4b80eff
Adjust inner queue options and init
haroldrubio Jul 28, 2021
dd3047a
Add unit tests for two step queue
haroldrubio Jul 29, 2021
9028053
Add unique per-job id instead of only on config
haroldrubio Jul 30, 2021
7d84cf7
Update queue specifications
haroldrubio Jul 30, 2021
9b6c2d3
Change job ID and add prepare_job function
haroldrubio Jul 30, 2021
d9c25bd
Add integration queue tests
haroldrubio Aug 2, 2021
db43d7a
Update queue naming convention
haroldrubio Aug 2, 2021
fcfd760
Update expertise endpoints
haroldrubio Aug 2, 2021
7cb2301
Fix delete on get bug
haroldrubio Aug 3, 2021
b888f46
Logically split code and add more documentation
haroldrubio Aug 3, 2021
a417e34
Correct typo in documentation
haroldrubio Aug 4, 2021
cab2d33
Add endpoint documentation and change function typing
haroldrubio Aug 4, 2021
a86bdde
Address CircleCI tests
haroldrubio Aug 4, 2021
2c400a7
Replace specter+mfr tests with ELMo tests
haroldrubio Aug 4, 2021
16b45d8
Fix ELMo tests
haroldrubio Aug 4, 2021
b75769a
Increase test sleep time
haroldrubio Aug 4, 2021
5c7eaa0
Adjust tests and sleep time
haroldrubio Aug 4, 2021
49771cd
Clean directory before each test
haroldrubio Aug 5, 2021
342394c
Fix boolean value
haroldrubio Aug 5, 2021
2a26a64
Deprecate old queue and integrate Celery
haroldrubio Aug 9, 2021
7a6e3e5
Separate create_dataset and run_expertise steps
haroldrubio Aug 12, 2021
31f1d3a
Restore /jobs and /results endpoint functionality
haroldrubio Aug 12, 2021
4fd2cca
Prevent logging all fetched affinity scores
haroldrubio Aug 12, 2021
88a42de
Import test setup from Matcher
haroldrubio Aug 13, 2021
9b37823
Add endpoint for testing
haroldrubio Aug 13, 2021
0e805c2
Update handling for tests and add TODO tasks
haroldrubio Aug 13, 2021
01aea5b
Use mock client if in test mode
haroldrubio Aug 13, 2021
4e134e3
Prevent retrieving results from a data file
haroldrubio Aug 13, 2021
9e631dd
Add test mode check to all other endpoints
haroldrubio Aug 13, 2021
224097b
Change directory semantics
haroldrubio Aug 13, 2021
63b189d
Restrict writing to server for test mode only
haroldrubio Aug 13, 2021
d21c6b6
Add error logging and clean up for failed tasks
haroldrubio Aug 13, 2021
16abd95
Return an error status for jobs that have crashed
haroldrubio Aug 13, 2021
ea3f926
Add initial celery test
haroldrubio Aug 16, 2021
27ac89d
Finish complete workflow test and modify CircleCI
haroldrubio Aug 16, 2021
ef5750f
Add extra testing for error during the job execution
haroldrubio Aug 16, 2021
35cfc36
Unpack config into request body and expect required parameters
haroldrubio Aug 16, 2021
3261a3e
Update README.md
haroldrubio Aug 17, 2021
1d9d27f
Update README.md
haroldrubio Aug 17, 2021
9051084
Add brief API comments
haroldrubio Aug 17, 2021
3dfeef4
Throw forbidden error on non existant profile
haroldrubio Aug 23, 2021
caf1946
Offload SPECTER variables to config file
haroldrubio Aug 23, 2021
d1c0722
Nest all created directories within a config-stored dir
haroldrubio Aug 23, 2021
2c23667
Use randomized alphanumeric string for job IDs
haroldrubio Aug 24, 2021
331f3c0
Return metadata.json in response
haroldrubio Aug 24, 2021
7f89dd8
Change config pre-processing
haroldrubio Aug 24, 2021
d141b86
Re-organize /expertise logic
haroldrubio Aug 25, 2021
8875610
Throw profile forbidden errors in endpoints
haroldrubio Aug 25, 2021
d9c19da
Move mock client function to utils.py
haroldrubio Aug 25, 2021
b417c7f
Remove old queue system
haroldrubio Aug 25, 2021
d090f31
Move redis database and change working directory
haroldrubio Aug 25, 2021
7669fe9
add get_notes to the mock client implementation
melisabok Aug 25, 2021
b107fb4
Merge branch 'feature/api-dev' of https://github.com/openreview/openr…
haroldrubio Aug 25, 2021
2e31d1c
Re-organize config preprocessing logic
haroldrubio Aug 26, 2021
dd6cf5c
Add integration with get_notes from mock client
haroldrubio Aug 26, 2021
da82cca
Add get_profile functionality to the mock client
haroldrubio Aug 26, 2021
4b81699
Remove testing mode dependence
haroldrubio Aug 26, 2021
fa6fd3d
Clean up server logic
haroldrubio Aug 26, 2021
64144ef
Reverse logic in pre-processing config
haroldrubio Aug 26, 2021
c6297f1
Rename config parameter
haroldrubio Aug 27, 2021
7e72bb7
Remove old code
haroldrubio Aug 27, 2021
a2f4461
Add single job query to /jobs
haroldrubio Aug 27, 2021
6b084d8
Remove extra test data
haroldrubio Sep 1, 2021
154dfa9
Use shortuuid to generate job IDs
haroldrubio Sep 1, 2021
78aa691
Add extra verbosity to the error log
haroldrubio Sep 1, 2021
2b27699
Add more validation tests
haroldrubio Sep 2, 2021
a1975c7
Add helper functions
haroldrubio Sep 3, 2021
8eab8b3
Add start from existing directory
haroldrubio Sep 3, 2021
a6f1db6
Update docstrings
haroldrubio Sep 3, 2021
fcb39c1
Delete README.md
haroldrubio Sep 3, 2021
fa8c9ce
Return the name parameter when querying
haroldrubio Sep 3, 2021
12be692
Remove unnecessary logging statements
haroldrubio Sep 10, 2021
139b682
Add token and baseurl as allowable config fields
haroldrubio Sep 10, 2021
f266328
Throw 403 forbidden on no authentication
haroldrubio Sep 10, 2021
8efe0b4
Merge branch 'feature/api-dev' of github.com:openreview/openreview-ex…
purujitgoyal Sep 13, 2021
850c33b
Migrate from working_dir/profile/job to working_dir/job
haroldrubio Sep 13, 2021
8a5af66
Add guest profile access
haroldrubio Sep 15, 2021
fac895f
Create work functions for endpoints
haroldrubio Sep 15, 2021
afab2d6
Simplify routes endpoints
haroldrubio Sep 15, 2021
2acabaa
Added ACL sentence piece model
purujitgoyal Sep 16, 2021
db0271c
Merge branch 'feature/api-dev' of github.com:openreview/openreview-ex…
purujitgoyal Sep 16, 2021
f6b5ee7
Add eviction of stale/errors on server
haroldrubio Sep 19, 2021
60ef066
Simplify before first request route
haroldrubio Sep 20, 2021
f03a9ba
get user id instead
melisabok Sep 22, 2021
5db1f01
set token in the mockclient
melisabok Sep 22, 2021
b02a27d
fix paths
melisabok Sep 22, 2021
bcfccdf
add tmp and log files to git ignore
melisabok Sep 22, 2021
abbeb23
fix tests
melisabok Sep 22, 2021
e460901
Fix MagicMock error and no directory error
haroldrubio Sep 22, 2021
c3d144c
Fix queue_evict test
haroldrubio Sep 22, 2021
d5d7137
move utils functions to ExpertiseService class
melisabok Sep 22, 2021
3c14b00
Merge branch 'feature/api-dev' of github.com:openreview/openreview-ex…
melisabok Sep 22, 2021
8220de3
add import
melisabok Sep 22, 2021
9e92598
Deprecate stale data eviction test
haroldrubio Sep 24, 2021
4b9c2a4
Throw bad request on configs with unexpected fields
haroldrubio Sep 24, 2021
6176a80
Add job status enumeration and test for queued status
haroldrubio Sep 24, 2021
4bc5a47
Remove token from config before writing
haroldrubio Sep 24, 2021
1581b1d
Add superuser access to all jobs
haroldrubio Sep 24, 2021
98339d7
Fix exception handling
haroldrubio Sep 24, 2021
93033e5
Separate testing and default working dirs
haroldrubio Sep 27, 2021
cb7f91d
Separate out different test cases into functions
haroldrubio Sep 29, 2021
06495d3
Maintain token in the config after writing to disk
haroldrubio Sep 29, 2021
d7bbe3d
clean the working dir using the test configuration
melisabok Sep 30, 2021
4aa6dd0
use a test class and write each scenario as a different test
melisabok Oct 1, 2021
7a23286
rename test file and add constructor
melisabok Oct 1, 2021
8fb71c5
fix identation
melisabok Oct 1, 2021
7dbba13
remove init method
melisabok Oct 1, 2021
b312e92
move fixtures outside of the class
melisabok Oct 1, 2021
4c22f67
remove elmo fixture
melisabok Oct 1, 2021
e48dede
add missing self
melisabok Oct 1, 2021
f082eaa
Change handing of statuses
haroldrubio Oct 1, 2021
2245a43
Revert ID change
haroldrubio Oct 1, 2021
a2f310c
reduce de test dataset
melisabok Oct 1, 2021
5908662
check status is not an error
melisabok Oct 1, 2021
793157f
refactor fixtures
melisabok Oct 1, 2021
e46498a
Add more specific error testing and write queued status
haroldrubio Oct 4, 2021
042fdd1
Use session scoped queue
haroldrubio Oct 5, 2021
697c7fc
FIx tests
haroldrubio Oct 6, 2021
1fabd06
Clean up test directory
haroldrubio Oct 6, 2021
0418778
Change to single request test
haroldrubio Oct 8, 2021
065ff09
Add high load job test
haroldrubio Oct 8, 2021
4a62b3d
Point to service folder for API docs
haroldrubio Oct 8, 2021
cb69dba
Add API documentation
haroldrubio Oct 8, 2021
6a3dc66
Reduce logger calls
haroldrubio Oct 8, 2021
d701419
Return multiple field errors at once
haroldrubio Oct 14, 2021
269f568
Search for score and metadata more efficiently
haroldrubio Oct 14, 2021
5766ff7
Change how superusers retrieve jobs IDs
haroldrubio Oct 14, 2021
517238d
Change from error to info
haroldrubio Oct 14, 2021
aeb2758
Add timeout mechanisms
haroldrubio Oct 14, 2021
8d6f064
Make fetching status more efficient
haroldrubio Oct 14, 2021
448663d
Add logger calls
haroldrubio Oct 14, 2021
cc2c276
Only return all jobs for the Superusers
haroldrubio Oct 15, 2021
ddccf3d
Move validation to separate function
haroldrubio Oct 15, 2021
c1076b4
Fix tests and add all statuses endpoint
haroldrubio Oct 15, 2021
dbcfcde
Fix parameter name
purujitgoyal Oct 15, 2021
9ee9744
Adjust timeouts
haroldrubio Oct 15, 2021
6316847
Replace * import
haroldrubio Oct 15, 2021
0e00dc7
Increase timeout
haroldrubio Oct 18, 2021
3a5290d
Fix import
haroldrubio Oct 18, 2021
0dd7b98
Adjust high load timeout
haroldrubio Oct 18, 2021
5d75db5
Export the default job config to init
haroldrubio Oct 18, 2021
28d93cb
Let validation function build the config
haroldrubio Oct 18, 2021
62dab1b
moves expertise config fields to Expertise Service init
purujitgoyal Oct 19, 2021
e7faa27
Removes not used stuff
purujitgoyal Oct 19, 2021
3d804f4
Merge branch 'feature/api-dev' of github.com:openreview/openreview-ex…
purujitgoyal Oct 20, 2021
cd51d43
Formatting changes, adds pre-commit
purujitgoyal Oct 20, 2021
15dbe77
Merge branch 'master' of github.com:openreview/openreview-expertise i…
purujitgoyal Oct 20, 2021
d31ff17
Merge branch 'feature/api-dev' of github.com:openreview/openreview-ex…
purujitgoyal Oct 21, 2021
1ccbbd5
Merge branch 'feature/api-dev' into feature/acl-scorer
purujitgoyal Oct 21, 2021
2f3b5de
Merge branch 'feature/acl-scorer' of github.com:openreview/openreview…
purujitgoyal Oct 21, 2021
89f0625
Adds missing import
purujitgoyal Oct 21, 2021
8e3937a
Merge branch 'master' of github.com:openreview/openreview-expertise i…
purujitgoyal Oct 28, 2021
39d1879
Formatting fixes
purujitgoyal Oct 28, 2021
4f95846
test fix
purujitgoyal Oct 28, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
[flake8]
max-line-length = 79
exclude = .tox,*.egg,build,temp
select = E,W,F
max-complexity = 18
verbose = 2
# https://pep8.readthedocs.io/en/latest/intro.html#error-codes
format = pylint
ignore =
E731
E741
W504
F401
F841
E203 # E203 - whitespace before ':'. Opposite convention enforced by black
E231 # E231: missing whitespace after ',', ';', or ':'; for black
E501 # E501 - line too long. Handled by black, we have longer lines
W503 # W503 - line break before binary operator, need for black
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ openreview_expertise.egg-info
__pycache__

/tmp
*.log
*.log
18 changes: 18 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
default_language_version:
python: python3
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.0.1
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- repo: https://github.com/ambv/black
rev: 21.7b0
hooks:
- id: black
language_version: python3
- repo: https://gitlab.com/pycqa/flake8
rev: 3.9.2
hooks:
- id: flake8
language_version: python3
53 changes: 49 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,13 @@ cd specter
wget https://ai2-s2-research-public.s3-us-west-2.amazonaws.com/specter/archive.tar.gz
tar -xzvf archive.tar.gz

conda install pytorch cudatoolkit=10.1 -c pytorch
conda install pytorch cudatoolkit=10.1 -c pytorch
pip install -r requirements.txt
python setup.py install
conda install filelock
cd ..
```
Pass the path to the cloned GitHub repository as `model_params.specter_dir`.
Pass the path to the cloned GitHub repository as `model_params.specter_dir`.

If you plan to use Multifacet-Recommender / SPECTER+MFR, download the checkpoint files from [here](https://drive.google.com/file/d/1_mWkQ1dr_Vl121WZkbNyNMV3G_bmoQ6s/view?usp=sharing), extract it, and pass the paths:
```
Expand All @@ -63,6 +63,19 @@ https://www.overleaf.com/read/ygmygwtjbzfg

https://www.overleaf.com/read/swqrxgqqvmyv

If you plan to use SentencePiece Model, you can follow the training procedure mentioned [here](https://github.com/acl-org/reviewer-paper-matching) to train the model and pass the paths to the trained model directory. The model files directory structure expected by the expertise is as follows:
```
path_to_trained_model_dir/
scratch/
abstracts.sp.20k.model
abstracts.sp.20k.model.model
abstracts.sp.20k.model.vocab
abstracts.sp.20k.vocab
similarity-model.pt
```

The `path_to_trained_model_dir` should be passed as `model_params.model_dir` in the config discussed in the Configuration section.

## Affinity Scores

There are two steps to create affinity scores:
Expand All @@ -80,7 +93,7 @@ python -m expertise.create_dataset config.json \
--username <your_username> \
```

For ELMo, SPECTER, Multifacet-Recommender and BM25 run the following command
For ELMo, SPECTER, Multifacet-Recommender, SentencePiece-ACL and BM25 run the following command
```
python -m expertise.run config.json
```
Expand Down Expand Up @@ -133,7 +146,7 @@ python -m expertise.service --host localhost --port 5000

By default, the app will run on `http://localhost:5000`. The endpoint `/expertise/test` should show a simple page indicating that Flask is running. Accessing the `/expertise` endpoint to compute affinity scores **requires** valid authentication in the headers of the request (i.e submitted from a logged in Python client)

In order to start the Celery queue worker, use:
In order to start the Celery queue worker, use:
```
celery --app expertise.service.server.celery_app worker
```
Expand Down Expand Up @@ -431,6 +444,38 @@ Here is an example:
}
```

#### SentencePiece-ACL specific parameters (affinity scores):
- `model_params.model_dir`: Path to the unpacked model directory. The model checkpoint will be loaded relative to this directory.
- `model_params.batch_size`: Batch size when running SentencePiece Model. This defaults to 32.
- `model_params.publications_path`: When running SentencePiece, this is where the embedded abstracts/titles of the Reviewers (and Area Chairs) are stored.
- `model_params.submissions_path`: When running SentencePiece, this is where the embedded abstracts/titles of the Submissions are stored.
- `model_params.max_score` (boolean, defaults to `true`): This parameter specifies that the reviewer is assigned based on the max similarity of the submission to the authored publication embeddings.
- `model.params.weighted_topk` (int, defaults to 0): This parameter specifies that the reviewer is assigned based on the weighted average of top `k` similarity score of the submission to the authored publication embeddings. This is skipped if `model_params.max_score` is set to `true`.
- `model_params.skip_model`: Since running SentencePiece can take a significant amount of time, the vectors are saved in `model_params.submissions_path` and `model_params.publications_path`. The jsonl files will be loaded with all the vectors.
- `model_params.use_cuda`: Boolean to indicate whether to use GPU (`true`) or CPU (`false`) when running SentencePiece Model. It defaults to CPU (`false`)

Here is an example:
```
{
"name": "iclr2020_sentence_piece",
"dataset": {
"directory": "./data/"
},
"model": "sentence_piece_acl",
"model_params": {
"model_dir": "../acl-sentence-piece/",
"max_score": true,
"batch_size": 16,
"skip_model": false,
"max_score": true,
"publications_path": "./",
"submissions_path": "./",
"use_cuda": false,
"scores_path": "./"
}
}
```

#### ELMo specific parameters (duplicate detection):
- `model_params.other_submissions_path`: When running ELMo, this is where the embedded abstracts/titles of the other Submissions are stored.
All the other parameters are the same as in the affinity scores.
Expand Down
5 changes: 1 addition & 4 deletions expertise/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
from .core import *
from .core import load_model
from . import config
from . import dataset
from . import models
from . import preprocess
from . import setup
from . import test
from . import train
from . import utils

2 changes: 1 addition & 1 deletion expertise/config/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
from .core import *
from .core import ModelConfig
12 changes: 5 additions & 7 deletions expertise/config/__main__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
'''
"""

'''
"""
from __future__ import absolute_import

import argparse
Expand All @@ -9,18 +9,16 @@
import expertise

parser = argparse.ArgumentParser()
parser.add_argument('model', help=f'select one of {expertise.available_models()}')
parser.add_argument('--outfile', '-o', help='file to write config')
parser.add_argument("model", help=f"select one of {expertise.available_models()}")
parser.add_argument("--outfile", "-o", help="file to write config")

args = parser.parse_args()

config = expertise.config.ModelConfig(model=args.model)

outfile = args.outfile if args.outfile else f'./{args.model}.json'
outfile = args.outfile if args.outfile else f"./{args.model}.json"

experiment_dir = os.path.dirname(os.path.abspath(outfile))

config.update(experiment_dir=experiment_dir)
config.save(outfile)


13 changes: 7 additions & 6 deletions expertise/config/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,16 @@
import pkgutil
import expertise


class ModelConfig(UserDict):
def __init__(self, **kwargs):
super(UserDict, self).__init__()
if kwargs.get('config_file_path'):
config_file_path = Path(kwargs['config_file_path'])
if kwargs.get("config_file_path"):
config_file_path = Path(kwargs["config_file_path"])
with open(config_file_path) as file_handle:
self.data = json.load(file_handle)
elif kwargs.get('config_dict'):
self.data = kwargs['config_dict']
elif kwargs.get("config_dict"):
self.data = kwargs["config_dict"]

def __repr__(self):
return json.dumps(self.data, indent=4)
Expand All @@ -23,8 +24,8 @@ def update(self, **kwargs):
self.data = {**self.data, **kwargs}

def save(self, outfile):
with open(outfile, 'w') as f:
json.dump(self.data, f, indent=4, separators=(',', ': '))
with open(outfile, "w") as f:
json.dump(self.data, f, indent=4, separators=(",", ": "))

def update_from_file(self, file):
config_path = Path(file).resolve()
Expand Down
5 changes: 4 additions & 1 deletion expertise/core.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
import pkgutil
from . import models


def model_importers():
return {m: i for i, m, _ in pkgutil.iter_modules(models.__path__)}


def available_models():
return [k for k in model_importers().keys()]


def load_model(module_name):
return model_importers()[module_name].find_module(module_name).load_module()
return model_importers()[module_name].find_module(module_name).load_module()
Loading