Releases: data61/anonlink-entity-service
Version 1.15.1
Spring Cleaning Release
Dependency updates
Implemented in #687
Delete upload files on object store after ingestion
If a data provider uploads its data via the object store, we now clean up afterwards.
Implemented in #686
Fixed Record Linkage API tutorial
Adjusted to changes in the clkhash library.
Implemented in #684
Delete encodings from database at project deletion
Encodings will be deleted at project deletion, but only for projects created with this version or higher.
Implemented in #683
Version 1.15.0
Highlights
Similarity scores are deduplicated
Previously candidate pairs that appear in more than one block would produce more than one similarity score.
The iterator that processing similarity scores now de-duplicates before storing them.
Implemented in: #660
Provided Block Identifiers are now hashed
We now hash the user provided block identifier before storing in DB.
Implemented in: #633
Failed runs return message indicating the failure reason
The run status for a failed run now includes a message
attribute with information on what went wrong.
Implemented in: #624
Other changes
The run status endpoint now includes total_number_of_comparisons
for completed runs.
Implemented in: #651
As usual lots of version upgrades - now using the latest stable redis and postgresql.
Version 1.14.0
Highlights
API now supports directly downloading similarity scores from the internal object store
If the request includes the header RETURN-OBJECT-STORE-ADDRESS
, the response will be a small json payload with
temporary download credentials to pull the binary similarity scores directly from the object store. The json object
has credentials
and object
keys::
{
"credentials": {
"AccessKeyId": "",
"SecretAccessKey": "",
"SessionToken": "",
"Expiration": "<ISO 8601 datetime string>"
},
"object": {
"endpoint": "<config.DOWNLOAD_OBJECT_STORE_SERVER>",
"secure": "<config.DOWNLOAD_OBJECT_STORE_SECURE>",
"bucket": "bucket_name",
"path": "path"
}
}
The binary file is serialized using anonlink.serialization
, you can convert the stream into Python types with::
mc = Minio(file_info['endpoint'], ...)
candidate_pair_stream = mc.get_object(file_info['bucket'], file_info['path'])
sims, (dset_is0, dset_is1), (rec_is0, rec_is1) = anonlink.serialization.load_candidate_pairs(candidate_pair_stream)
The following settings control the optional feature of using an external object store:
======================================= ==========================================
Environment Variable Helm Config
======================================= ==========================================
DOWNLOAD_OBJECT_STORE_SERVER
anonlink.objectstore.downloadServer
DOWNLOAD_OBJECT_STORE_SECURE
anonlink.objectstore.downloadSecure
DOWNLOAD_OBJECT_STORE_ACCESS_KEY
anonlink.objectstore.downloadAccessKey
DOWNLOAD_OBJECT_STORE_SECRET_KEY
anonlink.objectstore.downloadSecretKey
DOWNLOAD_OBJECT_STORE_STS_DURATION
-
(default 43200 seconds)
======================================= ==========================================
Implemented in: #594, #612, #613, #614
Service now uses sqlalchemy for database migrations
Sqlalchemy models have been added for all database tables, initial database setup
now uses alembic for migrations. The database and object store init scripts can now
be run multiple times without causing issues.
New configurable limits on maximum number of candidate pairs
Protects the service from running out of memory due to excessive numbers of
candidate pairs being processed. An added side effect is the service now keeps
track of the number of candidate pairs in a run (as well as the number of comparisons).
The configurable is controlled by the following two environment variables, and their initial
default values::
SOLVER_MAX_CANDIDATE_PAIRS="100_000_000"
SIMILARITY_SCORES_MAX_CANDIDATE_PAIRS="500_000_000"
If a run exceeds these limits, the run is put into an error state and further processing is
abandoned to protect the service from running out of memory.
Other changes
- Ingress now supports a user supplied path. We no longer assume an nginx ingress controller. #587
- Migrate off deprecated k8s chart repos #596, #588
- Helm chart now uses standard recommended Kubernetes labels. #616
- Fix an issue with case sensitivity in object store metadata #590
- If the object store bucket doesn't exist it is now automatically created. #577
- Ignore but log failures to delete from object store #576
- Many dependency updates #578, #579, #580, #582, #581, #583, #596, #604, #609, #615
- Update the base image, all base dependencies and migrated from minio-py v5 to v7 #601, #608, #610
- CI e2e tests on Kubernetes will now correctly fail if the tests don't run. #618
- Add optional pod annotations to init jobs. #619
Version 1.13.0
Highlights
- The entity service now supports user provided blocking information. This can reduce the amount of required comparisons significantly and thus allows for linkages between larger datasets.
- The server can be configured to use an object store for dataset uploads. This allows the use of libraries such as boto3 or minio to improve reliability, especially for large uploads.
Docker Images
data61/anonlink-app:v1.13.0
data61/anonlink-nginx:v1.4.6
data61/anonlink-benchmark:v0.3.3
Breaking Changes
- the
similarity_score
output type has been modified, it now returns a JSON array of JSON objects, where such an object looks like[[party_id_0, row_index_0], [party_id_1, row_index_1], score]
. #464 - Integration test configuration is now consistent with benchmark config. Instead of setting
ENTITY_SERVICE_URL
including/api/v1
now just set the host address inSERVER
. #495 matching
output type was removed. Use the equivalentgroups
instead. #458
Other Changes
Version 1.13.0-beta3
Version 1.13.0-beta2
Adds support for users to supply blocking information along with encodings. Data can now be uploaded to
an object store and pulled by the Anonlink Entity Service instead of uploaded via the REST API.
This release includes substantial internal changes as encodings are now stored in Postgres instead of
the object store.
- Feature to pull data from an object store and create temporary upload credentials. #537, #544, #551
- Blocking implementation #510 #527,
- Benchmark container now includes support for blocking #478, #541
- Encodings are now stored in Postgres database instead of files in an object store. #516, #522
- Start to add integration tests to complement our end to end tests. #520, #528
- Use anonlink-client instead of clkhash #536
- Use Python 3.8 in base image. #518
- A base image is now used for all our Docker images. #506, #511, #517, #519
- Binary encodings now stored internally with their encoding id. #505
- REST API implementation for accepting clknblocks #503
- Update Open API spec to version 3. Add Blocking API #479
- CI Updates #476
- Chart updates #496, #497, #539
- Documentation updates (production deployment, debugging with PyCharm) #473, #504
- Fix Jaeger #500, #523
Misc changes/fixes:
- Detect invalid encoding size as early as possible #507
- Use local benchmark cache #531
- Cleanup docker-compose #533, #534, #547
- Calculate number of comparisons accounting for user supplied blocks. #543
Try it out
You can pull this repository and try with Docker Compose. The Docker images are all hosted on Docker Hub:
Component | Docker Hub |
---|---|
Base Image | data61/anonlink-base |
Backend/Worker | data61/anonlink-app |
E2E Tests | data61/anonlink-test |
Nginx Proxy | data61/anonlink-nginx |
Benchmark | data61/anonlink-benchmark |
Docs | data61/anonlink-docs-builder |
Using Kubernetes (follow the detailed docs here:
helm repo add data61 https://data61.github.io/charts
helm repo update
helm install data61/entity-service --version 1.13.1 [--values...]
All the documentation, including tutorials can be found at https://anonlink-entity-service.readthedocs.io/en/latest/index.html
v1.13.0-beta
- Fixed a bug where a dataprovider could upload their clks multiple times in a project using the same upload token. (#463)
- Fixed a bug where workers accepted work after failing to initialize their database connection pool. (#477)
- Modified
similarity_score
output to follow the group format in preparation to extending this output type to more
parties. (#464) - Tutorials have been improved following an internal review. (#467)
- Database schema and CLK upload api has been modified to support blocking. (#470)
- Benchmarking results can now be saved to an object store without authentication. Allowing an AWS user to save to S3
using node permissions. (#490) - Removed duplicate/redundant tests. (#466)
- Updated dependencies:
Breaking Changes
- the
similarity_score
output type has been modified, it now returns a JSON array of JSON objects, where such an object
looks like[[party_id_0, row_index_0], [party_id_1, row_index_1], score]
. (#464) - Integration test configuration is now consistent with benchmark config. Instead of setting
ENTITY_SERVICE_URL
including
/api/v1
now just set the host address inSERVER
. (#495)
Database Changes (Internal)
- the
dataproviders
tableuploaded
field has been modified from a BOOL to an ENUM type (#463) - The
projects
table has a newuses_blocking
field. (#470)
Docker Images
data61/anonlink-app:v1.13.0-beta
data61/anonlink-nginx:v1.4.6-beta
data61/anonlink-benchmark:v0.3.1
Install to Kubernetes using the helm chart:
helm repo add data61 https://data61.github.io/charts
helm repo update
helm install data61/entity-service [--values...]
v1.13.0-alpha
-
fixed bug where invalid state changes could occur when starting a run (#459)
-
matching
output type has been removed as redundant with thegroups
output with 2 parties. (#458) -
Update dependencies:
- requests from 2.21.0 to 2.22.0 (#459)
Breaking Change
matching
output type is not available anymore. (#458)
v1.12.0
Created docker images:
data61/anonlink-app:v1.12.0
data61/anonlink-nginx:v1.4.5
data61/anonlink-benchmark:v0.3.0
Changelog:
- Logging configurable in the deployed entity service by using the key
loggingCfg
. (#448) - Several old settings have been removed from the default values.yaml and docker
files which have been replaced byCHUNK_SIZE_AIM
(#414):SMALL_COMPARISON_CHUNK_SIZE
LARGE_COMPARISON_CHUNK_SIZE
SMALL_JOB_SIZE
LARGE_JOB_SIZE
- Remove
ENTITY_MATCH_THRESHOLD
environment variable (#444) - Celery configuration updates to solve threads and memory leaks in deployment. (#427)
- Update docker-compose files to use these new preferred configurations.
- Update helm charts with preferred configuration default deployment is a minimal working deployment.
- New environment variables:
CELERY_DB_MIN_CONNECTIONS
,FLASK_DB_MIN_CONNECTIONS
,CELERY_DB_MAX_CONNECTIONS
andFLASK_DB_MAX_CONNECTIONS
to configure the database connections pool. (#405) - Simplify access to the database from services relying on a single way to get a connection via a connection pool. (#405)
- Deleting a run is now implemented. (#413)
- Added some missing documentation about the output type
groups
(#449) - Sentinel name is configurable. (#436)
- Improvement on the Kubernetes deployment test stage on Azure DevOps:
- Re-order cleaning steps to first purge the deployment and then deleting the remaining. (#426)
- Run integration tests in parallel, reducing pipeline stage
Kubernetes deployment tests
from 30 minutes to 15 minutes. (#438) - Tests running on a deployed entity-service on k8s creates an artifact containing all the logs of all the containers, useful for debugging. (#445)
- Test container not restarted on test failure. (#434)
- Benchmark improvements:
- Improvements on Redis cache:
- Update dependencies:
- Add some release documentation. (#455)
v1.12 pre release
We are creating this tag to be able to deploy an entity-service having all the necessary configurations introduced in develop required for our testing service on kubernetes.