Add API endpoints for search #262

jrwdunham · 2017-11-10T03:02:08Z

26 May 2020: Review of current status against stable/1.11.x

Developed against Archivematica 1.7 qa branch (1.7 was released some months later).
The second PREMIS reader/writer module (yaprw) was never merged. It would need to be merged or existing metsrw PREMIS capabilities swapped in.
The PR associated with yaprw looks like it may also provide enhanced Dublin Core (DC) handling.

Only 9 .py files updated in this PR, the conflicting files listed below by Github are:

 storage_service/locations/api/resources.py, 
 storage_service/locations/api/urls.py, 
 storage_service/locations/models/event.py, 
 storage_service/locations/models/package.py, 
 storage_service/locations/tests/test_package.py

Additionally, the requirements management process has changed across versions.
Documentation with examples: rendered text.
File endpoint example.
Additional examples inline in this PR.

9 November 2017: Original PR

Locations, packages and files can all now be searched via GET requests to:
- http://<storage service URL>/api/v2/search/location/
- http://<storage service URL>/api/v2/search/package/
- http://<storage service URL>/api/v2/search/file/
See documentation at docs/search.rst.
Includes migration to File model.
Uses Django REST Framework (new requirement).

For the requirements that motivated this work, see:

Note: much of this work is due to PR #89 of @mcantelon. Since that PR is quite old and required extensive rebasing, its commits were squashed and made into this PR, and its CR comments, where applicable, were transferred to this PR.

Note 2: when metsrw v. 0.2.1 is released, requirements/base.txt will need to be updated to reference it. See artefactual-labs/mets-reader-writer#34.

Note 3: the asynchronous processing of the original PR 89 has been removed. See

archivematica-storage-service/storage_service/locations/api/resources.py

Lines 575 to 576 in e4bd212

    
           p = Process(target=bundle.obj.index_file_data_from_aip_mets) 
        
           p.start()

. This was failing presumably due to a conflict with gevent. Logs:

1314 [2017-11-09 10:46:09 +0000] [32276] [INFO] Parent changed, shutting down: <Worker 28682>
1315 [2017-11-09 10:46:09 +0000] [32276] [INFO] Worker exiting (pid: 28682)

Fixes #261

Connected to #261

sevein · 2017-11-12T15:24:19Z

storage_service/locations/models/event.py

    source_id = models.TextField(max_length=128)
    source_package = models.TextField(blank=True,
        help_text=_l("Unique identifier of originating unit"))
+    size = models.IntegerField(default=0, help_text='Size in bytes of the file')


sevein

@jrwdunham, I understand why we wanted to start using django-rest-framework but do you know if the work in this PR had a particular requirement on the new framework? Some context here would be good. If we're going to adopt a new framework and they are going to coexist for a while we should probably explain the motivations.

Is django-rest-framework different enough that should be conceived in a new /api/v3 namespace maybe? tastypie exposes details about the resources to allow discovery (e.g. http://django-tastypie.readthedocs.io/en/latest/interacting.html#api-wide or http://django-tastypie.readthedocs.io/en/latest/interacting.html#inspecting-the-resource-s-schema) - is this going to work well when sharing the same namespaces with another framework? Also other things like pagination, etc... is it close enough? If they're not I think we're going to make very difficult to implement clients and that's not good.

jrwdunham · 2017-11-13T23:53:19Z

@sevein, I don't know the exact motivations for using DRF for this work. At a high level, I suspect it is because DRF won out over TastyPie historically.

One of the goals of this work (in this later stage) was to "Evaluate if this allows us to remove tastypie in favour of Django REST framework (yes/no/maybe?/both)". See https://wiki.archivematica.org/Improvements/Reporting. After that question it currently states "done, answer is both".

Ultimately this work needs to allow the client to be able to answer the following (types of) questions using the SS API:

How many files do I have with a given PRONOM id?
How many files do I have with a given PRONOM id that were ingested between date1 and date2?
How many files do I have that are ISO disk images?

I think this PR does that.

Given the small scope of the project behind this PR, I think we will have to deliver to the client a dev branch that they can test. That means leaving this PR unmerged until we can find the time to replace TastyPie with DRF or at least write tests to ensure that these changes provide a consistent API and do not break the current TastyPie API. Thoughts? @jhs @mcantelon ?

jrwdunham · 2017-11-13T23:55:15Z

docs/search.rst

+* enabled (whether the location is enabled)
+
+For example, if you wanted to get details about the transfer source location
+contained in the space 6d0b6cce-4372-4ef8-bf48-ce642761fd41 you could HTTP get::


I noticed that if you pass a nonsense param to one of these search endpoints, you get all of the resources. Is that expected/desired? For example http://192.168.168.192:8000/api/v2/search/location/?moonbeam=figaro returns all locations. As does http://192.168.168.192:8000/api/v2/search/location/

jrwdunham

I have copied comments from the original #89 PR to the relevant places in this one. I have cited the reviewers if they are not me.

jrwdunham · 2017-11-13T23:56:50Z

docs/search.rst

+* max_size (maximum filesize)
+* normalized (boolean: whether or not file was normalized)
+* valid (nullable boolean: whether or not file was validated and, if so, its
+  validity)


I changed the original validated field to valid, a nullable boolean field where None indicates not validated, True valid, and False invalid. To see how this used to work, see https://github.com/artefactual/archivematica-storage-service/pull/89/files for.

jrwdunham · 2017-11-13T23:57:44Z

storage_service/locations/api/search/router.py

+    filter_backends = (filters.DjangoFilterBackend,)
+    filter_class = FileFilter
+
+    @list_route(methods=['get'])


How is this used? TODO: find out.

jrwdunham · 2017-11-13T23:58:32Z

storage_service/locations/models/package.py

+        """
+        aip_dir_name = os.path.basename(os.path.splitext(self.full_path)[0])
+        relative_path = os.path.join(aip_dir_name, "data", "METS." + self.uuid + ".xml")
+        path_to_mets, temp_dir = self.extract_file(relative_path)


Comment from @Hwesta: Has this been tested with compressed & uncompressed AIPs? Does it work with AIPs not stored in the local filesystem? Since this is happening asynchronously, what happens when the AIP storage operation completes during this thread? eg self.full_path may not be accurate, or may change partway through

jrwdunham · 2017-11-14T00:01:59Z

For the original requirements that motivated this search API work, see https://wiki.archivematica.org/Research_data_management#METS_parsing.

sromkey · 2017-11-22T17:24:41Z

@jrwdunham can you create a sample CURL command and JSON response that we can use for testing? Suggested questions:

How many files do I have in a PRONOMid?
-- that were ingested between date1 and date2?
How many files do I have that are ISO disk images?

jrwdunham · 2017-11-30T00:12:37Z

@sromkey Here are some example searches. Something like this should probably be added to search.rst. (These assume you have curl and jq installed).

Show me files with PRONOM id fmt/19. Note we have to escape the forward slash
in the PRONOM id using %2F:

 curl http://127.0.0.1:62081/api/v2/search/file/?pronom_id=fmt%2F19 | jq
 {
   "count": 2,
   "next": null,
   "previous": null,
   "results": [
     {
       "uuid": "5b32e493-1e6a-4169-b248-f25fec387cff",
       "name": "wed4-25608bd2-cb70-4bce-b16f-4cb61ec2c3fb/objects/BBhelmet.ai",
       "file_type": "AIP",
       "size": 1080282,
       "format_name": "Acrobat PDF 1.5 - Portable Document Format",
       "pronom_id": "fmt/19",
       "pipeline": "b5fa8caf-aa94-4ed7-b38a-1bb12654e498",
       "source_package": "",
       "normalized": false,
       "valid": true,
       "ingestion_time": "2017-11-30T05:48:27Z"
     },
     {
       "uuid": "d02de26f-fd1c-4ee1-a6f4-3921c87611ef",
       "name": "wed5-a6db30b9-8606-4f60-a1d7-6e3d6a0653a1/objects/BBhelmet.ai",
       "file_type": "AIP",
       "size": 1080282,
       "format_name": "Acrobat PDF 1.5 - Portable Document Format",
       "pronom_id": "fmt/19",
       "pipeline": "b5fa8caf-aa94-4ed7-b38a-1bb12654e498",
       "source_package": "",
       "normalized": false,
       "valid": true,
       "ingestion_time": "2017-11-25T06:03:10Z"
     }
   ]
 }

If we just want the number of files with PRONOM id fmt/19, we access the count
attribute of the returned JSON object:

 curl http://127.0.0.1:62081/api/v2/search/file/?pronom_id=fmt%2F19 | jq '.count'
 2

How many files do I have with PRONOM id fmt/19 that were ingested between 2017-11-24 and 2017-11-26?

 curl "http://127.0.0.1:62081/api/v2/search/file/?pronom_id=fmt%2F19&ingestion_time_at_or_before=2017-11-26&ingestion_time_at_or_after=2017-11-24" | jq '.count'
 1

How many files do I have with PRONOM id fmt/19 that were ingested between 2017-11-24 and 2017-11-30?

 curl "http://127.0.0.1:62081/api/v2/search/file/?pronom_id=fmt%2F19&ingestion_time_at_or_before=2017-11-30&ingestion_time_at_or_after=2017-11-24" | jq '.count'
 2

How many files do I have that are ISO disk images?

 curl "http://127.0.0.1:62081/api/v2/search/file/?pronom_id=fmt%2F468" | jq '.count'
 1

jrwdunham · 2018-01-24T23:59:50Z

I rebased this against qa/0.x and added a commit that makes AIP file indexing work with uncompressed AIPs.

sromkey · 2018-01-25T21:37:23Z

It looks like the UUID for file queries returns one minted newly by the Storage Service when it creates the file in the database. It would be more meaningful to return the source_id, e.g. the UUID created by Archivematica and recorded in the METS file.

- Locations, packages and files can all now be searched via GET requests to: - http://<storage service URL>/api/v2/search/location/ - http://<storage service URL>/api/v2/search/package/ - http://<storage service URL>/api/v2/search/file/ - See documentation at docs/search.rst. - Includes migration to File model. - Uses Django REST Framework (new requirement).

Prior to this the `index_file_data_from_aip_mets` method of `Package` was assuming that the package was compressed. Now it correctly parses the METS file of uncompressed AIPs as well.

jrwdunham added the work-in-progress label Nov 10, 2017

jrwdunham self-assigned this Nov 10, 2017

qubot force-pushed the dev/issue-11581-search-api branch from e40c6b2 to 0cbae04 Compare November 10, 2017 03:05

jrwdunham mentioned this pull request Nov 10, 2017

Add yapremisrw2, yet another PREMIS reader/writer plugin artefactual-labs/mets-reader-writer#34

Open

sevein reviewed Nov 12, 2017

View reviewed changes

jrwdunham mentioned this pull request Nov 10, 2017

Dev/issue 8895 search api #89

Closed

jrwdunham commented Nov 13, 2017

View reviewed changes

jrwdunham commented Nov 14, 2017

View reviewed changes

jrwdunham mentioned this pull request Nov 28, 2017

Problem: API has no search capability #261

Open

jrwdunham added this to the 0.12.0 milestone Nov 29, 2017

qubot force-pushed the dev/issue-11581-search-api branch from 23d1e07 to a1bccbf Compare December 5, 2017 22:43

sevein removed this from the 0.12.0 milestone Dec 15, 2017

qubot force-pushed the dev/issue-11581-search-api branch from a1bccbf to 903d568 Compare January 24, 2018 23:58

jrwdunham added 7 commits March 20, 2018 15:59

Tox: allow args to pytest

65b5766

Use metsrw plugin API in search indexing

aa8c745

Add i18n! to model help text

37f02a2

Bump search API migration

e5cecd1

Make AIP file indexing work with uncompressed AIPs

96338a2

Prior to this the `index_file_data_from_aip_mets` method of `Package` was assuming that the package was compressed. Now it correctly parses the METS file of uncompressed AIPs as well.

Bump search API migration again

50d6358

qubot force-pushed the dev/issue-11581-search-api branch from 903d568 to 50d6358 Compare March 20, 2018 23:03

jrwdunham mentioned this pull request May 16, 2018

WIP: create REST API #357

Closed

jrwdunham mentioned this pull request Jun 5, 2018

Problem: HTTP REST API is inconsistent, under-documented, divorced from client code, and lacking needed endpoints #369

Open

jrwdunham mentioned this pull request Jun 5, 2018

Improve HTTP API #370

Closed

sallain added Status: in progress Issue that is currently being worked on. Waffle label. and removed work-in-progress labels Jul 17, 2018

sevein unassigned jrwdunham Oct 24, 2018

sallain added the 2 - In progress label Apr 16, 2019

sallain removed the 2 - In progress label May 7, 2019

ross-spencer removed the Status: in progress Issue that is currently being worked on. Waffle label. label May 26, 2020

sevein added the stalled label Mar 8, 2021

sevein closed this Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API endpoints for search #262

Add API endpoints for search #262

jrwdunham commented Nov 10, 2017 •

edited by ross-spencer

Loading

sevein Nov 12, 2017

sevein left a comment

jrwdunham commented Nov 13, 2017

jrwdunham Nov 13, 2017

jrwdunham left a comment

jrwdunham Nov 13, 2017

jrwdunham Nov 13, 2017

jrwdunham Nov 13, 2017

jrwdunham commented Nov 14, 2017

sromkey commented Nov 22, 2017

jrwdunham commented Nov 30, 2017 •

edited

Loading

jrwdunham commented Jan 24, 2018

sromkey commented Jan 25, 2018

	p = Process(target=bundle.obj.index_file_data_from_aip_mets)
	p.start()

Add API endpoints for search #262

Add API endpoints for search #262

Conversation

jrwdunham commented Nov 10, 2017 • edited by ross-spencer Loading

sevein Nov 12, 2017

Choose a reason for hiding this comment

sevein left a comment

Choose a reason for hiding this comment

jrwdunham commented Nov 13, 2017

jrwdunham Nov 13, 2017

Choose a reason for hiding this comment

jrwdunham left a comment

Choose a reason for hiding this comment

jrwdunham Nov 13, 2017

Choose a reason for hiding this comment

jrwdunham Nov 13, 2017

Choose a reason for hiding this comment

jrwdunham Nov 13, 2017

Choose a reason for hiding this comment

jrwdunham commented Nov 14, 2017

sromkey commented Nov 22, 2017

jrwdunham commented Nov 30, 2017 • edited Loading

jrwdunham commented Jan 24, 2018

sromkey commented Jan 25, 2018

jrwdunham commented Nov 10, 2017 •

edited by ross-spencer

Loading

jrwdunham commented Nov 30, 2017 •

edited

Loading