Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API endpoints for search #262

Closed
wants to merge 7 commits into from
Closed

Conversation

jrwdunham
Copy link
Contributor

@jrwdunham jrwdunham commented Nov 10, 2017

26 May 2020: Review of current status against stable/1.11.x

  • Developed against Archivematica 1.7 qa branch (1.7 was released some months later).
  • The second PREMIS reader/writer module (yaprw) was never merged. It would need to be merged or existing metsrw PREMIS capabilities swapped in.
  • The PR associated with yaprw looks like it may also provide enhanced Dublin Core (DC) handling.
  • Only 9 .py files updated in this PR, the conflicting files listed below by Github are:
     storage_service/locations/api/resources.py, 
     storage_service/locations/api/urls.py, 
     storage_service/locations/models/event.py, 
     storage_service/locations/models/package.py, 
     storage_service/locations/tests/test_package.py
    
  • Additionally, the requirements management process has changed across versions.
  • Documentation with examples: rendered text.
  • File endpoint example.
  • Additional examples inline in this PR.

9 November 2017: Original PR

  • Locations, packages and files can all now be searched via GET requests to:
    • http://<storage service URL>/api/v2/search/location/
    • http://<storage service URL>/api/v2/search/package/
    • http://<storage service URL>/api/v2/search/file/
  • See documentation at docs/search.rst.
  • Includes migration to File model.
  • Uses Django REST Framework (new requirement).

For the requirements that motivated this work, see:

Note: much of this work is due to PR #89 of @mcantelon. Since that PR is quite old and required extensive rebasing, its commits were squashed and made into this PR, and its CR comments, where applicable, were transferred to this PR.

Note 2: when metsrw v. 0.2.1 is released, requirements/base.txt will need to be updated to reference it. See artefactual-labs/mets-reader-writer#34.

Note 3: the asynchronous processing of the original PR 89 has been removed. See

p = Process(target=bundle.obj.index_file_data_from_aip_mets)
p.start()
. This was failing presumably due to a conflict with gevent. Logs:

1314 [2017-11-09 10:46:09 +0000] [32276] [INFO] Parent changed, shutting down: <Worker 28682>
1315 [2017-11-09 10:46:09 +0000] [32276] [INFO] Worker exiting (pid: 28682)

Fixes #261

Connected to #261

source_id = models.TextField(max_length=128)
source_package = models.TextField(blank=True,
help_text=_l("Unique identifier of originating unit"))
size = models.IntegerField(default=0, help_text='Size in bytes of the file')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i18n!

Copy link
Member

@sevein sevein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrwdunham, I understand why we wanted to start using django-rest-framework but do you know if the work in this PR had a particular requirement on the new framework? Some context here would be good. If we're going to adopt a new framework and they are going to coexist for a while we should probably explain the motivations.

Is django-rest-framework different enough that should be conceived in a new /api/v3 namespace maybe? tastypie exposes details about the resources to allow discovery (e.g. http://django-tastypie.readthedocs.io/en/latest/interacting.html#api-wide or http://django-tastypie.readthedocs.io/en/latest/interacting.html#inspecting-the-resource-s-schema) - is this going to work well when sharing the same namespaces with another framework? Also other things like pagination, etc... is it close enough? If they're not I think we're going to make very difficult to implement clients and that's not good.

@jrwdunham
Copy link
Contributor Author

@sevein, I don't know the exact motivations for using DRF for this work. At a high level, I suspect it is because DRF won out over TastyPie historically.

One of the goals of this work (in this later stage) was to "Evaluate if this allows us to remove tastypie in favour of Django REST framework (yes/no/maybe?/both)". See https://wiki.archivematica.org/Improvements/Reporting. After that question it currently states "done, answer is both".

Ultimately this work needs to allow the client to be able to answer the following (types of) questions using the SS API:

  • How many files do I have with a given PRONOM id?
  • How many files do I have with a given PRONOM id that were ingested between date1 and date2?
  • How many files do I have that are ISO disk images?

I think this PR does that.

Given the small scope of the project behind this PR, I think we will have to deliver to the client a dev branch that they can test. That means leaving this PR unmerged until we can find the time to replace TastyPie with DRF or at least write tests to ensure that these changes provide a consistent API and do not break the current TastyPie API. Thoughts? @jhs @mcantelon ?

* enabled (whether the location is enabled)

For example, if you wanted to get details about the transfer source location
contained in the space 6d0b6cce-4372-4ef8-bf48-ce642761fd41 you could HTTP get::
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that if you pass a nonsense param to one of these search endpoints, you get all of the resources. Is that expected/desired? For example http://192.168.168.192:8000/api/v2/search/location/?moonbeam=figaro returns all locations. As does http://192.168.168.192:8000/api/v2/search/location/

Copy link
Contributor Author

@jrwdunham jrwdunham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have copied comments from the original #89 PR to the relevant places in this one. I have cited the reviewers if they are not me.

* max_size (maximum filesize)
* normalized (boolean: whether or not file was normalized)
* valid (nullable boolean: whether or not file was validated and, if so, its
validity)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the original validated field to valid, a nullable boolean field where None indicates not validated, True valid, and False invalid. To see how this used to work, see https://github.com/artefactual/archivematica-storage-service/pull/89/files for.

filter_backends = (filters.DjangoFilterBackend,)
filter_class = FileFilter

@list_route(methods=['get'])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this used? TODO: find out.

"""
aip_dir_name = os.path.basename(os.path.splitext(self.full_path)[0])
relative_path = os.path.join(aip_dir_name, "data", "METS." + self.uuid + ".xml")
path_to_mets, temp_dir = self.extract_file(relative_path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from @Hwesta: Has this been tested with compressed & uncompressed AIPs? Does it work with AIPs not stored in the local filesystem? Since this is happening asynchronously, what happens when the AIP storage operation completes during this thread? eg self.full_path may not be accurate, or may change partway through

@jrwdunham
Copy link
Contributor Author

For the original requirements that motivated this search API work, see https://wiki.archivematica.org/Research_data_management#METS_parsing.

@sromkey
Copy link
Contributor

sromkey commented Nov 22, 2017

@jrwdunham can you create a sample CURL command and JSON response that we can use for testing? Suggested questions:

  • How many files do I have in a PRONOMid?
    -- that were ingested between date1 and date2?
  • How many files do I have that are ISO disk images?

@jrwdunham jrwdunham added this to the 0.12.0 milestone Nov 29, 2017
@jrwdunham
Copy link
Contributor Author

jrwdunham commented Nov 30, 2017

@sromkey Here are some example searches. Something like this should probably be added to search.rst. (These assume you have curl and jq installed).

  1. Show me files with PRONOM id fmt/19. Note we have to escape the forward slash
    in the PRONOM id using %2F:

     curl http://127.0.0.1:62081/api/v2/search/file/?pronom_id=fmt%2F19 | jq
     {
       "count": 2,
       "next": null,
       "previous": null,
       "results": [
         {
           "uuid": "5b32e493-1e6a-4169-b248-f25fec387cff",
           "name": "wed4-25608bd2-cb70-4bce-b16f-4cb61ec2c3fb/objects/BBhelmet.ai",
           "file_type": "AIP",
           "size": 1080282,
           "format_name": "Acrobat PDF 1.5 - Portable Document Format",
           "pronom_id": "fmt/19",
           "pipeline": "b5fa8caf-aa94-4ed7-b38a-1bb12654e498",
           "source_package": "",
           "normalized": false,
           "valid": true,
           "ingestion_time": "2017-11-30T05:48:27Z"
         },
         {
           "uuid": "d02de26f-fd1c-4ee1-a6f4-3921c87611ef",
           "name": "wed5-a6db30b9-8606-4f60-a1d7-6e3d6a0653a1/objects/BBhelmet.ai",
           "file_type": "AIP",
           "size": 1080282,
           "format_name": "Acrobat PDF 1.5 - Portable Document Format",
           "pronom_id": "fmt/19",
           "pipeline": "b5fa8caf-aa94-4ed7-b38a-1bb12654e498",
           "source_package": "",
           "normalized": false,
           "valid": true,
           "ingestion_time": "2017-11-25T06:03:10Z"
         }
       ]
     }
    

    If we just want the number of files with PRONOM id fmt/19, we access the count
    attribute of the returned JSON object:

     curl http://127.0.0.1:62081/api/v2/search/file/?pronom_id=fmt%2F19 | jq '.count'
     2
    
  2. How many files do I have with PRONOM id fmt/19 that were ingested between 2017-11-24 and 2017-11-26?

     curl "http://127.0.0.1:62081/api/v2/search/file/?pronom_id=fmt%2F19&ingestion_time_at_or_before=2017-11-26&ingestion_time_at_or_after=2017-11-24" | jq '.count'
     1
    
  3. How many files do I have with PRONOM id fmt/19 that were ingested between 2017-11-24 and 2017-11-30?

     curl "http://127.0.0.1:62081/api/v2/search/file/?pronom_id=fmt%2F19&ingestion_time_at_or_before=2017-11-30&ingestion_time_at_or_after=2017-11-24" | jq '.count'
     2
    
  4. How many files do I have that are ISO disk images?

     curl "http://127.0.0.1:62081/api/v2/search/file/?pronom_id=fmt%2F468" | jq '.count'
     1
    

@sevein sevein removed this from the 0.12.0 milestone Dec 15, 2017
@jrwdunham
Copy link
Contributor Author

I rebased this against qa/0.x and added a commit that makes AIP file indexing work with uncompressed AIPs.

@sromkey
Copy link
Contributor

sromkey commented Jan 25, 2018

It looks like the UUID for file queries returns one minted newly by the Storage Service when it creates the file in the database. It would be more meaningful to return the source_id, e.g. the UUID created by Archivematica and recorded in the METS file.

- Locations, packages and files can all now be searched via GET requests to:

  - http://<storage service URL>/api/v2/search/location/
  - http://<storage service URL>/api/v2/search/package/
  - http://<storage service URL>/api/v2/search/file/

- See documentation at docs/search.rst.
- Includes migration to File model.
- Uses Django REST Framework (new requirement).
Prior to this the `index_file_data_from_aip_mets` method of `Package`
was assuming that the package was compressed. Now it correctly parses
the METS file of uncompressed AIPs as well.
@jrwdunham jrwdunham mentioned this pull request Jun 5, 2018
@sallain sallain added Status: in progress Issue that is currently being worked on. Waffle label. and removed work-in-progress labels Jul 17, 2018
@ross-spencer ross-spencer removed the Status: in progress Issue that is currently being worked on. Waffle label. label May 26, 2020
@sevein sevein added the stalled label Mar 8, 2021
@sevein sevein closed this Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants