Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev/issue 8895 search api #89

Closed
wants to merge 14 commits into from
146 changes: 146 additions & 0 deletions docs/search.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
:Authors:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💖 Docs

Mike Cantelon

Search API
================================================================================

In addition to the search functionality present in the web interface, the
storage service also includes a REST search API. Searches are performed by
sending an HTTP GET request.

Search results will include a count of how many items were found and will
include next and previous properties indicating links to more items in the
result set.

Location search
--------------------------------------------------------------------------------

The endpoint for searching locations is::

http://<storage service URL>/api/v2/search/location/

Locations can be searched using the following search parameters:

* uuid (location UUID)
* space (space UUID)
* purpose (purpose code)
* enabled (whether the location is enabled)

For example, if you wanted to get details about the transfer source location
contained in the space 6d0b6cce-4372-4ef8-bf48-ce642761fd41 you could HTTP get::

http://<storage service URL>/api/v2/search/location/?space=7ec3d5d9-23ec-4fd5-b9fb-df82da8de630&purpose=TS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that if you pass a nonsense param to one of these search endpoints, you get all of the resources. Is that expected/desired? For example http://192.168.168.192:8000/api/v2/search/location/?moonbeam=figaro returns all locations. As does http://192.168.168.192:8000/api/v2/search/location/


Here is an example JSON response::

{
"count": 1,
"next": null,
"previous": null,
"results": [
{
"uuid": "f74c23e1-6737-4c24-a470-a003bc573051",
"space": "7ec3d5d9-23ec-4fd5-b9fb-df82da8de630",
"pipelines": [
"2a351be8-99b4-4f53-8ea5-8d6ace6e0243",
"b9d676ff-7c9d-4777-9a19-1b4b76a6542f"
],
"purpose": "TS",
"quota": null,
"used": 0,
"enabled": true
}
]
}


Package search
--------------------------------------------------------------------------------

The endpoint for searching packages is::

http://<storage service URL>/api/v2/search/package/

Packages can be searched using the following search parameters:

* uuid (package UUID)
* pipeline (pipeline UUID)
* location (location UUID)
* package_type (package type code: "AIP", "AIC", "SIP", "DIP", "transfer", "file", "deposit")
* status (package status code: "PENDING", "STAGING", "UPLOADED", "VERIFIED",
"DEL_REQ", "DELETED", "RECOVER_REQ", "FAIL", or "FINALIZE")
* min_size (minimum package filesize)
* max_size (maximum package filesize)

For example, if you wanted to get details about packages contained in the location
7c9ddb60-3d16-4fa3-a41e-4a1a876d2a89 you could HTTP GET::

http://<storage service URL>/api/v2/search/package/?package_type=AIP

Here is an example JSON response::

{
count: 1,
next: null,
previous: null,
results: [
{
uuid: "96365d3d-6656-4fdd-a247-f85c9e0ddd43",
current_path: "9636/5d3d/6656/4fdd/a247/f85c/9e0d/dd43/Apples-96365d3d-6656-4fdd-a247-f85c9e0ddd43.7z",
size: 7918099,
origin_pipeline: "b9d676ff-7c9d-4777-9a19-1b4b76a6542f",
current_location: "a3d95a1b-f8fb-4e34-9f15-60dcdf178470",
package_type: "AIP",
status: "UPLOADED",
pointer_file_location: "c2dfb32b-77dd-4597-abff-7c52e05e6d01",
pointer_file_path: "9636/5d3d/6656/4fdd/a247/f85c/9e0d/dd43/pointer.96365d3d-6656-4fdd-a247-f85c9e0ddd43.xml"
}
]
}


File search
--------------------------------------------------------------------------------

The endpoint for searching files is::

http://<storage service URL>/api/v2/search/file/

Files can be searched using the following search criteria:

* uuid (file UUID)
* package (package UUID)
* name (enter or partial filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a typo? Should "enter" be "entire"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think there's a typo in the line below: "PRONUM"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed these typos in #262

* pronom_id (PRONUM PUID)
* format_name (format name)
* min_size (minimum filesize)
* max_size (maximum filesize)
* normalized (boolean: whether or not file was normalized)
* valid (boolean: whether or not file data is valid or malformed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording sounds contradictory: "whether or not file data is valid or malformed"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think valid should be validated here. No? If so, I think the validated attribute of File only records whether the file has been validated, not whether it is valid. (Unless the help text is wrong in models/event.py::File

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should source_package be listed as a search filter here? FileFilter seems to expose it. It's a bit confusing since it only references Transfers I believe. See the pre-existing Package.index_file_data_from_transfer_mets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed validated to valid a nullable boolean field where None indicates not validated, True valid, and False invalid. Migrations and help text updated accordingly in #262.


For example, if you wanted to get details about files that are 29965171 bytes
or larger, you could HTTP GET::

http://<storage service URL>/api/v2/search/file/?min_size=29965171

Here is an example JSON response::

{
count: 1,
next: null,
previous: null,
results: [
{
uuid: "bd2074bb-2086-40b5-9c3f-3657cb900681",
name: "Bodring-5f0fa831-a74b-4bf5-8598-779d49c3663a/objects/pictures/Landing_zone-e50c8452-0791-4fac-9f45-15b088a39b10.tif",
file_type: "AIP",
size: 29965171,
format_name: "TIFF",
pronom_id: "",
source_package: "",
normalized: null,
validated: null,
ingestion_time: "2015-10-30T04:16:39Z"
}
]
}
6 changes: 5 additions & 1 deletion requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,12 @@ setuptools
bagit==1.5.4
brotli==0.5.2 # Better compression library for WhiteNoise
defusedxml==0.5.0
djangorestframework==3.2.4
Django>=1.8,<1.9
django-annoying==0.10.3
django-braces==1.11.0
django-extensions==1.7.9
django-filter==0.11.0
django-model-utils==3.0.0
#tastypie 0.13.3 has breaking changes
django-tastypie==0.13.1
Expand All @@ -20,7 +22,8 @@ gunicorn==19.7.1
jsonfield==2.0.1
logutils==0.3.4.1
lxml==3.7.3
metsrw==0.2.0
#metsrw==0.2.0
git+https://github.com/artefactual-labs/mets-reader-writer.git@dev/issue-11581-premis-parsing#egg=metsrw
ndg-httpsclient==0.4.2
pyasn1==0.2.3
python-gnupg==0.4.0
Expand All @@ -36,3 +39,4 @@ git+https://github.com/Brown-University-Library/django-shibboleth-remoteuser.git
# This may not actually be needed as SS uses sqlite by default which doesn't really care about length.
# But better to make sure Django doesn't have any validation issues (and also, keep db backend easily swappable)
git+https://github.com/seatme/django-longer-username.git@seatme#egg=longerusername
-e git://github.com/mcantelon/mets-reader-writer.git@dev/issue-8894-premis-parsing#egg=metsrw
4 changes: 4 additions & 0 deletions storage_service/locations/api/resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
# stdlib, alphabetical
import json
import logging
from multiprocessing import Process
import os
import re
import shutil
Expand Down Expand Up @@ -570,6 +571,9 @@ def obj_create(self, bundle, **kwargs):
bundle.obj.store_aip(origin_location, origin_path,
related_package_uuid, premis_events=events,
premis_agents=agents, aip_subtype=aip_subtype)
# Asynchronously index AIP files
p = Process(target=bundle.obj.index_file_data_from_aip_mets)
p.start()
elif bundle.obj.package_type in (Package.TRANSFER,) and bundle.obj.current_location.purpose in (Location.BACKLOG,):
# Move transfer to backlog
bundle.obj.backlog_transfer(origin_location, origin_path)
Expand Down
5 changes: 5 additions & 0 deletions storage_service/locations/api/search/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Common
# May have multiple models, so import * and use __all__ in file.
from router import router

__all__ = ['router']
164 changes: 164 additions & 0 deletions storage_service/locations/api/search/router.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
import django_filters
from rest_framework import routers, serializers, viewsets, filters
from rest_framework.decorators import list_route
from rest_framework.response import Response

from django.db.models import Sum

from locations import models


class CaseInsensitiveBooleanFilter(django_filters.Filter):
"""
This allows users to query booleans without having to use "True" and "False"
"""
def filter(self, qs, value):
if value is not None:
lc_value = value.lower()
if lc_value == "true":
value = True
elif lc_value == "false":
value = False
return qs.filter(**{self.name: value})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here if the string doesn't match either? It just passes the string back?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested, looks like it's ignored if a supported value isn't passed. Maybe we want this it to raise an error instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it passes the string back as it and their search query will then fail (which makes sense given they're provided the wrong values for a boolean).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, cool! What's the failure look like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll pass the value back as-is and cause the query to be invalid (which is desirable given they're provided the wrong type of value).

return qs


class PipelineField(serializers.RelatedField):
"""
Used to show UUID of related pipelines
"""
def to_representation(self, value):
return value.uuid


class LocationSerializer(serializers.HyperlinkedModelSerializer):
"""
Serialize Location model data
"""
space = serializers.ReadOnlyField(source='space.uuid')
pipelines = PipelineField(many=True, read_only=True, source='pipeline')

class Meta:
model = models.Location
fields = ('uuid', 'space', 'pipelines', 'purpose', 'quota', 'used', 'enabled')


class LocationFilter(django_filters.FilterSet):
"""
Filter for searching Location data
"""
uuid = django_filters.CharFilter(name='uuid')
space = django_filters.CharFilter(name='space')
purpose = django_filters.CharFilter(name='purpose')
enabled = CaseInsensitiveBooleanFilter(name='enabled')

class Meta:
model = models.Location
fields = ['uuid', 'space', 'purpose', 'enabled']


class LocationViewSet(viewsets.ReadOnlyModelViewSet):
"""
Search API view for Location model data
"""
queryset = models.Location.objects.all()
serializer_class = LocationSerializer
filter_backends = (filters.DjangoFilterBackend,)
filter_class = LocationFilter


class PackageSerializer(serializers.HyperlinkedModelSerializer):
"""
Serialize Package model data
"""
origin_pipeline = serializers.ReadOnlyField(source='origin_pipeline.uuid')
current_location = serializers.ReadOnlyField(source='current_location.uuid')
pointer_file_location = serializers.ReadOnlyField(source='pointer_file_location.uuid')

class Meta:
model = models.Package
fields = ('uuid', 'current_path', 'size', 'origin_pipeline', 'current_location', 'package_type', 'status', 'pointer_file_location', 'pointer_file_path')


class PackageFilter(django_filters.FilterSet):
"""
Filter for searching Package data
"""
min_size = django_filters.NumberFilter(name='size', lookup_type='gte')
max_size = django_filters.NumberFilter(name='size', lookup_type='lte')
pipeline = django_filters.CharFilter(name='origin_pipeline')
location = django_filters.CharFilter(name='current_location')
package_type = django_filters.CharFilter(name='package_type')

class Meta:
model = models.Package
fields = ['uuid', 'min_size', 'max_size', 'pipeline', 'location', 'package_type', 'status', 'pointer_file_location']


class PackageViewSet(viewsets.ReadOnlyModelViewSet):
"""
Search API view for Package model data
"""
queryset = models.Package.objects.all()
serializer_class = PackageSerializer
filter_backends = (filters.DjangoFilterBackend,)
filter_class = PackageFilter


class FileSerializer(serializers.HyperlinkedModelSerializer):
"""
Serialize File model data
"""
pipeline = serializers.ReadOnlyField(source='origin.uuid')

class Meta:
model = models.File
fields = ('uuid', 'name', 'file_type', 'size', 'format_name', 'pronom_id', 'pipeline', 'source_package', 'normalized', 'validated', 'ingestion_time')


class FileFilter(django_filters.FilterSet):
"""
Filter for searching File data
"""
min_size = django_filters.NumberFilter(name='size', lookup_type='gte')
max_size = django_filters.NumberFilter(name='size', lookup_type='lte')
pipeline = django_filters.CharFilter(name='origin')
package = django_filters.CharFilter(name='source_package')
name = django_filters.CharFilter(name='name', lookup_type='icontains')
normalized = CaseInsensitiveBooleanFilter(name='normalized')
ingestion_time = django_filters.DateFilter(name='ingestion_time', lookup_type='contains')
#ingestion_time_before = django_filters.DateFilter(name='ingestion_time', lookup_type='lt')
#ingestion_time_after = django_filters.DateFilter(name='ingestion_time', lookup_type='gt')

class Meta:
model = models.File
fields = ['uuid', 'name', 'file_type', 'min_size', 'max_size',
'format_name', 'pronom_id', 'pipeline', 'source_package',
'normalized', 'validated', 'ingestion_time']
#'ingestion_time_before', 'ingestion_time_after']


class FileViewSet(viewsets.ReadOnlyModelViewSet):
"""
Search API view for File model data

Custom endpoint "stats" provides total size of files searched for
"""
queryset = models.File.objects.all()
serializer_class = FileSerializer
filter_backends = (filters.DjangoFilterBackend,)
filter_class = FileFilter

@list_route(methods=['get'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this used? TODO: find out.

def stats(self, request):
filtered = FileFilter(request.GET, queryset=self.get_queryset())
count = filtered.qs.count()
summary = filtered.qs.aggregate(Sum('size'))
return Response({'count': count, 'total_size': summary['size__sum']})


# Route location, package, and file search API requests
router = routers.DefaultRouter()
router.register(r'location', LocationViewSet)
router.register(r'package', PackageViewSet)
router.register(r'file', FileViewSet)
7 changes: 6 additions & 1 deletion storage_service/locations/api/urls.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
from django.conf.urls import include, url
from tastypie.api import Api
from locations.api import v1, v2

from locations.api import v1, v2
from locations.api.search import router
from locations.api.sword import views


v1_api = Api(api_name='v1')
v1_api.register(v1.SpaceResource())
v1_api.register(v1.LocationResource())
Expand All @@ -16,9 +18,12 @@
v2_api.register(v2.PackageResource())
v2_api.register(v2.PipelineResource())


urlpatterns = [
url(r'', include(v1_api.urls)),
url(r'v1/sword/$', views.service_document, name='sword_service_document'),
url(r'', include(v2_api.urls)),
url(r'v2/sword/$', views.service_document, name='sword_service_document'),
url(r'v1/search/', include(router.urls)),
url(r'v2/search/', include(router.urls))
]
Loading