Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: there is no ADR for Archivematica reporting functionality #24

Open
peterVG opened this issue May 5, 2020 · 2 comments
Open
Assignees

Comments

@peterVG
Copy link
Collaborator

peterVG commented May 5, 2020

There is pent up demand from Archivematica users to introduce reporting functionality to Archivematica. They want statistics about what their Archivematica deployments are doing and when, as well as detailed breakdowns of the content in their Archivematica systems. While the existing Archival Storage search and hit display does provide some useful information, it does not aggregate this information or present it in management-style reports. Work on a comprehensive reporting feature has been delayed because it hasn't been clear where the canonical source of Archivematica statistical and content information is stored or which of these sources is the most convenient, comprehensive, and performant source for building reports. There is also some mixup between logging and reporting functionality. All of this is complicated by the fact that production Archivematica deployments are often split over multiple processing pipelines. This ADR should address these problems and provide options for moving forward with a solution(s).

@peterVG
Copy link
Collaborator Author

peterVG commented May 6, 2020

See reporting/0011-reporting.md

@ross-spencer
Copy link
Contributor

@peterVG this is starting to take good shape.

As you noted in Slack, then using the PR functionality (you can create a WIP/Draft draft now in Github) will be good to do more detailed revision.

Some thoughts that I hope will help until our meeting Monday:

Examples

In the context and problem statement, I wonder if you can break the examples of reporting out into categories with fewer specific examples, e.g. Repository maintenance reporting (might describe how many packages, how many deletions, etc.); File format reporting (might describe no.s formats, no.s of significant properties).

I think that would then feed into the considered options as to what data is in, and which data is out, of scope in this ADR. (My feeling is that we won't be able to tackle it all).

Exhausting our data sources

I think it might start to look over-whelming but I think we can add to the data sources in Archivematica. I think I'd like to exhaust them here, at least for discussion.

I was thinking we'd at least need to add:

  • System logs, e.g. Nginx/MCPServer, (vs. logs which are also ancilliary contents of an AIP (for now)).
  • Prometheus.
  • SIPs as a 1st class-package due to the SFU work. (I might need to describe my language around this in person! -- but yeah, I do see a world where folks can offload their SIPs from backlog instead of always creating AIPs now)

I was trying to think of others. I flip-flop between the API a lot. There is definitely information to be extracted from there which can be a different rendering to the database. It also might not be the same API as one we might create for this work?

I like that you've noted we might need to generate information. It's a good question, if we generate it in Archivematica where do we keep it? (Enhance the METS? Other DB tables?) Is Archivematica already saturated with regards to new information?

I wonder then if the Technical forces section could then start to be split into:

  • Sources of information (and their longevity)
  • Long-term support of components, e.g. theoretically, our ES index could be replaced by any client with another indexing solution. We could replace it ourselves.
  • Challenges, e.g.
    • the aggregation of data across multiple clients/multiple servers is a really great point, and I think there's a topology which that conjures.
    • Securing the data is a great one you've picked out as well. (I wonder if future ADRs will have a separate Security section?)
    • Some data not being there is another great one.

There may be other sections after we chat. I think this will then help draw out more decisions we want to make.

Emphasizing the use of this data

And just the last thing, but it would be good to keep in sight where this data ends up. And I think there may be plenty of places - mgmt reports, etc. but for Archivematica, keeping in mind that it might then consume its own reports somehow to drive re-ingest, or PAR-like actions, will be good to do do in this ADR.

image

One potential impact say (hypothetically), is that, we might write something extracts the data, provides nice reports, and on top of that nice visualizations. But we might also keep in mind that that thing we write, we might also write an API so that it can then be worked back into Archivematica (or indeed visualization tools). Certainly, we'll write some form of interface that we can cleanly work with and abstract from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants