-
Notifications
You must be signed in to change notification settings - Fork 0
Data Audit
There are a few management commands available for finding and diagnosing data inconsistencies. These should be run on the production work server.
audit_db
This management command will look for inconsistencies by comparing the contents of the database with the information stored on DocumentCloud. (Last run on 10/19. Took about 10 hours, so you should cron a single execution or nohup it.)
audit_dc
This management command will look for inconsistencies by comparing the information stored on DocumentCloud with the contents of the database. (Not tested yet.)
data_lookup
This management command takes a single MatterAttachment.id
or MatterAttachment.hyperlink
and will output details about what is in the database and what is on DocumentCloud.
Example output:
Database record
- hyperlink http://ord.legistar.com/Chicago/attachments/d0ac7580-3eca-4606-9aa0-ef80ff5d89e8.pdf
- last_modified 2015-03-19 01:04:52.117000+00:00
- link_obtained_at 2015-10-21 18:33:27.271190+00:00
- matter.id 72268
- matter.last_modified 2015-10-07 21:59:16.513000+00:00
- matter.attachments_obtained_at 2015-10-08 02:30:14.733211+00:00
Querying DocumentCloud [MatterAttachmentId:143975]
- source http://ord.legistar.com/Chicago/attachments/d0ac7580-3eca-4606-9aa0-ef80ff5d89e8.pdf
- created_at 2015-10-21 18:33:26+00:00
- updated_at 2015-10-21 20:30:32+00:00
* data added: set()
* data removed: set() << Differences between calculated and actual document data
* data changed: set()
Querying DocumentCloud [source: "http://ord.legistar.com/Chicago/attachments/d0ac7580-3eca-4606-9aa0-ef80ff5d89e8.pdf"]
- source http://ord.legistar.com/Chicago/attachments/d0ac7580-3eca-4606-9aa0-ef80ff5d89e8.pdf
- created_at 2015-10-21 18:33:26+00:00
- updated_at 2015-10-21 20:30:32+00:00
* data added: set()
* data removed: set() << Differences between calculated and actual document data
* data changed: set()
Document not found in DocumentCloud, no match for hyperlink/source or MatterAttachmentId
Search for the document in DocumentCloud. It is possible that it exists but was not returned in the search results because it is "Processing" or "Failed Import". If the document does not actually exist...
Manual fix: Run pull_pdfs
for the associated Matter id.
Document not found in DocumentCloud, no match for hyperlink/source but MatterAttachmentId is associated with different hyperlink/source
We are not sure why this happens. If the MatterAttachment.hyperlink
changed, then the MatterAttachment.last_modified
should have changed, and pull_pdfs
would have seen that last_modified >= link_obtained_at
. Spot checks indicate that the database matches the most recent information from Legistar.
Manual fix: Run pull_pdfs
for the associated Matter id and manually change the old document to Access: Private
on DocumentCloud.
Multiple documents in DocumentCloud with the same MatterAttachmentId but a different "source"
(The document was updated in Legistar with a new hyperlink but the same MatterAttachmentId. The MatterAttachment record matched the latest information from Legistar and the most recently uploaded document.)
This should be fixed by updates to the pull_attachments
command, which now looks for changes to MatterAttachment.hyperlink
and will privatize the old document on DocumentCloud.
Manual fix: Manually change the old document to Access: Private
on DocumentCloud.
Data Mismatch
(The data associated with a Matter does not match the document data associated with a related document in DocumentCloud. The pull_pdfs
query did not take changes to Matter-related data into account.)
Some mismatches will self-correct once the appropriate cron job runs. Others should be prevented by updates to the pull_pdfs
command, which now also takes the Matter.last_modified
timestamp into account.
Manual fix: Run pull_pdfs
for the associated Matter id.