Skip to content

Data Audit

J Wilson edited this page Oct 22, 2015 · 5 revisions

Diagnostic Tools

There are a few management commands available for finding and diagnosing data inconsistencies. These should be run on the production work server.

audit_db

This management command will look for inconsistencies by comparing the contents of the database with the information stored on DocumentCloud. (Last run on 10/19. Took about 10 hours, so you should cron a single execution or nohup it.)

audit_dc

This management command will look for inconsistencies by comparing the information stored on DocumentCloud with the contents of the database. (Not tested yet.)

data_lookup

This management command takes a single MatterAttachment.id or MatterAttachment.hyperlink and will output details about what is in the database and what is on DocumentCloud.

Example output:

Database record
- hyperlink http://ord.legistar.com/Chicago/attachments/d0ac7580-3eca-4606-9aa0-ef80ff5d89e8.pdf
- last_modified 2015-03-19 01:04:52.117000+00:00
- link_obtained_at 2015-10-21 18:33:27.271190+00:00
- matter.id 72268
- matter.last_modified 2015-10-07 21:59:16.513000+00:00
- matter.attachments_obtained_at 2015-10-08 02:30:14.733211+00:00

Querying DocumentCloud [MatterAttachmentId:143975]
- source http://ord.legistar.com/Chicago/attachments/d0ac7580-3eca-4606-9aa0-ef80ff5d89e8.pdf
- created_at 2015-10-21 18:33:26+00:00
- updated_at 2015-10-21 20:30:32+00:00
* data added: set()        
* data removed: set()  << Differences between calculated and actual document data 
* data changed: set()

Querying DocumentCloud [source: "http://ord.legistar.com/Chicago/attachments/d0ac7580-3eca-4606-9aa0-ef80ff5d89e8.pdf"]
- source http://ord.legistar.com/Chicago/attachments/d0ac7580-3eca-4606-9aa0-ef80ff5d89e8.pdf
- created_at 2015-10-21 18:33:26+00:00
- updated_at 2015-10-21 20:30:32+00:00
* data added: set()
* data removed: set()  << Differences between calculated and actual document data 
* data changed: set()

Issues

Document not found in DocumentCloud, no match for hyperlink/source or MatterAttachmentId

Search for the document in DocumentCloud. It is possible that it exists but was not returned in the search results because it is "Processing" or "Failed Import". If the document does not actually exist...

Manual fix: Run pull_pdfs for the associated Matter id.

Document not found in DocumentCloud, no match for hyperlink/source but MatterAttachmentId is associated with different hyperlink/source

We are not sure why this happens. If the MatterAttachment.hyperlink changed, then the MatterAttachment.last_modified should have changed, and pull_pdfs would have seen that last_modified >= link_obtained_at. Spot checks indicate that the database matches the most recent information from Legistar.

Manual fix: Run pull_pdfs for the associated Matter id and manually change the old document to Access: Private on DocumentCloud.

Multiple documents in DocumentCloud with the same MatterAttachmentId but a different "source"

(The document was updated in Legistar with a new hyperlink but the same MatterAttachmentId. The MatterAttachment record matched the latest information from Legistar and the most recently uploaded document.)

This should be fixed by updates to the pull_attachments command, which now looks for changes to MatterAttachment.hyperlink and will privatize the old document on DocumentCloud.

Manual fix: Manually change the old document to Access: Private on DocumentCloud.

Data Mismatch

(The data associated with a Matter does not match the document data associated with a related document in DocumentCloud. The pull_pdfs query did not take changes to Matter-related data into account.)

Some mismatches will self-correct once the appropriate cron job runs. Others should be prevented by updates to the pull_pdfs command, which now also takes the Matter.last_modified timestamp into account.

Manual fix: Run pull_pdfs for the associated Matter id.

Clone this wiki locally