Skip to content

Data preprocessing notes

JonThom edited this page Feb 6, 2022 · 3 revisions

Data preprocessing notes

Data issues

  • registered address
    • in 622 cases, the registered_address:is_address role is played not by a node_address type but by an org_entity. (468 different org_entities in total)
  • 1139 records are duplicated in the intermediary and officer tables, and play intermediary_of:is_intermediary and officer_of:is_officer.
    • in 7186 cases, the is_officer role is not present among the officers
      • in 7183 of these, the role player is in the intermediaries
        • 1186 intermediaries play is_officer 50487 times in total
          • these make up 1116 out of the 1139 (mentioned above) that appear in both the officer and intermediary tables
      • the remaining 3 cases are accounted for by 1 org_entity with id 1099616
    • all officers that play is_intermediary also occur in the intermediary table, while 70 intermediaries that play is_officer do NOT occur in the officer table

Solutions

  1. Edit the schema to allow org_entity to play registered_address:is_address
  2. Edit the schema to allow intermediary to play officer_of:is_officer, and officer to play intermediary_of:is_intermediary
  3. Delete 3 rows from officer_of table where role player is an org_entity (rather than modifying schema)
Clone this wiki locally