A collection of scrapers used to add data to https://findthatcharity.uk/. The scrapers use scrapy and can currently save data to an SQL database (designed for postgres), an elasticsearch index or mongodb database.
The standard format is based on the Organization
object described in the
threesixtygiving data standard.
Each spider (a particular part of the scraper) looks at a particular type of organisation (in the UK), and aims to transform a register of that type of organisation into a standard format that can be saved to the database powering findthatcharity.
The spiders are all designed to be run without human import and should fetch consistent data.
The spiders are found in the /findthatcharity_import/spiders
directory and cover:
casc
: Community Amateur Sports Clubs regulated by HMRCccew
: Registered charities in England and Walesccni
: Registered charities in Northern Irelandoscr
: Registered charities in Scotlandcompanies
: Companies registered with Companies House (the scraper only imports non-profit company types)mutuals
: Mutual societies registered with the Financial Conduct Authoritygor
: A register of government organisationsgrid
: Entries from the Global Research Identifier Database - only those that are based in the UK and are not a registered company are included.hesa
: Organisations covered by the Higher Education Statistics Agency.lae
: Register of local authorities in Englandlani
: Register of local authorities in Northern Irelandlas
: Register of local authorities in Scotlandpla
: Register of principal local authorities in Walesnhsods
: NHS organisationsrsl
: Registered social landlordsschools_gias
: Schools in England (also includes Universities)schools_ni
: Schools in Northern Irelandschools_scotland
: Schools in Scotlandschools_wales
: Schools in Wales
There is also a scraper for a set of files that link data together, hosted at https://github.com/drkane/charity-lookups and one that pulls data from the org-id register of identifier schemes.
If you have scrapy installed then you can run an individual scraper using:
scrapy crawl <spiderid>
For example to crawl charities in England and Wales you would run:
scrapy crawl ccew
This will run all scrapers, using the DB_URI
environmental variable
to save to an SQL database.
sh ./crawl_all.sh
The code comes with two specialist pipelines to add the data to a database, plus one to add postcode data.
Pipelines come with their own settings (see scrapy docs for how to use these)
and need to be activated in the ITEM_PIPELINES
setting.
This pipeline saves data to an SQL database. It has been designed for use with postgres, but uses SQLAlchemy to save data so could be used with other database engines.
The database schema is found in <findthatcharity_import/db.py> and can be managed using alembic. The tables used are:
organisation
: holds details about the organisations scraped.source
: data sourcesorganisation_links
: a table of links between different organisationsidentifier
: a list of identifiers from org-id.scrape
: contains details of individual scraping runs.
To create the database with alembic set the database connection as an environment
variable DB_URI
and then run alembic upgrade head
.
To save a scraping run to the database, you need to include the DB_URI as a setting. For example:
scrapy crawl ccew -s DB_URI="postgres://postgres:postgres@localhost/ftc"
# or using the preset environmental variable
scrapy crawl ccew -s DB_URI="$DB_URI"
The pipeline found in pipelines/postcode_lookup_pipeline.py
uses https://postcodes.findthatcharity.uk/ to lookup data about an organisation's postcode and add the data to the organisation's location
attribute.
To activate this pipeline add
'findthatcharity_import.pipelines.postcode_lookup_pipeline.PostcodeLookupPipeline': 100,
to the ITEM_PIPELINES
setting. It's is recommended that the value given as 100 above is set as low as possible to ensure that the postcodes are fetched before any data is saved to the database.
The following settings are available for this pipeline:
PC_URL
: The URL used to fetch the data for a postcode. An empty set of brackets shows where the postcode will go. (Defaulthttps://postcodes.findthatcharity.uk/postcodes/{}.json
)PC_FIELD
: The field in theOrganisation
Item that contains the postcode. (Default:postalCode
)PC_FIELDS_TO_ADD
: The area types that will be added to the item based on the postcode. NB in addition to this the lat/long is always added if present. (Default['cty', 'laua', 'ward', 'ctry', 'rgn', 'gor', 'pcon', 'ttwa', 'lsoa11', 'msoa11']
)
This pipeline saves data to an elasticsearch index. It is generic, so will work on any object that defines an id
attribute, but the object returned can be customised by adding a to_elasticsearch
method to the Item object.
The to_elasticsearch()
method should ensure that the following attributes are set on each item that is to be saved:
_id
- a unique identifier for the item (mandatory)_index
- the elasticsearch index to save to (will useES_INDEX
setting if not set)_type
- the elasticsearch type to save to (will useES_TYPE_
setting if not set)_op_type
- the operation type (default isindex
)
Every item is added to the elasticsearch index in bulk, without checking whether it already exists. This means it can overwrite data.
The following settings can be defined:
ES_URL
: The URL to access the elasticsearch service (Defaulthttp://localhost:9200
)ES_INDEX
: The elasticsearch index that data will be written to (Defaultcharitysearch
)ES_TYPE
: The elasticsearch type that will be given to the organisation (Defaultorganisation
)ES_BULK_LIMIT
: The chunk size used for sending data to elasticsearch (Default500
)
This pipeline is very similar to the elasticsearch one, but instead saves data to a MongoDB instance. It saves records in bulk, and will overwrite any existing records with the same ID.
Defining a to_mongodb
method on an Item will allow you to customise what is saved to the database - this method should return a tuple with the name of the collection it should be saved to and then the item itself.
The following settings are defined:
MONGO_URI
: The URI to access the mongoDB instance (Defaultmongodb://localhost:27017
)MONGO_DB
: The name of the MongoDB database (Defaultcharitysearch
)MONGO_COLLECTION
: The default name of the MongoDB collection (only used if not returned byitem.to_mongodb()
) (Defaultorganisation
)MONGO_BULK_LIMIT
: The chunk size used for sending data to mongoDB (Default50000
)
By default, the HTTPCACHE
extension is enabled, with resources cached for three hours.
This means that any data downloaded or websites visited are cached for three hours to prevent
overload of the sites. This means it is relatively risk-free to rerun scraping after
adjusting other settings for e.g. saving to a database. These settings can be changed
if needed.
The scrapers are also set by default to ignore robots.txt used on sites - this can be changed.