- Contents
- Contributor guidelines
- Adding a scraper
- Contact info
This document contains guidelines on contributing to the UKBCD project including how the project works, how to set up the environment, how we use our issue tracker, and how you can develop more scrapers.
You will need to install Python on the system you plan to run the script from. Python 3.8 and 3.9 are officially supported. Python 3.10 and 3.11 should work, but your mileage may vary.
The project uses poetry to manage dependencies and setup the build environment.
pip install poetry
# Clone the Repo
git clone https://github.com/robbrad/UKBinCollectionData
cd UKBinCollectionData
#Install Dependencies
poetry install
poetry shell
- To provide a real-world environment to learn Python and/or web scraping
- To provide UK bin data in a standardised format for use (albeit not exclusively) with HomeAssistant
- The majority of project work comes from developing new scrapers for requested councils. These can be found on the issue tracker with
council request
labels. - Tasks that require additional input have the
help wanted
label - these can be trickier requests or may have many smaller tasks. - Easier tasks, that would be a good fit for people new to the project or the world of web scraping are labelled with the
good first issue
label
This repo uses a design pattern called the Template Method which basically allows for a structured class that can be extended. In our case, the getting of the data from the council and the presentation of the JSON remains the same via the abstract class - however the scraping of each council is different and this allows us to have a class for each council - you can see this in action here.
There are a few different options for scraping, and you are free to choose whichever best suits the council:
- Using Beautiful Soup 4
- Using the requests module
- Reading data from external files
- Using Selenium to automate browser behaviour
To get started, first you will need to fork this repository and setup your own working environment before you can start developing.
Once your environment is ready, create a new branch from your master/main branch and then create a new .py file within the uk_bin_collection\councils
directory. The new .py file will be used in the CLI to call the parser, so be sure to pick a sensible name - e.g. CheshireEastCouncil.py is called with:
python collect_data.py CheshireEastCouncil <web-url>
To simplify things somewhat, a template file has been created - open this file, copy the contents to your new .py file and start from there. You are pretty much free to approach the scraping however you would like, but please ensure that:
- Your scraper returns a dictionary made up of the key "bins" and a value that is a list of bin types and collection dates (see outputs folder for examples).
- Any dates or times are formatted to standard UK formats (see below)
UKBCD has two mandatory parameters when it runs - the name of the parser (sans .py) and the URL from which to scrape. However, developers can also get the following data using kwargs
:
Parameter | Prompt | Notes | kwargs.get |
---|---|---|---|
UPRN | -u or --uprn |
kwargs.get('uprn') |
|
House number | -n or --number |
Sometimes called PAON | kwargs.get('paon') |
Postcode | -p or --postcode |
Needs to be wrapped in quotes on the CLI | kwargs.get('postcode') |
These parameters are useful if you're using something like the requests module and need to take additional user information into the request, such as:
python collect_data.py LeedsCityCouncil https://www.leeds.gov.uk/residents/bins-and-recycling/check-your-bin-day -p "LS1 2JG" -n 41
In the scraper, the following code takes the inputted parameters and uses them in two different variables:
user_postcode = kwargs.get("postcode")
user_paon = kwargs.get("paon")
Each parameter also has its own validation method that should be called after the kwargs.get
:
check_uprn()
check_paon()
check_postcode()
The first two are simple validators - if the parameter is used but no value is given, they will throw an exception. check_postcode()
works differently -
instead making a call to the postcodes.io API to check if it exists or not. An exception will only be thrown here if the response code is not HTTP 200
.
The project has a small but growing library of functions (and the occasional variable) that are useful when scraping websites or calendars - aptly named common.py. Useful functions include:
- functions to add ordinals to dates (04 becomes 4th) or remove them (4th becomes 04)
- a function to check if a date is a holiday in a given part of the UK
- a function that returns the dates of a given weekday in N amounts of weeks
- a function that returns a list of dates every N days from a given start date
common.py
also contains a standardised date format variable called date_format
, which is useful to call when formatting datetimes.
Please feel free to contribute to this library as you see fit - added functions should include the following:
- clear, lowercase and underscored name
- parameter types
- a return type (if there is one)
- a docustring describing what the function does, as well as parameter and return type descriptors.
In order for your scraper to work with the project's testing suite, some additional files need to be provided or modified:
Note: from here on, anything containing<council_name>
should be replaced with the scraper's name.
Type | File location |
---|---|
Modify | UKBinCollectionData/uk_bin_collection/tests/input.json |
Each council should have a node that matches the scraper's name. The node should include arguments in curly braces - the URL is mandatory, but any additional parameters like UPRN or postcode should also be provided. Councils should be listed in alphabetical order.
A "wiki_name" argument with the council's full name should also be provided.
A "wiki_note" argument should be used where non-standard instructions of just providing UPRN/Postcode/House Number parameters are needed.
A "wiki_command_url_override" argument should be used where parts of the URL need to be replaced by the user to allow a valid URL to be left for the integration tests.
A new Wiki entry will be generated automatically from this file's details.
Note: If you want the integration test to work you must supply real, working data (a business address is recommended - the council's address is usually a good one).
Example
"CheshireEastCouncil": {
"uprn": "100012791226",
"url": "https://online.cheshireeast.gov.uk/MyCollectionDay/SearchByAjax/GetBartecJobList?uprn=100012791226&onelineaddress=3%20COBBLERS%20YARD,%20SK9%207DZ&_=1621149987573",
"wiki_name": "Cheshire East Council",
"wiki_command_url_override": "https://online.cheshireeast.gov.uk/MyCollectionDay/SearchByAjax/GetBartecJobList?uprn=XXXXXXXX&onelineaddress=XXXXXXXX&_=1621149987573",
"wiki_note": "Both the UPRN and a one-line address are passed in the URL, which needs to be wrapped in double quotes. The one-line address is made up of the house number, street name and postcode.\nUse the form [here](https://online.cheshireeast.gov.uk/mycollectionday/) to find them, then take the first line and post code and replace all spaces with `%20`."
},
Type | File location |
---|---|
Add | UKBinCollectionData/uk_bin_collection/tests/outputs/<council_name>.json |
A sample of what the scraper outputs should be provided in the outputs folder. This can be taken from your development environment's console or a CLI. Please only include the "bins" data.
Adding the -d
or --dev_mode
parameter to your CLI command enables development mode which creates/updates the Output JSON file for the council automatically for you on each run
Example
{
"bins": [
{
"type": "Empty Standard Mixed Recycling",
"collectionDate": "29/07/2022"
},
{
"type": "Empty Standard Garden Waste",
"collectionDate": "29/07/2022"
},
{
"type": "Empty Standard General Waste",
"collectionDate": "05/08/2022"
}
]
}
Type | File location |
---|---|
Add | UKBinCollectionData/uk_bin_collection/tests/council_schemas/<council_name>.schema |
Using the above output, you will need to generate a JSON schema that the integration test can run against. Luckily, this is pretty easy and can be done using an online tool.
Note: due to seasonal collections (entirely dependent on council, of course), the schema may not include all bin types.
If this is the case, you may need to add them to the bin_type enum
manually (usually around the end of the file).
Example
{
"$schema": "http://json-schema.org/draft-06/schema#",
"$ref": "#/definitions/Welcome10",
"definitions": {
"Welcome10": {
"type": "object",
"additionalProperties": false,
"properties": {
"bins": {
"type": "array",
"items": {
"$ref": "#/definitions/Bin"
}
}
},
"required": [
"bins"
],
"title": "Welcome10"
},
"Bin": {
"type": "object",
"additionalProperties": false,
"properties": {
"type": {
"$ref": "#/definitions/Type"
},
"collectionDate": {
"type": "string"
}
},
"required": [
"collectionDate",
"type"
],
"title": "Bin"
},
"Type": {
"type": "string",
"enum": [
"Empty Standard Mixed Recycling",
"Empty Standard Garden Waste",
"Empty Standard General Waste"
],
"title": "Type"
}
}
}
Type | File location |
---|---|
Modify | UKBinCollectionData/uk_bin_collection/tests/features/validate_council_outputs.feature |
The council's name should be added to the feature file's example list. These names are alphabetically sorted, although
council
should always remain on line 10. The name should be wrapped in pipes.
As with any web scraping project, there's a reliance on the council not changing their website - if this happens Beautiful Soup will fail to read the site correctly, and the expected data will not be returned. To mitigate this and stay on top of "what works and what needs work" - we have created a set of Integration tests which run a feature file.
Based on the input.json, this does an actual live run against the council's site and validates if the returned data is JSON and conforms to a JSON Schema for that council.
cd UKBinCollectionData
poetry shell
poetry run pytest uk_bin_collection/tests/step_defs/ -n logical
The GitHub actions is set to run on push and pull_requests
It uses a Makefile to run the Behave tests to ensure the councils are all still working
The Github Actions publishes the Allure Behave Test results to Github Pages: https://robbrad.github.io/UKBinCollectionData/<python_version>/ eg https://robbrad.github.io/UKBinCollectionData/3.9/ you can check this to see if a council is still working as expected
The CodeCov.io report can be found here
As well as integration testing the repo is setup to test some of the static methods as well to ensure basic core functionality
cd UKBinCollectionData
poetry shell
poetry run coverage run --omit "*/tests/*" -m pytest uk_bin_collection/tests --ignore=uk_bin_collection/tests/step_defs/
poetry run coverage xml
If you have questions or comments, you can reach the project contributors in the following ways: