Guide for new crawlers #128

m-appel · 2024-02-16T09:43:36Z

m-appel
Feb 16, 2024
Maintainer

This is just a brain dump in rough form about things I could think about that are important when writing a crawler from scratch. This should be written out in a nicer way and combined with an example crawler. Note that this talks about things that are not yet in the code base.

New crawler requirements

If you add a new data source:

Update ACKNOWLEDGMENTS.md

If you add a new python dependency:

Update requirements.txt

Always required:

Add/update README
Update example config
Check code style with pre-commit
Test your crawler by actually running it

Points to note

Use sets to not create duplicate nodes
Think about if you need to specify all=False when creating/fetching nodes.
Do not use sys.exit, but raise an exception to kill a crawler. Also print to log before raising the exception.
Do not log to stdout/stderr, but use logging.{info,warning,error} as appropriate. You can log some steps to info, but do not be too verbose. warning should be for unexpected cases which are not critical enough to justify killing the crawler. error should be followed by an exception. Batch functions automatically log node/relationship creations, so you do not have to do this manually.
Be aware of property formatters that might change your ID property and do the formatting yourselves beforehand (example IPv6, but crawler will crash anyways if you fail to do this).
If possible, iterate over the data only once, gathering nodes and relationships in the process. Then iterate over the relationships and replace node values by their IDs. This way you do not have to perform the formatting twice (and are probably faster).
Use batch functions by default except when you are very sure you will only create a few nodes/relationships.
Specify reference_url_data as precise as possible, especially if it changes for parts of the data within the same crawler. Also try to use URLs that point to the correct data even when accessed at a later point in time. Note: URL is used as the default value for reference_url_data. Always specify a URL, even if it might not be precise and is updated in the code. It makes it easier to know where this crawler gets its data from just by looking at the header.
Try to specify a reference_url_info that gives an explanation / reference to the data.
Try to give a precise reference_time_modification, but do not add this if you are unsure. For this field it is better to give no info than wrong info.
Check that you do not create empty nodes / duplicate relationships. Multiple relationships between two nodes are fine if the relationship properties are different.
Do not change the interface of the default crawler. create_db always uses the default Crawler(ORG, URL, NAME) call. The main function in the crawler file is only for testing or individual runs of the crawler and should not be modified.
NAME should always be directory.file
Cache data where appropriate, and use the tmp directory (advanced usage; not required for most crawlers)
In general, do not manipulate the data, e.g., by removing entries or renaming properties. IYP is a tool that combines different data sources, and detecting differences in the data sources is part of that. If you are unsure, feel free to ask since this is not always a clear line. For example, a crawler that adds IP prefixes from BGP should not filter out private IP prefixes, even though they do not belong in BGP, but should remove prefixes with an invalid format.
Do not add data to nodes apart from the ID properties. Data source specific information can (and should) be attached to the created relationships. Nodes are accessed by different crawlers and thus should only contain information that all crawlers share.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guide for new crawlers #128

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Guide for new crawlers #128

m-appel Feb 16, 2024 Maintainer

New crawler requirements

Replies: 0 comments

m-appel
Feb 16, 2024
Maintainer