You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is just a brain dump in rough form about things I could think about that are important when writing a crawler from scratch. This should be written out in a nicer way and combined with an example crawler. Note that this talks about things that are not yet in the code base.
New crawler requirements
If you add a new data source:
Update ACKNOWLEDGMENTS.md
If you add a new python dependency:
Update requirements.txt
Always required:
Add/update README
Update example config
Check code style with pre-commit
Test your crawler by actually running it
Points to note
Use sets to not create duplicate nodes
Think about if you need to specify all=False when creating/fetching nodes.
Do not use sys.exit, but raise an exception to kill a crawler. Also print to log before raising the exception.
Do not log to stdout/stderr, but use logging.{info,warning,error} as appropriate. You can log some steps to info, but do not be too verbose. warning should be for unexpected cases which are not critical enough to justify killing the crawler. error should be followed by an exception. Batch functions automatically log node/relationship creations, so you do not have to do this manually.
Be aware of property formatters that might change your ID property and do the formatting yourselves beforehand (example IPv6, but crawler will crash anyways if you fail to do this).
If possible, iterate over the data only once, gathering nodes and relationships in the process. Then iterate over the relationships and replace node values by their IDs. This way you do not have to perform the formatting twice (and are probably faster).
Use batch functions by default except when you are very sure you will only create a few nodes/relationships.
Specify reference_url_data as precise as possible, especially if it changes for parts of the data within the same crawler. Also try to use URLs that point to the correct data even when accessed at a later point in time. Note: URL is used as the default value for reference_url_data. Always specify a URL, even if it might not be precise and is updated in the code. It makes it easier to know where this crawler gets its data from just by looking at the header.
Try to specify a reference_url_info that gives an explanation / reference to the data.
Try to give a precise reference_time_modification, but do not add this if you are unsure. For this field it is better to give no info than wrong info.
Check that you do not create empty nodes / duplicate relationships. Multiple relationships between two nodes are fine if the relationship properties are different.
Do not change the interface of the default crawler. create_db always uses the default Crawler(ORG, URL, NAME) call. The main function in the crawler file is only for testing or individual runs of the crawler and should not be modified.
NAME should always be directory.file
Cache data where appropriate, and use the tmp directory (advanced usage; not required for most crawlers)
In general, do not manipulate the data, e.g., by removing entries or renaming properties. IYP is a tool that combines different data sources, and detecting differences in the data sources is part of that. If you are unsure, feel free to ask since this is not always a clear line. For example, a crawler that adds IP prefixes from BGP should not filter out private IP prefixes, even though they do not belong in BGP, but should remove prefixes with an invalid format.
Do not add data to nodes apart from the ID properties. Data source specific information can (and should) be attached to the created relationships. Nodes are accessed by different crawlers and thus should only contain information that all crawlers share.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
This is just a brain dump in rough form about things I could think about that are important when writing a crawler from scratch. This should be written out in a nicer way and combined with an example crawler. Note that this talks about things that are not yet in the code base.
New crawler requirements
If you add a new data source:
If you add a new python dependency:
Always required:
Points to note
all=False
when creating/fetching nodes.sys.exit
, but raise an exception to kill a crawler. Also print to log before raising the exception.logging.{info,warning,error}
as appropriate. You can log some steps toinfo
, but do not be too verbose.warning
should be for unexpected cases which are not critical enough to justify killing the crawler.error
should be followed by an exception. Batch functions automatically log node/relationship creations, so you do not have to do this manually.reference_url_data
as precise as possible, especially if it changes for parts of the data within the same crawler. Also try to use URLs that point to the correct data even when accessed at a later point in time. Note:URL
is used as the default value forreference_url_data
. Always specify aURL
, even if it might not be precise and is updated in the code. It makes it easier to know where this crawler gets its data from just by looking at the header.reference_url_info
that gives an explanation / reference to the data.reference_time_modification
, but do not add this if you are unsure. For this field it is better to give no info than wrong info.Crawler(ORG, URL, NAME)
call. Themain
function in the crawler file is only for testing or individual runs of the crawler and should not be modified.NAME
should always bedirectory.file
tmp
directory (advanced usage; not required for most crawlers)Beta Was this translation helpful? Give feedback.
All reactions