-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add form submissions endpoint #96
base: master
Are you sure you want to change the base?
Conversation
Hi @dorcieg, thanks for your contribution! In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement. It's all electronic and will take just minutes. |
You did it @dorcieg! Thank you for signing the Singer Contribution License Agreement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the submission! Some quick comments on this.
For the most part this sounds like a useful stream to include, and looks like it's a solid foundation. There are some edge cases that can come up, especially when we're paging through a parent stream and syncing a child stream per parent object. I would imagine that this stream can take some time to get all the way through for large datasets, so it's likely vulnerable to potential interruption (network, timeouts, etc.).
For when we get the time to work on integrating this, a few pieces of information are usually helpful in that process.
- Some information about expected/actual row volume would be good to know. In order to be efficient with users' destinations, we like to be sure that the bookmark is being used correctly, and that there's minimal duplicate data being sent to ensure everything gets replicated.
- Some logs (with any data/PII/creds redacted, of course) of it running would be helpful to illustrate things like bookmark usage through the log messages.
- Snippets of logs of running it with target-stitch and an
Import API
connection in Stitch are also a good step to validate that the JSON Schema matches the data that comes through. The Stitch API validates each record against the schema and will return an error if a record does not match.
That's all for now! Please let me know if I can elaborate on any of the comments!
return STATE | ||
|
||
def _sync_form_submissions_by_form_id(STATE, form_guid): | ||
schema = load_schema("form_submissions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the schemas/form_submissions.json
file needs added to the PR?
singer.write_state(STATE) | ||
LOGGER.info("No more submissions for this form") | ||
break | ||
STATE = singer.write_bookmark(STATE, form_guid, bookmark_key, max_bk_value.strftime("%Y-%m-%d %H:%M:%S")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a subtle edge case with this replication strategy that I'm working on right now in #98. If you expect this stream to be high in volume, it could take awhile to sync and there's a kind of race condition with how the updates come in that can cause some to get skipped. #91 has a short illustration of how this can play out.
The best solution I have right now that's resilient to tap failures is storing the current_sync_start
in the state and making sure the bookmark doesn't get set above that value. (see #98) It might be worth it to try that pattern here, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the bookmark should probably not be written to state until after all forms have been checked, to limit the strange edge cases that can come up if the tap is interrupted. This can return the max_bk_value
and accept it as a parameter to maintain it over the whole stream sync. My comment above was assuming that would be the case.
tap_hubspot/__init__.py
Outdated
while up_to_date == False: | ||
form_offset = singer.get_offset(STATE, form_guid) | ||
|
||
if bool(form_offset) and form_offset.get('after') != None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool(form_offset)
can just be changed to form_offset
data = request(get_url("forms")).json() | ||
|
||
for row in data: | ||
STATE = _sync_form_submissions_by_form_id(STATE, row['guid']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs say this can only be used for some types of forms. Can/Should that be checked here to prevent possible errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the endpoint /forms/v2/forms
to get the list of forms to iterate through. By default non-marketing forms are filtered out of this endpoint
according to the docs so I think we're okay.
tap_hubspot/__init__.py
Outdated
bookmark_key = 'last_max_submitted_at' | ||
|
||
singer.write_schema("form_submissions", schema, ['guid', 'submittedAt', 'pageUrl'], [bookmark_key]) | ||
end = utils.strptime_with_tz(get_start(STATE, form_guid, bookmark_key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils.strptime_to_utc
is the recommended function to use for consistency (unless you need the original timezone).
if submitted_at > max_bk_value: | ||
max_bk_value = submitted_at | ||
|
||
if submitted_at <= end: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use a comment documenting that the stream is returned in reverse order. (It just tripped me up reading it heh)
tap_hubspot/__init__.py
Outdated
url = get_url("form_submissions", form_guid=form_guid) | ||
path = 'results' | ||
params = { | ||
'count': 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be limit? I don't see a count parameter available.
On January 25, 2019 Hubspot released a new endpoint to retrieve form submissions by form. This eliminates the need to rely on the form submission information on the contact object. To use this new endpoint you need to know the form id so this new function retrieves all of the form ids and then loops through them to get the form submissions. This endpoint is also different in that it returns the newest submissions first so the key used in the state file per form guid is
last_max_submitted_at
. The code compares this to the current form submission being processed and stops once the last submission is reached.