Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add form submissions endpoint #96

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dorcieg
Copy link

@dorcieg dorcieg commented Aug 14, 2019

On January 25, 2019 Hubspot released a new endpoint to retrieve form submissions by form. This eliminates the need to rely on the form submission information on the contact object. To use this new endpoint you need to know the form id so this new function retrieves all of the form ids and then loops through them to get the form submissions. This endpoint is also different in that it returns the newest submissions first so the key used in the state file per form guid is last_max_submitted_at. The code compares this to the current form submission being processed and stops once the last submission is reached.

@cmerrick
Copy link
Contributor

Hi @dorcieg, thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign a contribution license agreement. It's all electronic and will take just minutes.

@cmerrick
Copy link
Contributor

You did it @dorcieg!

Thank you for signing the Singer Contribution License Agreement.

Copy link
Contributor

@dmosorast dmosorast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the submission! Some quick comments on this.

For the most part this sounds like a useful stream to include, and looks like it's a solid foundation. There are some edge cases that can come up, especially when we're paging through a parent stream and syncing a child stream per parent object. I would imagine that this stream can take some time to get all the way through for large datasets, so it's likely vulnerable to potential interruption (network, timeouts, etc.).

For when we get the time to work on integrating this, a few pieces of information are usually helpful in that process.

  1. Some information about expected/actual row volume would be good to know. In order to be efficient with users' destinations, we like to be sure that the bookmark is being used correctly, and that there's minimal duplicate data being sent to ensure everything gets replicated.
  2. Some logs (with any data/PII/creds redacted, of course) of it running would be helpful to illustrate things like bookmark usage through the log messages.
  3. Snippets of logs of running it with target-stitch and an Import API connection in Stitch are also a good step to validate that the JSON Schema matches the data that comes through. The Stitch API validates each record against the schema and will return an error if a record does not match.

That's all for now! Please let me know if I can elaborate on any of the comments!

return STATE

def _sync_form_submissions_by_form_id(STATE, form_guid):
schema = load_schema("form_submissions")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the schemas/form_submissions.json file needs added to the PR?

singer.write_state(STATE)
LOGGER.info("No more submissions for this form")
break
STATE = singer.write_bookmark(STATE, form_guid, bookmark_key, max_bk_value.strftime("%Y-%m-%d %H:%M:%S"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtle edge case with this replication strategy that I'm working on right now in #98. If you expect this stream to be high in volume, it could take awhile to sync and there's a kind of race condition with how the updates come in that can cause some to get skipped. #91 has a short illustration of how this can play out.

The best solution I have right now that's resilient to tap failures is storing the current_sync_start in the state and making sure the bookmark doesn't get set above that value. (see #98) It might be worth it to try that pattern here, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the bookmark should probably not be written to state until after all forms have been checked, to limit the strange edge cases that can come up if the tap is interrupted. This can return the max_bk_value and accept it as a parameter to maintain it over the whole stream sync. My comment above was assuming that would be the case.

while up_to_date == False:
form_offset = singer.get_offset(STATE, form_guid)

if bool(form_offset) and form_offset.get('after') != None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool(form_offset) can just be changed to form_offset

data = request(get_url("forms")).json()

for row in data:
STATE = _sync_form_submissions_by_form_id(STATE, row['guid'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say this can only be used for some types of forms. Can/Should that be checked here to prevent possible errors?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the endpoint /forms/v2/forms to get the list of forms to iterate through. By default non-marketing forms are filtered out of this endpoint according to the docs so I think we're okay.

bookmark_key = 'last_max_submitted_at'

singer.write_schema("form_submissions", schema, ['guid', 'submittedAt', 'pageUrl'], [bookmark_key])
end = utils.strptime_with_tz(get_start(STATE, form_guid, bookmark_key))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils.strptime_to_utc is the recommended function to use for consistency (unless you need the original timezone).

if submitted_at > max_bk_value:
max_bk_value = submitted_at

if submitted_at <= end:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use a comment documenting that the stream is returned in reverse order. (It just tripped me up reading it heh)

url = get_url("form_submissions", form_guid=form_guid)
path = 'results'
params = {
'count': 50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be limit? I don't see a count parameter available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants