-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can't scrape past a certain date #8
Comments
Hi @gtrane, Before go into details, Anyway, I think the program is returning and empty iterator here:
I think you should explore what pushshift is returning, e.g. if you execute the following code, what's the value of the variable
|
Where do I execute this code? Within subreddit_downloader.py or in my terminal? Thank you! |
Hello, Thanks for this code. It has been really helpful for students like me. Whenever I try to extract data for any date post 11 December 2022, it returns empty files. I am not sure as to why this could be happening. Do you have any idea? Looking forward to your response! Thank you so much for all your efforts and this brilliant piece of code. |
Below is a piece of code I have written to convert the timestamps in the output files to proper date formats. Attaching it here, in case this comes in handy for anyone who wants to verify the dates post extraction. import pandas as pd import datetime for y in list(df1['created_utc']): df.to_csv('submissions_date_converted.csv') |
Hi @gtrane @smukherjee30, I have tested and investigated the issue, looks like there is an important change in the pushshift system [1] that involve malfunctions [2] with some "edge" cases. There is tracked also an issue like yours, although with different date ranges [3]. What I will do is wait until the pushshift migration is terminated and then check if some APIs are changed and potentially patch the code. Meanwhile, if you require the data now and cannot wait, there is a post on Reddit where people talk about possible alternatives. Hope this info could be helpful! [1] https://www.reddit.com/r/pushshift/comments/zkggt0/update_on_colo_switchover_bug_fixes_reindexing/ |
Hi,
First of all, thank you for this great program!
I have used your code successful for scraping a subreddit from specific utc date ranges. However, I have encountered a problem where I can't scrape anything past the UTC: 1670743183
my input to terminal:
python src/subreddit_downloader.py --reddit-id --reddit-secret --reddit-username --debug --batch-size 500 --utc-after 1670743183
The error is below. I have no idea why this is occurring, any advice would be greatly appreciated! Thank you.
subreddit_downloader.py 308
typer.run(main)
main.py 859 run
app()
main.py 214 call
return get_command(self)(*args, **kwargs)
core.py 829 call
return self.main(*args, **kwargs)
core.py 782 main
rv = self.invoke(ctx)
core.py 1066 invoke
return ctx.invoke(self.callback, **ctx.params)
core.py 610 invoke
return callback(*args, **kwargs)
main.py 497 wrapper
return callback(**use_params) # type: ignore
contextlib.py 79 inner
return func(*args, **kwds)
subreddit_downloader.py 299 main
assert utc_lower_bound < utc_upper_bound, f"utc_lower_bound '{utc_lower_bound}' should be " \
TypeError:
'<' not supported between instances of 'NoneType' and 'str'
The text was updated successfully, but these errors were encountered: