Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't scrape past a certain date #8

Open
gtrane opened this issue Feb 24, 2023 · 5 comments
Open

can't scrape past a certain date #8

gtrane opened this issue Feb 24, 2023 · 5 comments

Comments

@gtrane
Copy link

gtrane commented Feb 24, 2023

Hi,
First of all, thank you for this great program!
I have used your code successful for scraping a subreddit from specific utc date ranges. However, I have encountered a problem where I can't scrape anything past the UTC: 1670743183

my input to terminal:
python src/subreddit_downloader.py --reddit-id --reddit-secret --reddit-username --debug --batch-size 500 --utc-after 1670743183

The error is below. I have no idea why this is occurring, any advice would be greatly appreciated! Thank you.

subreddit_downloader.py 308
typer.run(main)

main.py 859 run
app()

main.py 214 call
return get_command(self)(*args, **kwargs)

core.py 829 call
return self.main(*args, **kwargs)

core.py 782 main
rv = self.invoke(ctx)

core.py 1066 invoke
return ctx.invoke(self.callback, **ctx.params)

core.py 610 invoke
return callback(*args, **kwargs)

main.py 497 wrapper
return callback(**use_params) # type: ignore

contextlib.py 79 inner
return func(*args, **kwds)

subreddit_downloader.py 299 main
assert utc_lower_bound < utc_upper_bound, f"utc_lower_bound '{utc_lower_bound}' should be " \

TypeError:
'<' not supported between instances of 'NoneType' and 'str'

@pistocop
Copy link
Owner

pistocop commented Mar 2, 2023

Hi @gtrane,

Before go into details,
I see that you need to "I can't scrape anything past the UTC: 1670743183", but in the code you are using "--utc-after 1670743183", I think you should instead use the program's argument --utc-before.

Anyway, I think the program is returning and empty iterator here:

for sub in submissions_generator:

I think you should explore what pushshift is returning, e.g. if you execute the following code, what's the value of the variable submissions_generator?

direction = "before"
utc_lower_bound="1670743183"
submissions_generator = pushshift_api.search_submissions(subreddit=subreddit,
                                                                     limit=batch_size,
                                                                     sort='desc' if direction == "before" else 'asc',
                                                                     sort_type='created_utc',
                                                                     after=utc_upper_bound if direction == "after" else None,
                                                                     before=utc_lower_bound if direction == "before" else None,
                                                                     )

@gtrane
Copy link
Author

gtrane commented Mar 7, 2023

Where do I execute this code? Within subreddit_downloader.py or in my terminal? Thank you!

@smukherjee30
Copy link

smukherjee30 commented Mar 29, 2023

Hello,

Thanks for this code. It has been really helpful for students like me.
But unfortunately, it is the same problem for me.

Whenever I try to extract data for any date post 11 December 2022, it returns empty files. I am not sure as to why this could be happening. Do you have any idea?

Looking forward to your response! Thank you so much for all your efforts and this brilliant piece of code.

@smukherjee30
Copy link

smukherjee30 commented Mar 29, 2023

Below is a piece of code I have written to convert the timestamps in the output files to proper date formats. Attaching it here, in case this comes in handy for anyone who wants to verify the dates post extraction.

import pandas as pd
df=pd.read_csv('submissions.csv')
df1=pd.read_csv('comments.csv')

import datetime
for x in list(df['created_utc']):
datetime_obj=datetime.datetime.fromtimestamp(x)
df.loc[df['created_utc'] == x,'created_utc'] = datetime_obj

for y in list(df1['created_utc']):
datetime_obj1=datetime.datetime.fromtimestamp(y)
df1.loc[df1['created_utc'] == y,'created_utc'] = datetime_obj1

df.to_csv('submissions_date_converted.csv')
df1.to_csv('comments_date_converted.csv')

@pistocop
Copy link
Owner

pistocop commented Apr 5, 2023

Hi @gtrane @smukherjee30,

I have tested and investigated the issue, looks like there is an important change in the pushshift system [1] that involve malfunctions [2] with some "edge" cases. There is tracked also an issue like yours, although with different date ranges [3].

What I will do is wait until the pushshift migration is terminated and then check if some APIs are changed and potentially patch the code. Meanwhile, if you require the data now and cannot wait, there is a post on Reddit where people talk about possible alternatives.

Hope this info could be helpful!

[1] https://www.reddit.com/r/pushshift/comments/zkggt0/update_on_colo_switchover_bug_fixes_reindexing/
[2] https://github.com/pushshift/api/issues
[3] pushshift/api#132

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants