Unable to download Huge corpus of papers #31

UddeshyaPandey · 2022-01-12T09:30:13Z

Describe the bug
Was downloading XML and CSV files for all the papers published in the year 2021 for the query "Transcription factors", the limit was set to 100k papers, and hits were 99k, ideally, it should start the download with a warning but the error is
TypeError: 'NoneType' object is not subscriptable

To Reproduce
Steps to reproduce the behaviour:

In your windows command prompt type
pygetpapers -q "Transcription factors" -x -c -o TF_database_2021 -k 100000 --startdate 2021-01-01 --enddate 2021-12-31
press 'Enter'
Scroll down to the end
See an error like
TypeError: 'NoneType' object is not subscriptable

Expected behaviour

Ideally, it should start the download of all the available XML and CSV files related to the query

Screenshots

Desktop (please complete the following information):

OS: Windows 11
Browser : Firefox
Version : Firefox 95.0

Additional context
it usually works for a small corpus of like 1000 to 100 papers, for example, pygetpapers ran smoothly the above query for the year 2022 and set the limit to 1000 papers, but the actual hits were only 458. it downloaded a corpus of 458 papers with CSV and XML files.
But for a huge corpus usually >1k, it shows the above error message.

The text was updated successfully, but these errors were encountered:

ayush4921 · 2022-02-23T15:46:26Z

Can you check the same command in version 1.1.5

petermr · 2022-02-24T08:47:12Z

Thanks both, I suggest that 100K is too large a chunk. Maybe 10K * it may put strain on the server and get blocked * when errors occur it may be difficult to locate the documents responsible - as we have here * make sure you can actually analyze the downloaded material. If you can't process 10K, downloading 100K won't gain anything.

…

On Wed, Feb 23, 2022 at 3:46 PM Ayush Garg ***@***.***> wrote: Can you check the same command in version 1.1.5 — Reply to this email directly, view it on GitHub <#31 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCS4VBAN5LNVD5WFDQNDU4T6N3ANCNFSM5LYO644A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to download Huge corpus of papers #31

Unable to download Huge corpus of papers #31

UddeshyaPandey commented Jan 12, 2022 •

edited

Loading

ayush4921 commented Feb 23, 2022

petermr commented Feb 24, 2022 via email

Unable to download Huge corpus of papers #31

Unable to download Huge corpus of papers #31

Comments

UddeshyaPandey commented Jan 12, 2022 • edited Loading

ayush4921 commented Feb 23, 2022

petermr commented Feb 24, 2022 via email

UddeshyaPandey commented Jan 12, 2022 •

edited

Loading