Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paginated github search response traversal does not work #5

Closed
theroys opened this issue Sep 11, 2014 · 7 comments
Closed

Paginated github search response traversal does not work #5

theroys opened this issue Sep 11, 2014 · 7 comments
Assignees
Labels

Comments

@theroys
Copy link
Contributor

theroys commented Sep 11, 2014

in SearchGitRepositories.py paginated search result traversal does not work.Only first page is traversed for projects , although this has been committed as such, i am creating this as a first bug to fix.

@theroys theroys added the bug label Sep 11, 2014
@longhn
Copy link
Contributor

longhn commented Sep 15, 2014

Hi @theroys,

Can you review and double check the below approach? If it is ok, then I can make the query manipulative by taking an user entry, and send a pull request.

resultCount =jsonRespData['total_count']
print 'total_count is: ' + str(resultCount)
current_page = 0
max_page = resultCount / per_page_result_size;
print 'max_page is: ' + str(max_page)

while (current_page <= max_page):
      fileName = fileName = 'github_search_dump_' + str(per_page_result_size * (current_page + 1)) + '.txt'

      print 'file name is: ' + fileName

      current_fetch_url = urljoin(GITHUB_REPO_SEARCH_URL, 'repositories?q=stars:>100&per_page=' + str(per_page_result_size) + '&page=' + str(current_page))
      print 'current_fetch_url is: ' + current_fetch_url

      current_fetch_res = requests.get(current_fetch_url, headers=headers)
      jsonRespData = json.loads(current_fetch_res.text)

      ##dump the data in a file/not in mongo
      filtered_proj_data = get_projects_metadata(jsonRespData)

      with open(fileName, 'w') as outfile:
          json.dump(filtered_proj_data, outfile)

      current_page = current_page + 1

@theroys
Copy link
Contributor Author

theroys commented Sep 17, 2014

Hi Long , yes that is the correct approach for this specific fix.This code also also calls the classifier , which in turn calls queryopenhub, there is a API key from openhub with a limit of 1000 calls per day.
With this specific query we will go to more than 5000 result. So we have problem when we run out of maximum allowed calls.
So to fix this issue there are two things to do ,
1.We make in the search code possible how many max result we will iterate through
2.You register your self in openhub and we get one more 1000 calls per day( we will enhance our code for query openhub, so that it works a number of API keys.)-This we will take as separate issue.

Thanks for working on this :)
..Anirban

@longhn
Copy link
Contributor

longhn commented Sep 23, 2014

I sent a pull request (#7) for this issue and open a new issue to make the query manipulative by taking an user entry (#8)
Please open a seperate issue to enhance openhub query

@theroys
Copy link
Contributor Author

theroys commented Sep 24, 2014

Hi @longhn i get 403 while fetching your fix from branch

@longhn
Copy link
Contributor

longhn commented Sep 25, 2014

Hi @theroys
I have some time to run the pull request again. However, I'm not able to get the 403 error. I'm able to run to the github_search_dump_500.txt

search for jquery-waypoints returned 0results
file name is: github_search_dump_500.txt
current_fetch_url is: https://api.github.com/search/repositories?q=stars:>100&per_page=100&page=4
 classifying SignalR
 best category match as per git desc ->Web Application Framework
search for SignalR returned 56results

The error I'm facing locates in another class. I think it should be a separate issue.

search for git-extras returned 3results
 classifying iosched

Traceback (most recent call last):
  File "/home/longhn/Project/OSSRank/gitdatacollection/SearchGitRepositories.py", line 78, in <module>
  File "/home/longhn/Project/OSSRank/gitdatacollection/SearchGitRepositories.py", line 70, in main
  File "/home/longhn/Project/OSSRank/gitdatacollection/SearchGitRepositories.py", line 26, in get_projects_metadata
  File "/home/longhn/Project/OSSRank/gitdatacollection/ProjectClassifier.py", line 160, in classify_project
    current_desc_words=get_desc_words(project_description)
  File "/home/longhn/Project/OSSRank/gitdatacollection/ProjectClassifier.py", line 57, in get_desc_words
    desc_words=set(wordpunct_tokenize(software_desc.replace('\n', '').lower()))
AttributeError: 'NoneType' object has no attribute 'replace'
>>> 

@theroys
Copy link
Contributor Author

theroys commented Sep 25, 2014

Hi @longhn , yes this is an error in projectclassifier.py , i will fix it.This is the issue with when description is not present and python returns NoneType object i need t change the logic there not categorize as per gitbub desc when not present.
Thanks for informing :).

@longhn
Copy link
Contributor

longhn commented Oct 5, 2014

@theroys, @fsiddiqi
Can you take a look at this issue and the my resolution in the pull request #7? If the pull request is OK, we should merge it to the master branch and close this issue.

In the ProjectClassifier.get_desc_words() function, I add a below check and it seems to work

def get_desc_words(software_desc, stopwords=[]):
    if (software_desc is None):
        return 'undefined description'
....

I'm able to reach the daily access limit of openhub

classifying codebox
 best category match as per git desc ->Application Development -IDE
openhub api returned an error while searching for project codebox<error>This api_key has exceeded its daily access limit.</error>

file name is: github_search_dump_1200.txt
current_fetch_url is: https://api.github.com/search/repositories?q=stars:>100&per_page=100&page=11
Traceback (most recent call last):
  File "/home/longhn/Project/worksapce/SearchGitRepos/src/gitdatacollection/SearchGitRepositories.py", line 78, in <module>
    main()
  File "/home/longhn/Project/worksapce/SearchGitRepos/src/gitdatacollection/SearchGitRepositories.py", line 70, in main
    filtered_proj_data = get_projects_metadata(jsonRespData)
  File "/home/longhn/Project/worksapce/SearchGitRepos/src/gitdatacollection/SearchGitRepositories.py", line 22, in get_projects_metadata
    for item in jsonContent['items']:
KeyError: 'items'

@longhn longhn closed this as completed Oct 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants