Skip to content

AlertaDengue/tweet_capture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Tweet Capture

These notebooks are meant to retrieve tweets related to arboviruses (dengue, zika, chikungunya) from Twitter API. The retrieval is done through streaming (in real time) and the retrieved tweets are stored in a MongoDB collection.

Main files

  • twitter_retriever.ipynb: it captures data from Twitter by streaming method and saves it into a MongoDB collection.

  • data_update_and_analysis.ipynb: updates geolocation data, which are not well organized on source data. It also compares databases if data is collected from different machines, as data retrieval from twitter might be asyncronous.

  • twitter_geolocation_eda.ipynb: It evaluates the geolocation variables from twitter data, for instance comparing the proportion of each variable from total tweets.

About MongoDB

MongoDB is a NoSQL database program which uses JSON-like document (also the data format from twitter API). You might navigate a database by using the browser MongoDB Compass or by using Python library pymongo.

Installation

Tutorial

Dumping and restoring MongoDB Collection:

mongodump --db DataBaseName --collection collection -o "path/to/folder"

Copy the generated dump/DataBaseName folder to the new machine. Then, import using mongorestore:

mongorestore --db DataBaseName /path/to/DataBaseName

  • the dump is the bson unzipped file

Note that /path/to/DataBaseName should be a directory filled with .json and .bson representations of your data.

You can also compress the file on the fly:

mongodump --db somedb --collection somecollection --out "path/to/folder" --gzip

About redundancy

It was suggested by Fiocruz to use some redundancy to capture data. One option might be AWS, which offers services such as Lambda (FaaS) and EC2 (IaaS).

About

Captura de tweets para o projeto Infodengue

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published