Merged Geotweets

One of the objective of the work is store the merged geotweets from Harvard and University of Salzburg tweet collection on this High Performance Cluster. This will not only make it available for other datascience projects but also enable high performance computing on it.

Please follow the instruction here to upload merged geotweets to Harvard HPC:

Login to FASRC using the login:

ssh [email protected]

Go to CGA's storage space:

cd  /n/cga

Go to merged geotweets storage folder:

cd  /n/cga/geotweets_merged

Store the data here according to most appropriate structure you seem fit
Restore the database using

CREATE EXTENSION postgis;
create user postgres superuser;
CREATE DATABASE socialmedia
     WITH
     OWNER = postgres;

ALTER DATABASE socialmedia
     SET search_path TO "$user", public, sde;

CREATE SCHEMA archive
     AUTHORIZATION postgres;
create database crawler;
create schema crawler;
Create role passau;
Create tablespace crawler_tablespace LOCATION /var/lib/pgsql/12/data;

pg_restore --host localhost --port $port --username "postgres" --no-acl --dbname "postgres" --verbose --schema "archive" "/n/cga/geotweets_merged/archive_backup_master"
pg_restore  --host localhost --port $port -U postgres -d postgres --section=pre-data --section=data -1 /n/cga/geotweets_merged/archive_backup_2019_12

-Query

Execute for both postgres and socialmedia:

SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';

SELECT * FROM archive.twitter WHERE ST_intersects(geom, ST_GeomFromText('POLYGON((5.96 49.57, 43.89 49.57, 43.89 29.25, 5.96 29.25, 5.96 49.57))', 4326)) and ST_GeometryType(geom)= 'ST_Point' and date between '2015-01-01'::timestamp and '2016-12-31'::timestamp and text ilike '%war%'

Processing Merged Geotweets

Restore monthly dump to postgres
Connect to PostGIS using:

psql -h localhost -p $port postgres

Export to CSV. This changes Geometry to WKT format which is very slow to do in PostGIS

copy (Select message_id,date,text,tags,tweet_lang,source,place, ST_AsText(geom) as geom,retweets,tweet_favorites,photo_url,quoted_status_id,user_id,user_name,user_location,followers,friends,user_favorites,status,user_lang,latitude,longitude,data_source from archive.twitter_2020_05) TO '/n/holyscratch01/cga/dkakkar/data/twitter_2020_05.csv' DELIMITER '|' CSV HEADER;

Running the script

ssh to Postgis compute node
Load Conda

module load Anaconda3/5.0.1-fasrc02

Create environment, activate env and install libraries

conda create -n geotweets python=3.6

source activate geotweets

pip install pandas

pip install geopandas

pip install numpy

pip install shapely

pip install pymapd

Change the input file name to your CSV and run the code

python3 /n/holyscratch01/cga/dkakkar/scripts/geotweets.py

Loading hourly CSVs to Omnisci

Activate screen

screen

ssh to Omnisci compute node
Load Conda

module load Anaconda3/5.0.1-fasrc02

source activate geotweets

Edit the script to include your file path and omnisci port just like gdelt script
Run the script

python3 /n/holyscratch01/cga/dkakkar/scripts/geotweets_omnisci_geom.py

Filter geotweets on country:

First, upload the geotweets tables
Run the country script by changing name of country and table name

module load Anaconda3/5.0.1-fasrc02

source activate gdelt

python3 /n/holyscratch01/cga/dkakkar/scripts/geotweets_omnisci_countries.py

Installing AWS CLI

[francescounpriv@boslogin03 francescounpriv]$ curl
"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 31.2M 100 31.2M 0 0 82.2M 0 --:--:-- --:--:-- --:--:-- 82.1M
[francescounpriv@boslogin03 francescounpriv]$ unzip awscliv2.zip > log-unzip
[francescounpriv@boslogin03 francescounpriv]$ mkdir -p ~/PROGRAMS/{bin,aws}
[francescounpriv@boslogin03 francescounpriv]$ ./aws/install --install-dir
$HOME/PROGRAMS/aws --bin-dir $HOME/PROGRAMS/bin
You can now run: /n/home14/francescounpriv/PROGRAMS/bin/aws --version
[francescounpriv@boslogin03 francescounpriv]$ aws --version
aws-cli/2.0.10 Python/3.7.3 Linux/3.10.0-957.12.1.el7.x86_64
botocore/2.0.0dev14

The dbname is "geotweet", and the credentials for accessing that are the username "geotweet" and the password stored in Devika home folder in the file ~/.pgsql_passwd
One should be able to connect to that db from all the compute nodes, from login nodes and from your laptop while connected to the VPN
Here an example:

[dkakkar@boslogin02 ~]$ module load postgresql/12.2-fasrc01
[dkakkar@boslogin02 ~]$ export PGPASSWORD=`cat .pgsql_passwd`
[dkakkar@boslogin02 ~]$ psql --host=rcdb-research.rc.fas.harvard.edu
--dbname=geotweet --username=geotweet
psql (12.2, server 9.5.15)
Type "help" for help.
geotweet=> \q

PostGIS VM

The VM is Name: cga-geotweets.rc.fas.harvard.edu Address 1: 10.242.121.78 cga-geotweets.rc.fas.harvard.edu

You should be allowed to login via ssh from FASRC VPN Realm.

The network is set to allow connection on port 5432 from

FASRC VPN Realm
Compute nodes in the cluster
Cluster login nodes

Docker is installed. I assume you might prefer to use docker to run your service rather than installing packages. I pulled the images for postgres and postgis ( mainly to test that docker was working and that the connection to docker hub was working. In any case, if you prefer to use native rpm packages rather than docker please do so.

You ( actually the members of cga_admins) should have login access and sudo access [root@cga-geotweets ~]# getent group srv-cga-geotweets-root srv-cga-geotweets-root::10514:dkakkar,thu,blewis,wguan [root@cga-geotweets ~]# getent group srv-cga-geotweets-pam srv-cga-geotweets-pam::10515:dkakkar,thu,blewis,wguan

local storage is [root@cga-geotweets ~]# df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 3.8G 0 3.8G 0% /dev tmpfs 3.9G 12K 3.9G 1% /dev/shm tmpfs 3.9G 17M 3.9G 1% /run tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup /dev/vda1 30G 3.5G 27G 12% / /dev/vdb 100G 33M 100G 1% /data

the /data is an extra volume that you can use as storage location for your DB. Or if the data is small and want to keep in on the root volume feel free to do so.

Please remember that puppet is there running and keeps enforcing basic settings. ( authentication, firewall, and other basic things ). We recommend that you leave it running, but If you don't want that, just issue the following command :

$> puppet agent --disable $> systemctl stop puppet $> systemctl disable puppet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merged Geotweets

Merged Geotweets

Processing Merged Geotweets

Running the script

Loading hourly CSVs to Omnisci

Filter geotweets on country:

PostGIS VM

Clone this wiki locally