Skip to content

Merged Geotweets

dkakkar edited this page Sep 15, 2022 · 6 revisions

Merged Geotweets

One of the objective of the work is store the merged geotweets from Harvard and University of Salzburg tweet collection on this High Performance Cluster. This will not only make it available for other datascience projects but also enable high performance computing on it.

Please follow the instruction here to upload merged geotweets to Harvard HPC:

  • Login to FASRC using the login:
  • Go to CGA's storage space:
cd  /n/cga
  • Go to merged geotweets storage folder:
cd  /n/cga/geotweets_merged
  • Store the data here according to most appropriate structure you seem fit

  • Restore the database using

CREATE EXTENSION postgis;
create user postgres superuser;
CREATE DATABASE socialmedia
     WITH
     OWNER = postgres;

ALTER DATABASE socialmedia
     SET search_path TO "$user", public, sde;

CREATE SCHEMA archive
     AUTHORIZATION postgres;
create database crawler;
create schema crawler;
Create role passau;
Create tablespace crawler_tablespace LOCATION /var/lib/pgsql/12/data;
pg_restore --host localhost --port $port --username "postgres" --no-acl --dbname "postgres" --verbose --schema "archive" "/n/cga/geotweets_merged/archive_backup_master"
pg_restore  --host localhost --port $port -U postgres -d postgres --section=pre-data --section=data -1 /n/cga/geotweets_merged/archive_backup_2019_12

-Query

  • Execute for both postgres and socialmedia:
SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';
SELECT * FROM archive.twitter WHERE ST_intersects(geom, ST_GeomFromText('POLYGON((5.96 49.57, 43.89 49.57, 43.89 29.25, 5.96 29.25, 5.96 49.57))', 4326)) and ST_GeometryType(geom)= 'ST_Point' and date between '2015-01-01'::timestamp and '2016-12-31'::timestamp and text ilike '%war%'

Processing Merged Geotweets

  • Restore monthly dump to postgres
  • Connect to PostGIS using:
psql -h localhost -p $port postgres
  • Export to CSV. This changes Geometry to WKT format which is very slow to do in PostGIS
copy (Select message_id,date,text,tags,tweet_lang,source,place, ST_AsText(geom) as geom,retweets,tweet_favorites,photo_url,quoted_status_id,user_id,user_name,user_location,followers,friends,user_favorites,status,user_lang,latitude,longitude,data_source from archive.twitter_2020_05) TO '/n/holyscratch01/cga/dkakkar/data/twitter_2020_05.csv' DELIMITER '|' CSV HEADER;

Running the script

  • ssh to Postgis compute node
  • Load Conda
module load Anaconda3/5.0.1-fasrc02
  • Create environment, activate env and install libraries
conda create -n geotweets python=3.6
source activate geotweets
pip install pandas
pip install geopandas
pip install numpy
pip install shapely
pip install pymapd
  • Change the input file name to your CSV and run the code
python3 /n/holyscratch01/cga/dkakkar/scripts/geotweets.py

Loading hourly CSVs to Omnisci

  • Activate screen
screen
  • ssh to Omnisci compute node
  • Load Conda
module load Anaconda3/5.0.1-fasrc02
source activate geotweets 
  • Edit the script to include your file path and omnisci port just like gdelt script
  • Run the script
python3 /n/holyscratch01/cga/dkakkar/scripts/geotweets_omnisci_geom.py

Filter geotweets on country:

  • First, upload the geotweets tables
  • Run the country script by changing name of country and table name
module load Anaconda3/5.0.1-fasrc02
source activate gdelt
python3 /n/holyscratch01/cga/dkakkar/scripts/geotweets_omnisci_countries.py
  • Installing AWS CLI
[francescounpriv@boslogin03 francescounpriv]$ curl
"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 31.2M 100 31.2M 0 0 82.2M 0 --:--:-- --:--:-- --:--:-- 82.1M
[francescounpriv@boslogin03 francescounpriv]$ unzip awscliv2.zip > log-unzip
[francescounpriv@boslogin03 francescounpriv]$ mkdir -p ~/PROGRAMS/{bin,aws}
[francescounpriv@boslogin03 francescounpriv]$ ./aws/install --install-dir
$HOME/PROGRAMS/aws --bin-dir $HOME/PROGRAMS/bin
You can now run: /n/home14/francescounpriv/PROGRAMS/bin/aws --version
[francescounpriv@boslogin03 francescounpriv]$ aws --version
aws-cli/2.0.10 Python/3.7.3 Linux/3.10.0-957.12.1.el7.x86_64
botocore/2.0.0dev14
  • The dbname is "geotweet", and the credentials for accessing that are the username "geotweet" and the password stored in Devika home folder in the file ~/.pgsql_passwd

  • One should be able to connect to that db from all the compute nodes, from login nodes and from your laptop while connected to the VPN

  • Here an example:

[dkakkar@boslogin02 ~]$ module load postgresql/12.2-fasrc01
[dkakkar@boslogin02 ~]$ export PGPASSWORD=`cat .pgsql_passwd`
[dkakkar@boslogin02 ~]$ psql --host=rcdb-research.rc.fas.harvard.edu
--dbname=geotweet --username=geotweet
psql (12.2, server 9.5.15)
Type "help" for help.
geotweet=> \q

PostGIS VM

The VM is Name: cga-geotweets.rc.fas.harvard.edu Address 1: 10.242.121.78 cga-geotweets.rc.fas.harvard.edu

You should be allowed to login via ssh from FASRC VPN Realm.

The network is set to allow connection on port 5432 from

  • FASRC VPN Realm
  • Compute nodes in the cluster
  • Cluster login nodes

Docker is installed. I assume you might prefer to use docker to run your service rather than installing packages. I pulled the images for postgres and postgis ( mainly to test that docker was working and that the connection to docker hub was working. In any case, if you prefer to use native rpm packages rather than docker please do so.

You ( actually the members of cga_admins) should have login access and sudo access [root@cga-geotweets ~]# getent group srv-cga-geotweets-root srv-cga-geotweets-root::10514:dkakkar,thu,blewis,wguan [root@cga-geotweets ~]# getent group srv-cga-geotweets-pam srv-cga-geotweets-pam::10515:dkakkar,thu,blewis,wguan

local storage is [root@cga-geotweets ~]# df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 3.8G 0 3.8G 0% /dev tmpfs 3.9G 12K 3.9G 1% /dev/shm tmpfs 3.9G 17M 3.9G 1% /run tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup /dev/vda1 30G 3.5G 27G 12% / /dev/vdb 100G 33M 100G 1% /data

the /data is an extra volume that you can use as storage location for your DB. Or if the data is small and want to keep in on the root volume feel free to do so.

Please remember that puppet is there running and keeps enforcing basic settings. ( authentication, firewall, and other basic things ). We recommend that you leave it running, but If you don't want that, just issue the following command :

$> puppet agent --disable $> systemctl stop puppet $> systemctl disable puppet