This project aims to unite similar Twitter communities by identifying shared interests through Unsupervised Learning Techniques on Graph and Tabular Data.
This project is a segment of my Unsupervised Learning and Social Network Analysis (UL & SNA) course, under the guidance of Professor M. Lazaar at ENSIAS, Mohammed V University.
In choosing a project for this course, I opted to concentrate on clustering communities within Twitter. Coming from a traditional machine learning background involving tabular data, I was particularly intrigued by the challenge of handling graph data and constructing machine learning models that could uncover patterns without human guidance. While Facebook and Google+ were available data sources, Twitter stood out due to its simplicity and engaging nature.
The entirety of this project comprises sample code demonstrating the following procedures:
- Identification of Twitter communities using the Stanford Network Analysis Project (SNAP) Twitter graph data, employing two distinct methods: Edge-based and Feature-based approaches.
- Generation of a visual representation and preprocessing of data by creating a graph and computing the adjacency matrix through networkx, scipy, and matplotlib.
- Edge-based approach:
- Execution of training for the Spectral Clustering model over the adjacency matrix followed by its evaluation using Silhouette score via Scikit-Learn.
- Feature-based approach:
- Construction of a tabular format from the graph data, enhancing it with critical graph centrality metrics, including degree, closeness, and betweenness centrality.
- Execution of training for various clustering algorithms—KMEANS, SpectralClustering, and AgglomerativeClustering—followed by their evaluation using Silhouette scores.
- Assignment of labels to clusters (produced by the best performing approach) by identifying the most commonly used hashtags among cluster members. These hashtags are then employed to encapsulate key themes, such as 'Social Media Cluster,' 'Gaming Cluster,' and 'Music Cluster,' portraying the prevalent interests within each cluster.