Skip to content

Finding Genre specific Actors and Clustering Film Genres (data from IMDB)

Mahmoud Ibrahim edited this page Jan 21, 2020 · 1 revision

genesorteR is an R package that can provide ranking of genes in single cell clusters (from scRNA-Seq data). It relies on the detection rate of genes in the single cell clusters. You can read more in genesorteR's pre-print on bioRxiv. While I was writing the package, I realized that it is also applicable to any sparse matrix, if the columns are clustered (ie. labeled in some coherent way).

I found this IMDB.com data set, which you can obtain from the SuiteSparse Matrix Collection at Texas A&M University (https://sparse.tamu.edu/Pajek/IMDB). This data set describes which actors appear in which films. Films are labeled by their genres (comedy, romance, horror...etc.). As far as genesorteR is concerned, actors are like genes, films are like single cells and film genres are like cell types or cell clusters. So if we run genesorteR on this data, we should be able to rank actors by their specificity in genres and find actors who were somewhat prolific but acted largely only in one genre (were somewhat typecast perhaps).

let's start..this will be a short tutorial.

What do we need?

Find genre-specific actors

I have repackaged the IMDB data in an easy digestible form, so we can load it directly from within R:

#get the matrix, this may take a couple of seconds as the file is downloaded. films are in columns, actors in rows.
library(Matrix)
imdb = readMM("https://dl.dropboxusercontent.com/s/4whgx113zlxo2np/imdb.mtx")

#movie genre ids
colnames(imdb) = read.table("https://dl.dropboxusercontent.com/s/fftpzrgpw470ceg/coldat.txt")[[1]]

#actor names
rownames(imdb) = read.table("https://dl.dropboxusercontent.com/s/mwykez0pwwxwgwp/rowdat.txt", sep = "\t", quote = "")[[1]]

#convert imdb to a dgCMatrix. By default, since it's binary it gets a ngTMatrix logical matrix type. But genesorteR undderstands only dgCMatrix objects.
imdb = imdb * 1

#now let's run genesorteR to get specificity scores for actors in movie genres
library(genesorteR)
sg = sortGenes(imdb, colnames(imdb), binarizeMethod="naive", cores = 16)

#if your run this command, you can see the top 2 actors for each genre. Feel free to google them :)
plotTopMarkerHeat(sg, top_n = 2, plotheat = FALSE, outs = TRUE)

Correlate and cluster film genres

#now we can correlate film genres based on actor specificity scores, and this gives us some reasonable clustering of film genre. 
plotCorrelationHeat(sg, markers=unlist(plotTopMarkerHeat(sg, outs=T, plotheat=F, top_n=20)), corMethod="spearman", outs = TRUE)

#note the top_n=20 parameter, means the correlation will be based on the top 20 actors for each genre. Feel free to change this number and see how the correlation matrix changes.

and voila...we can cluster genres and the clustering makes sense semantically, all based on actor specificity scores as defined by genesorteR. Even though this matrix has nearly 900k actors and 220k films, the entire code took less than 1 minute to run!

IMDB Film Genre Correlation

Questions or Corrections or Comments?