All the complete scripts of the first 6 Star Wars movies, as well as a light version of each containing only the dialogues and the places where they took place.
Likewise, you will find the images of each of the characters who had an interaction in the films.
In the folder data , are the CSV files of each film with information on the dialogues such as: the number of words per interaction, the types of words, the duration of the interaction, who speaks to whom and the location.
If you like the work then do not hesitate to visit my site.
This project was initially a personal project but with the work accomplished, I find it important to share with the community.
The idea for the Star Wars project came to me after a conference on data visualization at the KIKK festival in Namur in Belgium. The speaker (Nadieh Bremer) made me want to create a data-visualization and what could be better than the theme of Star Wars. My data-visualization will be visible later.
At first, I recovered the scripts of the first 6 films. I laid them out in markdown files in order to keep only the dialogues, the speakers and the places. I accompanied the script files with one file per film each time containing all the characters.
In the continuity of my work, I encoded the markdown files in HTML so that I could automatically extract the data I wanted with Javascript. At the same time, I created a small script that counted the words.
The second part of my work consisted of watching all the films and checking that everything was correct in terms of scripts.
This done, I encoded each film in a numbers file with several data including among others : the speaker, the interlocutor, the content, the duration, the place, the number of words, the type of words, etc. Thanks to the subtitle files, I was able to recover the duration of the talks and check all the data a second time.
A small overview of the progress of the project.
-
Recover script files(3th January 2020) -
Transcription and cleaning in markdown(8th January 2020) -
Adding data in the sheet(8th January 2020) - Adding listeners
- Adding durations
- Adding sorts of words
- Creating CSV files for each movie
- Creating JSON files for each movie
In this repo, you can find several files about the Star Wars univers.
Folder | Description |
---|---|
📂 Sources | All source files that were used to collect the data |
📂 Markdown | Markdown files with dialogs, speakers and location |
📂 JSON for sheet | JSON files format to populate the Sheet file |
📂 Data sheet | Sheet file which gathers all the information for each film |
📂 Data CSV | CSV file ready to use for each film |
My count words function used for the project :
function countWords(s){
s = s.replace(/(^\s*)|(\s*$)/gi,"")
s = s.replace(/[ ]{2,}/gi," ")
s = s.replace(/[...]/gi," ")
s = s.replace(/[(]+.+[)]/gi," ")
s = s.replace(/\n /,"\n")
return s.split(' ').filter(function(str){return str!="";}).length
}
Code to make the json
const uls = document.querySelectorAll('ul')
var global = []
var wordsGlobal = []
let where = null
let timePerWords = Math.round(83672 / 9595);
uls.forEach(ul => {
let previousEl = ul.previousElementSibling
let lis = [...ul.getElementsByTagName('li')]
if (previousEl !== null) {
if (previousEl.nodeName === "P") {
where = previousEl.innerText
}
}
lis.forEach(li => {
let content = li.innerText.split(" : ")
let number = 0
let text = content[1]
let textFormat = null
let peoples = content[0].split(' to ')
if (typeof text !== 'undefined') {
textFormat = formatSentence(text)
number = countWords(textFormat)
}
global.push({
"from": peoples[0],
"to": peoples[1],
"text": text,
"where": where,
"number": number,
"time": number * timePerWords
})
})
})
let data = JSON.stringify(global)
My PHP file to get the total speech time on screen based on the SRT
$data = file_get_contents('./srt.txt', false);
$res = preg_replace("/\*([0-9])\*/", "", $data);
$res2 = preg_replace("/[^0-9:,>]/", " ", $res);
$res3 = preg_replace('!\s+!', ' ', $res2);
$res4 = preg_replace('/ , /', ' ', $res3);
$res5 = preg_replace('/ > /', '>', $res4);
$res6 = preg_replace('!\s+!', ' ', $res5);
$res7 = explode(" ", $res6);
$minutesGlobal = 0;
$msGlobal = 0;
foreach($res7 as $part) {
$explode = explode('>', $part);
if (sizeof($explode) == 2) {
$explode[0] = preg_replace("/,/", ":", $explode[0]);
$explodeNumbers = explode(':',$explode[0]);
$explode[1] = preg_replace("/,/", ":", $explode[1]);
$explodeNumbers2 = explode(':',$explode[1]);
$ms = (int)$explodeNumbers2[3] - (int)$explodeNumbers[3];
$secondes = (int)$explodeNumbers2[2] - (int)$explodeNumbers[2];
$minutes = (int)$explodeNumbers2[1] - (int)$explodeNumbers[1];
$hours = (int)$explodeNumbers2[0] - (int)$explodeNumbers[0];
$minutesGlobal += $minutes;
$msGlobal += $ms;
}
}
echo $minutesGlobal . ' ' . $msGlobal;
The storyline, characters and images represented in this repo belong to the respective owners
Links | Description |
---|---|
DISNEY | All content belong to Disney |
IMSDB | Scripts from movies |
YIFY | Subtitles files |
MARKDOWN TO HTML | Convert markdown to html |
JSON PARSER | Parser the JSON |
CSV-JSON | Convert JSON to CSV |
WORDCOUNTER | Count word if needed |
REGEXR | For Regular Expressions |
WORDPOS | Part-of-speech for type of words |