This repository contains a python pipeline that maintains an up-to-date catalogue of "The Big Picture" podcast episodes as Letterboxd lists. I love movies - I’m an avid Big Picture podcast listener and longtime Letterboxd user. The creation of this catalogue is an effort to share a valuable resource with other members of both communities.
This repo contains the project design and codebase - for more information on my learnings from this side project you can read my Medium writeup.
I have listened to The Big Picture podcast since 2018 - the knowledge and opinions I’ve absorbed have helped me grow my knowledge and appreciation for cinema as an adult. The podcast format is engaging but has limitations for absorbing and referencing the dense information shared. The hosts and guest often discuss dozens of films each episode, of which I typically remember a few and miss out on the rest. This project aims to solve that problem, leveraging Langchain's LLM structured output to create Letterboxd lists to accompany each episode of the Big Picture podcast.
Letterboxd lists are a great way to structure and share film catalogues. In addition to the film identification, this project leverage the list description to add value for listeners.
- Timestamp Links: Timestamp links for each film mentioned are included facilitate a frictionless interface between the audio version of the podcast and the Letterboxd list. I hope this will continue to drive list-viewers to revisit the podcast itself.
-
Reference Context: The reason for the reference to the film is included in a single sentence summary that includes the speaker name.
-
Ranking and Drafting: When list-making or "drafting" occurs in the podcast, the final results are appended to the bottom of the description.
All steps are executed in a single Google Colab ipython notebook that is set to run daily to keep the catalogue up to date. Colab was leveraged in order to access the A100 GPU, which is required to transcribe and diarize the audio file.
Most of this pipeline is adaptable to another podcast with modifications to the RSS feed ingestion. The most interesting contribution is a generic solution for naming speakers in a diarized transcript, applying LLM to identify speaker names using conversational context, given a list of hosts and guests. More info in step 5 speaker_naming.py.
-
Download podcasts from the RSS feed using podcast-downloader. This repo creates a catalogue of files to transcribe from an RSS feed. Relatively plug and play - only the config file requires user input. Able to save the files in .WAV format, required for whisperX.
-
Transcribe and diarize the podcast audio using WhisperX. There are many options but as of October 2024 WhisperX proved the most popular / fastest. The community support on github was essential for quick learning and debugging. I was happy with the accuracy of the resulting diarization but occasional, sometimes glaring errors persist. The transcripts are usually accurate, but sometimes miss-transcribe movie names, especially new or niche movies. The diarization also struggles occasionally when multiple speakers have similar voice pitch or talk over one another. The step is handled by transcription.py.
-
Name the speakers in the diarized transcript using OpenAI gpt-4o. Using a list of hosts and guests from the podcast, I ask got-4o to identify one speaker at a time using the context of the transcript. This is handled by speaker_naming.py.
-
Summarize the podcast transcript using OpenAI gpt-4o. Create a 4-5 sentence summary for the subjects discussed in this podcast episode to provide basic context for the Letterboxd list. This is saved in a CSV for upload to Letterboxd. This is handled by list_creation.py.
-
Create table of films mentioned in the podcast using OpenAI gpt-4o. Letterboxd allows for list creation from a CSV of film names and release years. I use gpt-4o + Langchain to return this information in a structure JSON format, which includes a one-sentence summary of the reference and a timestamp link so that viewers can go directly to that moment in the podcast. This is handled by list_creation.py.
-
Upload the summary & film list to Letterboxd using Selenium. Selenium loops through the summary and film table CSVs, executing the Letterboxd create list flow. This is handled by selenium_list_upload.py.
Diarization vs. Transcription: Transcripts of the audio are publically available from Apple Podcasts, a diarized transcript (including speaker names) of the episodes increases the usefulness of the descriptions and the saliance of the inferences and lists.
File structure for Letterboxd Upload Compatability:
- CSV containing movie tiles & year released - titled "json_<episode_title>"
- String containing List 'Title' - saved as column 1 in "summary_<episode_title>" CSV
- String containing List 'Description' - saved as column 2 in "summary_<episode_title>" CSV
This pipeline preserves each intermediate .csv & .txt file prior to the final requirements. Diarized transcripts may be used in future steps for GAI applications including RAG. The pipeline requires these folders to organize the intermediate files:
.
├── csv (CSV files ready for Selenium upload)
├── csv_uploaded (CSV files after Selenium upload)
├── podcast_downloader (git clone of podcast-downloader repo with custom .config)
├── python_functions (python files containing functions for each step of pipeline)
└── transcripts (.wav, .csv, & .txt version of each episode)
- Transcription pipeline
- Letterboxd upload pipeline
- Backfill archive
- Automate pipeline execution
- More efficient change data capture to reduce cost of daily executions
- Automate reddit posting for audience growth
- Simplify podcast info collection in list_creation.py using rss feed info.
- Collaborations with the Ringer?