Fusion Based Video Summarization: Integrating Transcripts and Keyframes for YouTube Content Analysis
NADENDLA HARSHA VARDHAN*1, SHAIK REENA KOWSAR2, SHAIK NEHA3, PAMUJULA PAVAN NAGA SAI4, VASIREDDY SWATHI5
1Student, Department of CSE(AIML), Bapatla Engineering College, Bapatla 522101, AP, India
2Student, Department of CSE(AIML), Bapatla Engineering College, Bapatla 522101, AP, India
3Student, Department of CSE(AIML), Bapatla Engineering College, Bapatla 522101, AP, India
4Student, Department of CSE(AIML), Bapatla Engineering College, Bapatla 522101, AP, India
5Assistant Professor, Department of CSE(AIML), Bapatla Engineering College, Bapatla 522101, AP, India.
Abstract— In addition, the fastest growth of video-sharing platforms has produced an onslaught of long-form multimedia content. Long videos are commonly used in educational and technical domains, but users often find it difficult to efficiently extract relevant information from them. Most of the existing transcript-based summarization approaches mainly make use of textual features and tend to neglect viewer engagement cues that indicate significance of video segments. This work presents a novel fusion-based multimodal YouTube video summarization pipeline leveraging transcript, engagement analysis, and generative AI insights. Our framework utilizes TextRank algorithm along with TF–IDF based similarity measure to rank sentences of transcript of the video. The user Engagement Signals like retention rates, engagement, and sentiment scores are used along with sentence rank scores to identify important sentences in the video using an engagement fusion model. Since there are multiple engagement signals, we perform dimensionality reduction using PCA to reduce computational complexity. We use generative AI models to generate summaries to benchmark against our extractive summary models. We designed a UI using Streamlit where a user can enter the URL of the YouTube video and view the summary of the video along with other details. Our results show that adding engagement aware signals help generate better summaries with more context as opposed to traditional methods that only take into consideration the transcript of the video.
Keywords— Video Summarization, TextRank, Engagement Fusion, Natural Language Processing, YouTube Analytics, Generative AI