This project takes a YouTube podcast URL, extracts the transcript, identifies key topics and Q&A pairs, simplifies them for children, and generates an HTML report with the results.
-
LLM Calls (
utils/call_llm.py
) -
YouTube Processing (
utils/youtube_processor.py
)- Get video title, transcript and thumbnail
-
HTML Generator (
utils/html_generator.py
)- Create formatted report with topics, Q&As and simple explanations
The application flow consists of several key steps organized in a directed graph:
- Video Processing: Extract transcript and metadata from YouTube URL
- Topic Extraction: Identify the most interesting topics (max 5)
- Question Generation: For each topic, generate interesting questions (3 per topic)
- Topic Processing: Batch process each topic to:
- Rephrase the topic title for clarity
- Rephrase the questions
- Generate ELI5 answers
- HTML Generation: Create final HTML output
flowchart TD
videoProcess[Process YouTube URL] --> topicsQuestions[Extract Topics & Questions]
topicsQuestions --> contentBatch[Content Processing]
contentBatch --> htmlGen[Generate HTML]
subgraph contentBatch[Content Processing]
topicProcess[Process Topic]
end
The shared memory structure will be organized as follows:
shared = {
"video_info": {
"url": str, # YouTube URL
"title": str, # Video title
"transcript": str, # Full transcript
"thumbnail_url": str, # Thumbnail image URL
"video_id": str # YouTube video ID
},
"topics": [
{
"title": str, # Original topic title
"rephrased_title": str, # Clarified topic title
"questions": [
{
"original": str, # Original question
"rephrased": str, # Clarified question
"answer": str # ELI5 answer
},
# ... more questions
]
},
# ... more topics
],
"html_output": str # Final HTML content
}
- Purpose: Process YouTube URL to extract video information
- Design: Regular Node (no batch/async)
- Data Access:
- Read: URL from shared store
- Write: Video information to shared store
- Purpose: Extract interesting topics from transcript and generate questions for each topic
- Design: Regular Node (no batch/async)
- Data Access:
- Read: Transcript from shared store
- Write: Topics with questions to shared store
- Implementation Details:
- First extracts up to 5 interesting topics from the transcript
- For each topic, immediately generates 3 relevant questions
- Returns a combined structure with topics and their associated questions
- Purpose: Batch process each topic for rephrasing and answering
- Design: BatchNode (process each topic)
- Data Access:
- Read: Topics and questions from shared store
- Write: Rephrased content and answers to shared store
- Purpose: Create final HTML output
- Design: Regular Node (no batch/async)
- Data Access:
- Read: Processed content from shared store
- Write: HTML output to shared store