[HN Gopher] Show HN: I create a free website for download YouTub...
       ___________________________________________________________________
        
       Show HN: I create a free website for download YouTube transcript,
       subtitle
        
       Author : trungnx2605
       Score  : 97 points
       Date   : 2024-02-18 09:41 UTC (13 hours ago)
        
 (HTM) web link (www.downloadyoutubesubtitle.com)
 (TXT) w3m dump (www.downloadyoutubesubtitle.com)
        
       | tomcam wrote:
       | What a great service. Thanks!
        
       | rpastuszak wrote:
       | How are you getting the transcripts? Using the private YT API
       | like in https://www.npmjs.com/package/youtube-transcript?
        
         | trungnx2605 wrote:
         | youtube_transcript_api
        
       | santamex wrote:
       | I liked also this one:
       | 
       | https://filmot.com/
       | 
       | Here you can search the subtitles of YouTube videos.
        
       | foobarqux wrote:
       | How do I actually search files with timestamps (preferably from
       | the CLI)?
       | 
       | I can use rg if the search terms happen to be on the same line
       | but if the terms span multiple lines the interleaved timestamp
       | metadata will prevent the query from being matched.
        
       | ldenoue wrote:
       | I'm also doing this but it adds punctuations, paragraphs and
       | chapter headers because most raw YouTube transcripts lack proper
       | punctuation
       | 
       | https://www.appblit.com/scribe
        
         | BigElephant wrote:
         | How are you deriving the punctuation?
        
         | undershirt wrote:
         | Wow! This is great
        
       | mmh0000 wrote:
       | yt-dlp[1] can also do this:
       | 
       | ```
       | 
       | $ yt-dlp --write-sub --sub-lang "en.*" --write-auto-sub --skip-
       | download 'https://www.youtube.com/watch?v=...'
       | 
       | ```
       | 
       | [1] https://github.com/yt-dlp/yt-dlp
        
       | tarasglek wrote:
       | Here is mine: https://www.val.town/v/taras/scrape2md
       | 
       | Use it like https://taras-
       | scrape2md.web.val.run/https://youtu.be/TJqeCpx...
       | 
       | This is meant to be a general purpose content-to-markdown tool
       | for llm interactions in https://chatcraft.org
        
         | hn_acker wrote:
         | What's the copyright license on your scrape2md code?
        
           | tarasglek wrote:
           | Updated description with license(MIT) and link to the more
           | fully featured version.
        
       | wahnfrieden wrote:
       | Is there any way to extract the transcripts from JS state on
       | YouTube, instead of making API reqs for them?
        
       | rspoerri wrote:
       | I use this script, because automatically generated subtitles are
       | badly formatted as transcript (only good as subtitles). It works
       | pretty well to archive the videos including the transcript and
       | subtitles.
       | 
       | ```
       | 
       | #!/bin/zsh
       | 
       | # download as mp4, get normal subtitles
       | 
       | yt-dlp -f mp4 "$@" --write-auto-sub --sub-format best --write-sub
       | 
       | # download subtitles and convert them to transcript
       | 
       | yt-dlp --skip-download --write-subs --write-auto-subs --sub-lang
       | en -k --sub-format ttml --convert-subs srt --exec before_dl:"sed
       | -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] -->
       | [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e
       | '/^[[:digit:]]\\{1,3\\}$/d' -e 's/<[^>] _> //g' -e
       | '/^[[:space:]]_$/d' -i '' %(requested_subtitles.:.filepath)#q"
       | "$@"
       | 
       | ```
        
       | araes wrote:
       | Checking online, this [1] appears to be one of the most heavily
       | referenced on StackOverflow for downloading both user entered and
       | automatically generated transcripts. (Python based)
       | 
       | [1] https://github.com/jdepoix/youtube-transcript-api
       | 
       | Notably, Google really needs to have an obvious API endpoint for
       | this kind of call. If 1000's of programmers are all rolling their
       | own implementation, there's probably a huge number that
       | constantly download the full video and transcribe in data
       | harvesting.
       | 
       | Kind of surprised honestly it's taken this long for Youtube to
       | fall prey to massive data harvesting campaigns. From this article
       | [2] and this paper on Youtube data statistics [3] there are
       | ~14,000,000,000 videos on Youtube with a mean length of 615
       | seconds (~10 minutes).
       | 
       | You'd think people would be interested in:
       | 8,610,000,000,000 seconds       143,500,000,000 minutes
       | 2,391,666,666 hours       3,274,083 months       272,840 years
       | 27,284 decades       2,728 centuries       273 millennia
       | 
       | Of live action video on nearly every single subject in human
       | existence.
       | 
       | Also, the paper's really cool and extremely sobering about being
       | a "content creator" based on the 1% get all views.
       | 
       | [2] "What We Discovered on 'Deep YouTube'",
       | https://www.theatlantic.com/technology/archive/2024/01/how-m...
       | 
       | [3] "Dialing for Videos: A Random Sample of YouTube",
       | https://journalqd.org/article/view/4066/3766
        
       | numpad0 wrote:
       | I see lots of yt-dlp commands here so...
       | 
       | PSA: yt-dlp exits non-zero if destination filename or any of
       | intermediate files' names are too long for the filesystem. Use
       | `-o "%(title).150B [%(id)s].%(ext)s"` to limit filename length(to
       | 150 bytes in this example). "--trim-filenames" don't work.
        
       | trungnx2605 wrote:
       | It use youtube_transcript_api
        
       | trungnx2605 wrote:
       | Hi guy, I still get an error that shows Client Error: Too Many
       | Requests for URL. So YouTube blocked the IP right?
        
       ___________________________________________________________________
       (page generated 2024-02-18 23:01 UTC)