[HN Gopher] You-get: Dumb downloader that scrapes the web
       ___________________________________________________________________
        
       You-get: Dumb downloader that scrapes the web
        
       Author : Anon84
       Score  : 197 points
       Date   : 2024-10-27 12:45 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | politelemon wrote:
       | It seems they do not want you to report an issue without an
       | accompanying fix for it.
       | 
       | > If you would like to report a problem you find when using you-
       | get, please open a Pull Request, which should include [snip]
       | 
       | Can't say I've encountered this before.
        
         | kylecazar wrote:
         | They want you to just submit a PR with a test that, if passed,
         | would indicate the problem for you is fixed.
        
           | thangngoc89 wrote:
           | What happens if you don't know Python? Python is a relatively
           | easy language to learn but no way I'm gonna learn Python just
           | to report an issue
        
             | Filligree wrote:
             | Good chance you wouldn't be writing good bug reports
             | either, then. Github issues have enough noise that a first-
             | pass filter like this feels like a good idea, even if it
             | has some false positives.
        
               | papichulo2023 wrote:
               | I fail to see the logic in your comment. Just another
               | case of Goodhart's law.
        
               | achierius wrote:
               | This isn't really a metric though. It's a formal
               | existence proof that the bug exists. The key difference
               | IMO is that you have to create a test which A) looks (to
               | the maintainer) like it should pass, while simultaneously
               | B) not passing. It's much harder to game.
               | 
               | There are other cases where Goodharts Law fails as well:
               | consider quant firms, where the "metric" used to judge a
               | trader is basically how much money you pull in. Seems to
               | be working fine for them
        
               | dartos wrote:
               | If you can't describe your bug in a test, then you
               | probably can't describe it sufficiently in English
               | either.
               | 
               | Seems to make sense
        
               | latexr wrote:
               | This in no way aligns with reality. I _frequently_
               | interact with users who can't code at all but make good
               | bug reports. One of the best ways to ensure success is to
               | have a form (GitHub allows creating those) which describe
               | exactly what is necessary and guide people in the right
               | direction.
               | 
               | What you're saying is even worse, since you're implying
               | someone could be an expert computer programmer or power
               | user, but because they're unfamiliar with the specific
               | language this project chose, they are incapable of making
               | good bug reports. That makes no sense.
        
             | js8 wrote:
             | The same thing that happens if the author of the OSS you
             | use doesn't know English.
        
             | dartos wrote:
             | Then you don't get to contribute bug reports.
             | 
             | Perfectly fine rule for a maintainer to have.
        
             | dotancohen wrote:
             | If the bug is egregious enough, somebody else will find it.
             | If the bug is important enough to you but esoteric, then
             | ask on a forum or enlist the help of someone you know who
             | does know Python.
             | 
             | How do you currently submit bug reports on e.g. MS Word or
             | Adobe Photoshop? This way is certainly more open than those
             | commonly-deployed software.
        
             | epcoa wrote:
             | Did you (or anyone) in this thread look to see exactly what
             | they are looking for with their provided examples?
             | 
             | https://github.com/soimort/you-
             | get/pull/2680/commits/313b8d2...
             | 
             | You do not need to know Python deeply to construct what
             | they are expecting. They're not actually looking for a unit
             | test or something.
        
               | latexr wrote:
               | > Did you (or anyone) in this thread look to see exactly
               | what they are looking for with their provided examples?
               | 
               | I did. And I looked at all examples of "good commits",
               | not just the trivial ones.
               | 
               | https://github.com/soimort/you-get/pull/2685/files
               | 
               | That's already complex for someone unfamiliar with the
               | software (which might nonetheless be able to open a
               | competent bug report).
        
             | nunez wrote:
             | That's exactly it. They put up a gate that blocks low-
             | effort issues that only add busywork. I like it!
        
           | sigseg1v wrote:
           | I kind of like this. It's a more formal proof of concept. You
           | prove the bug exists by writing a failing test. If they
           | cannot construct a failing test then it's either too hard to
           | mock or reproduce (and therefore maybe not even worth fixing,
           | for a free tool), or it's impossible because it's not a bug.
           | Frees up maintainer time from dealing with reports that
           | aren't bugs.
        
             | latexr wrote:
             | > If they cannot construct a failing test then it's either
             | too hard to mock or reproduce (...), or it's impossible
             | because it's not a bug.
             | 
             | Or, you know, the user is not a developer. Or is unfamiliar
             | with Python, or their test suite, or git, or...
             | 
             | It is perfectly possible to be good at reporting bugs but
             | be incapable of submitting pull requests.
        
               | newaccount74 wrote:
               | The problem with popular tools is that they have more
               | bugs that can be fixed. So bug reports are pretty much
               | worthless: You know that there are 1000 bugs out there,
               | but you only have resources to fix 10 of them.
               | 
               | By asking users to provide reproducible test cases, you
               | can massively reduce the amount of work you have to do.
               | Of course that means 90% of bugs will never be reported.
               | But since you don't have the resources to fix them
               | anyway, why not just focus on the bugs that can be
               | reproduced and come with a test case...
        
         | onionisafruit wrote:
         | Interesting. I like the idea of encouraging people to try
         | creating a test or even a whole fix, but saying that's all you
         | will accept is a bit much. On the other hand, I'm not doing the
         | work to maintain you-get. I don't know what they deal with.
         | This may be an effective way to filter a flood of repetitive
         | issues from people who don't know how to run a command line
         | program.
        
           | probably_wrong wrote:
           | I believe there are two extremes. On one end you get a bunch
           | of repetitive non-issues, while on the other end you only get
           | issues about (say) bugs in FreeBSD 13.3 because only hard-
           | core users have the skills and patience to follow THE
           | PROCESS.
           | 
           | I know how to make an isolated virtual environment, install
           | the package, make a fork, create a test and make a PR. But I
           | don't know whether I care enough about a random project to
           | actually do it.
        
         | wccrawford wrote:
         | As the other commenter said, they want a failing test, not a
         | fix.                   A detailed description of the
         | encountered problem;         At least one commit, addressing
         | the problem through some unit test(s).             Examples of
         | good commits: #2675, #2680, #2685
         | 
         | "Addressing" is probably a bad word to use here.
         | "Demonstrating" would have been better, IMO.
        
           | tylerchilds wrote:
           | the most expensive piece of writing software is scoping work.
           | 
           | i'm almost tempted to add a test suite just to give people
           | more agency over my output because right now i'm only
           | soliciting feedback in person to cut down on internet
           | bullshit, like what happened to xz-utils
        
         | thih9 wrote:
         | It's relatively easy to write a failing test and it massively
         | cuts down the work related to moderating issues. Also, reduces
         | the danger of github issues turning into a support forum.
         | 
         | If this results in the project being easier to maintain and
         | being maintained longer, then I'm fine with this.
        
           | seneca wrote:
           | > It's relatively easy to write a failing test and it
           | massively cuts down the work related to moderating issues.
           | 
           | Relative to what? Learning someone else's code base well
           | enough to write a useful test is not trivial.
           | 
           | It's not a bad method, but the vast majority of users won't
           | be capable of writing a test that encapsulates their issue.
        
             | chucksmash wrote:
             | In the case of this tool, adding a failing test case looks
             | trivial if you've got the URL of a page it fails on.
             | 
             | Provided the maintainer is willing to provide some minimal
             | guidance to issue reporters who lack the necessary know-
             | how, it even seems like a clever back door way of helping
             | people learn to contribute to open source.
        
         | zufallsheld wrote:
         | Serverspec does the same:
         | https://github.com/mizzy/serverspec?tab=readme-ov-file#maint...
        
         | omoikane wrote:
         | The Chinese version of the text has an extra header line that
         | translates to "to prevent abuse via GitHub Issues, we are not
         | accepting general issues". An earlier commit has this for the
         | English text:                  `you-get` is currently
         | experimenting with an aggressive approach to handling issues.
         | Namely, a bug report must be addressed with some code via a
         | pull request.
         | 
         | https://github.com/soimort/you-get/commit/75b44b83826b3c2d9a...
         | 
         | Maybe they got too much spam.
         | 
         | By the way, `tests/test.py` seems to just run the extractors
         | against various websites directly. I can't find where it's
         | mocking out network requests and replies. Maybe this is to
         | simplify the process for people creating pull requests?
        
           | godelski wrote:
           | I can get this, but I aggressively report accounts and
           | issues. I'm not sure how GitHub handles them but they seem to
           | not come back.
           | 
           | Though what I'm unsure how to deal with is legitimate users
           | being idiotic. For example, recently one issue was opened
           | that asked where the source code was. Not only was there a
           | directory named "src" but there were some links in the readme
           | to specific parts. While I do appreciate GitHub and places
           | like hugging face [0], there are a lot of very aggressive and
           | demanding noobs.
           | 
           | I'd like ways to handle them better.... I'm tired of people
           | yelling at me because 5 year old research code no longer
           | works out of the box or because you've never touched code
           | before.
           | 
           | [0] check any hugging face issue and you'll see far more
           | spam. Same accounts will open multiple issues that just
           | barate owners and hugging face makes it difficult to report
           | these accounts.
        
             | throwaway314155 wrote:
             | The solution is to ignore them and close their issue. Open
             | source maintainers have enough to worry about and are
             | unpaid, it's okay to be a little dictatorial when it comes
             | to "bad questions".
        
       | KTibow wrote:
       | Can someone explain why this is better than yt-dlp
        
         | uniqueuid wrote:
         | That's an interesting question. They only depend on a single
         | library, but I wonder how much code is really their own. I
         | found it curious, for example, that there is a dedicated mp4
         | joiner (I mean, if you already have ffmpeg, there is probably
         | no way you can do it better yourself).
         | 
         | https://github.com/soimort/you-get/blob/develop/src/you_get/...
        
         | grugagag wrote:
         | How did you infer better than yt-dlp? I think the more the
         | better when it comes to this space as google fights back.
        
           | xg15 wrote:
           | But some information what the differences to ytdlp are and
           | what the reasons for starting an entirely new project were,
           | would still be helpful.
           | 
           | (Also, a multitude of tools isn't really all that helpful if
           | they all stop working in the same instant because they all
           | relied on the same APIs etc)
        
       | vanjajaja1 wrote:
       | > Search on Google Videos and download > $ you-get "Richard
       | Stallman eats"
       | 
       | I don't often read instruction manuals, but this time I did and I
       | found this gross easter egg
        
       | dotancohen wrote:
       | Can it back up a text webpage? Can it remove popups for
       | newsletters, or subscription, or logins, or cookies'
       | notifications? Can it read pages that require signing in?
        
       | demberto wrote:
       | this different from JDownloader2?
        
       | tcsenpai wrote:
       | I like this. I am imagining a companion extension for chrome/ff
       | that uses you-get as a backend to implement it in a seamless way.
       | Forward thinking idea: imagine going on youtube and have you-get
       | extension bypass the youtube player and playing the content
       | directly without ads. When I say youtube I might also say any
       | other platform.
        
         | mikojan wrote:
         | Sounds like FastStream Video Player
         | 
         | https://addons.mozilla.org/en-US/firefox/addon/faststream/?u...
        
       | xg15 wrote:
       | I wouldn't exactly call a ytdl-style media downloader with a
       | whole library of site-specific extractors and converters "dumb"
       | but still cool that more projects like ytdl exist.
        
       | andai wrote:
       | For a while I had expensive internet and low bandwidth, but I
       | loved listening to music and lectures on YouTube. At some point I
       | realized that getting only the audio stream would save me 90% in
       | bandwidth costs. [0]
       | 
       | youtube-dl (and yt-dlp) has a flag, I believe -G, which gives you
       | the URL(s) for the requested format/quality. I used the command
       | line on my computer and put the link in VLC. On my phone I had
       | this elaborate workaround involving downloading the file to my
       | VPS first over SSH, then downloading it to my phone, until I
       | realized my phone browser can consume the URL directly, so I set
       | up a PHP frontend for `youtube-dl -G -f bestaudio {url}`
       | 
       | It's no longer online and I lost the code, but it was like one
       | line of code.
       | 
       | I mention this because you-get seems to support the same usecase
       | (via --url / -u), so I wanted to let people know how useful this
       | is!
       | 
       | (While it was online I shared it on some forums and got very
       | positive feedback, people used it for audiobooks etc.)
       | 
       | [0] Also playing with screen off saves 80% battery life! YouTube
       | knows these facts and that's why they made background playback
       | (which fetches only audio stream) a paid feature...
        
         | 01HNNWZ0MV43FF wrote:
         | I think it's -x to just rip audio now
        
         | TechDebtDevin wrote:
         | Brave Mobile browser allows turning on background video audio
         | thus eliminating the need for YouTube Premium and similar
         | subscriptions.
        
           | l3x4ur1n wrote:
           | I don't know why your comment is downvoted because I use this
           | feature of Brave very often and I also exclusively watch YT
           | in Brave mobile (no ads).
        
             | gaudystead wrote:
             | For me, it was as easy as adding a shortcut to the YouTube
             | homepage on Brave that it basically acts like the YouTube
             | app, but with ad blocking built in. It's the only way I
             | watch YT videos on mobile.
        
               | icar wrote:
               | You might be interested in GrayJay app.
        
             | TechDebtDevin wrote:
             | There are a lot of people that don't like Brave's business
             | model. But I've never given Brave a dime and turn off their
             | ad network stuff and they've saved me hundreds of dollars
             | on Youtube Premium over the years.
        
           | cocok wrote:
           | For Firefox:
           | 
           | https://github.com/mozilla/video-bg-play
        
         | ww520 wrote:
         | That's the -F option to list all the formats, including the
         | audio streams. Pick the audio format with -f to download the
         | audio. I usually pick the .m4a format and then run it through
         | ffmpeg to convert to mp3.
        
           | KMnO4 wrote:
           | What's the point of converting it to mp3? AAC inside an m4a
           | container usually has better sound quality than similarly
           | compressed mp3, and definitely better than reencoding.
        
             | userbinator wrote:
             | MP3 is accepted by far more players.
        
           | krick wrote:
           | That's really unnecessarily complicated workflow you have.
           | It's achievable by yt-dlp with just 3 flags:
           | 
           | --extract-audio
           | 
           | --format bestaudio
           | 
           | --audio-format mp3
        
             | knowitnone wrote:
             | you're unnecessarily making huge assumptions. Some people
             | don't want the bestaudio or mp3
        
               | krick wrote:
               | If I would make any assumptions, I would post another 30
               | options from my config that are nice to have when you
               | download audio from youtube. These 3 are exactly
               | equivalent to what gp does.
        
           | andai wrote:
           | Same but I converted to Opus, because I was trying to squeeze
           | it into as little bandwidth as possible. It was mostly speech
           | content and Opus auto detects and optimizes for speech at low
           | nitrates.
        
         | Synaesthesia wrote:
         | BTW if you browse YouTube with Firefox browser on Android you
         | can play back YouTube videos with the screen locked using
         | background player fix extension.
        
         | 6yyyyyy wrote:
         | NewPipe can do this very nicely, it even lets you build a
         | playlist of videos.
        
         | wutwutwat wrote:
         | A service that takes arbitrary user input and then attempts to
         | download/proxy whatever is at the end of that input. Brave
         | soul.
        
         | khimaros wrote:
         | on Android YTDLnis solves this very nicely. simply share the
         | video URL to the app and it can download whichever format you
         | like https://github.com/deniscerri/ytdlnis
        
         | cquintana92 wrote:
         | One of my last weekend projects was something similar: convert
         | youtube playlists into podcast-compstible URLs:
         | 
         | https://github.com/cquintana92/yt2pc
        
         | dredmorbius wrote:
         | mpv similarly has this option. I _listen_ to far more videos
         | than I _watch_.
         | 
         | <https://mpv.io/>
        
       | MattDaEskimo wrote:
       | Another library released which lies about what it is to
       | circumvent anti-bot security.
       | 
       | Let's just not act surprised when tighter attestation comes in
       | effect.
        
         | ajsnigrutin wrote:
         | This library/program solves problems that people have with
         | pages like youtube... too many ads, no way to download videos
         | for offline use (or archive for when they get removed), and
         | better performance with a native player.
         | 
         | If I was forced to watch all the ads on youtube, i wouldn't
         | watch videos there at all.
        
         | therein wrote:
         | A future in which YouTube will refuse to stream you data
         | because you didn't pass client attestation is definitely coming
         | and I wish we could stop it.
         | 
         | It is a dark future where some of us will accept it, and rest
         | of us will be constantly taking part in a cat-mouse chase in
         | which we glitch out attestation tokens from vulnerable devices
         | to get by.
        
           | userbinator wrote:
           | We need laws against user-agent discrimination.
        
         | troupo wrote:
         | I used to "save" interesting links by emailing them to myself.
         | 
         | Now most of them are dead, twitter accounts removed, youtube
         | videos deleted, facebook pages bought by media management
         | companies, sites rebuilt etc.
         | 
         | Whatever the primary goal if this tool, it, and other similar
         | tools, are invaluable in actually saving and preserving content
        
       | krick wrote:
       | Given the title and the first few sentences from a description I
       | assumed that it's some heuristic-based tool to try and grab
       | whatever there is on the page, which would be useful if there's
       | no tool which implemented the support for this site (which in
       | most cases just means "yt-dlp doesn't support it"). But
       | apparently it's also extractor-based with a separate extractor
       | for each somewhat-popular source. So, basically it's just less
       | sophisticated clone of yt-dlp?
        
       | jdthedisciple wrote:
       | Anybody else getting this error constantly?
       | you-get: [error] oops, something went wrong.         you-get:
       | don't panic, c'est la vie. please try the following steps:
       | you-get:   (1) Rule out any network problem.         you-get:
       | (2) Make sure you-get is up-to-date.         you-get:   (3) Check
       | if the issue is already known, on         you-get:
       | https://github.com/soimort/you-get/wiki/Known-Bugs         you-
       | get:         https://github.com/soimort/you-get/issues
       | you-get:   (4) Run the command with '--debug' option,
       | you-get:       and report this issue with the full output.
       | 
       | Tried with debug flag but didn't really help
       | pattern = str(pattern, 'latin1')
       | ^^^^^^^^^^^^^^^^^^^^^^         TypeError: decoding to str: need a
       | bytes-like object, NoneType found
       | 
       | I was curious to see if it can bypass age restriction (though I
       | tried on non-age-restricted video too with the same error).
        
       | natch wrote:
       | Is this just a fork of yet-dlp with credits rewritten?
        
       | fnoobnar wrote:
       | I'm not sure I understand why Bandcamp is on the list of
       | supported sites: they allow you to just download the files on the
       | condition you first pay the artist for them.
       | 
       | The fact you can download it with this tool is because the artist
       | is letting you listen to it for free before buying it.
       | Downloading it with this tool seems totally unnecessary and a bit
       | of a jerk move. Bandcamp hosts mostly small and independent
       | artists and labels.
        
         | khaki54 wrote:
         | I presume you could subscribe and still use this tool? People
         | use automation tools like this to download things that they
         | already pay for because it saves them the effort of logging
         | into 5 different apps depending on which walled garden it's in.
        
           | hluska wrote:
           | Do artists get paid on Bandcamp if they bypass the login?
        
         | lovethevoid wrote:
         | Their list of supported sites isn't a declaration of where you
         | should use this tool for moralistic reasons. It's just a list
         | of popular sites it works on.
        
       | wanderingmind wrote:
       | Nice work. But as a consumer, Why should I use you-get over yt-
       | dlp? What are its strengths over yt-dlp, which works quiet well
       | on a huge range of websites[1]
       | 
       | [1] https://github.com/yt-dlp/yt-
       | dlp/blob/master/supportedsites....
        
       ___________________________________________________________________
       (page generated 2024-10-27 23:00 UTC)