[HN Gopher] 4.5M Suspected Fake Stars in GitHub
       ___________________________________________________________________
        
       4.5M Suspected Fake Stars in GitHub
        
       Author : qianli_cs
       Score  : 130 points
       Date   : 2024-12-29 14:30 UTC (4 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | mentalgear wrote:
       | In a world with so much fake PR stuff and AI slop, any and all
       | project that tries to verify what's real and what's not is an
       | excellent choice of topic, fostering integrity again in the our
       | industry.
       | 
       | Here's the actual repo: https://github.com/hehao98/StarScout
        
       | hobs wrote:
       | Allow it a week to finish all iterations and expect it to read >=
       | 40TB of data. You can use nohup to put it as a background
       | process. After a week, you can run the following command to
       | collect the results into MongoDB and local CSV files:
       | 
       | I just love the yolo nature of "well let's check in a week if
       | that 40TB of data processing worked"
        
         | queuebert wrote:
         | Reminds me of this famous paper: https://arxiv.org/abs/astro-
         | ph/9912202
        
       | prepend wrote:
       | I don't like stars as a metric. Or at least as a comparator. If
       | you brag about having a millions stars that says something as a
       | million is a lot.
       | 
       | But if you brag that your project as a million and your
       | competitor has half a million, that is so illogical that I would
       | discount your project and think it's run by dummies.
       | 
       | Are there practical situations where people really need stars
       | enough to buy them?
        
         | hiccuphippo wrote:
         | My only guesses are people showing popular repos for their CV
         | or to appear legitimate to get access to another repo like what
         | happened with the xz utils backdoor.
        
           | bdcravens wrote:
           | There's also the third category of projects receiving
           | funding.
        
             | datadrivenangel wrote:
             | And for a while startups were using it as a traction metric
             | for open core projects when pitching to VCs.
        
         | pembrook wrote:
         | Once you start trying to make a living from anything you do
         | online, you start to realize that literally everything on the
         | internet is gamed to extreme. Even this article was written and
         | posted here for a reason.
         | 
         | If your GitHub repo can in any way provide you with income
         | (from just having something to talk about in an interview or an
         | innocuous "buy me a coffee link"...all the way up to selling
         | $100,000/yr enterprise support plans), you now have a strong
         | incentive to game the system.
         | 
         | And if it's allowed by the system, then it's a prisoners
         | dilemma. Because if you DON'T do it, your competition will do
         | it and eat your lunch.
         | 
         | That's why it's so important to design high integrity ranking
         | systems.
        
         | magic_smoke_ee wrote:
         | The "like" metric is dumbed-down to self-amplifying popularity
         | that hovers around meaninglessness. It would be more valuable
         | to weight things based who else you respect also rate a
         | particular item.
        
         | mrweasel wrote:
         | Github really wants to be a social network or something to that
         | effect and I get the feeling that most developer don't care. If
         | you log in it's pretty clear that the "front page" is suppose
         | to be something like a feed, but I don't know anyone who uses
         | it. Mine is completely blank and pointless. Stars I suppose is
         | to be something akin to a like, maybe.
         | 
         | I have plenty of Github projects bookmarked, but I never
         | "stared" one... Why would I?
        
       | dang wrote:
       | (We merged comments from
       | https://news.ycombinator.com/item?id=42573954 to this thread.)
       | 
       | https://www.bleepingcomputer.com/news/security/over-31-milli...
        
       | zitterbewegung wrote:
       | All metrics will be gamed at some point. I don't know exactly how
       | you could even fight this.
        
         | jasoneckert wrote:
         | Neither do I.
         | 
         | I believe the only thing anyone can do is take metrics of how
         | the metrics are gamed, as this particular paper has done.
        
         | jazzyjackson wrote:
         | there's various reasons webs-of-trust don't takeoff, but I can
         | imagine a system where the metrics I see are only aggregated
         | from friends-of-friends, and any other signal is just
         | considered untrustworthy and therefor not worth observing
        
           | drusepth wrote:
           | Do you still trust that system when your friends-of-friends
           | are the ones gaming the system? Given the inherent network
           | effects of manipulating webs of trust, I wouldn't be
           | surprised if everyone had at least one friend-of-a-friend
           | they shouldn't necessarily trust.
        
             | morkalork wrote:
             | Given all the obvious bots and sketchy recruiters that try
             | to connect with me on LinkedIn, who all appear to have at
             | least one mutual connection, it probably won't work.
        
               | jagged-chisel wrote:
               | Do we have a similar issue on GH? I think the nature of
               | the service and its target audience affect this problem
               | in a big way. You can follow anyone on GH, but there's no
               | mutual connection option at all. LI has following _and_
               | mutual connections. LI also has a much wider audience.
               | 
               | How might a 'connection' look on GH? Will people freely
               | connect, or will they appraise requests more closely?
        
           | wruza wrote:
           | I can imagine access to raw data instead of some stupid come-
           | on-game-me-able predefined indicator, and that I can run some
           | private statistical analysis over it. People would use (and
           | share) different algorithms and gamers will at least wander
           | through this collectively created mud without any
           | understanding except for the defaultest measures.
           | 
           | But of course this is too complex and "no one will use it"
           | (tm). So we'll better have a screwed up recommendation system
           | that doesn't work at all, cause that's simpler!
        
           | codetrotter wrote:
           | I can only speak for me personally. For me the way that I use
           | GitHub I don't think the concept of "friends of friends"
           | would be all that useful on GitHub.
           | 
           | There are a handful of people that I know IRL that I follow
           | on GitHub. And a few hundred that I follow in total. Out of
           | the handful of people I know IRL, and who I follow on GitHub,
           | only two or three of them are active there any given week.
           | All of the other people I follow I have very little idea who
           | they are. Usually I follow people I don't know if I come
           | across their profile and either the profile itself or their
           | projects make me follow them. But I star way more different
           | repos than the number of people I click follow on.
           | 
           | For me, the main way of discovering new repos are:
           | 
           | - Frontpage of HN, and comments in posts on HN.
           | 
           | - Specific search results on Google when I have searched for
           | libraries or programs that do specific things.
           | 
           | - Libraries on crates.io that I think might be interesting to
           | look into in the future.
           | 
           | Maybe once or twice a month I happen to click on the main
           | page of GitHub itself and see mentions of repos that have
           | been committed to or starred or created by people I follow.
           | 
           | So for me I don't think "friends of friends" is a
           | particularly great signal for things to look at. Most of the
           | people I follow, I don't know much about them.
           | 
           | Likewise, for anyone that follows me it's not necessarily any
           | strong signal that I follow someone else in order to
           | determine if activity from that someone else should be shown
           | or weighted as more significant to my follower just because I
           | happen to follow that other person.
           | 
           | If you do want a strong signal for who to boost for my
           | followers based on my own activity, go and look at the
           | dependencies that I am using in my own projects. That's a
           | pretty good indicator that I put some amount of effort and
           | interest into looking at something. This could be done by
           | GitHub itself, parsing the Cargo.toml files of my projects
           | and extracting the dependencies section and looking up which
           | of those dependencies are hosted on GitHub.
        
           | kube-system wrote:
           | Maybe so, but in this case, I don't think 'stars' is a good
           | candidate for one of those metrics. I think the people
           | worried about 'fake stars' are doing it wrong, and should
           | just ignore the metric entirely.
        
         | begueradj wrote:
         | It comes down to fighting against the human nature. And that's
         | a lost battle.
         | 
         | Set any law you want, our nature will push us to circumvent it
         | even legally.
        
           | thrance wrote:
           | Not nature no, it's all about incentives. Oftentimes it's
           | financial, for github stars it's prestige and visibility.
        
           | mentalgear wrote:
           | Most people are happy living in a fair ecosystem - it's only
           | the 1-2% of the population that seek control, money and power
           | that start trying to exploit the system.
           | 
           | Only if we let that minority keep manipulating the system
           | without consequences, it becomes the driving market force
           | that the rest of the population also feels they have to
           | comply to, to go along, as it already has happened in
           | finance, academia, etc.
        
             | JumpCrisscross wrote:
             | > _Most people are happy living in a fair ecosystem_
             | 
             | For varying and self-serving definitions of fair. (Almost
             | everyone in the rich world is in an unfairly-advantaged
             | minority.)
        
           | vouaobrasil wrote:
           | I don't really think so. The Amish have a nice system. Their
           | society has many fewer bad actors compared to general
           | society.
           | 
           | Actually one of the keys is repeated contact. People who have
           | to interact again and again will try and game the system
           | less. Not sure how to build that into a star system but why
           | give up so easily? Do programmers give up when you say "this
           | algorithm can't be made any faster?"
        
             | JumpCrisscross wrote:
             | > _one of the keys is repeated contact_
             | 
             | The other is hierarchy. You can't automate reputation
             | scoring.
        
             | eddythompson80 wrote:
             | I don't think it's just the Amish. Collectivist cultures in
             | general have (or maybe perceived to have, I don't know)
             | fewer bad actors compared to individualistic cultures.
             | 
             | It doesn't matter if people have to interact frequently if
             | there is no real consequences to that interaction. The
             | punishment in those collectivist cultures involves social
             | shunning, shaming, etc. Individualistic cultures almost
             | pride themselves on how much they can disregard social
             | shunning and shaming. Shameless people are celebrities and
             | elected officials. They are admired as opposed to shunned
             | and ignored. A bad actor in an Amish community is expelled
             | and loses access to what that community offers. That would
             | be illegal in the general society unless their "bad act"
             | was actually illegal. Discriminating against someone for
             | being a dickhead who exploits loopholes and unregulated
             | corner cases (without explicitly breaking the law) would be
             | illegal in many contexts.
             | 
             | > Not sure how to build that into a star system but why
             | give up so easily? Do programmers give up when you say
             | "this algorithm can't be made any faster?"
             | 
             | I don't think people have given up. Online fraud detection
             | is a massive industry as is. Spotify plays, YouTube views,
             | Google search, Amazon reviews, reddit upvotes, twitter's
             | retweets, facebook likes/shares, etc all fall exactly into
             | the same bucket. There is even a significant dollar amount
             | attached to many of those more so that GitHub stars. All
             | are frequently gamed/faked and it's a battle between the
             | platforms and the adversary
        
         | nwienert wrote:
         | Only show "Active Developer Stars" by default:
         | 
         | - Only accounts that have a decent amount of activity (pushing
         | code, commenting, etc)
         | 
         | - Has set up SSH
         | 
         | - Older than 2 years
         | 
         | - Account active consistently for at least a year
         | 
         | - Must have 2-factor enabled
         | 
         | - Filled out profile
         | 
         | etc
        
           | stevage wrote:
           | So now all the bots are pushing code, have SSH etc...
        
           | zitterbewegung wrote:
           | I've heard of gaming GitHub stars by asking their friends to
           | star their projects which would get around all of your
           | bullets. Hence why I said it would be hard to game.
        
           | eddythompson80 wrote:
           | All of those are very, very, easy to automate. There are
           | plenty of bot accounts that have _unintentionally_ checked
           | the full list.
        
             | nwienert wrote:
             | You can find a set of requirements that aren't. Eg 2-factor
             | can include phone number. And activity requirements can be
             | based on repo maturity (no just pushing to random empty
             | repos).
             | 
             | And while some boy accounts may have them, I doubt many
             | have most.
             | 
             | Also, you argue on semantics but the general idea of
             | setting up a legitimacy test that factors in various things
             | is very easily doable, the factors can be kept private, and
             | you definitely can find ones that are generally hard to
             | game.
        
               | gruez wrote:
               | >You can find a set of requirements that aren't. Eg
               | 2-factor can include phone number. And activity
               | requirements can be based on repo maturity (no just
               | pushing to random empty repos).
               | 
               | Then you have people complaining about being
               | "shadowbanned" (because there's no recourse if you're a
               | person and the algorithm thinks you're not active
               | enough), or that github is being anti-privacy (by
               | requiring phone number). It's hard to win here.
        
               | wholinator2 wrote:
               | I think the point is that these requirements are not
               | published, and they are not requirements to use stars.
               | Anyone can star, no one knows whether their account is
               | contributing to the star count. Now, presumably you could
               | star a thing and check if the number went up but maybe
               | introduce slight randomness or delay to obfuscate even
               | those details. I remember when reddit removed the total
               | upvote/downvote counts from the ui
        
               | eddythompson80 wrote:
               | The point is that this is not arguing on semantics nor is
               | it as simple as just a "set of requirements" that they
               | just follow. Battling fraud online is an entire business
               | in itself. Take Spotify plays, YouTube views, Google
               | search ranking, Amazon reviews, reddit votes, etc. These
               | organizations have significantly more incentives than
               | GitHub to reduce fraud in these metrics, and while they
               | do, it's still really really hard and it's very easy to
               | show how these metrics are gamed/faked all the time.
               | 
               | It's not a matter of "here is a list of requirements that
               | no one knows about, and here is slight randomness/delay
               | to obfuscate".
               | 
               | How much do you think it takes to pay an actual human
               | from a poor country to come to work each day at 8am,
               | create one github account after another, enter them in a
               | database, and leave at 5pm?
               | 
               | If you want to "study" how github handles stars because
               | there is legitimate financial incentive for you in it,
               | for $100 a day you can pay 10 or 20 of those people to
               | create few thousands accounts a day. Do it few times a
               | month, and throw these accounts in an automated system
               | that creates random repos, pushes a few commits here and
               | there, etc. Also "introduce some slight randomness or
               | delay to obfuscate these events". Do some A/B testing to
               | figure how the 300k accounts under your control affect a
               | repo star system, then advertise a "GitHub stars service"
               | "$0.50 per guaranteed star on Github". Your average VC
               | funded startup could get 10k stars for $5k.They probably
               | give AWS 10 times that a month.
               | 
               | Once github changes their requirements, do more testing,
               | figure out what the requirements now are, then you're
               | back in the game. If people do it all the time to
               | Spotify, YouTube, Google, Amazon, Reddit, and Twitter,
               | why do you think GitHub would somehow crack that nut?
        
               | JumpCrisscross wrote:
               | > _the point is that these requirements are not
               | published_
               | 
               | Well-connected people will get the tip off. And your PR
               | team will have to keep batting down conspiracy theories,
               | since if there's one thing the nutters love it's black
               | boxes.
        
           | the__alchemist wrote:
           | Hmm. I don't have SSH, but have many GH projects, and have
           | been active for a decade. So, I would be filtered out as not
           | an active dev, with the spammers?
        
             | nwienert wrote:
             | Sure, but at least stars would be net more useful.
        
         | uludag wrote:
         | I believe networks of human individuals can solve this to a
         | good degree assuming a particular topology exists.
         | 
         | Like, imagine a group of professionals of decent sized, all
         | specializing in a similar field, and having lots of strong
         | connections between each other where they have ample
         | opportunities to share information. It would be hard for an
         | outsider to come in and astroturf their product without immense
         | effort (like hiring shills to attend conferences). In-person
         | networks also obviously solve the problem stars as reputation:
         | reputation spreads naturally in these sorts of networks.
         | 
         | I think the problem comes with algorithmic scale. Maybe a
         | solution would be to have more community building activities
         | (maybe preferably offline).
        
         | aydyn wrote:
         | Requiring real ID and showing _regional_ stars like
         | Apple/Google would be a start.
        
           | eddythompson80 wrote:
           | > Requiring real ID
           | 
           | Yeah, people would love that for sure.
           | 
           | > showing _regional_ stars like Apple/Google would be a
           | start.
           | 
           | What does that mean? I thought regions only impact ranking
           | not the net amount of stars (assuming we're talking about
           | Apple/Google Maps). Which as far as I know, github doesn't do
           | ranking.
        
           | stronglikedan wrote:
           | > Requiring real ID
           | 
           | Sir, this is an HN.
        
         | mentalgear wrote:
         | doesn't mean why shouldn't fight back. That's exactly why we
         | need research projects like these: to maintain the balance.
        
         | 1propionyl wrote:
         | Any metric that becomes a target ceases to be a good metric.
         | 
         | The wrinkle is that measures that don't easily quantify are
         | more resistant. For example, showing provable use by other
         | reputable or trusted projects, or a significant amount of
         | resources allocated to maintenance, or ...
         | 
         | Really just anything that can't be reduced to a single number
         | in a canonical way will in the long run prove far more useful
         | for longer.
         | 
         | This of course shifts some of the burden onto potential users
         | to assess things more critically, and forecloses direct
         | numerical comparison. But the idea that you could just look at
         | a number and make such comparisons was faulty from the get go.
        
         | sedatk wrote:
         | Prioritize the stars given by accounts you follow in the UI.
         | Done.
        
           | p1esk wrote:
           | I don't want to follow anyone, but I do give stars to repos I
           | like.
        
             | sedatk wrote:
             | Then you'll have to start following the creators of repos
             | you like to build a web of trust.
        
         | awkward wrote:
         | I can see github platform internals caring about this for
         | anomaly detection, but as a developer, who cares? I suppose a
         | botnet could be making fake stars on a malware project or
         | supply chain attack, but the problem there doesn't seem like
         | it's the number of stars.
        
       | dzonga wrote:
       | do stars even count ?
       | 
       | my determination to use a project is 1. the readme 2. the issues
        
         | tonymet wrote:
         | recent commits and community engagement are better indicators
        
           | Retric wrote:
           | I'd generally rather use a library that hasn't needed to
           | update in 5 years than something in active development.
        
             | insane_dreamer wrote:
             | the challenge is differentiating between "haven't need to
             | update it in 5 years because it still solid and compatible
             | with its ecosystem" vs "haven't updated it in 5 years
             | because of any other reason"
        
             | sixothree wrote:
             | Sounds good in theory. But almost every time I use one of
             | these projects, it's in "abandoned" status and definitely
             | needs attention. There is 1 project I can point to that I
             | use that does not actually need any maintenance and another
             | that honestly makes me _extremely_ nervous to use because
             | of lack of maintenance.
        
             | mardifoufs wrote:
             | Can you give me some examples? Because in my experience
             | even very stable, very "foundational" libs and frameworks
             | that I know about and use almost never go 5 years without
             | any commit/change. There's always either a small bug fix,
             | or some update to a build script, updated documentation, or
             | something.
             | 
             | The only repos where that's not the case are usually very
             | niche, and in that case it becomes very hard to judge if
             | the library is just very stable or a minefield of bugs and
             | undesired behavior that no one else reported because no one
             | else is using it.
        
             | tonymet wrote:
             | openssl?
        
         | renewiltord wrote:
         | It used to be a heuristic VCs would use to gauge popularity.
         | You know how it is: if you have the revenue, talk about the
         | revenue; if you only have the users, talk about the users; if
         | you only have the stars, talk about the stars hehe
        
         | muglug wrote:
         | Sometimes projects get stars just because people like the
         | personality or company behind the project.
         | 
         | Case in point: https://github.com/facebook/hhvm/. It got 15,000
         | stars in its first few years, but roughly 10 non-Facebook
         | companies actually ever used it in production, and today only
         | one non-Facebook company uses it (I work at that company).
        
           | consumer451 wrote:
           | Sometimes, they are surreal stars for surrealist languages
           | that zero people actually use:
           | 
           | https://github.com/TodePond/DreamBerd - 11.7k stars
        
           | michaelmior wrote:
           | That doesn't mean that the stars are just because people like
           | the company. People may find the technology interesting even
           | if they have no intent of using it.
        
         | wildzzz wrote:
         | A star is just a bookmark for me. It says nothing beyond "I may
         | want to look at this again". When comparing two similar
         | projects, I may look at the star counts to see which one is
         | more popular but it's probably the last metric I'd consider.
        
         | glaucon wrote:
         | I agree, I am also interested in : date of most recent
         | substantive commit; date of first commit; number of
         | contributors.
         | 
         | I don't have hard and fast rules for how I interpret those
         | values, it depends on my intentions but I find them useful
         | things to consider.
         | 
         | Going back to the readme, nothing turns me off faster than a
         | skeletal readme, it doesn't have to be "War and Peace" but it
         | needs to be more than just how to install it.
        
       | attentionmech wrote:
       | I think number of clones is a much better metric (it's like proof
       | of work, it needs compute to clone a repo). For me starring a
       | repo is liking bookmarking it, nothing else. They might as well
       | just mark it as "Bookmarked" instead of "Starred".
        
         | nejsjsjsbsb wrote:
         | A better metric until it becomes a target. Once it is a target,
         | getting a billion clones is trivial.
         | 
         | Github should just stop showing star counts. Who cares about
         | them.
        
           | attentionmech wrote:
           | I think it's like a "upvote" thing which shows whether
           | historically users have found the repo interesting. Even if
           | you hide stars, there needs to be a way for the collective
           | hivemind of github users to help each other with what repos
           | are high quality or not right?
        
             | rpdillon wrote:
             | You don't need to crowdsource everything. I've never used
             | stars as a good metric because it's literally zero effort.
             | It's anybody who happens by just stars it, So all you can
             | really conclude from star count is that this is interesting
             | to this number of people.
             | 
             | Two metrics that I think correlate extremely highly with
             | quality: The number of commits in the repository and the
             | date of the most recent commit. I've used a metric based on
             | those two inputs for the past 15 years to evaluate repos
             | and I am not disappointed. Depending on the nature of the
             | project, I weigh the two attributes differently. Some
             | projects are arguably, 'done', and so the date of the most
             | recent commit is not very important in that case.
        
               | michaelmior wrote:
               | I think "interesting to this number of people" is not a
               | meaningless metric, but I would agree on the two other
               | metrics you cite.
        
               | ryandrake wrote:
               | There is a big difference between "highest quality" and
               | "most popular." Online services constantly confuse the
               | two because it's easier to measure popularity.
        
             | LtWorf wrote:
             | Except that most people don't bother starring stuff, so the
             | few who do are drowned by noise of fake stars.
        
           | ghxst wrote:
           | I sort by most amount of stars quite frequently when I am
           | learning a new language and want to know what the most
           | popular package is for something. What do you think would be
           | a better metric for a use case like that?
        
             | arccy wrote:
             | number of actual imports in code
        
               | flippyhead wrote:
               | CodeRank(tm)!
        
               | nejsjsjsbsb wrote:
               | This might work but biases against languages whose
               | package managers are not used in the rank. As well as
               | code that is used alot but not referenced via code
               | directly e.g. drop in dlls.
        
               | james_marks wrote:
               | Goodheart's law - this would just cause imports in junk
               | repos
        
             | michaelmior wrote:
             | I think it's a decent metric. I agree with the other
             | comment that actual imports is probably a better metric,
             | but that's not always as trivial to find.
             | 
             | That said, the package repositories for many popular
             | languages list stats of either declared dependencies or
             | package downloads, which helps.
        
               | LtWorf wrote:
               | rdeps are completely broken in github. I wrote a library
               | that I have used in other projects of my own and it was
               | always at 0 users.
               | 
               | Anyway if stuff is used by proprietary stuff it will also
               | sit at 0.
               | 
               | I now moved to codeberg where there is less spam,
               | although it does have stars
        
             | burnte wrote:
             | Don't count any of my stars then, I thought it was a
             | bookmark feature. Every repo I've starred is only starred
             | to find later, not an endorsement from me.
        
           | LtWorf wrote:
           | VC apparently.
        
         | pan69 wrote:
         | A similar thing happens on npmjs.com where it shows downloads
         | for packages, which is often used as a metric of quality.
         | However, everytime a build pipeline runs and it pulls the
         | package, that's a download.
        
           | attentionmech wrote:
           | May be with these rules: - Per user account we only count one
           | clone - We don't count anonymous clones
           | 
           | But I agree it's not like this is also without any issues
        
           | michaelmior wrote:
           | I don't think it's a useless metric and it's one I use
           | myself, but it can also be gamed pretty easily. So the more
           | people making decisions based on downloads, the higher the
           | likelihood of bots generating downloads just to juice the
           | stats.
        
           | LtWorf wrote:
           | And if your users know about "a cache" you won't get
           | downloads. So iy's more beneficial if your users are the kind
           | of noobs who redownload all the crap every single time rather
           | than having fast CI
        
         | Lerc wrote:
         | The weird thing is I forked Freepascal to add an architecture
         | of A VM I had written. It wasn't really useful to anyone else,
         | but every now and again it earns a star from a random passer
         | by.
        
           | attentionmech wrote:
           | Even I am curious now. Can you share me the fork? I want to
           | see what you added there and how it's added.
        
         | GZGavinZhao wrote:
         | *sad noises from NixOS/nixpkgs, llvm/llvm-project, and all
         | other repos with an absurd commit log/branches that takes ages
         | to do a full clone
         | 
         | (just a joke that immediately came to mind, not intended to
         | undermine OP's idea)
        
           | attentionmech wrote:
           | default to git --shallow in the cli can be one option here.
        
         | simoncion wrote:
         | > (it's like proof of work, it needs compute to clone a repo)
         | 
         | It's github's compute, so why do I (the person who's cloning
         | the repo) care about the compute? I don't pay for it!
        
           | david_allison wrote:
           | I suspect GP is referring to counting the occurrences of `git
           | clone` [on a fork?], rather than counting forks via the
           | GitHub UI
        
         | TZubiri wrote:
         | That is absolutely the wrong takeaway. The correct takeaway is
         | that supply chain attacks and spam are real threats, and that
         | these metrics can be gamed by malicious actors.
         | 
         | The work in cloning a repo is negligible, and the requirement
         | of work is not a security design guarantee in github. The
         | actual cost of liking projects is network, malicious actors
         | need to create fake accounts, waste IP addresses and ip blocks
         | in the process. Whether you are cloning or liking is just the
         | last mile.
         | 
         | To me the takeaway is not to trust a project based on it's
         | github metrics, and by extension not to trust projects just
         | because they are linked and liked in hacker news for example.
         | And to be wary of how I introduce dependencies into my
         | projects.
         | 
         | Not just because of strictly malicious dependencies, but also
         | because of trash dependencies that don't add value.
        
           | james_marks wrote:
           | > because of trash dependencies that don't add value
           | 
           | And at best, will still need maintenance in the future. One
           | of the top lessons I preach to juniors.
        
             | galangalalgol wrote:
             | I like the idea of granular permissions for libraries. When
             | you include a dependency you whitelist permissions it gets.
             | Package managers could automate this if the language
             | supports it. But making it about permissions instead of
             | metrics .akes it not arbitrary. This library gets no
             | filesystem access, that one gets no network access. This
             | one runs build time system commands... Austral is the only
             | language I know of that supports such a thing. While it
             | might be possible to bolt it on to rust, I think it would
             | take so much rework to make it infeasible.
        
           | yieldcrv wrote:
           | if a supply chain attack is susceptible to that, its purely
           | the fault of the crowd the relies on those metrics
        
           | ATechGuy wrote:
           | For all speculation around supply chain attacks with fake
           | Github stars, the article says:
           | 
           | "our study does not find any evidence of fake stars being
           | used for social engineering attacks"
        
         | robinsonb5 wrote:
         | The weird thing is I've seen enough forks that have never seen
         | any development that I'm pretty sure some people are using
         | those as bookmarks rather than stars!
        
           | neom wrote:
           | I'm not a SWE but I use github still, I thought stars ARE
           | bookmarks, what are stars then???? They're not for
           | bookmarking????
        
             | diego_sandoval wrote:
             | I would think most people use it for bookmarking, but it
             | seems like another portion of users use it as a "like"
             | button.
        
               | notpushkin wrote:
               | It is kinda both. It also reposts the project for your
               | GitHub followers.
        
             | attentionmech wrote:
             | They are currency of reputation and status. If you have
             | enough stars, you get invited to private parties with
             | elites. (I am just joking, they are bookmarks who got
             | famous)
        
             | LtWorf wrote:
             | Nobody knows but since we are at the point where you can
             | get VC money if you have enough, there is an incentive to
             | get them.
        
           | Terr_ wrote:
           | AFAIK the "fork" option also helps guard against the original
           | project getting deleted or somehow moved.
        
           | datadrivenangel wrote:
           | And forks on github have some bad ergonomics! Weird places
           | where the upstream project still has control/influence over
           | your fork. A full clone is better if you actually want
           | control over the code fork.
        
         | kube-system wrote:
         | I would imagine those figures would mostly indicate which
         | projects are most likely to be used in scripts or CI pipelines.
        
         | burnte wrote:
         | > They might as well just mark it as "Bookmarked" instead of
         | "Starred".
         | 
         | This is how I always interpreted the star feature and have used
         | it as a bookmarking feature. I didn't know it was more akin to
         | a like button!
        
         | Suppafly wrote:
         | >For me starring a repo is liking bookmarking it, nothing else.
         | 
         | Literally all I ever use the stars for, I don't know what they
         | are 'supposed' to be used for if not that.
        
         | WA wrote:
         | For me, it's "bookmarking obscure stuff". Why would I bookmark,
         | say, React? I can find this easily. I only star stuff that has
         | few stars and isn't as easy to find later.
        
       | lprd wrote:
       | Do we need that type of metric anyways? Surely there are better
       | ways to measure a repo's activity...
        
         | topspin wrote:
         | It seems like a conceptionally simple problem to grade a repo
         | given the vast number of metrics available. Especially
         | considering the advanced code analysis tools available today. I
         | want a top-level analysis of some sort, based on: usage by
         | other software (if applicable,) activity, issue frequency and
         | resolution, derivatives (forks, etc.,) number of participants,
         | code maturity, code testing, release frequency, license
         | structure and many other parameters.
         | 
         | There is an opportunity here for a third party to do this well.
        
       | ocean_moist wrote:
       | The github social media features are so weird I get around 10
       | follow requests per week from random people who follow >2k people
       | something off happening there.
        
         | mattbruv wrote:
         | I have the same thing happen to me often. Sometimes I get a
         | notification on my GitHub homepage that someone followed me a
         | day or so ago, and when I click to view their profile it seems
         | that they have already unfollowed me. For example, This guy did
         | it, and he has 6K+ followers and is only following ~200:
         | https://github.com/NobleMajo. It seems weird that he would
         | follow me to unfollow me right away. I have a feeling that
         | these accounts do this intentionally to harvest followers by
         | prompting Github to show a ton of different people that he is
         | following them in order to have them follow back in exchange. I
         | think most people will follow someone back who follows them
         | without really thinking about it. In my case I investigated who
         | it was who followed me and realized he isn't actually following
         | me and is probably harvesting followers. Why would someone
         | waste time out of their life to do this? Who knows. Probably
         | want to feel special or stand out from other people without
         | doing anything to earn it.
        
       | medv wrote:
       | This means 4.5M fake accounts. GitHub does a good job of
       | detecting bots, but room for improvements still exists.
        
         | elashri wrote:
         | That's not what the paper said. The numbers are much lower
         | because not all starts are by unique accounts.
         | 
         | > In total, StarScout identified 4.53 million fake stars across
         | 22,915 repositories (before the postprocessing step designed to
         | remove spurious ones), created by 1.32 million accounts; among
         | these stars, 0.95 million are identified with the low activity
         | signature and 3.58 million are identified by the clustering
         | signature. In the postprocessing step, StarScout further
         | identified 15,835 repositories with fake star campaigns
         | (corresponding to 3.1 million fake stars coming from 278k
         | accounts).
        
       | ashvardanian wrote:
       | Not surprising at all, honestly. The incentive to farm stars is
       | massive. According to the article, 10K stars can cost just $1K,
       | whereas achieving those numbers organically often takes years of
       | work, millions in R&D, and countless deployments. When this
       | seemingly trivial metric becomes a key factor in unlocking
       | capital from VCs, it's no wonder people resort to shortcuts. In a
       | way, the real surprise is that not everyone is buying stars.
        
       | halamadrid wrote:
       | Another interesting way - and I personally think its fraudulent.
       | This is how it goes - run hackathons or sponsor events in
       | Universities. There are a ton of colleges who are constantly
       | seeking support to run events.
       | 
       | Some companies take advantage of this by asking for stars in
       | return of sponsorship. I have seen proposals that say for a $2000
       | sponsorship - 2000 stars guaranteed. The way it works is if a
       | participant registers in the event they also have to show proof
       | that they starred a specific repo that belongs to the company.
        
       | simoncion wrote:
       | IMO, Github stars and number of "forks" are just as good a metric
       | as "number of daily downloads" of a library or Docker image or
       | similar.
       | 
       | After noticing how many, many companies run many, many builds
       | through their CI systems and (for a variety of reasons) end up
       | re-downloading everything those builds require, regardless of
       | whether or not it has changed since the last time they ran the
       | build, I've come to the firm conclusion that these metrics are
       | just plain bad if one uses them as a basis to make any
       | significant decision.
        
       | semiinfinitely wrote:
       | sometimes I star my own github repos does that count as fake
        
         | bdangubic wrote:
         | it doesn't if you really like it :)
        
       | openrisk wrote:
       | If you were wondering about fake forks, spoiler alert
       | 
       | > counts in Cluster 1 come from merchants that only sell stars,
       | while accounts in Cluster 2 come from merchants selling stars and
       | forks simultaneously
        
       | johncoltrane wrote:
       | $PROJECT was bookmarked 666 times with GitHub's internal
       | bookmarking mechanism doesn't say much about a project.
       | 
       | The fact that so many people give those bookmarks so much value
       | that an entire ecosystem was built around "fake" bookmarks is
       | mind boggling.
        
       | gitgud wrote:
       | GitHub Stars are just one of many signals that describe the
       | quality of a project.
       | 
       | If a project has 10,000 stars but 1 commit and a terrible
       | README... then the star count doesn't have as much weight...
       | 
       | You can't trust any signal in isolation (like star count), but
       | looking at many signals together is quite reliable
        
       | ivanjermakov wrote:
       | In my experience, open/closed issues ratio is much more important
       | than star count.
       | 
       | Star count is how interested people are in this project, does not
       | signify much about its quality. I would not star the repo of a
       | tool I use everyday, but would star some obscure project to try
       | it out later.
        
       | Der_Einzige wrote:
       | I wrote a whole benchmark which is not only resistant to this,
       | but would automatically detect most fake stars!
       | 
       | https://github.com/Hellisotherpeople/Bright
        
       | casenmgreen wrote:
       | I'm rather surprised it's only 4.5m.
        
       ___________________________________________________________________
       (page generated 2025-01-02 23:00 UTC)