[HN Gopher] Facebook wants to 'normalize' the mass scraping of p...
       ___________________________________________________________________
        
       Facebook wants to 'normalize' the mass scraping of personal data
        
       Author : gbrown_
       Score  : 136 points
       Date   : 2021-04-20 18:48 UTC (4 hours ago)
        
 (HTM) web link (www.vice.com)
 (TXT) w3m dump (www.vice.com)
        
       | kuroguro wrote:
       | Starting to feel like my login manager will soon need a 'generate
       | random profile' button next to the 'generate password' one...
        
         | arminiusreturns wrote:
         | What a great idea actually. Tired of manually managing my
         | firefox profiles.
        
       | novok wrote:
       | Lemme mass scrape facebook and linked in then ;)
        
         | pocket_cheese wrote:
         | Scraping facebook has been one of my dreams. I remember taking
         | a cursory crack at it and giving up after seeing how much of
         | time, money and effort it would involve. The cost of storing
         | the data alone would be cost prohibitive, even if you have a
         | fancy FAANG salary.
         | 
         | Scraping facebook is an operation. One I wish to make happen
         | one day :D
        
         | spaced-out wrote:
         | Courts have already said you can
         | 
         | https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
        
         | agogdog wrote:
         | I've tried for various reasons, they apparently have a team to
         | fight scraping so it's a game of rate-limit whack-a-mole.
         | 
         | This is a decent example of the imbalance of power too...
         | Facebook could probably scrape whatever they want with little
         | resistance from average sized sites... if they really wanted to
         | they could put a team of skilled engineers on a team to get it.
         | 
         | On the other end, if you want to scrape Facebook you're staring
         | down a team of skilled engineers that's actively trying to
         | prevent you from doing much. At this point in technology
         | they're more formidable than most countries.
        
         | MattGaiser wrote:
         | If you can get away with it, you can.
        
         | jandrese wrote:
         | The only thing stopping you is Facebook's tech team. It's
         | basically an arms race between the scrapers and anti-scraping
         | tech.
        
           | asdff wrote:
           | It is so frustrating how many brilliant minds on either side
           | are just wasted fighting this bullshit war and countless
           | others like it for someone else's bonus check. There are so
           | many hard engineering problems that need to be solved, and
           | it's depressing how we as a society reward who can sell the
           | most widgets instead of who can help the most people.
        
       | freshpots wrote:
       | Instagram is a prime example of it. They started the platform by
       | making it default that you can't delete old comments. You can't
       | even easily view them. That makes it easy to change a users
       | behavior as 'that is how it always has been'. It is scary how
       | much control these platforms have and how they are increasingly
       | preventing users from removing/viewing past content.
        
       | greyscale_2 wrote:
       | I'm confused by the conflation of 'scraping' and 'leaking'. Is
       | this FB email talking about the mass scraping of information
       | users put on their public profiles or the illicit acquisition of
       | non-public user data?
        
         | MattGaiser wrote:
         | > the mass scraping of information users put on their public
         | profiles
         | 
         | I believe it is this.
        
           | kuroguro wrote:
           | > the illicit acquisition of non-public user data
           | 
           | I thought they meant this, as the phone numbers were
           | 'scraped' trough one of their public facing features (I think
           | it was contacts import this time, before that a lot of phones
           | were leaked trough the search bar / forgot password).
           | 
           | I think they're misusing 'scrape' here intentionally as if to
           | say they did nothing wrong.
        
       | jwalton wrote:
       | Yes, it's a broad industry issue that companies like Facebook
       | will ask us for our phone numbers and promise they will only be
       | used for authentication [1], and then leak them by publishing
       | them on the Internet and letting third parties scrape them... I'm
       | not sure it's a problem we should normalize though.
       | 
       | [1] https://nakedsecurity.sophos.com/2019/03/05/facebook-
       | critici...
        
       | aviraldg wrote:
       | A better headline would be "Facebook wants to inform users about
       | the fact that scraping publicly available information is easy
       | (and there's nothing wrong about that.)" Of course, that wouldn't
       | get Vice as many clicks.
       | 
       | Don't publish information you don't want to be part of _some_
       | database publicly on the internet. I wish schools had some sort
       | of tech literacy class where they explained this stuff to
       | people...
        
         | jwalton wrote:
         | The information that was scraped here, though, are phone
         | numbers that Facebook requires you to add to your account and
         | that they promised would not be used for anything other than
         | authentication purposes. Most of these numbers should not have
         | been publicly available in the first place.
        
       | intricatedetail wrote:
       | I will vote for any party that will make online tracking illegal.
       | Ad targeting should be limited.
        
       | Ticklee wrote:
       | They already have, the internet is a mess.
       | 
       | Every website you visit wants more and more of your data.
       | Facebook played a huge role in making this level of data sharing
       | widespread.
        
         | joe_the_user wrote:
         | I would claim the opposite. Facebook normalized the belief that
         | the information people put on their FB page was NOT being
         | scraped by many people even though it was. The rise of Facebook
         | accompanied a whole belief system about "things I share with my
         | friends on the Internet".
         | 
         | It seems like Facebook is now large enough that they're
         | effectively owning up to the unavoidable truth - there's no way
         | that information made available to all subscribers of some
         | largish social network isn't going to be public to the world.
        
           | uoaei wrote:
           | That's not an accurate framing of the situation. Sure, they
           | relied on people's technological illiteracy to do things
           | people didn't really think were possible for a while. But in
           | the face of the news about recent leaks, and the Cambridge
           | Analytica scandal in particular, they have had to switch to a
           | more active PR strategy to quell the concerns people have
           | about their product(s).
        
             | joe_the_user wrote:
             | The Cambridge Analytica scandal was three years ago and
             | this article is about PR moves Facebook is doing now.
             | 
             | And sure, I don't give every gruesome detail in the rise
             | but I'd still claim that the overall situation is that
             | Facebook is large enough and it's model porous enough that
             | a variety of actors have scraped it, are scraping it and
             | will scrape it. And given this, Facebook has to start
             | owning up to an inevitable situation. Keep in mind, The
             | Cambridge Analytica scandal was predicated on Facebook's
             | claimed data model (which I'd claim isn't just false but
             | also "can't be true"). Sure, the easiest way to scrape it
             | is having API access, which it's hard not to give to your
             | advertisers. But if Facebook gave no one API access,
             | various actors would be directly scraping.
             | 
             | And overall, I'd say The Cambridge Analytica scandal was
             | the thing that wasn't a good framing of the broad problems
             | of Facebook and privacy.
             | 
             | Edit: _" But in the face of the news about recent leaks,
             | and the Cambridge Analytica scandal in particular, they
             | have had to switch to a more active PR strategy to quell
             | the concerns people have about their product(s)."_
             | 
             | And I'd say, this is again actually the wrong frame.
             | Facebook is at the center of the storm, no doubt. But there
             | is no large social network possible that wouldn't be
             | subject to the general privacy problems of Facebook.
             | Facebook created the fantasy definition of privacy,
             | Facebook violated that definition but no one could satisfy
             | it.
        
               | Retric wrote:
               | Scandals linger far longer than 3 years.
               | 
               | A great example is M&M's dye choice became controversial
               | due to customer confusion over which red dyes where
               | harmful. So, the company couldn't simply change the dye
               | because what they where using wasn't problematic. In the
               | end they had to flat out stop selling red M&M's for over
               | a decade, and their reintroduction was surprisingly
               | controversial.
        
               | joe_the_user wrote:
               | If you read my gp, I'm not arguing the Cambridge
               | Analytica scandal didn't influence Facebook. I'm arguing
               | the real, larger frame is that Facebook can't help but be
               | porous and it's acknowledge that truth for their self-
               | interest. That helps them avoid scandal, yes but contrary
               | to the earlier poster "it's cause of scandal" or "it's
               | cause Facebook bad" is a bad, distorting frame. And that
               | isn't saying Facebook is good, it's saying the entire
               | framework of social networks and things propagating on
               | the Internet creates a certain kind of "playing field".
               | 
               | I would speculate, in fact, that Facebook acting now make
               | the obvious point that of course people are going to be
               | scraping the data of their site because after X many
               | scandals, it's becoming obvious that people will do that,
               | that they will do that to any site like Facebook and that
               | they'll have much clearer cover if they "normalize" thing
               | that are ... fricken normal.
               | 
               | I'd further speculate that they couldn't act when
               | Cambridge Analytica was fresher because then they'd be
               | seen as being self-justifying and then they had to be
               | seen as humble and apologetic.
        
         | api wrote:
         | It's not just Facebook. Every sales, marketing, or product
         | person in the world basically has an unlimited appetite for
         | data and will push to suck up as much data as possible.
         | 
         | There is a logical reason for this: one of the toughest things
         | is knowing what your users actually want and what their actual
         | pain points are. In advertising there's an analogous problem
         | often summarized as: "I know I am wasting 80% of my ad spend,
         | but I don't know which 80%."
         | 
         | Every single incentive on the business side incentivizes data
         | grabbing. This will never change unless users vote hard with
         | their wallets or unless there is protective legislation.
        
           | Ticklee wrote:
           | I wholeheartedly think http/s is irreparably damaged, for
           | example it is impossible to find good information on search
           | engines, even the free ones like Searx. If you are able to
           | find a website you can bet it includes 10MiB of trackers and
           | ads.
           | 
           | Hopefully someone writes a better protocol with no third
           | party cookies and heavily restricted javascript.
        
             | elzbardico wrote:
             | Hate to nitpick but those things are not features of the
             | HTTP protocol from the IETF but of HTML from the W3C
        
               | Ticklee wrote:
               | Nitpick away, when I am ignorant I'd rather be told than
               | stay ignorant.
        
             | phailhaus wrote:
             | As the other poster pointed out, those are properties of
             | HTML and not HTTP/S. But what I'd like to point out is that
             | this:
             | 
             | > heavily restricted javascript
             | 
             | Is basically impossible. Any useful subset of javascript
             | would be turing-complete, and therefore enough to do
             | whatever's necessary to track the user. Literally all you
             | need to be able to do is make an HTTP request and bam, you
             | can track.
        
               | a1369209993 wrote:
               | Turing-complete is (kind of) irrelevant, the question is
               | what (equivalent of) system calls is has access to. Eg,
               | javascript should not be able to set cookies or cause
               | network traffic after page load by default.
               | 
               | > all you need to be able to do is make an HTTP request
               | 
               | Precicely. Inability to do this is (part of) what > >
               | heavily restricted javascript _means_.
        
       | kwdc wrote:
       | How different the world would be if companies that hold data
       | about you suddenly have to pay rent unless they have specific
       | explicit permission, eg direct association. Put a stinger that
       | means all permission granted requires a complete chain of custody
       | for the data. So no data brokers lurking in the shadows. And a
       | cost for non-compliance. This might get people thinking twice
       | about building databases "just because".
       | 
       | If the database has value then perhaps it should have a regular
       | cost?
       | 
       | Who knows what data is out there? My experience with just my
       | credit reports was that the files about me were full of errors.
       | At least I was able to correct them.
       | 
       | I also discovered a bunch of linkedin-scraped data about me that
       | was posted on various contact sites. Multiple errors.
        
         | joe_the_user wrote:
         | Your plan would put Facebook, which gets data from people, in a
         | better position, since they collect data on people with those
         | people's permission, in exchange for services. They just have
         | to assign a dollar value to their services and they would have
         | fulfilled your requirements. Where yeah, it would nice if
         | credit companies had to get permission too.
         | 
         | And nearly every website already warns me they're going to
         | collect data. With you're step, the next thing is signing away
         | that rent.
         | 
         | Or, if your plan involved rent that can't signed away, well, no
         | one would host anyone for anything since they wouldn't want to
         | pay that.
        
       | throwawayfeaxcz wrote:
       | I am kind of with Facebook on this one, if you didn't want your
       | phone scraped you shouldn't have plastered it on the internet
       | next to your name.
        
       | 1vuio0pswjnm7 wrote:
       | This short article is somewhat misleading on Facebook's position.
       | They are against scraping. Not on behalf of users but on behalf
       | of mass user data collecters, what it calls "the industry", like
       | itself. That is why they engage is "anti-scraping". That is also
       | why LinkedIn has tried to sue others for scraping LinkedIn public
       | data.
       | 
       | Facebook does not want the public, outside of "the industry", to
       | have the same public data that Facebook has collected. If
       | everyone can potentially have the same data Facebook has, data
       | collection potentially becomes democratised and the world does
       | not need Facebook nor "the industry" anymore. These advertising
       | services companies no longer have any special value.
       | 
       | The problem with Facebook, and "the industry", is data
       | collection, not lack of "anti-scraping" competence. Once the
       | sensitive data is collected by private industry on a massive
       | scale, then liability is created. The data is not any safer than
       | if a government had collected it. In some jurisdictions it is
       | less safe, because there are restrictions on this type of
       | activity by government that do not apply to companies. This
       | liability is why some people take the position that the data
       | collection Google or Facebook does to further its "business" is
       | neither harmless nor "acceptable".
       | 
       | Facebook is framing this liability problem as one of "scraping",
       | not collection. It is not trying to further the interests of
       | users but instead to further its own interests. Facebook wants
       | the courts and regulatory authorities to see mass quantities of
       | public data about internet users as Facebook's semi-exclusive
       | asset, to be protected as if it was "private" data. Facebook is
       | arguing mass public data "leaks" are not acceptable and that's
       | why "the industry" must step up its "anti-scraping" measures.
       | 
       | However "scraping" the internet for public data is not the
       | problem, it is only a symptom. Massive data collection initiated
       | by these companies about internet users, for the purpose of
       | selling advertising services, is the problem.
        
       | readflaggedcomm wrote:
       | The strategy as worded isn't wrong, assuming I understand their
       | terminology. If an account with no special privileges can access
       | the information at least once, it's essentially public
       | information.
       | 
       | People are very worried about what it and isn't public, but these
       | in-between areas where a platform puts up hurdles still aren't
       | private.
        
         | uoaei wrote:
         | Of course, Facebook designs their platform to incentivize
         | sharing publicly whenever possible, and designs dark patterns
         | to dissuade people from understanding the full extent of what
         | they can control with respect to their privacy.
        
           | joe_the_user wrote:
           | Reality also incentivizes sharing things publicly. But
           | screenshot sharing is huge on Facebook and on the Internet
           | generally.
           | 
           | The only thing kind-of-like-privacy that exists on the
           | Internet is "encrypted messages sent to well-vetted actual
           | friends" and anonymously posted things well-scrubbed of
           | identifying information. Everything else is just something to
           | make people feel better. And most people's stuff doesn't come
           | out and create a scandal because most people's stuff is
           | boring and unimportant, that's the main protection the
           | average person has.
        
       | IceWreck wrote:
       | Web scraping is legal and even if it wasn't there is no way you
       | can prevent it.
       | 
       | Don't upload stuff on a public website if you don't want it
       | scraped/harvested.
        
         | Nextgrid wrote:
         | Agreed. The problem here as I understand is that Facebook
         | misled users about how "private" their info actually was.
        
           | jandrese wrote:
           | This is why I tell people to treat anything the put on the
           | internet as public information. This includes cloud storage.
           | If you have to put it up there then you encrypt it yourself
           | before uploading. It only takes one compromised
           | person/machine in the company to undermine all of that
           | company's promises to you.
        
       | joe_the_user wrote:
       | _Facebook wants to "normalize" the idea that large scale scraping
       | of user data from social networks like its own is a common
       | occurrence_
       | 
       | Get people used to the truth? Shock, horror!
       | 
       | I mean, certainly Facebook rose to it's position through a sort
       | of opposite claim, that a user could be "public" (visible to a
       | wide circle of friends-of-friends-of-etc) but not public (visible
       | to Russian hackers, Brazilian botmasters or whoever). This claim
       | is kind of a fairy tale, something that no only isn't true but
       | couldn't be true. "This information is public to anyone who
       | creates an account but not public en masse to the world". Still,
       | the claim made an average FB user feel safer (and lot of people
       | "got on the Internet" in a big way through FB circa ~2010). And
       | it's got a lot of traction now. But since the situation is
       | fundamentally porous, now that FB is large, it seems it's in
       | their legal interest to drop the bullshit and just say "if it's
       | public, it's public, what the hell else do you expect".
       | 
       | And yeah, the exploitation of public data arguably lead to all
       | sorts of bad effects and it would have been and would be nice to
       | head this off in some fashion. But imagining you can this off by
       | maintain a "quote-public versus totally-public" distinction isn't
       | one of those ways.
        
       ___________________________________________________________________
       (page generated 2021-04-20 23:01 UTC)