hngopher.com

       [HN Gopher] I scraped all of OpenAI's Community Forum
       ___________________________________________________________________
        
       I scraped all of OpenAI's Community Forum
        
       Author : alt-glitch
       Score  : 236 points
       Date   : 2024-03-28 14:44 UTC (8 hours ago)
        
 (HTM) web link (julep-ai.github.io)
 (TXT) w3m dump (julep-ai.github.io)
        
       | xandrius wrote:
       | Love it, just for the sole reason of turning something OpenAI
       | made into a dataset for everyone else :D
        
         | codetrotter wrote:
         | I don't think OpenAI are gonna lose any sleep over this.
         | 
         | Isn't a "community forum" like this basically just: "we're not
         | gonna spend money on providing adequate customer support so
         | instead here is a forum where y'all can talk amongst yourselves
         | and we'll give you some badges and imaginary points for doing
         | the customer support yourselves"?
        
           | solardev wrote:
           | They probably just sic a customer service GPT on it and use
           | it to train the other ones...
        
       | miduil wrote:
       | That's an interesting write-up, I wonder how this would look for
       | other big Discourse communities such as NixOS.
        
       | throwaway98797 wrote:
       | did they have the right to use all thier data?
       | 
       | /s
        
       | SunlitCat wrote:
       | I didn't even knew they have community forums. Looking at the
       | main homepage (openai.com), the only external links I can find
       | are to chatgpt and their docs hosted on platform.openai.com. The
       | other links lead to their socials, github and soundcloud (of all
       | places).
       | 
       | Maybe I'm not looking thoroughly enough, so I may be wrong, tho!
        
         | hughesjj wrote:
         | I would also love to see these forums both to post and to lurk
        
           | djantje wrote:
           | https://community.openai.com/ (when you are logged in on
           | platform.openai.com, there is a link from the menu)
        
             | SunlitCat wrote:
             | Thank you!
             | 
             | Gone are the days when you simply saw all the important
             | links on the main page, it seems. :)
        
       | velid0 wrote:
       | Now train a gpt based on the data :D
        
         | testfrequency wrote:
         | But make sure to call it ClosedData or something so we know
         | it's not open source
         | 
         | (sorry, I think openai and sam are gross)
        
           | davely wrote:
           | Maybe I don't understand this sentiment, but are people
           | really that hung up on the name?
           | 
           | I see this sort of thing posted a lot (i.e., "it should be
           | ClosedAI instead of OpenAI, lol")
           | 
           | What if it just means "Open for Business" instead of "Open
           | Access for All"? Or maybe they should just make it an
           | acronym?
           | 
           | I'm sorry for the confusion on my part, but there's just been
           | a lot of words dedicated toward expressing frustration with
           | the company because they chose to use "open" in their name.
           | 
           | Personally, I don't find it frustrating that Apple doesn't
           | sell fruit and Intel doesn't actually give intelligence data.
        
             | rootusrootus wrote:
             | Is the frustration because of the name, or because open
             | [access] was part of their ethos at the beginning, and
             | people think they've abandoned it?
        
               | startupsfail wrote:
               | OpenAI is supposed to be a nonprofit. But, when the
               | nonprofit board tried to exercise control, it became very
               | clear that the nonprofit arm is not, in fact in control
               | any longer. The board was wiped out, nearly everyone in
               | the company seemingly was willing to join Microsoft or
               | Sam Altman or what not.
               | 
               | This doesn't seem to be compatible with continuing
               | loftily call themselves with the same name, as the
               | initial nonprofit mission.
        
             | woopsn wrote:
             | It's a gimmick. When the nonprofit was organized in 2015,
             | the name certainly did not mean open for business. It meant
             | (loftily) undertaking the quasi-religious quasi-humanist
             | mission "in the spirit of liberty" to generate a new kind
             | of super wealth as "broadly and evenly distributed as
             | possible".
             | 
             | As in prepare for the end... THE END OF HIGH PRICES!
             | 
             | > to benefit humanity as a whole, unconstrained by a need
             | to generate financial return
             | 
             | - https://openai.com/blog/introducing-openai
        
             | phyzome wrote:
             | "What if it just means" -- I mean, we don't have to ask
             | "what if". We can look at the original press release:
             | 
             | https://openai.com/blog/introducing-openai
             | 
             | << We're hoping to grow OpenAI into such an institution. As
             | a non-profit, our aim is to build value for everyone rather
             | than shareholders. Researchers will be strongly encouraged
             | to publish their work, whether as papers, blog posts, or
             | code, and our patents (if any) will be shared with the
             | world. We'll freely collaborate with others across many
             | institutions and expect to work with companies to research
             | and deploy new technologies. >>
             | 
             | They never give an explicit explanation for their name, but
             | it's pretty obvious.
        
       | enonimal wrote:
       | > Number of Posts with negative sentiment, grouped by Topic
       | 
       | > # 1 Result: Python Packaging
       | 
       | Checks out
        
         | minimaxir wrote:
         | A pro-tip for using the OpenAI API is to _not_ use the official
         | Python package for interfacing with it. The REST API
         | documentation is good, and just using it in your HTTP client of
         | choice like requests is roughly the same LOC without unexpected
         | issues, along with more control.
        
           | rockostrich wrote:
           | I've found this happens with a lot of first party clients. At
           | work, we use LaunchDarkly for feature flags and use their
           | code references tool to keep track of where flags are being
           | referenced. The tool uses their first party Go client to
           | interact with the API but the client doesn't handle rate
           | limiting at all even though they have rate limiting headers
           | clearly documented for their API.
        
         | doctorpangloss wrote:
         | The Python package is really well engineered, and the startup
         | that is making the OpenAPI client based on it, Stainless, is
         | doing a good job.
         | 
         | This shows laypeople piling into a hype thing and running
         | immediately into the roadblock of programming.
         | 
         | Normal people don't want to like, put in effort to feel like
         | they are a part of something.
         | 
         | They are used to "just" having to turn on Netflix to feel like
         | they are a part of the biggest TV show, or "just" having to
         | click a button to buy a Stanley Cup, or "just" having to click
         | a button to buy Bitcoin. The API and performance issues, IMO,
         | they're not noise, but they are meaningless. To me this also
         | signals how badly Grok and Stability are doing it, they are
         | doubling and tripling down on popular opinions that have a
         | strong, objective meaninglessness to them (like how fast the
         | tokens come out and how much porn you're allowed to make).
         | Whereas the Grok people are looking at this analysis and
         | feeling very validated right now.
         | 
         | I have no dog in this race, but I would hope that the OpenAI
         | people do not waste any time on Python APIs for dumb people;
         | instead, they should definitely improve their store and have a
         | firmer opinion on how that would look. They almost certainly
         | have a developing opinion on a programming paradigm for
         | chatbots, but I feel like they are hamstrung by needed to
         | quantize their models to meet demand, not decisions about the
         | look and feel of Python APIs or the crappiness of the Python
         | packaging ecosystem. Another POV is that the Apple development
         | experience persists to be notoriously crappy, and yet they are
         | the most valuable platform for most companies in the world
         | right now; and also, JetBrains could not sustain an audience
         | for the AppCode IDE, because everyone uses middlewares anyway;
         | so I really don't think Python APIs matter as much as the
         | community says they do. It's a Nice to Have, but it Does Not
         | Matter.
        
           | enonimal wrote:
           | we may think more similarly than you seem to think...
           | 
           | this was more a slam on python packaging in general, than it
           | is on the OpenAI implementation.
           | 
           | I wouldn't be surprised if many of the issues under this
           | topic are more related to Python package version nightmares,
           | than OpenAI's Python implementation itself.
        
       | wavyknife wrote:
       | (disclaimer: I work for Discourse)
       | 
       | Discourse has an AI plugin that admins can run on their community
       | to generate their own sentiment analysis (among other things),
       | though it's not quite as thorough as this write up!
       | https://meta.discourse.org/t/discourse-ai-plugin/259214
       | 
       | We're always interested to see how public data can be used like
       | this. It's something that can be a lot more difficult on closed
       | platforms.
        
         | Aachen wrote:
         | > helps you keep tabs on your community by analyzing posts and
         | providing sentiment and emotional scores to give you an overall
         | sense of your community for any period of time [...]
         | 
         | > Toxicity can scan both new posts and chat messages and
         | classify them on a toxicity score across a variety of labels
         | 
         | Is that within the defined data processing purposes of all
         | Discourse setups? Does the tool warn admins they might need to
         | update their policies before being able to run this tool,
         | perhaps needing to seek consent (depending on their
         | jurisdiction and ethics)? It sounds somewhat objectionable,
         | trying to guess my mental state from what I write without opt-
         | in
         | 
         | Edit: and apparently it also tries to flag NSFW chat messages,
         | does Discourse have PM chats where this would flag private
         | messages for admins to read or is it only public chats that
         | this bot runs on?
         | 
         | > tagging NSFW image content in posts and chat messages
        
           | BadHumans wrote:
           | More companies and communities than you think already do this
           | without your knowledge let alone consent.
        
             | david_allison wrote:
             | That doesn't mean we can't do better
        
               | BadHumans wrote:
               | Better at what though? I don't even think it's a problem
               | to begin with.
        
           | eddd-ddde wrote:
           | I don't think there's anything left for you to consent once
           | you decide to post on a public forum. If I can read your post
           | and guess your mental state so can any other bot.
        
           | wavyknife wrote:
           | Discourse is not a centralized platform, so it's up to
           | individual sites to ensure they're compliant with data and
           | privacy regulations.
        
           | xfalcox wrote:
           | > Is that within the defined data processing purposes of all
           | Discourse setups?
           | 
           | It's an optional plugin that can be enabled / disabled by the
           | site admin. Those modules are all disabled by default, and
           | each need to be enabled by the site owner.
           | 
           | > Edit: and apparently it also tries to flag NSFW chat
           | messages, does Discourse have PM chats where this would flag
           | private messages for admins to read or is it only public
           | chats that this bot runs on?
           | 
           | Discourse PMs can be read by admins, see the definition here:
           | https://meta.discourse.org/t/guidance-and-best-practices-
           | on-...
        
       | dorkwood wrote:
       | I did a bit of data scraping for fun in the past, but I was never
       | quite sure of the legality of what I was doing. What if I was
       | breaking some law in some jurisdiction of some country? Was
       | someone going to track me down and punish me?
       | 
       | OpenAI has taught me that no one gives a shit. Scrape the entire
       | internet if you want, and use the data for whatever you feel
       | like.
        
         | ifyoubuildit wrote:
         | Do you think it would be better if someone did track you down
         | and punish you? Which world do you want to live in?
        
           | n0sleep wrote:
           | I think large companies should be punished for stealing from
           | people to make themselves richer.
        
         | EcommerceFlow wrote:
         | A precursor to this would have been that Linkedin lawsuit
         | Microsoft lost, allowing that one company to scrape all of
         | Linkedin (technically "public information").
        
         | alt-glitch wrote:
         | We were really heading someplace with The Semantic Web aka The
         | Real Web 3.0 [1]
         | 
         | Alas we have to fight against the machines in order to properly
         | read the internet thru machines.
         | 
         | I believe Discourse knowingly keeps its data easy to scrape
         | though, so kudos to them!
         | 
         | [1]: https://en.wikipedia.org/wiki/Semantic_Web
        
         | bsuvc wrote:
         | > OpenAI has taught me that no one gives a shit. Scrape the
         | entire internet if you want, and use the data for whatever you
         | feel like.
         | 
         | Cloudflare gives a shit.
         | 
         | My household had to use our 5G internet for most things for a
         | week or two until our IP reputation recovered.
        
           | stoorafa wrote:
           | Yeah it's probably worth renting a server if there's any
           | doubt about whether it's wholly appropriate to do something
        
       | xfalcox wrote:
       | That's super cool, thanks for sharing! I will share this as an
       | easy to follow example of what we can with AI.
       | 
       | > Allowing a Q&A interface using these embeddings over the post
       | contents could speed up research over the community posts (if you
       | know the right questions to ask :P). Let's view some posts
       | similar to this one complaining about function calling
       | 
       | That's indeed a great thing to surface, and that's exactly how
       | the the OpenAI forum selects the "Related Topics" to show at the
       | end of every topic. We use embeddings for this feature, and the
       | entire thing is open-source:
       | https://github.com/discourse/discourse-ai/blob/main/lib/embe...
       | 
       | We also embeddings for suggesting tags, categories, HyDE search
       | and more. It's by far my favorite tech of this new AI/ML gen so
       | far in terms of applicability.
       | 
       | > Using Twitter-roBERTa-base for sentiment analysis, we generated
       | a post_sentiment label (negative, positive, neutral) and
       | post_sentiment_score confidence score for each post.
       | 
       | We do the same, with even the same model, and conveniently show
       | that information on the admin interface of the forum. Again all
       | open source: https://github.com/discourse/discourse-
       | ai/tree/main/lib/sent...
       | 
       | Disclaimer: I'm the tech lead on the AI parts of Discourse, the
       | open source software that powers OpenAI's community forum.
        
       | fzysingularity wrote:
       | So epic, thank you for making this dataset available to everyone!
        
       ___________________________________________________________________
       (page generated 2024-03-28 23:01 UTC)