[HN Gopher] I scraped all of OpenAI's Community Forum
___________________________________________________________________
I scraped all of OpenAI's Community Forum
Author : alt-glitch
Score : 236 points
Date : 2024-03-28 14:44 UTC (8 hours ago)
(HTM) web link (julep-ai.github.io)
(TXT) w3m dump (julep-ai.github.io)
| xandrius wrote:
| Love it, just for the sole reason of turning something OpenAI
| made into a dataset for everyone else :D
| codetrotter wrote:
| I don't think OpenAI are gonna lose any sleep over this.
|
| Isn't a "community forum" like this basically just: "we're not
| gonna spend money on providing adequate customer support so
| instead here is a forum where y'all can talk amongst yourselves
| and we'll give you some badges and imaginary points for doing
| the customer support yourselves"?
| solardev wrote:
| They probably just sic a customer service GPT on it and use
| it to train the other ones...
| miduil wrote:
| That's an interesting write-up, I wonder how this would look for
| other big Discourse communities such as NixOS.
| throwaway98797 wrote:
| did they have the right to use all thier data?
|
| /s
| SunlitCat wrote:
| I didn't even knew they have community forums. Looking at the
| main homepage (openai.com), the only external links I can find
| are to chatgpt and their docs hosted on platform.openai.com. The
| other links lead to their socials, github and soundcloud (of all
| places).
|
| Maybe I'm not looking thoroughly enough, so I may be wrong, tho!
| hughesjj wrote:
| I would also love to see these forums both to post and to lurk
| djantje wrote:
| https://community.openai.com/ (when you are logged in on
| platform.openai.com, there is a link from the menu)
| SunlitCat wrote:
| Thank you!
|
| Gone are the days when you simply saw all the important
| links on the main page, it seems. :)
| velid0 wrote:
| Now train a gpt based on the data :D
| testfrequency wrote:
| But make sure to call it ClosedData or something so we know
| it's not open source
|
| (sorry, I think openai and sam are gross)
| davely wrote:
| Maybe I don't understand this sentiment, but are people
| really that hung up on the name?
|
| I see this sort of thing posted a lot (i.e., "it should be
| ClosedAI instead of OpenAI, lol")
|
| What if it just means "Open for Business" instead of "Open
| Access for All"? Or maybe they should just make it an
| acronym?
|
| I'm sorry for the confusion on my part, but there's just been
| a lot of words dedicated toward expressing frustration with
| the company because they chose to use "open" in their name.
|
| Personally, I don't find it frustrating that Apple doesn't
| sell fruit and Intel doesn't actually give intelligence data.
| rootusrootus wrote:
| Is the frustration because of the name, or because open
| [access] was part of their ethos at the beginning, and
| people think they've abandoned it?
| startupsfail wrote:
| OpenAI is supposed to be a nonprofit. But, when the
| nonprofit board tried to exercise control, it became very
| clear that the nonprofit arm is not, in fact in control
| any longer. The board was wiped out, nearly everyone in
| the company seemingly was willing to join Microsoft or
| Sam Altman or what not.
|
| This doesn't seem to be compatible with continuing
| loftily call themselves with the same name, as the
| initial nonprofit mission.
| woopsn wrote:
| It's a gimmick. When the nonprofit was organized in 2015,
| the name certainly did not mean open for business. It meant
| (loftily) undertaking the quasi-religious quasi-humanist
| mission "in the spirit of liberty" to generate a new kind
| of super wealth as "broadly and evenly distributed as
| possible".
|
| As in prepare for the end... THE END OF HIGH PRICES!
|
| > to benefit humanity as a whole, unconstrained by a need
| to generate financial return
|
| - https://openai.com/blog/introducing-openai
| phyzome wrote:
| "What if it just means" -- I mean, we don't have to ask
| "what if". We can look at the original press release:
|
| https://openai.com/blog/introducing-openai
|
| << We're hoping to grow OpenAI into such an institution. As
| a non-profit, our aim is to build value for everyone rather
| than shareholders. Researchers will be strongly encouraged
| to publish their work, whether as papers, blog posts, or
| code, and our patents (if any) will be shared with the
| world. We'll freely collaborate with others across many
| institutions and expect to work with companies to research
| and deploy new technologies. >>
|
| They never give an explicit explanation for their name, but
| it's pretty obvious.
| enonimal wrote:
| > Number of Posts with negative sentiment, grouped by Topic
|
| > # 1 Result: Python Packaging
|
| Checks out
| minimaxir wrote:
| A pro-tip for using the OpenAI API is to _not_ use the official
| Python package for interfacing with it. The REST API
| documentation is good, and just using it in your HTTP client of
| choice like requests is roughly the same LOC without unexpected
| issues, along with more control.
| rockostrich wrote:
| I've found this happens with a lot of first party clients. At
| work, we use LaunchDarkly for feature flags and use their
| code references tool to keep track of where flags are being
| referenced. The tool uses their first party Go client to
| interact with the API but the client doesn't handle rate
| limiting at all even though they have rate limiting headers
| clearly documented for their API.
| doctorpangloss wrote:
| The Python package is really well engineered, and the startup
| that is making the OpenAPI client based on it, Stainless, is
| doing a good job.
|
| This shows laypeople piling into a hype thing and running
| immediately into the roadblock of programming.
|
| Normal people don't want to like, put in effort to feel like
| they are a part of something.
|
| They are used to "just" having to turn on Netflix to feel like
| they are a part of the biggest TV show, or "just" having to
| click a button to buy a Stanley Cup, or "just" having to click
| a button to buy Bitcoin. The API and performance issues, IMO,
| they're not noise, but they are meaningless. To me this also
| signals how badly Grok and Stability are doing it, they are
| doubling and tripling down on popular opinions that have a
| strong, objective meaninglessness to them (like how fast the
| tokens come out and how much porn you're allowed to make).
| Whereas the Grok people are looking at this analysis and
| feeling very validated right now.
|
| I have no dog in this race, but I would hope that the OpenAI
| people do not waste any time on Python APIs for dumb people;
| instead, they should definitely improve their store and have a
| firmer opinion on how that would look. They almost certainly
| have a developing opinion on a programming paradigm for
| chatbots, but I feel like they are hamstrung by needed to
| quantize their models to meet demand, not decisions about the
| look and feel of Python APIs or the crappiness of the Python
| packaging ecosystem. Another POV is that the Apple development
| experience persists to be notoriously crappy, and yet they are
| the most valuable platform for most companies in the world
| right now; and also, JetBrains could not sustain an audience
| for the AppCode IDE, because everyone uses middlewares anyway;
| so I really don't think Python APIs matter as much as the
| community says they do. It's a Nice to Have, but it Does Not
| Matter.
| enonimal wrote:
| we may think more similarly than you seem to think...
|
| this was more a slam on python packaging in general, than it
| is on the OpenAI implementation.
|
| I wouldn't be surprised if many of the issues under this
| topic are more related to Python package version nightmares,
| than OpenAI's Python implementation itself.
| wavyknife wrote:
| (disclaimer: I work for Discourse)
|
| Discourse has an AI plugin that admins can run on their community
| to generate their own sentiment analysis (among other things),
| though it's not quite as thorough as this write up!
| https://meta.discourse.org/t/discourse-ai-plugin/259214
|
| We're always interested to see how public data can be used like
| this. It's something that can be a lot more difficult on closed
| platforms.
| Aachen wrote:
| > helps you keep tabs on your community by analyzing posts and
| providing sentiment and emotional scores to give you an overall
| sense of your community for any period of time [...]
|
| > Toxicity can scan both new posts and chat messages and
| classify them on a toxicity score across a variety of labels
|
| Is that within the defined data processing purposes of all
| Discourse setups? Does the tool warn admins they might need to
| update their policies before being able to run this tool,
| perhaps needing to seek consent (depending on their
| jurisdiction and ethics)? It sounds somewhat objectionable,
| trying to guess my mental state from what I write without opt-
| in
|
| Edit: and apparently it also tries to flag NSFW chat messages,
| does Discourse have PM chats where this would flag private
| messages for admins to read or is it only public chats that
| this bot runs on?
|
| > tagging NSFW image content in posts and chat messages
| BadHumans wrote:
| More companies and communities than you think already do this
| without your knowledge let alone consent.
| david_allison wrote:
| That doesn't mean we can't do better
| BadHumans wrote:
| Better at what though? I don't even think it's a problem
| to begin with.
| eddd-ddde wrote:
| I don't think there's anything left for you to consent once
| you decide to post on a public forum. If I can read your post
| and guess your mental state so can any other bot.
| wavyknife wrote:
| Discourse is not a centralized platform, so it's up to
| individual sites to ensure they're compliant with data and
| privacy regulations.
| xfalcox wrote:
| > Is that within the defined data processing purposes of all
| Discourse setups?
|
| It's an optional plugin that can be enabled / disabled by the
| site admin. Those modules are all disabled by default, and
| each need to be enabled by the site owner.
|
| > Edit: and apparently it also tries to flag NSFW chat
| messages, does Discourse have PM chats where this would flag
| private messages for admins to read or is it only public
| chats that this bot runs on?
|
| Discourse PMs can be read by admins, see the definition here:
| https://meta.discourse.org/t/guidance-and-best-practices-
| on-...
| dorkwood wrote:
| I did a bit of data scraping for fun in the past, but I was never
| quite sure of the legality of what I was doing. What if I was
| breaking some law in some jurisdiction of some country? Was
| someone going to track me down and punish me?
|
| OpenAI has taught me that no one gives a shit. Scrape the entire
| internet if you want, and use the data for whatever you feel
| like.
| ifyoubuildit wrote:
| Do you think it would be better if someone did track you down
| and punish you? Which world do you want to live in?
| n0sleep wrote:
| I think large companies should be punished for stealing from
| people to make themselves richer.
| EcommerceFlow wrote:
| A precursor to this would have been that Linkedin lawsuit
| Microsoft lost, allowing that one company to scrape all of
| Linkedin (technically "public information").
| alt-glitch wrote:
| We were really heading someplace with The Semantic Web aka The
| Real Web 3.0 [1]
|
| Alas we have to fight against the machines in order to properly
| read the internet thru machines.
|
| I believe Discourse knowingly keeps its data easy to scrape
| though, so kudos to them!
|
| [1]: https://en.wikipedia.org/wiki/Semantic_Web
| bsuvc wrote:
| > OpenAI has taught me that no one gives a shit. Scrape the
| entire internet if you want, and use the data for whatever you
| feel like.
|
| Cloudflare gives a shit.
|
| My household had to use our 5G internet for most things for a
| week or two until our IP reputation recovered.
| stoorafa wrote:
| Yeah it's probably worth renting a server if there's any
| doubt about whether it's wholly appropriate to do something
| xfalcox wrote:
| That's super cool, thanks for sharing! I will share this as an
| easy to follow example of what we can with AI.
|
| > Allowing a Q&A interface using these embeddings over the post
| contents could speed up research over the community posts (if you
| know the right questions to ask :P). Let's view some posts
| similar to this one complaining about function calling
|
| That's indeed a great thing to surface, and that's exactly how
| the the OpenAI forum selects the "Related Topics" to show at the
| end of every topic. We use embeddings for this feature, and the
| entire thing is open-source:
| https://github.com/discourse/discourse-ai/blob/main/lib/embe...
|
| We also embeddings for suggesting tags, categories, HyDE search
| and more. It's by far my favorite tech of this new AI/ML gen so
| far in terms of applicability.
|
| > Using Twitter-roBERTa-base for sentiment analysis, we generated
| a post_sentiment label (negative, positive, neutral) and
| post_sentiment_score confidence score for each post.
|
| We do the same, with even the same model, and conveniently show
| that information on the admin interface of the forum. Again all
| open source: https://github.com/discourse/discourse-
| ai/tree/main/lib/sent...
|
| Disclaimer: I'm the tech lead on the AI parts of Discourse, the
| open source software that powers OpenAI's community forum.
| fzysingularity wrote:
| So epic, thank you for making this dataset available to everyone!
___________________________________________________________________
(page generated 2024-03-28 23:01 UTC)