Post B0FJh3r0cphZ4on0ds by ricci@discuss.systems
 (DIR) More posts by ricci@discuss.systems
 (DIR) Post #B0DzkhpkFJtjhNhOD2 by ricci@discuss.systems
       2025-11-14T04:16:10Z
       
       1 likes, 0 repeats
       
       It sometimes surprises me to learn that there are people who don't know that one of the first really big datasets used to train and evaluate computer language and social models was (and still is) a bunch of internal emails from Enron.Yes, that Enron. Collected as part of the investigation into its collapse.https://en.wikipedia.org/wiki/Enron_Corpus
       
 (DIR) Post #B0Dzzp05sjvQGokNvc by falcennial@mastodon.social
       2025-11-14T04:18:53Z
       
       0 likes, 0 repeats
       
       @ricci I was one of them people who did not know right up until now! that's the power of telling people. I'm going to boost this intel and share it around.
       
 (DIR) Post #B0E07KHXybBHJbUOBs by ricci@discuss.systems
       2025-11-14T04:20:16Z
       
       0 likes, 0 repeats
       
       @falcennial It's kind of ironic, given that some people are pointing out comparisons between 2001 Enron and 2025 NVIDIA in terms of their business model. Time is a flat circle.
       
 (DIR) Post #B0E0Q9hRXsJ3zra34i by jawnsy@mastodon.social
       2025-11-14T04:23:39Z
       
       0 likes, 0 repeats
       
       @ricci oh wow, that's cool
       
 (DIR) Post #B0E0RvccrBoSumAkq0 by kims@mas.to
       2025-11-14T04:24:00Z
       
       0 likes, 0 repeats
       
       @ricci Wow, I did not know thatAnd had you given me a five-answer multiple choice, I probably would have guessed wrong four times[This will make for a great trivia question next time I write a quiz: "The internal emails from which company...." Thank you.]
       
 (DIR) Post #B0E0U3DhZwJB3hWiJc by ricci@discuss.systems
       2025-11-14T04:24:15Z
       
       0 likes, 0 repeats
       
       @hypostase and that was resurrected in 2024 to pump a crypto rugpull? the very same
       
 (DIR) Post #B0E0XpmtShUsYbn7ei by ricci@discuss.systems
       2025-11-14T04:25:04Z
       
       0 likes, 0 repeats
       
       @jawnsy Where do you get a bunch of public-domain text with implied social connections? Court filings, it turns out.
       
 (DIR) Post #B0E0kK1AtMoln2QSNE by jawnsy@mastodon.social
       2025-11-14T04:27:18Z
       
       0 likes, 0 repeats
       
       @ricci I mostly want a model trained on your and @dan 's shitposts, including the unpublished drafts
       
 (DIR) Post #B0E0wmM7MIHN9YA5CK by ricci@discuss.systems
       2025-11-14T04:29:34Z
       
       0 likes, 0 repeats
       
       @jawnsy @dan When I was teaching myself to build neural nets I tried to download an archive of my toots to train  my own personal toot machine. However, the archive function here on discuss was broken (still is)I made a spam filter instead
       
 (DIR) Post #B0E1Il1gywoNOu5W5o by jawnsy@mastodon.social
       2025-11-14T04:33:26Z
       
       0 likes, 0 repeats
       
       @ricci I think it would be convenient if @dan could blame his comically large shipping orders on runaway agents, but I think the human is fully accountable for each and every one of those
       
 (DIR) Post #B0E25J0hvuVshUmES0 by KentNavalesi@mstdn.social
       2025-11-14T04:42:15Z
       
       0 likes, 0 repeats
       
       @ricci . . . I knew that.
       
 (DIR) Post #B0E27AMJ4R7aRmlbaC by ricci@discuss.systems
       2025-11-14T04:42:40Z
       
       0 likes, 0 repeats
       
       @KentNavalesi ⭐
       
 (DIR) Post #B0E4h0hCfsBYoMlE0W by MrBirch@beige.party
       2025-11-14T05:11:30Z
       
       0 likes, 0 repeats
       
       @ricci Yup. I was at a startup in the 2009-2012 timeframe and that was one of the main data sets we used to train our models. We were working on business oriented communication apps, so it was very appropriate for the task. Sadly, we never got the product to market.
       
 (DIR) Post #B0E4rHkQ8AFkaoAHxY by ricci@discuss.systems
       2025-11-14T05:13:22Z
       
       0 likes, 0 repeats
       
       @Virginicus 👏
       
 (DIR) Post #B0E560a2F1QzgzKO1o by Npars01@mstdn.social
       2025-11-14T05:16:00Z
       
       0 likes, 0 repeats
       
       @ricci @jawnsy Wondering if the following investigations into tax evasion & financial fraud ever ran their database through AI?Lux Leakshttps://www.icij.org/investigations/luxembourg-leaks/new-leak-reveals-luxembourg-tax-deals-disney-koch-brothers-empire/https://www.icij.org/investigations/luxembourg-leaks/ten-years-on-lux-leaks-remains-a-byword-for-corporate-tax-chicanery/https://en.wikipedia.org/wiki/LuxLeaksParadise Papershttps://www.democracynow.org/2017/11/8/headlines/paradise_papers_reveal_wealthy_donors_funneled_350m_into_2016_electionhttps://www.theguardian.com/news/2017/nov/07/us-republican-donors-offshore-paradise-papershttps://en.wikipedia.org/wiki/Paradise_Papershttps://en.wikipedia.org/wiki/List_of_people_and_organisations_named_in_the_Paradise_PapersPanama Papershttps://en.wikipedia.org/wiki/Panama_Papershttps://www.salon.com/2016/05/17/how_koch_industries_is_scamming_america_investigation_highlights_a_global_web_of_tax_avoidance/https://www.icij.org/inside-icij/2025/05/how-a-pentagon-contractor-built-a-global-empire-and-a-massive-tax-evasion-scheme/Pandora Papershttps://theconversation.com/the-pandora-papers-how-punishing-tax-cheats-can-serve-as-a-deterrent-170435https://law.queensu.ca/news/Pandora-Papers-reveals-parallel-financial-world-for-the-haves1/
       
 (DIR) Post #B0E5FvjB8bAdXRKsAC by ricci@discuss.systems
       2025-11-14T05:17:51Z
       
       0 likes, 0 repeats
       
       @Npars01 @jawnsy I don't know but I'd bet money (not a lot, but still) that all of these leaks are in the training sets for the LLMs on the market now
       
 (DIR) Post #B0EGxbWkhcgp7pYFFY by pluralistic@mamot.fr
       2025-11-14T07:28:57Z
       
       1 likes, 1 repeats
       
       @ricci Not collected - the company declared bankruptcy and got sued. As part of discovery, litigants demanded searches of the company's email. The bankruptcy trustees decided it would cost too much to hire outside counsel to review and redact irrelevant (but often highly sensitive and personal) messages in Enron's Outlook trove, and dumped dox on all of its workers, including things like personal emails agonizing about illness, divorce, etc.
       
 (DIR) Post #B0EHAzX8l8AhLO7H5U by InarticulateOtter@mastodon.social
       2025-11-14T07:31:22Z
       
       0 likes, 0 repeats
       
       @ricci xkcd 2501
       
 (DIR) Post #B0EJG7e2EGiBrAupLU by david_chisnall@infosec.exchange
       2025-11-14T07:54:44Z
       
       0 likes, 1 repeats
       
       @ricci Not just natural language processing. It’s also the largest public archive of spreadsheets. When I was at Microsoft, a bunch of projects used it. For example, when the TypeScript version of the Excel calc engine wanted to see how good their coverage was, they tried to see how many of the Enron sheets they could correctly calculate (as in, give the same answers as desktop Excel, not give the answer without all of the fraud).
       
 (DIR) Post #B0EJbNwioenAUnnMSO by swelljoe@mas.to
       2025-11-14T07:58:34Z
       
       0 likes, 0 repeats
       
       @ricci I bought my first Aeron chair from the Arthur Andersen bankruptcy auction. I thought about bidding on a big shredder, too, but my dad said, "Those shredders are worn out, son."But, I did not know about the Enron Corpus.
       
 (DIR) Post #B0ENQ7IUm81dUXrUZc by toxy@mastodon.acc.sunet.se
       2025-11-14T08:41:19Z
       
       0 likes, 0 repeats
       
       @ricci I had no idea. Fascinating.
       
 (DIR) Post #B0EQ5vkGJeQrjZup4i by gonzo_askold@mastodon.social
       2025-11-14T06:50:13Z
       
       0 likes, 0 repeats
       
       @ricci https://en.wikipedia.org/wiki/The_Pile_(dataset)> Some potential sub-datasets were excluded for various reasons, such as the US Congressional Record, which was excluded due to its racist content.[1]it's so very funny
       
 (DIR) Post #B0EQ5xF8kKj4NdMzqK by ricci@discuss.systems
       2025-11-14T09:11:18Z
       
       0 likes, 0 repeats
       
       @gonzo_askold a very 😭 kind of funny
       
 (DIR) Post #B0EcJIY82YkpfuNfvc by CaaS@infosec.exchange
       2025-11-14T11:28:12Z
       
       0 likes, 0 repeats
       
       @ricci So, one of the first thing we train them on is to lie and deceit, and fraud.
       
 (DIR) Post #B0EdLNyOwhSP1uEFyC by gonzo_askold@mastodon.social
       2025-11-14T11:39:48Z
       
       0 likes, 0 repeats
       
       @ricci it's called type 3 fun, grandpa
       
 (DIR) Post #B0ErK7gjPFVBajFw8G by ricci@discuss.systems
       2025-11-14T14:16:26Z
       
       1 likes, 0 repeats
       
       @pluralistic it amuses me greatly that I have one (1) mailbox today that is larger than this entire corpus
       
 (DIR) Post #B0F5PFCKJOelgjN2QK by tomjennings@tldr.nettime.org
       2025-11-14T16:54:16Z
       
       0 likes, 0 repeats
       
       @ricciThe Toyota "unintended acceleration" lawsuit was fascinating reading.  @jawnsy
       
 (DIR) Post #B0FFmAX8op0zQMg0Se by jackwilliambell@rustedneuron.com
       2025-11-14T18:50:14Z
       
       0 likes, 0 repeats
       
       @ricci TIL… 😲
       
 (DIR) Post #B0FI8NQVlMyZgsIDAG by jonathankoren@sfba.social
       2025-11-14T19:16:50Z
       
       0 likes, 0 repeats
       
       @ricci flipped through it in grad school, but never did anything with it.it has Ken Lat’s passcode for his house’s security gate in it
       
 (DIR) Post #B0FJh3r0cphZ4on0ds by ricci@discuss.systems
       2025-11-14T19:34:19Z
       
       0 likes, 0 repeats
       
       @jonathankoren just imagine, all future LLMs will have the Epstein files in their training sets
       
 (DIR) Post #B0FLR8AV16JZjTRBDM by dermoth@noc.social
       2025-11-14T16:09:28Z
       
       0 likes, 0 repeats
       
       @ricci @falcennial So we are training our AI with a model of failure. And then ppl expect that to take over our work? What could possibly go wrong?... 🤔
       
 (DIR) Post #B0FQjRxueNQS7C6YiW by ricci@discuss.systems
       2025-11-14T20:53:12Z
       
       0 likes, 0 repeats
       
       @standefer That's probably where I heard about it too
       
 (DIR) Post #B0FRAJ6JiSSiRPqHTc by silvermoon82@wandering.shop
       2025-11-14T20:58:02Z
       
       0 likes, 0 repeats
       
       @ricci That being a foundation for LLM research explains an awful lot about the state of the industry and the world today.
       
 (DIR) Post #B0FRClSZFWZmyRQCGG by ricci@discuss.systems
       2025-11-14T20:58:31Z
       
       0 likes, 0 repeats
       
       @silvermoon82 Yes, it does seem ... worth noting ... doesn't it?
       
 (DIR) Post #B0FUhdAkPZhS6uUpG4 by viq@social.hackerspace.pl
       2025-11-14T21:37:10Z
       
       0 likes, 0 repeats
       
       @ricci @pluralistic I need to look, I think I may have that in unread messages in one of my mail folders 🤣😭
       
 (DIR) Post #B0FjVm3D4gVdU3m6BE by jonathankoren@sfba.social
       2025-11-15T00:23:36Z
       
       0 likes, 0 repeats
       
       @ricci It will be glorious.https://slate.com/news-and-politics/2025/11/donald-trump-jeffrey-epstein-emails-files-news.html
       
 (DIR) Post #B0bwLQBszxB8zT5BYG by keithpjolley@discuss.systems
       2025-11-25T17:30:59Z
       
       0 likes, 0 repeats
       
       @ricci i was working at qcom when i started on my CS masters. at qcom we had used sendmail and had a mailing list for EVERYTHING, all of which was stored in NFS, dating back to eternity.  i used this as my dataset for my thesis on graph theory (community detection).  i plucked out the from, to, cc, and subject lines and dumped into an enormous sqlite db.  from that used d3 and perl to create a website that you could type in a regex and it would create a graph from the results.  it was incredible how you could type in a name of a project (like a modem) and have the entire project teams laid out for you.  san diego execs, san diego hw engineering, bangalore hw eng, testers, finance team, marketing, and all the leads and SMEs highlighted.   we had a lot of recreational mailing lists too so you could do things like "photography," "skiing," and "vlsi" to find a chip designer who liked to take pictures in the snow.when i left qcom i switched over to the enron dataset but since i didn't really know any of the people or projects it wasn't as meaningful to me.  to demo that i'd type in "california" and the names that popped up were the same names that had been in the newspapers during the whole gray davis power fiasco.
       
 (DIR) Post #B0bwcTAhAXHF9reewa by keithpjolley@discuss.systems
       2025-11-25T17:34:08Z
       
       0 likes, 0 repeats
       
       @ricci @pluralistic probably one attachment.  a spreadsheet listing all the places nearby to order lunch from.