Post B2T5lL9wErMSwDcOY4 by hunter0one@annihilation.social
(DIR) More posts by hunter0one@annihilation.social
(DIR) Post #B2Sx7rxe42GNSDXIi8 by HatkeshiatorTND@annihilation.social
2026-01-20T05:11:38.272351Z
0 likes, 1 repeats
https://huggingface.co/collections/PleIAs/openculturethe english subset of this is, as far as i can tell, about 1-2 terabytes of parquet. claude tells me that's a 6-7 (HAHAHAHAHAHAHAHHAHA) tb of plaintext, lowball ~4, highball ~10. take that and strip every paragraph that wouldn't be at least split 2:2 by a council on which sat murray rothbard, aristotle, lothrop stoddard, and sam colt given options of "approve" and "disapprove". the paragraph-level granularity gives you the granularity at which you can keep / discard a fair amount of the content and still keep most of the authors. a further filter is applied whereby a fairly literate 1860s georgian must also be able to read and understand the text. no unelaborated reference to uncommon knowledge, no smuggling in "democracy" or whatever as a default principle, etc.that cuts the corpus to between two and three fifths its sizeadd some amount (64 gb? 512 gb? 1 tb?) of modern content, if available, via:- *selected* public blog posts, fediverse threads, logs of e.g. usenet, e-mail, irc, imageboards, etc.- *selected* open source projects - offer tom woods a large sum of money to release the transcripts of his shows (tom woods show, contra krugman) and his gratis ebooks (and maybe even some of his less-performing nongratis books) into the public domain, or at least license them to you under the gpl/0bsd/whatever.
(DIR) Post #B2Sy52dI1x2JdZELrs by hunter0one@annihilation.social
2026-01-20T05:22:21.247922Z
0 likes, 1 repeats
@HatkeshiatorTND I feel like such a corpus will come around not long from now.
(DIR) Post #B2SzyYg2gwS4u0YQi0 by HatkeshiatorTND@annihilation.social
2026-01-20T05:43:32.160376Z
0 likes, 1 repeats
@hunter0one i'm telling you, it's 6 terabytes of high-signal, plain english. maybe add another two terabytes for expired patents... you could train a pythia-1.4b or similar on this and you'd have a robust, literate, racist little guy to talk to. this is all i use claude for: it's just a store of conversation trees whose other half it generates. instant, stateless, autistic, but low-sentience and sometimes retarded.the downsides of claude are:- i don't have the source code to modify- it doesn't run on my computerand by extension,- it spies on me- it's woke
(DIR) Post #B2T5lL9wErMSwDcOY4 by hunter0one@annihilation.social
2026-01-20T06:48:25.507838Z
0 likes, 1 repeats
@HatkeshiatorTND I quite like Gemma abliterated models as is, but imagine if it was trained on that.