Post Ak7TBFiXAL2xSYABma by emilvolk@furry.engineer
 (DIR) More posts by emilvolk@furry.engineer
 (DIR) Post #Ak7QU6Yz7mtYhyzjea by mothcompute@vixen.zone
       2024-07-20T10:19:09Z
       
       0 likes, 0 repeats
       
       theres probably a way to turn any binary data into series of sentences that ai scrapers will absorb into their datasets. if you took the base64 or such of some binary data and used each character to index into a two dimensional array of strings where the other index is the number of words youve added to the current sentence, with the right tables you could make complex enough sentences that bad nlp would let them be scraped. if you have access to the table this transform is reversible
       
 (DIR) Post #Ak7QlRgxp11XA1zmq0 by emilvolk@furry.engineer
       2024-07-20T10:22:17Z
       
       0 likes, 0 repeats
       
       @mothcompute ngl this is amazing
       
 (DIR) Post #Ak7RUMKEcadnY1mSZ6 by mothcompute@vixen.zone
       2024-07-20T10:30:25Z
       
       0 likes, 0 repeats
       
       @emilvolk i have been thinking of ways to guard code from scraping for ages now so it is nice to hit upon something that might actually like. work, while not being a complete pain to actually use. all this would need to work is a sufficiently large list of sentences of various lengths
       
 (DIR) Post #Ak7SHOXTBFXgbPLxFg by emilvolk@furry.engineer
       2024-07-20T10:39:15Z
       
       0 likes, 0 repeats
       
       @mothcompute connect with a cryptographic analyst/mathematician, this is seriously good shit.
       
 (DIR) Post #Ak7Sl7W722XEUttOVc by mothcompute@vixen.zone
       2024-07-20T10:44:39Z
       
       0 likes, 0 repeats
       
       @emilvolk i wish i knew any. best ive got so far is by using something like base64 not only do you reduce the size of the lookup table but you also get to somewhat alleviate the distribution issues of using bytewise lookups when encoding something like ascii, where all the characters are less than 128, since base64 obviously does not respect byte alignment. gzipping the data or prepending a public key and encrypting the following data before encoding it this way would also increase the entropy
       
 (DIR) Post #Ak7TBFiXAL2xSYABma by emilvolk@furry.engineer
       2024-07-20T10:49:21Z
       
       0 likes, 0 repeats
       
       @mothcompute give me some time, I will think about this from the mathematical point of view.
       
 (DIR) Post #Ak7TU4CeCRx3FbDhbM by mothcompute@vixen.zone
       2024-07-20T10:52:46Z
       
       0 likes, 0 repeats
       
       @emilvolk i havent thought of any good ways to increase the computational complexity of the decoder but currently it just comes from having to manually look up the indices of each word in the sentence. using 3 or 4 dimensional tables with random or cryptographically derived indices might also help but it makes creating the tables much harder and the entire idea gets into proof of work territory and i dont know how i feel about reinventing the principle underlying cryptocurrency
       
 (DIR) Post #Ak7WFCQfdHReHZNiUq by emilvolk@furry.engineer
       2024-07-20T11:23:40Z
       
       0 likes, 0 repeats
       
       @mothcompute one-time pads might be a solution
       
 (DIR) Post #Ak7WyZGJaRb483AfOi by mothcompute@vixen.zone
       2024-07-20T11:31:54Z
       
       0 likes, 0 repeats
       
       @emilvolk i imagine those would make for pretty big output files though. that was the problem with one of my older ideas, where you would overflow the context length between each word of the original text to make the content more difficult for an llm to interpret
       
 (DIR) Post #Ak7XMkKk4iolPrlrua by mothcompute@vixen.zone
       2024-07-20T11:36:16Z
       
       0 likes, 0 repeats
       
       @emilvolk maybe if you shuffled the bytes of the input in a deterministic way before encoding that would both help increase the computational complexity and mess with the sort of linear nature of llms