Post B0qNVRJMd1aOFUSXgm by wolf480pl@mstdn.io
 (DIR) More posts by wolf480pl@mstdn.io
 (DIR) Post #B0q841o5qXUe18IuLw by mntmn@mastodon.social
       0 likes, 0 repeats
       
       in terms of "finding things in large texts", for example "find a page in this pdf that mentions both shutdown mode and reg18", are there interesting alternatives to all that llm stuff beyond regex search? are there natural language processing systems that are precise/reliable and understandable? i imagine something like a fuzzy parser with stemming and some sort of ontologies, synonyms and logical inference
       
 (DIR) Post #B0q8Dfs0GJwSsQjTTE by jannem@fosstodon.org
       0 likes, 0 repeats
       
       @mntmn No, not really. And that's a reason why small LLMs as language processors (not chatbots) are exciting.
       
 (DIR) Post #B0q8GzPQ4bTmLQwYW8 by mntmn@mastodon.social
       0 likes, 0 repeats
       
       @jannem i somehow find that hard to believe
       
 (DIR) Post #B0q8euir9JCQlARWCm by mntmn@mastodon.social
       0 likes, 0 repeats
       
       i don't like llms because they consume a lot of power and are connected to all the ai greed hype, they have to be strangely trained, their representations are not introspectable, they make tons of errors/are not reliable at all etc. i'd rather like a sharp, more machinistic tool that just clearly says "error" when it can't do the job. grep is such a tool--would be nice to have a grep that can clean up and normalize messy human language a bit
       
 (DIR) Post #B0q92cEtWMgINvjLai by skyfaller@jawns.club
       0 likes, 0 repeats
       
       @mntmn This is probably not what you're looking for, but I think bloom filters are cool: https://endler.dev/2019/tinysearch/
       
 (DIR) Post #B0q9AGepe34TldVkpM by fmn@mastodon.social
       0 likes, 0 repeats
       
       @mntmn "are there natural language processing systems that are precise/reliable and understandable?" - let me wear my noam chomsky hat for a second: there are no such systems and never will be. natural language is ever changing and ambiguous, and parties involved often don't have - sufficiently - common context required for precise communication. this is why people talking or even reading have so many back-and-forths.
       
 (DIR) Post #B0qA2lIJxBo6QJxzMW by jannem@fosstodon.org
       0 likes, 0 repeats
       
       @mntmn I mean, there's been many attempts. Especially for constrained applications such as a corporate document store and things like that. As far as I know, none of those systems were ever a success.
       
 (DIR) Post #B0qMRB2f4lK8g53CK0 by lanodan@queer.hacktivis.me
       0 likes, 0 repeats
       
       @mntmn Reminds me that there's agrep (from https://github.com/laurikari/tre) for fuzzy matches but I'm not even sure I've ever used it, probably fluent enough in regex.
       
 (DIR) Post #B0qMRDmos7QzBof5Xs by mntmn@mastodon.social
       0 likes, 0 repeats
       
       @lanodan oh that's interesting
       
 (DIR) Post #B0qNVRJMd1aOFUSXgm by wolf480pl@mstdn.io
       0 likes, 0 repeats
       
       @jannem @mntmn AFAIK the way these LLM tools work is they have an embedding of words into a  vector space, they index text by converting every word in a every document to a vector, and storing it in a database together with ID of the document it came from, and then when you search, they turn each of the query words into vectors, and search for K nearest neighbors in the vector space for each of them.Then they feed the documents they found to an LLM.What if you skipped the last step?
       
 (DIR) Post #B0qNVSXa3eYI3s7PUG by pixx@merveilles.town
       0 likes, 0 repeats
       
       @wolf480pl @jannem @mntmnTheoretically for this all you really need is "for each page, grab the text. Split into words. Store each word on a line in a file - e.g. /tmp/document/$PAGENUMBER.Then, grep for files which contain both terms"?I think?
       
 (DIR) Post #B0qNVSzwMDElTpO3gu by mntmn@mastodon.social
       0 likes, 0 repeats
       
       @pixx @wolf480pl @jannem right. or split it by sections / chapters. i think it's mostly a matter of making a better PDF reader that can do these kinds of things
       
 (DIR) Post #B0rDSWKWhvv8YtvsHY by damjanovic@chaos.social
       0 likes, 0 repeats
       
       @mntmn @pixx @wolf480pl @jannem Are you looking for a PDF search tool? I like both of these:https://flathub.org/en/apps/de.leopoldluley.Clapgrephttps://docfetcher.sourceforge.io
       
 (DIR) Post #B0rrhwF0eL2zQxG5eC by mntmn@mastodon.social
       0 likes, 0 repeats
       
       @damjanovic @pixx @wolf480pl @jannem interesting, thanks!
       
 (DIR) Post #B0sojvhhVqIRdste7c by keydelk@fosstodon.org
       0 likes, 0 repeats
       
       @mntmn very much how I feel about LLMs. I also don’t like how “sloppy” and imprecise they feel. With traditional programs, there is generally a pretty precise relationship between input and output. If you provide the right input, you can be pretty confident that you will get the right output. With LLMs, while they are more flexible in the type of input they will accept, you can never be confident about the output.
       
 (DIR) Post #B0st0liqHDgbQNiPZI by mntmn@mastodon.social
       0 likes, 0 repeats
       
       @keydelk yes exactly