[HN Gopher] Rga: Ripgrep, but also search in PDFs, E-Books, Offi...
       ___________________________________________________________________
        
       Rga: Ripgrep, but also search in PDFs, E-Books, Office documents,
       zip, etc.
        
       Author : bukacdan
       Score  : 250 points
       Date   : 2024-09-17 13:11 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jedisct1 wrote:
       | Ever heard of ugrep?
        
         | seanthemon wrote:
         | Seems this project predates ugrep and has a nicer interface.
        
           | mbrubeck wrote:
           | Both projects were started around the same time in 2019. The
           | initial git commits were about five weeks apart:
           | 
           | https://github.com/Genivia/ugrep/commit/e37c986dd842adc3b2c2.
           | ..
           | 
           | https://github.com/phiresky/ripgrep-
           | all/commit/16b4277d361ce...
        
             | seanthemon wrote:
             | Wow! That's interesting.
        
       | skeptrune wrote:
       | This is sweet
        
       | sim7c00 wrote:
       | wish i knew about this workin support jobs sifting for logs and
       | lines in zip 'support file' packages. very nice!
        
       | Gehinnn wrote:
       | Love this for searching in movie subtitles!
        
       | rectang wrote:
       | To what extent does reading these formats accurately require the
       | execution of code _within the documents_? In other words, not
       | just stuff like zip expansion by a library dependency of rga, but
       | for example macros inside office documents or JavaScript inside
       | PDFs.
       | 
       | Note: I have no reason to believe such code execution is actually
       | happening -- so please don't take this as FUD. My assumption is
       | that a secure design would involve running only external code and
       | thus would sacrifice a small amount of accuracy, possibly
       | negligible.
        
         | fwip wrote:
         | Also note that it's not necessarily safe to read these
         | documents even if you don't intend on executing embedded code.
         | For example, reading from pdfs uses poppler, which has had a
         | few CVEs that could result in arbitrary code execution, mostly
         | around image decoding. https://cve.mitre.org/cgi-
         | bin/cvekey.cgi?keyword=poppler
         | 
         | (No shade to poppler intended, just the first tool on the list
         | I looked at.)
        
           | rectang wrote:
           | That's a qualitatively different kind of security topic,
           | though. On the one hand, we have a bug in a tool that reads a
           | passive format with complete accuracy. On the other we have
           | the need to sacrifice some amount of accuracy to avoid
           | executing embedded code in a dynamic file format.
        
           | sim7c00 wrote:
           | this is why i do like to try and parse shit myself for my own
           | tools, not that thats without risk but i dont share my.code
           | so its untargeted. however, to support a wide variety like
           | this the tools are ok. most code honestly in a pdf will not
           | target pdftotext , i think. i think it would target the thing
           | people open pdfs with like browsers and maybe a few readers
           | like adobe and foxit reader. pdftotext seems more like an
           | 'academic target', like a nice exersize but not very fruitful
           | in an actual attack. i might be wrong tho.
        
             | sadboi31 wrote:
             | Citation indexes are the devil and Google is hell. Try as
             | you might to avoid it but you're already on an index.
             | Security through obscurity isn't secure or obscure in this
             | modern age. https://www.tandfonline.com/doi/full/10.1080/03
             | 054985.2024.2...
        
         | traverseda wrote:
         | None of them really execute "code". Pandoc has a pretty good
         | write up of the security implications or running it, which I
         | think applies just as much to the other ones, with the added
         | caveat of zip bombs.
         | 
         | https://pandoc.org/MANUAL.html#a-note-on-security
         | 
         | It's just text, this isn't ripgrepping through your excel
         | macros, just the data that's actually in the excel file.
        
       | anthk wrote:
       | Use Recoll for that; check the recommended dependencies from your
       | package manager. Synaptic it's good for this with a click from
       | the right mouse button on the package.
       | 
       | EDIT: For instance, under Trisquel/Ubuntu/Debian and derivatives,
       | click on 'recollcmd', and with the right click button mark all
       | the dependencies.
       | 
       | Install RecollGUI for a nice UI.
       | 
       | Now you will have something like Google Search but libre in your
       | own desktop.
        
         | gjadi wrote:
         | For emacs users in the room, there is consult-recoll.
        
       | nanna wrote:
       | Lazy question but anyone integrated this with Emacs Dired, to
       | transparently search all the files?
        
         | hprotagonist wrote:
         | https://randomeffect.net/post/2022/10/07/use-ripgrep-all-fro...
         | possibly?
        
         | setopt wrote:
         | According to Reddit [1], you can use the existing rg.el
         | package, and just point it to the rga binary instead of the rg
         | binary, and it is supposed to just work.
         | 
         | [1]:
         | https://www.reddit.com/r/emacs/comments/1eghspj/comment/lg6q...
        
       | nullifidian wrote:
       | It's somewhat similar to 'recoll' in its functionality, only with
       | recoll you need to index everything before search. It even uses
       | the same approach of using third-party software like poppler for
       | extracting the contents.
        
       | usefulcat wrote:
       | Would be cool if it also searched metadata in images, audio,
       | video.
        
       | wanderingmind wrote:
       | Awesome tool and I use it often. One under utilized feature of
       | rga is its integration with fuzzy search (fzf) that provides
       | interactive outputs compared to running the commands and
       | collecting outputs in sequence. So in short use rga-fzf instead
       | of rga in CLI.
        
       ___________________________________________________________________
       (page generated 2024-09-17 23:00 UTC)