[HN Gopher] Rga: Ripgrep, but also search in PDFs, E-Books, Offi...
___________________________________________________________________
Rga: Ripgrep, but also search in PDFs, E-Books, Office documents,
zip, etc.
Author : bukacdan
Score : 250 points
Date : 2024-09-17 13:11 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jedisct1 wrote:
| Ever heard of ugrep?
| seanthemon wrote:
| Seems this project predates ugrep and has a nicer interface.
| mbrubeck wrote:
| Both projects were started around the same time in 2019. The
| initial git commits were about five weeks apart:
|
| https://github.com/Genivia/ugrep/commit/e37c986dd842adc3b2c2.
| ..
|
| https://github.com/phiresky/ripgrep-
| all/commit/16b4277d361ce...
| seanthemon wrote:
| Wow! That's interesting.
| skeptrune wrote:
| This is sweet
| sim7c00 wrote:
| wish i knew about this workin support jobs sifting for logs and
| lines in zip 'support file' packages. very nice!
| Gehinnn wrote:
| Love this for searching in movie subtitles!
| rectang wrote:
| To what extent does reading these formats accurately require the
| execution of code _within the documents_? In other words, not
| just stuff like zip expansion by a library dependency of rga, but
| for example macros inside office documents or JavaScript inside
| PDFs.
|
| Note: I have no reason to believe such code execution is actually
| happening -- so please don't take this as FUD. My assumption is
| that a secure design would involve running only external code and
| thus would sacrifice a small amount of accuracy, possibly
| negligible.
| fwip wrote:
| Also note that it's not necessarily safe to read these
| documents even if you don't intend on executing embedded code.
| For example, reading from pdfs uses poppler, which has had a
| few CVEs that could result in arbitrary code execution, mostly
| around image decoding. https://cve.mitre.org/cgi-
| bin/cvekey.cgi?keyword=poppler
|
| (No shade to poppler intended, just the first tool on the list
| I looked at.)
| rectang wrote:
| That's a qualitatively different kind of security topic,
| though. On the one hand, we have a bug in a tool that reads a
| passive format with complete accuracy. On the other we have
| the need to sacrifice some amount of accuracy to avoid
| executing embedded code in a dynamic file format.
| sim7c00 wrote:
| this is why i do like to try and parse shit myself for my own
| tools, not that thats without risk but i dont share my.code
| so its untargeted. however, to support a wide variety like
| this the tools are ok. most code honestly in a pdf will not
| target pdftotext , i think. i think it would target the thing
| people open pdfs with like browsers and maybe a few readers
| like adobe and foxit reader. pdftotext seems more like an
| 'academic target', like a nice exersize but not very fruitful
| in an actual attack. i might be wrong tho.
| sadboi31 wrote:
| Citation indexes are the devil and Google is hell. Try as
| you might to avoid it but you're already on an index.
| Security through obscurity isn't secure or obscure in this
| modern age. https://www.tandfonline.com/doi/full/10.1080/03
| 054985.2024.2...
| traverseda wrote:
| None of them really execute "code". Pandoc has a pretty good
| write up of the security implications or running it, which I
| think applies just as much to the other ones, with the added
| caveat of zip bombs.
|
| https://pandoc.org/MANUAL.html#a-note-on-security
|
| It's just text, this isn't ripgrepping through your excel
| macros, just the data that's actually in the excel file.
| anthk wrote:
| Use Recoll for that; check the recommended dependencies from your
| package manager. Synaptic it's good for this with a click from
| the right mouse button on the package.
|
| EDIT: For instance, under Trisquel/Ubuntu/Debian and derivatives,
| click on 'recollcmd', and with the right click button mark all
| the dependencies.
|
| Install RecollGUI for a nice UI.
|
| Now you will have something like Google Search but libre in your
| own desktop.
| gjadi wrote:
| For emacs users in the room, there is consult-recoll.
| nanna wrote:
| Lazy question but anyone integrated this with Emacs Dired, to
| transparently search all the files?
| hprotagonist wrote:
| https://randomeffect.net/post/2022/10/07/use-ripgrep-all-fro...
| possibly?
| setopt wrote:
| According to Reddit [1], you can use the existing rg.el
| package, and just point it to the rga binary instead of the rg
| binary, and it is supposed to just work.
|
| [1]:
| https://www.reddit.com/r/emacs/comments/1eghspj/comment/lg6q...
| nullifidian wrote:
| It's somewhat similar to 'recoll' in its functionality, only with
| recoll you need to index everything before search. It even uses
| the same approach of using third-party software like poppler for
| extracting the contents.
| usefulcat wrote:
| Would be cool if it also searched metadata in images, audio,
| video.
| wanderingmind wrote:
| Awesome tool and I use it often. One under utilized feature of
| rga is its integration with fuzzy search (fzf) that provides
| interactive outputs compared to running the commands and
| collecting outputs in sequence. So in short use rga-fzf instead
| of rga in CLI.
___________________________________________________________________
(page generated 2024-09-17 23:00 UTC)