[HN Gopher] M2dir: Treating mails as files without going crazy
       ___________________________________________________________________
        
       M2dir: Treating mails as files without going crazy
        
       Author : cl3misch
       Score  : 96 points
       Date   : 2024-05-23 07:17 UTC (15 hours ago)
        
 (HTM) web link (bitfehler.srht.site)
 (TXT) w3m dump (bitfehler.srht.site)
        
       | chriscappuccio wrote:
       | As a CLI fan, I'm interested in where this could go
        
         | mathfailure wrote:
         | It would go only where you'd bring it to.
        
       | ksherlock wrote:
       | BeOS stored mail as individual files with extended attributes
       | holding the subject, date, sender, etc. The email app (BeMail)
       | was used to view/compose/send email but inbox management was
       | handled by Tracker (BeOS version of Macintosh Finder or Windows
       | Explorer). But the Tracker window was configured to display the
       | extended attributes instead of the file names. The actual
       | filename wasn't even displayed. Nobody went crazy.
       | 
       | Example: https://birdhouse.org/beos/refugee/bemail.jpg via
       | https://birdhouse.org/beos/refugee/ (which has some other images
       | of Tracker organization with extended attributes)
        
         | jauntywundrkind wrote:
         | Pushing concerns up to the OS level is that one thing we could
         | be doing in so many places, but havent really tried in decades.
         | Should we use a universal format/protocol agnostic way of
         | having data attached to files? Naaaahhhh. /s
        
           | bobthecowboy wrote:
           | My gut reaction to this was "isn't that just sqlite"?
           | 
           | I don't think this is what you were thinking of, but I do
           | kind of love the idea of formalizing sqlite file formats
           | where the "metadata" is standardized and the "file" is stored
           | inside. Like a file format for a recipe, or a picture, or ...
        
             | brirec wrote:
             | Isn't _that_ just a container format, like what video and
             | audio files have used for decades?
             | 
             | I don't know of any existing container formats with support
             | for a relational DB as one of the embedded streams, but the
             | whole point of container formats is that you _can_ add
             | arbitrary metadata, which of course can be a whole
             | database.
             | 
             | Of course, the way BeOS does what OP is talking about is by
             | having many DB columns within the filesystem itself! (The
             | filesystem _is_ a queryable database).
        
               | bobthecowboy wrote:
               | Yes, I totally get the distinction (and I was among those
               | amazed by BeOS back in the day - I still show the old
               | demo videos to friends who haven't seen it). I hadn't
               | considered the container formats used by media, but in my
               | head it would be the other way around - each file would
               | be a sqlite file _first_ so that they all share some
               | commonality around access and inspection (I 'm assuming
               | in my ignorance that the media container formats are
               | different).
               | 
               | Are there any database filesystems today? I haven't
               | really looked, but the last one I heard of was the one
               | that MS abandoned years ago. Actually I suppose Haiku
               | probably still has one? I can't imagine how difficult it
               | would be to get a DB Filesystem as a mainstream choice on
               | Linux, let alone across OSen.
        
           | ryandrake wrote:
           | I'm generally a fan of taking advantage of the filesystem,
           | especially when your application is just... storing and
           | viewing files. It irrationally upsets me when an application
           | grafts its own "Library" on top of my perfectly working
           | filesystem, requiring me to import my files into an
           | artificial thing that is just like a filesystem.
           | 
           | On the other hand, extended attributes and other filesystem-
           | specific features could be problematic if you want to share
           | files with other operating systems. If I copy a file to a
           | FAT32 formatted SDCard, I need to worry about what might not
           | copy over.
        
       | tracker1 wrote:
       | It's a somewhat interesting idea... I've had similar ideas in the
       | past regarding maildir replacement without resorting to a db
       | file. I like the idea of having directories representing email
       | dir/folders, you generally will want some level of aggregation
       | and/or search... I've thought that having separate eml (header +
       | body) along with a .meta.json file for additional tagging/details
       | (deleted flag, tags, etc).
       | 
       | Search is a very different story, you wouldn't want to have to do
       | a full directory scan for text based search. So some level of
       | indexing would be useful for a client mail service.
       | 
       | Similarly, I've thought it would be really cool if Cloudflare
       | offered a TCP worker option, you could to a simple mail service
       | backed by R2. The web ui/ux could be pretty awesome and geo
       | distributed.
        
         | arp242 wrote:
         | > Search is a very different story, you wouldn't want to have
         | to do a full directory scan for text based search. So some
         | level of indexing would be useful for a client mail service.
         | 
         | I don't know; my ~/code directory has tons of stuff and
         | searching with ripgrep doesn't seem too slow:                 %
         | time rg HelloWorld | wc -l       4       rg HelloWorld >
         | /dev/null  0.13s user 0.12s system 99% cpu 0.251 total
         | % time rg string | wc -l       57813       rg string >
         | /dev/null  0.20s user 0.14s system 99% cpu 0.339 total
         | 
         | Rough estimate of files that rg will search:                 %
         | scc       -----------------------------------------------------
         | --------------------------       Language
         | Files       Lines     Blanks    Comments      Code       ...
         | Total                        11024     1864982     175565
         | 208777   1480640       ----------------------------------------
         | ---------------------------------------
         | 
         | Finding close to 60k matches in 11k files/1.7M lines in about
         | 0.3 seconds isn't too bad.
         | 
         | It should be said I ran a few commands on that directory before
         | the above results, so there's probably some filesystem caching
         | going on, but I can't be bothered to reboot.
         | 
         | For many (not all, obviously) cases I think you may be able to
         | get away without a index. Most people aren't subscribed to tons
         | of email lists and get maybe a few emails a day at the most.
         | 
         | I'd consider anything below ~3 seconds to be fine for search,
         | so this scales to about 100k files/emails. At 10 emails/day on
         | average that's about a decade. Most people do not get 10
         | emails/day on average.
         | 
         | And you can even do some "poor man indexing" by just making a
         | new directory every five or ten years. Most of the time you
         | want just emails from the last year or so.
        
           | Arelius wrote:
           | > Most people do not get 10 emails/day on average.
           | 
           | I'd like to see the stats, but I seem to average around > 40
           | emails a day, (most are unactionable) but always considered
           | my email load quite light. For people like my wife who do
           | much of their work communication over email, it appears to be
           | much higher.
        
           | tracker1 wrote:
           | I'm also considering a Server/Service that has a web ui
           | component, where it's shared server resources... yeah,
           | running a search on a local ssd/nvme is crazy fast... now do
           | it when there are 100k other users on that filesystem.
        
         | rakoo wrote:
         | > Search is a very different story, you wouldn't want to have
         | to do a full directory scan for text based search. So some
         | level of indexing would be useful for a client mail service.
         | 
         | While notmuch and mu exist, I myself use the mblaze suite
         | (https://github.com/leahneukirchen/mblaze) and it's more than
         | enough for me. As a totally unscientific benchmark, it takes
         | 300 ms to find 7 mails out of 24k when searching in headers, 4
         | seconds when searching in the body.
         | 
         | I myself use a different way: I convert the entire (all 24k of
         | them) list of emails to 1-lines with Sender, Subject, Date,
         | Folder and feed it to fzf which gives me preview as well. The
         | search is then instant; on the given fields only, but I never
         | need more than that. This is my full MUA:
         | https://sr.ht/~rakoo/omail/
        
         | mxuribe wrote:
         | @tracker1 If i'm not mistaken i think thunderbird and other
         | email clients who support conventional maildir often include a
         | local db (such as sqlite) whose purpose tends to be mostly for
         | helping indexing content to ease some aspects of search. That
         | being said, as others have noted, search mostly tends to be
         | fast enough at the filesystem level. ;-)
        
         | geek_at wrote:
         | not 100% related but I have build OpenTrashmail [1] which gives
         | you the emails in 3 variants. As folders on disk (no DB used),
         | as RSS feed or as JSON feed. Which satisfied my needs for local
         | management of emails
         | 
         | [1] https://github.com/HaschekSolutions/opentrashmail
        
       | jll29 wrote:
       | One wonders why email isn't kept in a well-thought out directory
       | structure since the beginnings of UNIX, given that almost
       | anything is a file in UNIX, and especially given the power of
       | UNIX text processing tools.
        
         | technofiend wrote:
         | If you'd like to try it, MH uses directories and files for
         | managing your email:
         | https://www.gnu.org/software/emacs/manual/html_node/mh-e/ind...
         | As you mentioned this is the unix way and MH (according to
         | Wikipedia) dates back to 1979:
         | https://en.m.wikipedia.org/wiki/MH_Message_Handling_System
        
           | MassPikeMike wrote:
           | The parent's pointer is to "MH-E", the emacs package, which
           | is a great interface to MH for folks who use emacs to read
           | their email.
           | 
           | For folks who don't, I wanted to clarify that MH also works
           | great outside of emacs. Its command-line tools are
           | composable, so you can do things like reply to the first
           | message about chess sent this week:
           | 
           | repl `pick -subject chess -after "19 May 24 0000 PST"`
           | 
           | Using them in scripts is especially powerful.
           | 
           | The modern implementation is "nmh", "New Message Handler",
           | https://www.nongnu.org/nmh/. MH was the mail system within
           | MIT's Athena computing environment back in the day, so many
           | MIT folks developed a fondness for it and it retains a
           | following. There's even a very comprehensive O'Reilly book,
           | free online: https://rand-mh.sourceforge.io/book/
        
         | PurpleRamen wrote:
         | There were multiple formats for storing mails through the
         | times. And many are using folders. But each format has their
         | own problems, and were optimized for certain benefits. And on
         | unix you have to make this workable with multiple programs
         | accessing them in parallel, because in the early days there
         | were no servers who had tight control over everything. So,
         | formats were often designed around using or preventing file
         | locks, making efficient use of storage or allowing fast
         | handling and management of mail-flags.
        
           | ck45 wrote:
           | For a quite long time, a very popular format was mbox (the
           | most popular?), which is a single file. With the arrival of
           | qmail, it was slowly replaced by Maildir.
        
           | mbreese wrote:
           | Not even just common formats, but way back, Mail was
           | delivered by copying files from one server to another. I
           | (barely) remember using UUCP before SMTP/NNTP to sync Mail
           | and news. So, the format that you stored messages in was very
           | important. It's easy to copy a single message when it is a
           | complete file.
        
       | Gys wrote:
       | The mail protocol is plain text so it's not difficult to save
       | emails as individual files. I had such setup some years ago for a
       | company. Emails were stored in one folder per week, each email in
       | its own subfolder with attachments extracted and a meta text
       | file. References were in a database.
       | 
       | I also remember working with a windows email server that saved
       | all emails only as files, no db, although the directory structure
       | was more complicated. But that was maybe 20 years ago...
        
       | zokier wrote:
       | Is there a reason why metadata and the message are stored so
       | separately? I.e. why
       | INBOX/2023-09-04_13:47_builds@sr.ht,GTfrlwJfN5vyR28R
       | INBOX/.meta/GTfrlwJfN5vyR28R.flags
       | 
       | instead of
       | INBOX/2023-09-04_13:47_builds@sr.ht,GTfrlwJfN5vyR28R/message
       | INBOX/2023-09-04_13:47_builds@sr.ht,GTfrlwJfN5vyR28R/.flags
       | 
       | The latter structure would allow creating/deleting the message
       | and flags atomically.
        
         | mathfailure wrote:
         | That'd require jumping between dirs when traversing multiple
         | messages.
        
       | graycat wrote:
       | Been thinking about this subject:
       | 
       | Of course, standard (usual, common) email is just text. Right for
       | the pictures, to have them just as text, they are _encoded_ as
       | _base64_. Right, its MIME (MultiMedia Internet Mail Extensions).
       | 
       | Soooo, okay, my ISP (Internet Service Provider) has an email
       | service. The service is a Web site, and it does offer getting the
       | "Source", that is, the text, all as just one file.
       | 
       | Now, suppose for each email message I send/receive, I keep the
       | text in its own file, with just the text, just as I got it from,
       | say, my ISP. I will handle the file naming, indexing,
       | summarization, etc.
       | 
       | Help!!!! Is there an _email_ program that I can run that, for
       | each of those files, can read it and _display_ it? Sure, it
       | should be able to display the text, as text, that is not one of
       | the MIME extensions but also be able to do _the right thing_ for
       | each of the rest, still images, video clips, audio, whatever.
       | Know of such a program???? Thanks!
        
       | Aloisius wrote:
       | For MacOS, extracting attachments into files is useful so that
       | Spotlight can index them for search. I believe the same is true
       | for Windows.
       | 
       | Mail.app, uses a directory structure that looks similar* to this
       | for say, gmail:                   {account-uuid}/[Gmail].mbox/All
       | Mail.mbox/{mailbox-guid}/Data/Messages/{msguid}.partial.emlx
       | {account-uuid}/[Gmail].mbox/All Mail.mbox/{mailbox-
       | guid}/Data/Attachments/{msguid}/{mime part #}/{mime subpart
       | #}/filename.ext
       | 
       | The emlx format is a bit different from eml. It contains the
       | number of bytes for the message at the top and an xml plist at
       | the end that has message flags, last viewed time, gmail labels,
       | etc. For partial.emlx files, the base64 content is removed from
       | the email itself and a content length is added.
       | 
       | This format has its drawbacks, of course.
       | 
       | * Not shown is the hierarchy based on message uid used to keep
       | the number of files in the Messages directory down.
        
       | QasimK wrote:
       | I've been thinking about doing this myself, so it's fantastic to
       | see a project.
       | 
       | I find a files-centric (and more broadly filesystem-centric)
       | approach easier to grapple with than one that focuses on apps
       | (and hiding away the data). It makes it much easier to access my
       | own data for other purposes outside of what the app provides. In
       | particular when the files are in plain-text or otherwise human-
       | editable. I can reuse all of the existing tool that I'm familiar
       | with to search, modify or re-purpose the data.
        
         | skydhash wrote:
         | I can do away with files if the app provides scripting
         | capabilities (IPC, plugins,...). I know the average users won't
         | use it, but if you've nailed down your workflow, it's
         | liberating to be able to speed up parts of it.
        
       | robertlagrant wrote:
       | I was hoping the mailing list link would be to an FTP site I'd
       | upload my email to.
        
         | colinsane wrote:
         | my chief concern with the spec was actually "do FTP clients
         | generally support `:` in a filename?"
         | 
         | but then i realized i'm not likely to mount a remote M2dir so
         | i'm far less concerned with the answer.
        
       | AdieuToLogic wrote:
       | Whenever I see efforts to treat email as files, I fondly think of
       | my time using nmh[0]. Until the pervasive use of multimedia
       | email, nmh was a really nice way to communicate with email IMHO.
       | 
       | 0 - https://www.nongnu.org/nmh/
        
       ___________________________________________________________________
       (page generated 2024-05-23 23:01 UTC)