[HN Gopher] Imagining a personal data pipeline
       ___________________________________________________________________
        
       Imagining a personal data pipeline
        
       Author : surprisetalk
       Score  : 101 points
       Date   : 2024-08-07 18:06 UTC (4 days ago)
        
 (HTM) web link (www.joshcanhelp.com)
 (TXT) w3m dump (www.joshcanhelp.com)
        
       | curiousthought wrote:
       | This is actually a great use case for something like Windows
       | Recall. Ingestion of data after the fact requires the data to be
       | discoverable.
       | 
       | If there was a way to add a meta-prompt to Windows Recall like
       | "Create a log entry every time I watch something with its title
       | and URL" it could serve as a history whether things were watched
       | on YouTube, Vimeo, or any other site, without requiring plugging
       | into each service individually. Repeat ad nauseum for each thing
       | to be logged, or perhaps someone can come up with a more clever
       | query than I that catches everything sufficiently.
       | 
       | The level of granularity on many services might be surprisingly
       | large, preventing introspection of the data at a useful level.
        
         | groby_b wrote:
         | This is a horrible use case for Windows Recall. Even if we
         | ignore all the privacy implications of having a third party
         | screenshot you every 30 seconds and making the files world
         | readable, it's a bad idea.
         | 
         | Recall has lost a ton of useful metadata you already have -
         | both URL visits and streaming are clearly discernible actions,
         | both at the network stack level, and from your browser history.
         | Throwing that away to trust an LLM to re-infer the same data is
         | both reducing data fidelity and significantly increasing
         | processing cost.
         | 
         | If you want to see this done reasonably well, I'd suggest
         | looking at e.g https://beepb00p.xyz/promnesia.html (which not
         | surprisingly bears a strong similarity to what the article
         | discusses)
         | 
         | LLMs don't add much value here, outside of tightly locked down
         | systems where screenshots are the _only_ way of exporting.
        
           | curiousthought wrote:
           | Sorry when I said something like Windows Recall, I didn't
           | mean Windows Recall but software with similar capabilities. I
           | think in my mind I was imagining some sort of ongoing screen
           | capture along with a meta prompt or prompts, and some sort of
           | output.
           | 
           | The value the LLM adds is interpreting/processing data
           | without having to tailor input streams. Imagine if formats
           | change, fields get renamed, and so on. The maintenance would
           | be a headache if this was done on a per-service level. I
           | think the reduction in fidelity seems like a reasonable
           | tradeoff, but that's for the user to decide of course along
           | with local/cloud processing and proprietary/open source
           | software.
           | 
           | Even things like invoices from the same service change format
           | over time.
        
             | netsharc wrote:
             | I've been using https://www.manictime.com for maybe close
             | to 20 years now, although not the pro version that offers
             | screenshot recording (curiously the website doesn't mention
             | the existence of a free "standard" license). It records
             | window titles and presence/away times.
             | 
             | A prompt every few minutes that would ask "What are you
             | doing now?" would be interesting to me, as a professional
             | procrastinator. Maybe an even better one would be one that
             | says something like "In the last 10 minutes, you spent 90%
             | of it on Hacker News".
        
         | IncreasePosts wrote:
         | Browser history does that pretty well currently.
        
       | sevazhidkov wrote:
       | My friend and I had a similar idea a few years ago, so we've
       | built a prototype of a tool that converts personal data exports
       | to a single SQLite database: https://github.com/bionic/bionic
       | (repo includes "popular Spotify songs when I'm in transit
       | according to Google Maps" query). Unfortunately, we haven't found
       | ourselves actually using the aggregated data: we've looked on it
       | a few times, but it didn't end up solving some real pain. It was
       | fun to build though!
        
       | plaidfuji wrote:
       | I also think about this general problem a fair amount, but this:
       | 
       | > ... There is a whole bunch of toil baked into this hobby and
       | I'm wary of creating an endless source of digital chores for
       | myself
       | 
       | Always stops me from pursuing it seriously. Also that I already
       | do a fair amount of ELT at work and couldn't tolerate it as a
       | hobby.
       | 
       | But this framework makes sense. Seems like the idea is connector,
       | schema mapping, and datatype standardization configured in one
       | place. It's a well thought-out framework and I actually have an
       | internal platform at work that accomplishes something very
       | similar, albeit for a totally different purpose.
       | 
       | But I also personally wouldn't see a ton of value from this,
       | _except_ if it were used for bills, taxes, and financial
       | management. But then the privacy aspect becomes paramount.
       | There's a reason people just have to do that stuff manually.
       | 
       | I would be surprised if Apple and Google didn't eventually start
       | to build something like this. Google is already pretty good at
       | unifying email and calendar. It's something that's really only
       | possible to deploy at the mobile OS level, because any other
       | alternative would involve sending all of your data to a third
       | party platform, which for most people these days is a non-
       | starter. Plus with LLMs being a thing, the perfect interface to
       | centralized/standardized personal data now exists.
        
       | jskherman wrote:
       | This is the whole main challenge of the quantified self movement
       | all over again.
       | 
       | There's a lot of attempts to solve this problems but not much has
       | been found, possibly because the whole setup of ELT processes is
       | a lot of chores (just think about the whole inconsistent formats
       | of data across services). It's like having a second job in data
       | engineering, and I'm not even remotely in the software/data
       | industry! I just like and do coding as a hobby.
        
       | LeonB wrote:
       | HN user SimonW who created Datasette gave a talk in 2020 on
       | "dogsheep" his tool for harvesting and processing personal data
       | from a series of third parties.
       | 
       | https://simonwillison.net/2020/Nov/14/personal-data-warehous...
       | 
       | Here is more about dogsheep -- (" Dogsheep is a collection of
       | tools for personal analytics using SQLite and Datasette.")
       | 
       | https://dogsheep.github.io/
        
         | jhardy54 wrote:
         | I was surprised, I thought the article was going to be about
         | dogsheep, _especially_ when the article makes a point of
         | linking to prior work.
         | 
         | Absolutely no shade, it just goes to show how hard it is to
         | keep track of All The Things.
        
           | LeonB wrote:
           | I've got a small number of links to such things here --
           | (which is how I could track down dogsheep quickly)
           | 
           | https://wiki.secretgeek.net/personal-data-liberation
        
       | jonahbenton wrote:
       | See also Perkeep, from Brad Fitzpatrick
       | 
       | https://en.m.wikipedia.org/wiki/Perkeep
        
         | vineyardmike wrote:
         | I've always been really interested in Perkeep, and I think it
         | gets a lot of things right architecturally. I'm always
         | disappointed though whenever I go to use it. It seems like it's
         | just missing one or two things, and there is no community that
         | has grown around it. It's also kinda a pain to extend, because
         | all of the types are hard-coded as structs within the project,
         | so anything new basically needs to be written in vs added on
         | top.
         | 
         | For example, the project docs are practically impossible to
         | use. You need to review source if you want to create a
         | configuration file yourself (needed to use any data store
         | besides sqlite index + local files). There are just conflicting
         | explanations, outdated descriptions, etc.
         | 
         | Additionally, the project spends a lot of time saying "objects
         | not files" but the only objects defined and used by the system
         | are... files.
        
       | yevpats wrote:
       | This is why we built CloudQuery
       | (https://github.com/cloudquery/cloudquery) an open source high
       | performance ELT framework powered by Apache Arrow (framework is
       | open source, our connectors closed source). You can run local
       | pipeline and write plugins (extractors) in Go, Python, Javascript
       | and any other language and save data to any destination (files,
       | SQLite, DuckDB, PostgreSQL, ...)
       | 
       | (Founder here)
        
       | smolder wrote:
       | The promise of computing that never materialized was having
       | software to track everything you've ever done, and leveraging
       | that data for _your own_ benefit and no one else 's.
        
       | nilirl wrote:
       | I made something for just this use. It's very simple, but it
       | gives you all your data in compressed JSON. I use it myself for
       | mostly diet and exercise, but I also log other things like movies
       | and books I'm reading. I made it because I wanted a nice
       | interface to review my logs.
       | 
       | https://www.idiotlamborghini.com/strategies/weave
        
       | jauntywundrkind wrote:
       | I'm struggling to remember/find the exact story but like 6 months
       | ago, some dev had built a guestlist or q&a & used some off the
       | shelf Notion-y thing - maybe some form builder tool.
       | 
       | There was seemingly IMHO a lot of protest the dude didn't make
       | some kind of php script or something one off & "simple" for the
       | job.
       | 
       | But they'd used existing tools.to make a real data pipeline. And
       | potentially could keep making new tools around similar pipelines.
       | They had invested in some pipe building technology and it felt
       | like no one was interested in giving credit for that.
       | 
       | Seperately, Karlicoss has HPI as their personal data
       | toolkit/pipeline, and a massive map of services & data systems
       | they've roped into HPI (and HPI-near) systems.
       | https://beepb00p.xyz/hpi.html https://beepb00p.xyz/myinfra.html
        
       | compsciphd wrote:
       | About a decade ago, I took on a personal toy project to try and
       | teach myself larger scale programming in java, as well as the
       | APIs provided by multiple internet services (google, twitter,
       | facebook, ....)
       | 
       | The project was to try and collect and make searchable my
       | "internet self". something I called "personal search". i.e. the
       | idea was to try to index every web page I look at, every e-mail I
       | get / see. Social media content shared to me by my social graph
       | (using said APIs), and further indexing pages shared.
       | 
       | The indexing itself wasn't the hard part (per se, as at the time,
       | the APIs facebook, twitter et al were very expansive, much more
       | limited these days, one can attempt to deal with it with
       | intelligent dom scraping, but that's a never ending race where
       | your sources are consistently changing things and you are chasing
       | their changes), the question is how does this information really
       | create significant value for myself? i.e. how often am I going to
       | actually be searching this personal archive. I search google many
       | times a day to find new things (or refind things I already found
       | through it), but how often do I search for things that are within
       | my personal index during a normal day? a handful at most I'd
       | think (and many times, not even once).
       | 
       | With that said, the concept that I then decided to try and teach
       | myself was trying to write a browser extension that could do a
       | similarity search (of sorts) between the documents in my personal
       | index and the web page that I'm currently looking at + content
       | related to that (ex: looking at a news article about current
       | events, idea is that it should surface other articles your
       | friends have shared on the topic and their comments on it). That
       | ended up being an area i didn't have the time (or expertise) to
       | really go far with, so it sort of ended there.
        
       | noelwelsh wrote:
       | I always wonder what people would do with this data. I have a
       | more "tears in the rain" approach. I just don't think there is
       | that much value in, say, my old workout logs. I cannot see how
       | capturing and analyzing it all would make my life much better. I
       | feel if there was one crazy hack that would, say, make me
       | stronger, the people who's job is to get as strong as possible
       | would have already found it. (It's probably PEDs.)
        
         | zimpenfish wrote:
         | > I just don't think there is that much value in, say, my old
         | workout logs.
         | 
         | To you, perhaps, but witness the excitement of historians and
         | archaeologists when they find things like the Ea-nasir
         | complaint[0] or an Akkadian shopping list[1].
         | 
         | You never know what might be interesting to future peoples.
         | 
         | [0] https://en.wikipedia.org/wiki/Complaint_tablet_to_Ea-nasir
         | 
         | [1] https://news.artnet.com/art-world/akkadian-cuneiform-
         | tablet-...
        
         | nicbou wrote:
         | In my case, I use it as an enhanced photo stream slash diary.
         | It let's me see what I was up to on a given day. I see my
         | location, my photos, my diary, my search queries and my
         | transactions, among other things.
        
       | burakemir wrote:
       | For the subproblem of being able to unify and query various data
       | sources in different formats, I would suggest to take a look at
       | Datalog and specifically Mangle, my implementation of it. I don't
       | want to plug the project here but more describe the approach.
       | 
       | Usually your data will comfortably fit in a file. Your data
       | getter emits these files in facts (essentially relations). If you
       | want structures data, it can also be a single column that is of
       | some struct type (similar to protobuf).
       | 
       | With all data available the problem becomes one of querying. With
       | a good enough query language and system, you write these can data
       | transformations via Datalog rules which roughly correspond to
       | database views.
       | 
       | It is always possible to write queries in code in a general
       | purpose language, but is a bit clumsy and hard to get an overview
       | or reuse. It may also be possible to do SQL but it SQL is not
       | very compositional and you ask yourself whether the base data
       | representation should be adapted refactored. Essentially you do
       | not want to think about the optimal schema or set of structs but
       | just do transformations you need in the lowest friction way.
       | 
       | With Datalog you may benefit from a unified representation
       | (everything is "facts") and the transformations to useful
       | different formats (different kinds of facts) can be factored and
       | reused. It may mean duplication and denormalization but usually
       | that does not matter.
       | 
       | Mangle supports aggregation and even calling some custom
       | functions during query evaluation. The repo is at
       | https://github.com/google/mangle and obviously there remains a
       | lot to do, the API is unstable, there are bugs and the type
       | checker is not finished... but a number of people and projects
       | seem to use it. Even if you do not use it, it may give you how to
       | use facts (relations) as a unified data structure for your
       | project.
        
       | kkfx wrote:
       | Well... I have mine in Emacs/org-mode/org-roam managed notes,
       | integrated being in a single integrated platform with no need of
       | extra code, with my mails (notmuch), contacts (org-contacts),
       | financial transactions (beancount), file (org-attach-ed) etc down
       | to the infra/OS config (NixOS tangled from org-mode, org-babel
       | blocks, as per the Emacs, zsh, mplayer, ... configs).
       | 
       | The point is that with classic tools IS EASY, limited only by the
       | current sorry state of IT things, with modern tools is a
       | nightmare that demand much effort to be done.
        
       | kaz-inc wrote:
       | I have a project I've built that's somewhat like this, ironically
       | called Pipeline [0]. It's a manual entry timestamped note taking
       | system, and the UI is like messaging yourself. I've set it up
       | over a wireguard VPN server and it connects all of my devices, it
       | works offline as a PWA, and I've tested it on
       | chrome/Firefox/safari on iOS/Linus/android/macos/windows. It
       | mostly works on all of those platforms and some of my
       | friends/family use it to take notes for themselves.
       | 
       | The fundamental query I usually use is substring search. The only
       | contents is text, because I believe in the primacy of plaintext.
       | The notes for the last 4 years of my life takes up 60 megs, and
       | it takes half a second on a 5 year old android phone to parse all
       | of it, and less than 50ms to search through all of it, so I can
       | do it on every keystroke/ incrementally.
       | 
       | [0] Pipeline Notes: https://github.com/kasrasadeghi/pipeline-js
       | 
       | I'm not a web developer by trade, so if anyone has any feedback
       | on security/UI/service workers, please let me know!
        
       | nicbou wrote:
       | I have built a timeline thing to gather all of my data as an
       | augmented diary.
       | 
       | https://nicolasbouliane.com/projects/timeline
       | 
       | The newer version is basically a static site generator running on
       | top of my data. The older version was actively fetching the data
       | from various sources, but it was getting a little unwieldy.
       | 
       | The biggest challenge is to automatically get your data out of
       | Google Photos, social networks, and even your own phone. All of
       | my handwritten notes and my sketches are stuck on my iPad and
       | must be manually exported. It's tedious and unsustainable.
       | 
       | Same with social networks. Data must be manually exported. There
       | is also no simple, human-readable file format for social media
       | posts. You have to parse the export format they choose to use at
       | the moment.
        
       ___________________________________________________________________
       (page generated 2024-08-11 23:02 UTC)