[HN Gopher] Imagining a personal data pipeline
___________________________________________________________________
Imagining a personal data pipeline
Author : surprisetalk
Score : 101 points
Date : 2024-08-07 18:06 UTC (4 days ago)
(HTM) web link (www.joshcanhelp.com)
(TXT) w3m dump (www.joshcanhelp.com)
| curiousthought wrote:
| This is actually a great use case for something like Windows
| Recall. Ingestion of data after the fact requires the data to be
| discoverable.
|
| If there was a way to add a meta-prompt to Windows Recall like
| "Create a log entry every time I watch something with its title
| and URL" it could serve as a history whether things were watched
| on YouTube, Vimeo, or any other site, without requiring plugging
| into each service individually. Repeat ad nauseum for each thing
| to be logged, or perhaps someone can come up with a more clever
| query than I that catches everything sufficiently.
|
| The level of granularity on many services might be surprisingly
| large, preventing introspection of the data at a useful level.
| groby_b wrote:
| This is a horrible use case for Windows Recall. Even if we
| ignore all the privacy implications of having a third party
| screenshot you every 30 seconds and making the files world
| readable, it's a bad idea.
|
| Recall has lost a ton of useful metadata you already have -
| both URL visits and streaming are clearly discernible actions,
| both at the network stack level, and from your browser history.
| Throwing that away to trust an LLM to re-infer the same data is
| both reducing data fidelity and significantly increasing
| processing cost.
|
| If you want to see this done reasonably well, I'd suggest
| looking at e.g https://beepb00p.xyz/promnesia.html (which not
| surprisingly bears a strong similarity to what the article
| discusses)
|
| LLMs don't add much value here, outside of tightly locked down
| systems where screenshots are the _only_ way of exporting.
| curiousthought wrote:
| Sorry when I said something like Windows Recall, I didn't
| mean Windows Recall but software with similar capabilities. I
| think in my mind I was imagining some sort of ongoing screen
| capture along with a meta prompt or prompts, and some sort of
| output.
|
| The value the LLM adds is interpreting/processing data
| without having to tailor input streams. Imagine if formats
| change, fields get renamed, and so on. The maintenance would
| be a headache if this was done on a per-service level. I
| think the reduction in fidelity seems like a reasonable
| tradeoff, but that's for the user to decide of course along
| with local/cloud processing and proprietary/open source
| software.
|
| Even things like invoices from the same service change format
| over time.
| netsharc wrote:
| I've been using https://www.manictime.com for maybe close
| to 20 years now, although not the pro version that offers
| screenshot recording (curiously the website doesn't mention
| the existence of a free "standard" license). It records
| window titles and presence/away times.
|
| A prompt every few minutes that would ask "What are you
| doing now?" would be interesting to me, as a professional
| procrastinator. Maybe an even better one would be one that
| says something like "In the last 10 minutes, you spent 90%
| of it on Hacker News".
| IncreasePosts wrote:
| Browser history does that pretty well currently.
| sevazhidkov wrote:
| My friend and I had a similar idea a few years ago, so we've
| built a prototype of a tool that converts personal data exports
| to a single SQLite database: https://github.com/bionic/bionic
| (repo includes "popular Spotify songs when I'm in transit
| according to Google Maps" query). Unfortunately, we haven't found
| ourselves actually using the aggregated data: we've looked on it
| a few times, but it didn't end up solving some real pain. It was
| fun to build though!
| plaidfuji wrote:
| I also think about this general problem a fair amount, but this:
|
| > ... There is a whole bunch of toil baked into this hobby and
| I'm wary of creating an endless source of digital chores for
| myself
|
| Always stops me from pursuing it seriously. Also that I already
| do a fair amount of ELT at work and couldn't tolerate it as a
| hobby.
|
| But this framework makes sense. Seems like the idea is connector,
| schema mapping, and datatype standardization configured in one
| place. It's a well thought-out framework and I actually have an
| internal platform at work that accomplishes something very
| similar, albeit for a totally different purpose.
|
| But I also personally wouldn't see a ton of value from this,
| _except_ if it were used for bills, taxes, and financial
| management. But then the privacy aspect becomes paramount.
| There's a reason people just have to do that stuff manually.
|
| I would be surprised if Apple and Google didn't eventually start
| to build something like this. Google is already pretty good at
| unifying email and calendar. It's something that's really only
| possible to deploy at the mobile OS level, because any other
| alternative would involve sending all of your data to a third
| party platform, which for most people these days is a non-
| starter. Plus with LLMs being a thing, the perfect interface to
| centralized/standardized personal data now exists.
| jskherman wrote:
| This is the whole main challenge of the quantified self movement
| all over again.
|
| There's a lot of attempts to solve this problems but not much has
| been found, possibly because the whole setup of ELT processes is
| a lot of chores (just think about the whole inconsistent formats
| of data across services). It's like having a second job in data
| engineering, and I'm not even remotely in the software/data
| industry! I just like and do coding as a hobby.
| LeonB wrote:
| HN user SimonW who created Datasette gave a talk in 2020 on
| "dogsheep" his tool for harvesting and processing personal data
| from a series of third parties.
|
| https://simonwillison.net/2020/Nov/14/personal-data-warehous...
|
| Here is more about dogsheep -- (" Dogsheep is a collection of
| tools for personal analytics using SQLite and Datasette.")
|
| https://dogsheep.github.io/
| jhardy54 wrote:
| I was surprised, I thought the article was going to be about
| dogsheep, _especially_ when the article makes a point of
| linking to prior work.
|
| Absolutely no shade, it just goes to show how hard it is to
| keep track of All The Things.
| LeonB wrote:
| I've got a small number of links to such things here --
| (which is how I could track down dogsheep quickly)
|
| https://wiki.secretgeek.net/personal-data-liberation
| jonahbenton wrote:
| See also Perkeep, from Brad Fitzpatrick
|
| https://en.m.wikipedia.org/wiki/Perkeep
| vineyardmike wrote:
| I've always been really interested in Perkeep, and I think it
| gets a lot of things right architecturally. I'm always
| disappointed though whenever I go to use it. It seems like it's
| just missing one or two things, and there is no community that
| has grown around it. It's also kinda a pain to extend, because
| all of the types are hard-coded as structs within the project,
| so anything new basically needs to be written in vs added on
| top.
|
| For example, the project docs are practically impossible to
| use. You need to review source if you want to create a
| configuration file yourself (needed to use any data store
| besides sqlite index + local files). There are just conflicting
| explanations, outdated descriptions, etc.
|
| Additionally, the project spends a lot of time saying "objects
| not files" but the only objects defined and used by the system
| are... files.
| yevpats wrote:
| This is why we built CloudQuery
| (https://github.com/cloudquery/cloudquery) an open source high
| performance ELT framework powered by Apache Arrow (framework is
| open source, our connectors closed source). You can run local
| pipeline and write plugins (extractors) in Go, Python, Javascript
| and any other language and save data to any destination (files,
| SQLite, DuckDB, PostgreSQL, ...)
|
| (Founder here)
| smolder wrote:
| The promise of computing that never materialized was having
| software to track everything you've ever done, and leveraging
| that data for _your own_ benefit and no one else 's.
| nilirl wrote:
| I made something for just this use. It's very simple, but it
| gives you all your data in compressed JSON. I use it myself for
| mostly diet and exercise, but I also log other things like movies
| and books I'm reading. I made it because I wanted a nice
| interface to review my logs.
|
| https://www.idiotlamborghini.com/strategies/weave
| jauntywundrkind wrote:
| I'm struggling to remember/find the exact story but like 6 months
| ago, some dev had built a guestlist or q&a & used some off the
| shelf Notion-y thing - maybe some form builder tool.
|
| There was seemingly IMHO a lot of protest the dude didn't make
| some kind of php script or something one off & "simple" for the
| job.
|
| But they'd used existing tools.to make a real data pipeline. And
| potentially could keep making new tools around similar pipelines.
| They had invested in some pipe building technology and it felt
| like no one was interested in giving credit for that.
|
| Seperately, Karlicoss has HPI as their personal data
| toolkit/pipeline, and a massive map of services & data systems
| they've roped into HPI (and HPI-near) systems.
| https://beepb00p.xyz/hpi.html https://beepb00p.xyz/myinfra.html
| compsciphd wrote:
| About a decade ago, I took on a personal toy project to try and
| teach myself larger scale programming in java, as well as the
| APIs provided by multiple internet services (google, twitter,
| facebook, ....)
|
| The project was to try and collect and make searchable my
| "internet self". something I called "personal search". i.e. the
| idea was to try to index every web page I look at, every e-mail I
| get / see. Social media content shared to me by my social graph
| (using said APIs), and further indexing pages shared.
|
| The indexing itself wasn't the hard part (per se, as at the time,
| the APIs facebook, twitter et al were very expansive, much more
| limited these days, one can attempt to deal with it with
| intelligent dom scraping, but that's a never ending race where
| your sources are consistently changing things and you are chasing
| their changes), the question is how does this information really
| create significant value for myself? i.e. how often am I going to
| actually be searching this personal archive. I search google many
| times a day to find new things (or refind things I already found
| through it), but how often do I search for things that are within
| my personal index during a normal day? a handful at most I'd
| think (and many times, not even once).
|
| With that said, the concept that I then decided to try and teach
| myself was trying to write a browser extension that could do a
| similarity search (of sorts) between the documents in my personal
| index and the web page that I'm currently looking at + content
| related to that (ex: looking at a news article about current
| events, idea is that it should surface other articles your
| friends have shared on the topic and their comments on it). That
| ended up being an area i didn't have the time (or expertise) to
| really go far with, so it sort of ended there.
| noelwelsh wrote:
| I always wonder what people would do with this data. I have a
| more "tears in the rain" approach. I just don't think there is
| that much value in, say, my old workout logs. I cannot see how
| capturing and analyzing it all would make my life much better. I
| feel if there was one crazy hack that would, say, make me
| stronger, the people who's job is to get as strong as possible
| would have already found it. (It's probably PEDs.)
| zimpenfish wrote:
| > I just don't think there is that much value in, say, my old
| workout logs.
|
| To you, perhaps, but witness the excitement of historians and
| archaeologists when they find things like the Ea-nasir
| complaint[0] or an Akkadian shopping list[1].
|
| You never know what might be interesting to future peoples.
|
| [0] https://en.wikipedia.org/wiki/Complaint_tablet_to_Ea-nasir
|
| [1] https://news.artnet.com/art-world/akkadian-cuneiform-
| tablet-...
| nicbou wrote:
| In my case, I use it as an enhanced photo stream slash diary.
| It let's me see what I was up to on a given day. I see my
| location, my photos, my diary, my search queries and my
| transactions, among other things.
| burakemir wrote:
| For the subproblem of being able to unify and query various data
| sources in different formats, I would suggest to take a look at
| Datalog and specifically Mangle, my implementation of it. I don't
| want to plug the project here but more describe the approach.
|
| Usually your data will comfortably fit in a file. Your data
| getter emits these files in facts (essentially relations). If you
| want structures data, it can also be a single column that is of
| some struct type (similar to protobuf).
|
| With all data available the problem becomes one of querying. With
| a good enough query language and system, you write these can data
| transformations via Datalog rules which roughly correspond to
| database views.
|
| It is always possible to write queries in code in a general
| purpose language, but is a bit clumsy and hard to get an overview
| or reuse. It may also be possible to do SQL but it SQL is not
| very compositional and you ask yourself whether the base data
| representation should be adapted refactored. Essentially you do
| not want to think about the optimal schema or set of structs but
| just do transformations you need in the lowest friction way.
|
| With Datalog you may benefit from a unified representation
| (everything is "facts") and the transformations to useful
| different formats (different kinds of facts) can be factored and
| reused. It may mean duplication and denormalization but usually
| that does not matter.
|
| Mangle supports aggregation and even calling some custom
| functions during query evaluation. The repo is at
| https://github.com/google/mangle and obviously there remains a
| lot to do, the API is unstable, there are bugs and the type
| checker is not finished... but a number of people and projects
| seem to use it. Even if you do not use it, it may give you how to
| use facts (relations) as a unified data structure for your
| project.
| kkfx wrote:
| Well... I have mine in Emacs/org-mode/org-roam managed notes,
| integrated being in a single integrated platform with no need of
| extra code, with my mails (notmuch), contacts (org-contacts),
| financial transactions (beancount), file (org-attach-ed) etc down
| to the infra/OS config (NixOS tangled from org-mode, org-babel
| blocks, as per the Emacs, zsh, mplayer, ... configs).
|
| The point is that with classic tools IS EASY, limited only by the
| current sorry state of IT things, with modern tools is a
| nightmare that demand much effort to be done.
| kaz-inc wrote:
| I have a project I've built that's somewhat like this, ironically
| called Pipeline [0]. It's a manual entry timestamped note taking
| system, and the UI is like messaging yourself. I've set it up
| over a wireguard VPN server and it connects all of my devices, it
| works offline as a PWA, and I've tested it on
| chrome/Firefox/safari on iOS/Linus/android/macos/windows. It
| mostly works on all of those platforms and some of my
| friends/family use it to take notes for themselves.
|
| The fundamental query I usually use is substring search. The only
| contents is text, because I believe in the primacy of plaintext.
| The notes for the last 4 years of my life takes up 60 megs, and
| it takes half a second on a 5 year old android phone to parse all
| of it, and less than 50ms to search through all of it, so I can
| do it on every keystroke/ incrementally.
|
| [0] Pipeline Notes: https://github.com/kasrasadeghi/pipeline-js
|
| I'm not a web developer by trade, so if anyone has any feedback
| on security/UI/service workers, please let me know!
| nicbou wrote:
| I have built a timeline thing to gather all of my data as an
| augmented diary.
|
| https://nicolasbouliane.com/projects/timeline
|
| The newer version is basically a static site generator running on
| top of my data. The older version was actively fetching the data
| from various sources, but it was getting a little unwieldy.
|
| The biggest challenge is to automatically get your data out of
| Google Photos, social networks, and even your own phone. All of
| my handwritten notes and my sketches are stuck on my iPad and
| must be manually exported. It's tedious and unsustainable.
|
| Same with social networks. Data must be manually exported. There
| is also no simple, human-readable file format for social media
| posts. You have to parse the export format they choose to use at
| the moment.
___________________________________________________________________
(page generated 2024-08-11 23:02 UTC)