[HN Gopher] Show HN: I built an open-source data copy tool calle...
___________________________________________________________________
Show HN: I built an open-source data copy tool called ingestr
Hi there, Burak here. I built an open-source data copy tool called
ingestr (https://github.com/bruin-data/ingestr) I did build quite
a few data warehouses both for the companies I worked at, as well
as for consultancy projects. One of the more common pain points I
observed was that everyone had to rebuild the same data ingestion
bit over and over again, and each in different ways: - some wrote
code for the ingestion from scratch to various degrees - some used
off-the-shelf data ingestion tools like Fivetran / Airbyte I have
always disliked both of these approaches, for different reasons,
but never got around to working on what I'd imagine to be the
better way forward. The solutions that required writing code for
copying the data had quite a bit of overhead such as how to
generalize them, what language/library to use, where to deploy, how
to monitor, how to schedule, etc. I ended up figuring out solutions
for each of these matters, but the process always felt suboptimal.
I like coding but for more novel stuff rather than trying to copy a
table from Postgres to BigQuery. There are libraries like dlt
(awesome lib btw, and awesome folks!) but that still required me to
write, deploy, and maintain the code. Then there are solutions
like Fivetran or Airbyte, where there's a UI and everything is
managed through there. While it is nice that I didn't have to write
code for copying the data, I still had to either pay some
unknown/hard-to-predict amount of money to these vendors or host
Airbyte myself which is roughly back to square zero (for me, since
I want to maintain the least amount of tech myself). Nothing was
versioned, people were changing things in the UI and breaking the
connectors, and what worked yesterday didn't work today. I had a
bit of spare time a couple of weeks ago and I wanted to take a stab
at the problem. I have been thinking of standardizing the process
for quite some time already, and dlt had some abstractions that
allowed me to quickly prototype a CLI that copies data from one
place to another. I made a few decisions (that I hope I won't
regret in the future): - everything is a URI: every source and
every destination is represented as a URI - there can be only one
thing copied at a time: it'll copy only a single table within a
single command, not a full database with an unknown amount of
tables - incremental loading is a must, but doesn't have to be
super flexible: I decided to support full-refresh, append-only,
merge, and delete+insert incremental strategies, because I believe
this covers 90% of the use-cases out there. - it is CLI-only, and
can be configured with flags & env variables so that it can be
automated quickly, e.g. drop it into GitHub Actions and run it
daily. The result ended up being `ingestr`
(https://github.com/bruin-data/ingestr). I am pretty happy with
how the first version turned out, and I plan to add support for
more sources & destinations. ingestr is built to be flexible with
various source and destination combinations, and I plan to
introduce more non-DB sources such as Notion, GSheets, and custom
APIs that return JSON (which I am not sure how exactly I'll do but
open to suggestions!). To be perfectly clear: I don't think
ingestr covers 100% of data ingestion/copying needs out there, and
it doesn't aim that. My goal with it is to cover most scenarios
with a decent set of trade-offs so that common scenarios can be
solved easily without having to write code or manage infra. There
will be more complex needs that require engineering effort by
others, and that's fine. I'd love to hear your feedback on how can
ingestr help data copying needs better, looking forward to hearing
your thoughts! Best, Burak
Author : karakanb
Score : 94 points
Date : 2024-02-27 15:18 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| matijash wrote:
| This looks pretty cool! What was the hardest part about building
| this?
| karakanb wrote:
| hey, thanks!
|
| I guess there were a couple of things that I found as tricky:
|
| - deciding on the right way to represent sources and
| destinations was hard, before landing on URIs I thought of
| using config files but that'd also add additional complexity
| etc
|
| - the platforms had different quirks concerning different data
| types
|
| - dlt stores state on its own, which means that re-runs are not
| running from scratch after changing the incremental strategy,
| and they require a full refresh, it took me quite some time to
| figure out how exactly to work with it
|
| I think among these the hardest part was to get myself to build
| and release it, because I had it in my mind for a long time and
| it took me a _long while_ to build and share it :)
| jrhizor wrote:
| I like the idea of encoding complex connector configs into URIs!
| karakanb wrote:
| Glad to hear that! I am not 100% sure if it'll look pretty for
| all platforms but I hope it'll be an okay base to get started!
| javajosh wrote:
| Perhaps OP re-invented it, but it's been around for a long time
| in the java world via jdbc urls. See, for example this writeup:
| https://www.baeldung.com/java-jdbc-url-format
| karakanb wrote:
| I don't think I invented anything tbh, I just relied on
| SQLAlchemy's URI formats, and I decided to abuse it slightly
| for even more config.
| kipukun wrote:
| Do you think you'll add local file support in the future? Also,
| do you have any plans on making the reading of a source parallel?
| For example, connectorx uses an optional partition column to read
| chunks of a table concurrently. Cool how it's abstracted.
| karakanb wrote:
| I am working on file support right now as a destination to
| begin with. I believe I should get local files as well as
| S3-compatible sources going by tonight.
|
| Reading the sources in parallel is an interesting idea, I'll
| definitely take a look at it. ingestr supports incremental
| loads by a partitioned column, but there's no parallelized
| partition reading at the moment.
|
| Thanks a lot for your comment!
| adawg4 wrote:
| I second this!
| karakanb wrote:
| I have just released v0.1.2 which supports CSV destinations
| with the URI format `csv://path/to/file.csv`, hope that's
| helpful!
| simonw wrote:
| I was surprised to see SQLite listed as a source but not as a
| destination. Any big reasons for that or is it just something you
| haven't got around to implementing yet?
|
| I've been getting a huge amount of useful work done over the past
| few years sucking data from other systems into SQLite files on my
| own computer - I even have my own small db-to-sqlite tool for
| this (built on top of SQLAlchemy) - https://github.com/simonw/db-
| to-sqlite
| karakanb wrote:
| I do use the dlt library to support as many source &
| destinations as possible and they do not support SQLite as of
| today. I am interested in supporting SQLite simply because I
| love it as well, so that's definitely in the roadmap.
|
| db-to-sqlite looks lovely, I'll see if I can learn a thing or
| two from it!
| mritchie712 wrote:
| looks like dlt doesn't support it as a destination (which this
| is a wrapper around)
|
| https://dlthub.com/docs/dlt-ecosystem/destinations/
| MatthausK wrote:
| one of the dltHub founders here - we aim to address this in
| the coming weeks
| MatthausK wrote:
| one of the dltHub founders here - we aim to address this in the
| coming weeks
| chinupbuttercup wrote:
| This looks pretty cool. Is there any schema management included
| or do schema changes need to be in place on both sides first?
| karakanb wrote:
| It does handle schema evolution wherever it can, including
| inferring the initial schema automatically based on the source
| and destination as well, which means there's no need for manual
| schema changes anywhere and it will keep them in sync wherever
| it can.
| ab_testing wrote:
| If you can add source and destination as csv, it will increase
| the usefulness of this product manifold.
|
| There are many instances where people either have a csv that they
| want to load into a database or get a specific database table
| exported into csv.
| karakanb wrote:
| I agree, I am looking into it right now!
| karakanb wrote:
| I have just released v0.1.2 which supports CSV destinations
| with the URI format `csv://path/to/file.csv`, hope that's
| helpful!
| karakanb wrote:
| Also released local CSV file as a source in v0.1.3.
| skanga wrote:
| Hi Burak, I saw cx_Oracle in the requirements.txt but the support
| matrix did not mention it. Does this mean Oracle is coming? Or a
| typo?
| karakanb wrote:
| I added it as an experimental source a few hours ago, but I
| haven't had the chance to test it, that's why I haven't put it
| into the support matrix yet. Do you mind trying it out if you
| do use Oracle?
| Phlogi wrote:
| I'd love to see support for odbc, any plans?
| karakanb wrote:
| Do you mean SQL Server? If that's the case, ingestr is already
| able to connect to Microsoft SQL Server and use it both as a
| source and a destination.
| PeterZaitsev wrote:
| Looks great Burak! Appreciate your contribution to Open Source
| Data ecosystem!
| karakanb wrote:
| thanks a lot Peter!
| hermitcrab wrote:
| I am very interested in data ingestion. I develop a desktop data
| wrangling tool in C++ ( Easy Data Transform ). So far it can
| import files in various formats (CSV, Excel, JSON, XML etc). But
| I am interested in being able to import from databases, APIs and
| other sources. Would I be able to ship your CLI as part of my
| product on Windows and Mac? Or can someone suggest some other
| approach to importing from lots of data sources without coding
| them all individually?
| karakanb wrote:
| hmm, that's an interesting question, I don't know the answer to
| be honest. are you able to run external scripts on the device?
| if so, you might be able to install & run ingestr with a CSV
| destination (which I released literally 2 mins ago), but that
| seems like a lot of work as well, and will probably be way
| slower than your C++ application.
|
| Maybe someone else has another idea?
| hermitcrab wrote:
| I can start a CLI as a separate process. But ingesting to CSV
| and then reading the CSV would be slow. Maybe it would be
| better to ingest into DuckDB or in memory in Arrow memory
| format. If anyone has any other suggestions, I am all ears.
| ijidak wrote:
| Is there a reason CSV (as a source) isn't supported? I've been
| looking for exactly this type of tool, but that supports CSV.
|
| CSV support would be huge.
|
| Please please please provide CSV support. :)
| karakanb wrote:
| I have released v0.1.2 with the destination literally minutes
| ago, I'll take a look at CSV as a source!
|
| just so that I have a better understanding, do you mind
| explaining your usecase?
| karakanb wrote:
| I have just released supporting local CSV file as a source in
| v0.1.3, let me know if this helps! :)
| parkcedar wrote:
| This looks awesome. I had this exact problem just last week and
| had to write my own tool to perform the migration in go. After
| creating the tool I thought this must be something others would
| use- glad to see someone beat me to it!
|
| I think it's clever keep the tool simple and only copy one table
| at a time. My solution was to generate code based on an sql
| schema, but it was going to be messy and require more user
| introspection before the tool could be run.
| karakanb wrote:
| thanks a lot for your comment, glad to hear we converged on a
| similar idea! :)
| yanko wrote:
| Db2 like not existing db in the real world
| karakanb wrote:
| it's on my roadmap for sure!
___________________________________________________________________
(page generated 2024-02-27 23:01 UTC)