[HN Gopher] Show HN: Open source data discovery and observabilit...
___________________________________________________________________
Show HN: Open source data discovery and observability platform
Author : ndementev
Score : 93 points
Date : 2022-10-22 12:14 UTC (10 hours ago)
(HTM) web link (opendatadiscovery.org)
(TXT) w3m dump (opendatadiscovery.org)
| AnEro wrote:
| I can't wait until documentation gets filled out enough to see if
| I want to spend an afternoon importing all my pipelines to then
| see if it's useful for me
| germanosin wrote:
| You can find detailed documentation here
| https://docs.opendatadiscovery.org/. We would love to hear any
| feedback or questions you may have about our product! Here is a
| link to our Slack community
| https://go.opendatadiscovery.org/slack
| AnEro wrote:
| I did see that and was disappointed. It described the
| features in full but I was looking to see the features set up
| to see if it is something I can easily integrate or not.
| Don't get me wrong love the idea and its young I don't fault
| the team but I'm not going to get a docker running just to
| find out i have to reformat all my tests or write way too
| many new connections to enable any of these features.
| ndementev wrote:
| I see your point. As you mentioned the product is rather
| young and we continue to develop it. I agree that
| documentation is one of the fundamental parts. Thank you
| for your input, we will find a way to make the
| documentation more straight forward and useful for cases
| such as you've described.
|
| Perhaps you would be interested in a call with us where we
| can answer all your questions including integration with
| your infrastructure, provide help configuring the platform
| if needed, etc?
| AnEro wrote:
| Its not confusing its well written I just want more, but
| I get it's a community project and it takes time to do
| all this stuff for free. I'm just ~not so~patiently
| waiting for docs more around the person looking at set up
| vs the conceptual
| ndementev wrote:
| Well, for setting the platform's features up there's this
| page: https://docs.opendatadiscovery.org/configuration-
| and-deploym...
|
| It explains how to set up a platform in a way that
| certain features were enabled/disabled. Maybe this is
| something you will find useful in a way.
|
| It'd be great if you could provide an example, if I got
| you wrong
| AnEro wrote:
| Well like adding a test from panadas/great expectations,
| software looks great when its set up and already added.
| So trying to add one myself I just have to imagine, so
| great expectations make sense probably an API hook set
| up. But I'm using panadas mostly so how do I add one? Is
| it going to be more work to use existing tests? If so how
| much? Really I'm trying to timeline how long setting up
| everything from my pipeline inside the product as well.
| Since I want a junior doing alot of this aswell is this
| going to be weirdly hard for a math uni grad? You know?
|
| The online demo I didn't see abilities to do that on that
| type of account, which idc just part of gauging how long
| it would take to go from 0 to running to useful for the
| company.
|
| Thanks for the help I'm going to keep an eye out for the
| product in the future for sure. It's just at the point it
| is still more work for me to see if I want to do the work
| to use it
| ndementev wrote:
| Gotcha!
|
| Thank you for the input, we are going to work on this.
| Cilvic wrote:
| Looks really useful. I'm confused about pricing.
|
| There is:
|
| - the "schedule a call" which sounds like there is some paid
| version of this
|
| - "Free an open source"
|
| Is this a volunteer project but you still offer to take a call?
| germanosin wrote:
| Hi Cilvic, This is an opensource product. You could use it for
| free. If you have any questions or need any assistance we would
| be happy to help you and the same time we hope you'll help our
| product with your feedback and real-world use cases.
| Cilvic wrote:
| Thanks, in the youtube video you mention ODD v4 with
| "enterprise features". Do you plan to release paid enterprise
| features?
| germanosin wrote:
| No, this was a role based access control and enterprise
| databases support.
| skrtskrt wrote:
| How much does the use case for this intersect with Pachyderm?
| ndementev wrote:
| While Pachyderm (a great product by the way) helps teams to
| automate transformation tasks, ODD is more of a
| discovery/observability/monitoring solution for your pipelines.
| Basically if Pachyderm helps you to build a pipeline, ODD helps
| you to monitor all of your pipelines in a context of _your
| whole data infrastructure_
| Cilvic wrote:
| I really like how you describe things from "use cases". What's
| missing for me is a clear highlight of what is part of ODD.
|
| For example: all the steps under 3. are not part of ODD, or are
| they?
|
| Only step 1 is performed in ODD, yes?
|
| Personally, I'm mostly interested in lineage and would love a
| usecase that explains real world lineage. Say we have
| redshift/postgres and a Tableau with a dataset. How is the
| lineage generated or manually maintained.
|
| Anyways great effort.
| ndementev wrote:
| Thank you for your kind words!
|
| May I ask you what do you mean by saying "all steps under 3"?
| Are you referring to
| https://docs.opendatadiscovery.org/use_cases/dq_visibility?
|
| As for the
|
| > How is the lineage generated or manually maintained
|
| All lineage in the platform is generated and _not_ manually
| handled by user in the UI. We are leveraging ODD Specification
| (https://github.com/opendatadiscovery/opendatadiscovery-
| speci...) and all ODD Collectors (agents that scrape metadata
| from your data sources) send payload to the ODD Platform in
| this specification's format. ODD Specification introduces
| something called ODDRN -- OpenDataDiscovery Resource Names.
| These are basically strings, identifiers of specific data
| entities. All ODD Collectors generates same identifiers for
| same entities, allowing us automatically build a lineage graph
| in ODD Platform.
|
| Not letting a user to manually change lineage in the UI is
| kinda our solution to one of the lineage problems. This way
| users can be sure that the lineage is correct, up to date and
| no one messed with it at least in the UI.
|
| Of course if there's an described API endpoint, there's a way
| to change the lineage by sending a request on your own (e.g.
| via curl or custom script), but I wouldn't call it manual. This
| approach allows companies and users to write their own
| integrations, making the system open.
| Cilvic wrote:
| Sorry I meant this page
| https://docs.opendatadiscovery.org/use_cases/viz_preparation
|
| If lineage is as automatic as you say that's not clear to me
| after reading. Thanks for explaining!
| ndementev wrote:
| Gotcha!
|
| We are continue working on a documentation, thank you for
| bringing this up! We'll take a look how this can be
| improved.
|
| > For example: all the steps under 3. are not part of ODD,
| or are they? Only step 1 is performed in ODD, yes?
|
| Yes, that's correct. In this scenario ODD acts as a source
| of knowledge about the problem.
| jethkl wrote:
| I see the motivation and skill of your group, and I hope you can
| retain that and build a useful contribution. I also see the
| extraordinary effort required to get to this point, and many
| projects fail to get this far, so congratulations. Something is
| obviously going well.
|
| However, I am unmoved by your list of key wins (details below).
| If you indeed built something useful, is there a different way to
| deliver your message about the functionality that you enable?
|
| Here are my reactions:
|
| 1) Shorten data discovery phase. In my experience, analysts and
| data scientists are always very familiar with what relevant data
| exists, or else they can find the right people to acquire what
| data they need. Often, kick-off meetings for new projects cover
| with stakeholders which data is useful.
|
| 2) Have transparency on how and by whom the data is used. For
| publicly available data, this is not something that a company
| usually cares about. Internal and proprietary data management is
| already a very mature space, and every company with such data
| already has processes in place to manage data access. I grant
| this is often a mess, but I also don't see any global solution on
| the horizon.
|
| 3) Foster data culture by continuous compliance and data quality
| monitoring. Data quality monitoring is extremely complex. I have
| seen many claims over many years of tools that solve this problem
| broadly, but I have yet to see any solution that matches the
| claims.
|
| 4) Accelerate data insights. This is a very bold claim for a new
| project, especially given the many (5+) decades of work and
| experience developing tools and techniques for data insights.
|
| 5) Know the sources of your dashboards and ad hoc reports. All
| dashboards I am aware of surface this sort of information.
|
| 6) Deprecate outdated objects responsibly by assessing and
| mitigating the risks. This is a good idea, but it is challenging
| in practice, as illustrated by several prominent examples
| [1,2,3].
|
| Finally, unrelated to the above, your project's name (ODD) is
| very similar to the name used by the Outlier Detection DataSets
| (ODDS) project [4]
|
| Good luck.
|
| [1] http://www.lenna.org/editor.html
|
| [2] https://scikit-
| learn.org/stable/modules/generated/sklearn.da...
|
| [3] https://deepai.org/dataset/fb15k and
| https://paperswithcode.com/dataset/fb15k-237
|
| [4] http://odds.cs.stonybrook.edu
| ndementev wrote:
| Thank you for your kind words and a constructive feedback! We
| appreciate it.
|
| Let me cover some of your reactions from my perspective as a
| Data Engineer. Please feel free to add your opinion on those
|
| > Shorten data discovery phase. In my experience, analysts and
| data scientists are always very familiar with what relevant
| data exists, or else they can find the right people to acquire
| what data they need. Often, kick-off meetings for new projects
| cover with stakeholders which data is useful.
|
| You're right, but from my experience it's not always the case.
| Sometimes finding the key person/team responsible for a dataset
| might be challenging. You mentioned the kick-off meeting, about
| which I agree, but it's not always the silver bullet. Data goes
| outdated/deprecated all the time and we are trying to solve a
| problem of telling about this to all people which may be
| affected by this as soon an as easy as possible.
|
| > Know the sources of your dashboards and ad hoc reports. All
| dashboards I am aware of surface this sort of information
|
| Again, you are right. All dashboard services and BI tools can
| show you from what data source what data are they getting. But
| from my experience sometimes it's useful to take a look at the
| _origin_ of data some dashboard uses. This is where end-to-end
| lineage comes in hand. Also, I consider useful to have metadata
| of all of my dashboards from all of my company 's BI tools in
| one place.
|
| > Deprecate outdated objects responsibly by assessing and
| mitigating the risks. This is a good idea, however, it is
| challenging
|
| Couldn't agree more. We are working not only to improve our way
| to solve this problem, but the solution itself, if it makes
| sense. We are basically trying to find a right approach to this
| and offer it to everyone else. I know it's ambitious and really
| is a loud statement, but I hope we are getting there.
|
| In overall, thank you for your input!
|
| @germanosin, would you like to add something I may have missed?
| weekay wrote:
| Why do demos need Google or other login ? This is such a friction
| . I should be able to get access to demos w/o having to login .
| Also the use case for pre sales is interesting. Has anyone really
| had any success during pre-sales for an enterprise customer
| agreeing to installing a collector on their applications ?
| [deleted]
| germanosin wrote:
| Hi, we're the team behind this product! We updated our demo to
| include social logins so that spam doesn't get through. There
| are no logs being collected, and we're not selling this
| information either - if you don't want the online version of it
| just head over https://github.com/opendatadiscovery/odd-
| platform/tree/main/... locally using docker-compose
| maddynator wrote:
| Thanks folks. How do i contribute to the project?
| ndementev wrote:
| Thank you!
|
| We have a lot of repositories on GitHub, please feel free
| to pick any issue from the list. Do not hesitate to ask us
| anything in GitHub issues' threads or in our Slack
| community. I'll provide links for your convinience
|
| 1. ODD Platform GitHub:
| https://github.com/opendatadiscovery/odd-platform
|
| 2. Slack Community: https://go.opendatadiscovery.org/slack
|
| 3. Documentation with information on how to contribute:
| https://docs.opendatadiscovery.org/developer-guides/how-
| to-c...
| aschwad wrote:
| Interesting initiative! Do I understand correctly, that any push
| mechanism is done via the ODD API and pull mechanisms check on
| the schema of data sources? Do you already have a standard for
| providing ETL metadata? On which level of detail are you
| collecting this metadata?
|
| At BMW, the data catalogue is continuously growing and the amount
| of datasets is increasing rapidly. Therefore we had a similar
| problem to find out how datasets relate to each other and how
| they are transformed --> we needed coarse- and fine-grained data
| lineage. We found a way by leveraging the Spline Agent
| (https://github.com/AbsaOSS/spline) to make use of the Execution
| Plans, transform them into a suiting data model for our set of
| requirements and developed a UI to explore these relationships.
| We also open-sourced our approach in a
|
| - paper:
| https://link.springer.com/article/10.1007/s13222-021-00387-7
|
| - and blog post: https://medium.com/@alex.schoenenwald/fishing-
| for-data-linea...
| ndementev wrote:
| Thank you!
|
| Actually everything is working on a push basis in ODD now. ODD
| Platform implements ODD Specification
| (https://github.com/opendatadiscovery/opendatadiscovery-
| speci...) and all agents, custom scripts and integrations,
| Airflow/Spark listeners, etc are pushing metadata to specific
| ODD Platform's endpoint
| (https://github.com/opendatadiscovery/opendatadiscovery-
| speci...). ODD Collectors (agents) are pushing metadata on a
| configurable schedule.
|
| ODD Specification is a standard for collecting and gathering
| such metadata, ETL included. We gather metadata for lineage on
| an entity level now, but we plan to expand this to the column-
| level lineage at the end 2022 -- start 2023. Specification
| allows us to make the system open and it's really easy to write
| your own integration by taking a look in what format metadata
| needs to be injected in the Platform.
|
| ODD Platform has its own OpenAPI specification
| (https://github.com/opendatadiscovery/odd-
| platform/tree/main/...) so that the already indexed and layered
| metadata could be extracted via platform's API.
|
| Also, thank you for sharing links with us! I'm thrilled to take
| a look how BMW solved a problem of lineage gathering from
| Spark, that's something we are improving in our product right
| now.
| O__________O wrote:
| Related website:
|
| - https://opendatadiscovery.org/
|
| Demo (requires GitHub/Google login) + Demo Video
|
| - https://demo.oddp.io/login
|
| - https://youtube.com/watch?v=ZSa2FWAyUic
|
| Use cases:
|
| - https://docs.opendatadiscovery.org/use_cases
|
| Presentation on ODD (by HN user germanosin):
|
| - https://youtube.com/watch?v=Y0aFqHd4h3k
| dang wrote:
| Ok, we've changed the URL above to the project home page, since
| it gives more background info, from
| https://github.com/opendatadiscovery/odd-platform. Thanks!
___________________________________________________________________
(page generated 2022-10-22 23:00 UTC)