[HN Gopher] Show HN: Open source data discovery and observabilit...
       ___________________________________________________________________
        
       Show HN: Open source data discovery and observability platform
        
       Author : ndementev
       Score  : 93 points
       Date   : 2022-10-22 12:14 UTC (10 hours ago)
        
 (HTM) web link (opendatadiscovery.org)
 (TXT) w3m dump (opendatadiscovery.org)
        
       | AnEro wrote:
       | I can't wait until documentation gets filled out enough to see if
       | I want to spend an afternoon importing all my pipelines to then
       | see if it's useful for me
        
         | germanosin wrote:
         | You can find detailed documentation here
         | https://docs.opendatadiscovery.org/. We would love to hear any
         | feedback or questions you may have about our product! Here is a
         | link to our Slack community
         | https://go.opendatadiscovery.org/slack
        
           | AnEro wrote:
           | I did see that and was disappointed. It described the
           | features in full but I was looking to see the features set up
           | to see if it is something I can easily integrate or not.
           | Don't get me wrong love the idea and its young I don't fault
           | the team but I'm not going to get a docker running just to
           | find out i have to reformat all my tests or write way too
           | many new connections to enable any of these features.
        
             | ndementev wrote:
             | I see your point. As you mentioned the product is rather
             | young and we continue to develop it. I agree that
             | documentation is one of the fundamental parts. Thank you
             | for your input, we will find a way to make the
             | documentation more straight forward and useful for cases
             | such as you've described.
             | 
             | Perhaps you would be interested in a call with us where we
             | can answer all your questions including integration with
             | your infrastructure, provide help configuring the platform
             | if needed, etc?
        
               | AnEro wrote:
               | Its not confusing its well written I just want more, but
               | I get it's a community project and it takes time to do
               | all this stuff for free. I'm just ~not so~patiently
               | waiting for docs more around the person looking at set up
               | vs the conceptual
        
               | ndementev wrote:
               | Well, for setting the platform's features up there's this
               | page: https://docs.opendatadiscovery.org/configuration-
               | and-deploym...
               | 
               | It explains how to set up a platform in a way that
               | certain features were enabled/disabled. Maybe this is
               | something you will find useful in a way.
               | 
               | It'd be great if you could provide an example, if I got
               | you wrong
        
               | AnEro wrote:
               | Well like adding a test from panadas/great expectations,
               | software looks great when its set up and already added.
               | So trying to add one myself I just have to imagine, so
               | great expectations make sense probably an API hook set
               | up. But I'm using panadas mostly so how do I add one? Is
               | it going to be more work to use existing tests? If so how
               | much? Really I'm trying to timeline how long setting up
               | everything from my pipeline inside the product as well.
               | Since I want a junior doing alot of this aswell is this
               | going to be weirdly hard for a math uni grad? You know?
               | 
               | The online demo I didn't see abilities to do that on that
               | type of account, which idc just part of gauging how long
               | it would take to go from 0 to running to useful for the
               | company.
               | 
               | Thanks for the help I'm going to keep an eye out for the
               | product in the future for sure. It's just at the point it
               | is still more work for me to see if I want to do the work
               | to use it
        
               | ndementev wrote:
               | Gotcha!
               | 
               | Thank you for the input, we are going to work on this.
        
       | Cilvic wrote:
       | Looks really useful. I'm confused about pricing.
       | 
       | There is:
       | 
       | - the "schedule a call" which sounds like there is some paid
       | version of this
       | 
       | - "Free an open source"
       | 
       | Is this a volunteer project but you still offer to take a call?
        
         | germanosin wrote:
         | Hi Cilvic, This is an opensource product. You could use it for
         | free. If you have any questions or need any assistance we would
         | be happy to help you and the same time we hope you'll help our
         | product with your feedback and real-world use cases.
        
           | Cilvic wrote:
           | Thanks, in the youtube video you mention ODD v4 with
           | "enterprise features". Do you plan to release paid enterprise
           | features?
        
             | germanosin wrote:
             | No, this was a role based access control and enterprise
             | databases support.
        
       | skrtskrt wrote:
       | How much does the use case for this intersect with Pachyderm?
        
         | ndementev wrote:
         | While Pachyderm (a great product by the way) helps teams to
         | automate transformation tasks, ODD is more of a
         | discovery/observability/monitoring solution for your pipelines.
         | Basically if Pachyderm helps you to build a pipeline, ODD helps
         | you to monitor all of your pipelines in a context of _your
         | whole data infrastructure_
        
       | Cilvic wrote:
       | I really like how you describe things from "use cases". What's
       | missing for me is a clear highlight of what is part of ODD.
       | 
       | For example: all the steps under 3. are not part of ODD, or are
       | they?
       | 
       | Only step 1 is performed in ODD, yes?
       | 
       | Personally, I'm mostly interested in lineage and would love a
       | usecase that explains real world lineage. Say we have
       | redshift/postgres and a Tableau with a dataset. How is the
       | lineage generated or manually maintained.
       | 
       | Anyways great effort.
        
         | ndementev wrote:
         | Thank you for your kind words!
         | 
         | May I ask you what do you mean by saying "all steps under 3"?
         | Are you referring to
         | https://docs.opendatadiscovery.org/use_cases/dq_visibility?
         | 
         | As for the
         | 
         | > How is the lineage generated or manually maintained
         | 
         | All lineage in the platform is generated and _not_ manually
         | handled by user in the UI. We are leveraging ODD Specification
         | (https://github.com/opendatadiscovery/opendatadiscovery-
         | speci...) and all ODD Collectors (agents that scrape metadata
         | from your data sources) send payload to the ODD Platform in
         | this specification's format. ODD Specification introduces
         | something called ODDRN -- OpenDataDiscovery Resource Names.
         | These are basically strings, identifiers of specific data
         | entities. All ODD Collectors generates same identifiers for
         | same entities, allowing us automatically build a lineage graph
         | in ODD Platform.
         | 
         | Not letting a user to manually change lineage in the UI is
         | kinda our solution to one of the lineage problems. This way
         | users can be sure that the lineage is correct, up to date and
         | no one messed with it at least in the UI.
         | 
         | Of course if there's an described API endpoint, there's a way
         | to change the lineage by sending a request on your own (e.g.
         | via curl or custom script), but I wouldn't call it manual. This
         | approach allows companies and users to write their own
         | integrations, making the system open.
        
           | Cilvic wrote:
           | Sorry I meant this page
           | https://docs.opendatadiscovery.org/use_cases/viz_preparation
           | 
           | If lineage is as automatic as you say that's not clear to me
           | after reading. Thanks for explaining!
        
             | ndementev wrote:
             | Gotcha!
             | 
             | We are continue working on a documentation, thank you for
             | bringing this up! We'll take a look how this can be
             | improved.
             | 
             | > For example: all the steps under 3. are not part of ODD,
             | or are they? Only step 1 is performed in ODD, yes?
             | 
             | Yes, that's correct. In this scenario ODD acts as a source
             | of knowledge about the problem.
        
       | jethkl wrote:
       | I see the motivation and skill of your group, and I hope you can
       | retain that and build a useful contribution. I also see the
       | extraordinary effort required to get to this point, and many
       | projects fail to get this far, so congratulations. Something is
       | obviously going well.
       | 
       | However, I am unmoved by your list of key wins (details below).
       | If you indeed built something useful, is there a different way to
       | deliver your message about the functionality that you enable?
       | 
       | Here are my reactions:
       | 
       | 1) Shorten data discovery phase. In my experience, analysts and
       | data scientists are always very familiar with what relevant data
       | exists, or else they can find the right people to acquire what
       | data they need. Often, kick-off meetings for new projects cover
       | with stakeholders which data is useful.
       | 
       | 2) Have transparency on how and by whom the data is used. For
       | publicly available data, this is not something that a company
       | usually cares about. Internal and proprietary data management is
       | already a very mature space, and every company with such data
       | already has processes in place to manage data access. I grant
       | this is often a mess, but I also don't see any global solution on
       | the horizon.
       | 
       | 3) Foster data culture by continuous compliance and data quality
       | monitoring. Data quality monitoring is extremely complex. I have
       | seen many claims over many years of tools that solve this problem
       | broadly, but I have yet to see any solution that matches the
       | claims.
       | 
       | 4) Accelerate data insights. This is a very bold claim for a new
       | project, especially given the many (5+) decades of work and
       | experience developing tools and techniques for data insights.
       | 
       | 5) Know the sources of your dashboards and ad hoc reports. All
       | dashboards I am aware of surface this sort of information.
       | 
       | 6) Deprecate outdated objects responsibly by assessing and
       | mitigating the risks. This is a good idea, but it is challenging
       | in practice, as illustrated by several prominent examples
       | [1,2,3].
       | 
       | Finally, unrelated to the above, your project's name (ODD) is
       | very similar to the name used by the Outlier Detection DataSets
       | (ODDS) project [4]
       | 
       | Good luck.
       | 
       | [1] http://www.lenna.org/editor.html
       | 
       | [2] https://scikit-
       | learn.org/stable/modules/generated/sklearn.da...
       | 
       | [3] https://deepai.org/dataset/fb15k and
       | https://paperswithcode.com/dataset/fb15k-237
       | 
       | [4] http://odds.cs.stonybrook.edu
        
         | ndementev wrote:
         | Thank you for your kind words and a constructive feedback! We
         | appreciate it.
         | 
         | Let me cover some of your reactions from my perspective as a
         | Data Engineer. Please feel free to add your opinion on those
         | 
         | > Shorten data discovery phase. In my experience, analysts and
         | data scientists are always very familiar with what relevant
         | data exists, or else they can find the right people to acquire
         | what data they need. Often, kick-off meetings for new projects
         | cover with stakeholders which data is useful.
         | 
         | You're right, but from my experience it's not always the case.
         | Sometimes finding the key person/team responsible for a dataset
         | might be challenging. You mentioned the kick-off meeting, about
         | which I agree, but it's not always the silver bullet. Data goes
         | outdated/deprecated all the time and we are trying to solve a
         | problem of telling about this to all people which may be
         | affected by this as soon an as easy as possible.
         | 
         | > Know the sources of your dashboards and ad hoc reports. All
         | dashboards I am aware of surface this sort of information
         | 
         | Again, you are right. All dashboard services and BI tools can
         | show you from what data source what data are they getting. But
         | from my experience sometimes it's useful to take a look at the
         | _origin_ of data some dashboard uses. This is where end-to-end
         | lineage comes in hand. Also, I consider useful to have metadata
         | of all of my dashboards from all of my company 's BI tools in
         | one place.
         | 
         | > Deprecate outdated objects responsibly by assessing and
         | mitigating the risks. This is a good idea, however, it is
         | challenging
         | 
         | Couldn't agree more. We are working not only to improve our way
         | to solve this problem, but the solution itself, if it makes
         | sense. We are basically trying to find a right approach to this
         | and offer it to everyone else. I know it's ambitious and really
         | is a loud statement, but I hope we are getting there.
         | 
         | In overall, thank you for your input!
         | 
         | @germanosin, would you like to add something I may have missed?
        
       | weekay wrote:
       | Why do demos need Google or other login ? This is such a friction
       | . I should be able to get access to demos w/o having to login .
       | Also the use case for pre sales is interesting. Has anyone really
       | had any success during pre-sales for an enterprise customer
       | agreeing to installing a collector on their applications ?
        
         | [deleted]
        
         | germanosin wrote:
         | Hi, we're the team behind this product! We updated our demo to
         | include social logins so that spam doesn't get through. There
         | are no logs being collected, and we're not selling this
         | information either - if you don't want the online version of it
         | just head over https://github.com/opendatadiscovery/odd-
         | platform/tree/main/... locally using docker-compose
        
           | maddynator wrote:
           | Thanks folks. How do i contribute to the project?
        
             | ndementev wrote:
             | Thank you!
             | 
             | We have a lot of repositories on GitHub, please feel free
             | to pick any issue from the list. Do not hesitate to ask us
             | anything in GitHub issues' threads or in our Slack
             | community. I'll provide links for your convinience
             | 
             | 1. ODD Platform GitHub:
             | https://github.com/opendatadiscovery/odd-platform
             | 
             | 2. Slack Community: https://go.opendatadiscovery.org/slack
             | 
             | 3. Documentation with information on how to contribute:
             | https://docs.opendatadiscovery.org/developer-guides/how-
             | to-c...
        
       | aschwad wrote:
       | Interesting initiative! Do I understand correctly, that any push
       | mechanism is done via the ODD API and pull mechanisms check on
       | the schema of data sources? Do you already have a standard for
       | providing ETL metadata? On which level of detail are you
       | collecting this metadata?
       | 
       | At BMW, the data catalogue is continuously growing and the amount
       | of datasets is increasing rapidly. Therefore we had a similar
       | problem to find out how datasets relate to each other and how
       | they are transformed --> we needed coarse- and fine-grained data
       | lineage. We found a way by leveraging the Spline Agent
       | (https://github.com/AbsaOSS/spline) to make use of the Execution
       | Plans, transform them into a suiting data model for our set of
       | requirements and developed a UI to explore these relationships.
       | We also open-sourced our approach in a
       | 
       | - paper:
       | https://link.springer.com/article/10.1007/s13222-021-00387-7
       | 
       | - and blog post: https://medium.com/@alex.schoenenwald/fishing-
       | for-data-linea...
        
         | ndementev wrote:
         | Thank you!
         | 
         | Actually everything is working on a push basis in ODD now. ODD
         | Platform implements ODD Specification
         | (https://github.com/opendatadiscovery/opendatadiscovery-
         | speci...) and all agents, custom scripts and integrations,
         | Airflow/Spark listeners, etc are pushing metadata to specific
         | ODD Platform's endpoint
         | (https://github.com/opendatadiscovery/opendatadiscovery-
         | speci...). ODD Collectors (agents) are pushing metadata on a
         | configurable schedule.
         | 
         | ODD Specification is a standard for collecting and gathering
         | such metadata, ETL included. We gather metadata for lineage on
         | an entity level now, but we plan to expand this to the column-
         | level lineage at the end 2022 -- start 2023. Specification
         | allows us to make the system open and it's really easy to write
         | your own integration by taking a look in what format metadata
         | needs to be injected in the Platform.
         | 
         | ODD Platform has its own OpenAPI specification
         | (https://github.com/opendatadiscovery/odd-
         | platform/tree/main/...) so that the already indexed and layered
         | metadata could be extracted via platform's API.
         | 
         | Also, thank you for sharing links with us! I'm thrilled to take
         | a look how BMW solved a problem of lineage gathering from
         | Spark, that's something we are improving in our product right
         | now.
        
       | O__________O wrote:
       | Related website:
       | 
       | - https://opendatadiscovery.org/
       | 
       | Demo (requires GitHub/Google login) + Demo Video
       | 
       | - https://demo.oddp.io/login
       | 
       | - https://youtube.com/watch?v=ZSa2FWAyUic
       | 
       | Use cases:
       | 
       | - https://docs.opendatadiscovery.org/use_cases
       | 
       | Presentation on ODD (by HN user germanosin):
       | 
       | - https://youtube.com/watch?v=Y0aFqHd4h3k
        
         | dang wrote:
         | Ok, we've changed the URL above to the project home page, since
         | it gives more background info, from
         | https://github.com/opendatadiscovery/odd-platform. Thanks!
        
       ___________________________________________________________________
       (page generated 2022-10-22 23:00 UTC)