[HN Gopher] Launch HN: Cord (YC W21) - training data toolbox for...
       ___________________________________________________________________
        
       Launch HN: Cord (YC W21) - training data toolbox for computer
       vision
        
       Hey HN community -  I'm Ulrik from Cord (https://cord.tech) in the
       current YC W21 batch [1] - we are building software that allows
       people to label their data intelligently using a toolbox of various
       'labeling algorithms'. Labeling algorithms are any units of
       intelligence (e.g. a pre-trained model, or an interpolation
       algorithm) that help automate the annotation process. This enables
       data science and machine learning teams to rapidly iterate on their
       ML models without having to farm out labeling tasks to an external
       workforce.  Today we're launching the first part of our product,
       our Web App, which serves our initial set of automation features
       through a GUI. It also allows you to classify images and draw
       vector labels, visualize data, and perform collaborative QA.
       Computer vision ML algorithms are widely used for tasks like
       detecting everyday objects such as cars and pedestrians. However,
       they are yet to see widespread adoption for things like detecting
       cancerous polyps during an endoscopic procedure or blood clots in
       MRI scans. The lack of massive-scale labeled training datasets that
       fuel contemporary approaches is often the blocking element in
       building ML applications that solve these more specialised tasks.
       We also believe that the core part of the IP of an ML application
       stems from the labeled data used to train it.  Creating these
       datasets is challenging for several reasons. Labeling the data
       requires expensive domain-expert annotators, and privacy might
       prevent the data from being sent to an external workforce.
       Ultimately most labeling work tends to be done using open-source
       tools that were not created for speed and purpose-built to handle
       massive-scale datasets[2]. These tools also tend to provide a poor
       experience for the end consumer of the training data (e.g., data
       scientists, ML engineers) because they lack intelligence and
       require high manual input.  The initial seed of the idea came while
       I was working on a CS master's project of visualizing massive-scale
       medical image datasets. I saw saw how much time and effort was
       being spent by doctors on labeling data. I met my co-founder Eric,
       who had worked as a quant researcher in finance, and after meeting
       him we realized we could take an algorithmic approach to tackling
       the labeling problem. Instead of writing trading algorithms, we
       turned our focus to writing labeling algorithms.  For example, for
       a food calorie estimation project we translated image level
       classifications of food items to individualized bounding box labels
       using a labeling algorithm we wrote with our SDK, requiring only
       one manual label per food item. Although it was an image dataset,
       our algorithm approximated noisy bounding box labels by using a
       CSRT object tracker across images. It then trained a shallow Faster
       RCNN 'micro-model' on the noisy labels, ran inference on the data,
       and suppressed earlier noisy labels. We then quickly visually
       reviewed and adjusted the results on our Web App[3]. We have
       applied a similar approach in areas such as gastroenterology[4] and
       pathology.  The days of relying on an army of human annotators and
       waiting to start the model building process are hopefully (soon)
       over. We are incredibly excited to be driving for that change - and
       are delighted to be sharing Cord with the HN community! We would
       love to hear your feedback. How are you going about creating and
       managing training data today? What are your key constraints? If you
       have used a creative method to label your data before, please
       share. Thank you so much in advance!  [1] What I Learned From My
       First Month at Y Combinator - https://medium.com/swlh/what-i-
       learned-from-my-first-month-a...  [2] Why You Should Ditch Your In-
       House Training Data Tools (And Avoid Building Your Own) -
       https://medium.com/p/ef78915ee84f  [3] Label a Dataset with a Few
       Lines of Code - https://eric-landau.medium.com/label-a-dataset-
       with-a-few-li...  [4] Pain Relief for Doctors Labelling Data -
       https://eric-landau.medium.com/pain-relief-for-doctors-label...
        
       Author : ulrikhansen54
       Score  : 33 points
       Date   : 2021-02-11 17:06 UTC (5 hours ago)
        
       | yesterday200 wrote:
       | At Lenus eHealth we have been following Cord closely for some
       | time and can only vouch for the quality of their product! Been
       | trying out the public API and is impressed with the progress so
       | far.
        
         | ulrikhansen54 wrote:
         | Thanks! We've really enjoyed engaging with Lenus.
        
       | jrrb wrote:
       | Is this something we will be able to license and run on our own
       | servers? We are quite wary of sharing our data and labels
       | externally -- we've had bad experiences with that...
        
         | ulrikhansen54 wrote:
         | Yes! We do on premise deployments for enterprise. But even with
         | our cloud solution, our philosophy and terms of services is
         | that all your data and any models you build through the
         | platform are 100% yours. There are some companies that are
         | squirrely around this point, but we don't use any client data
         | for other models. Also, even if you don't want to go for an
         | enterprise on-prem deployment, you can connect your cloud
         | buckets with the software -- so no need to upload anything.
        
           | jrrb wrote:
           | Oh that's great to hear, I'll reach out to find more
        
             | ulrikhansen54 wrote:
             | Looking forward to it!
        
       | ipsum2 wrote:
       | How is this different fromAquarium?
        
         | ulrikhansen54 wrote:
         | To my understanding Aquarium is focused on curation of datasets
         | rather than labeling (e.g. finding edge cases in data, etc.)
         | and have built an awesome product around that. Aquarium works
         | well in domains that are further along on the ML adoption curve
         | -- particularly AV -- where finding and solving edge cases in
         | your data is typically the key constraint. Immature domains
         | (e.g. medical, agtech, etc.) is still stuck at the data
         | labeling stage, and due to the expensiveness of the annotators
         | have a hard time scaling.
         | 
         | Our focus is on minimising human involvement in the data
         | annotation process to make it more efficient and facilitate ML
         | development in these more specialised fields.
        
       | iceburg8 wrote:
       | how is this different from roboflow?
        
         | eric_landau wrote:
         | As far as I can tell Roboflow is more focused on being an end
         | to end platform for AI. The customers we work with generally
         | want to retain more control over their model building process,
         | we just help them with automating as much of the annotation
         | process as we can, often with the help of their own models.
        
       | festinalente wrote:
       | Congrats! This seems like a sorely needed product in the
       | industry. Are there plans to expand this to other areas like
       | document tagging?
        
         | ulrikhansen54 wrote:
         | Thanks! We don't have any immediate plans to expand into other
         | data types. We've decided to remain focused on just one 'data
         | vertical' (anything computer vision related) for the
         | foreseeable future, as we want to build out a solid base of
         | labeling algorithms for CV before spreading a wider net.
        
       | rustastra wrote:
       | As a user, am I expected to write the labeling algorithms myself
       | or do you offer some in-built ones?
        
         | eric_landau wrote:
         | You can do either! We offer a bunch of automation features
         | directly through the Web App but people have also used the SDK
         | to write their own algorithms. We have seen a lot of different
         | annotation processes now so we can often direct people on the
         | best flow to automate their labeling.
        
       | 123molchun wrote:
       | Looks pretty cool, but how is this different from Scale AI?
        
         | eric_landau wrote:
         | Hi Eric from Cord here. Scale is a great company and they have
         | done really well in AV especially. The issue with them is that
         | they require you to send your data overseas to be annotated by
         | a human workforce. They also probably have a bunch of in-house
         | automated tools, but they don't pass the savings onto their
         | clients because they are paid on a per label basis. We want to
         | pass the benefit of automation to anyone that needs labeled
         | data.
        
       | evilolive wrote:
       | Really cool stuff ! Looking forward to trying it out
       | 
       | my favorite commercial option so far is : https://www.v7labs.com/
       | 
       | prodi.gy for running small augmentation UIs is worth checking out
       | 
       | The workflow of the Toronto univ annotation suite [1] is sweet
       | for making polygons semi-automatically, but not widely available
       | yet
       | 
       | [1] https://youtu.be/3kFQJQicHxA
        
         | ulrikhansen54 wrote:
         | Thanks for sharing those evilolive! I haven't come across the
         | Toronto Univ annotation suite before --- polygon auto-
         | segmentation is an interesting problem. We have embedded a
         | simple auto-segmentation algorithm into our GUI that run
         | without a model but are looking to add DL-enabled auto
         | segmentation in the near future.
         | 
         | Would be curious to hear why V7 is your favourite?
        
       ___________________________________________________________________
       (page generated 2021-02-11 23:01 UTC)