[HN Gopher] Show HN: An annotation tool for ML and NLP
       ___________________________________________________________________
        
       Show HN: An annotation tool for ML and NLP
        
       Author : neiman1
       Score  : 61 points
       Date   : 2021-06-19 13:37 UTC (9 hours ago)
        
 (HTM) web link (www.getmarkup.com)
 (TXT) w3m dump (www.getmarkup.com)
        
       | rubatuga wrote:
       | What are some of your competitors, as well as any other open-
       | source alternatives? What makes your tool better?
        
       | hadsed wrote:
       | Beautiful. So many annotation tools focus on "text
       | classification" which assumes you've already got segmented
       | samples. In the real world of documents that's a whole challenge
       | in itself.
       | 
       | Another challenge is that sometimes you're working with PDFs and
       | that means not only ingesting but also displaying. The difficulty
       | is in keeping track of annotations and predictions across the
       | PDF<->text string boundary, both ways.
       | 
       | There are understandably even fewer solutions to that problem
       | because it's a harder UI to build.
        
         | gryn wrote:
         | allenai seems to be working on something like that for pdf
         | files.
         | 
         | https://github.com/allenai/pawls
        
         | neiman1 wrote:
         | Much appreciated! That's true, and lots of the tools that do
         | feature text annotation can be quite restrictive in that they
         | don't allow you to add attributes / repeatedly annotate the
         | same span of text.
         | 
         | Support for PDFs and other doc types is definitely on the
         | backlog, but I keep holding off due to the challenges you
         | mentioned.
        
       | Delk wrote:
       | Looks like an interesting project. Would you have some kind of a
       | summary of the methodology you're using for the annotation
       | suggestions? What kind of learning, and which kinds of features?
        
         | neiman1 wrote:
         | Just to preface this summary, it's all a bit hacked together at
         | the moment, and I'm in the process of rewriting the tool from
         | scratch so this description is likely to change.
         | 
         | To generate the suggestions there's an active learner with an
         | underlying random forest classifier, that has been fed ~60 seed
         | sentences [1], to classify positive sentences (e.g. contains a
         | prescription) and negative sentences (e.g. doesn't contain a
         | prescription).
         | 
         | All positive sentences are fed into a sequence-to-sequence RNN,
         | that has been trained on ~50k synthetic rows of data [2] which
         | maps unstructured sentences (e.g. patient is on pheneturide
         | 250mg twice a day) to a structured output with the desired
         | features (e.g. name: pheneturide; dose: 250; unit: mg;
         | frequency: 2). The synthetic data was generated using Markup's
         | in-built data generator [3].
         | 
         | The outputs of the RNN are validated to ensure they meet the
         | expected structure and are valid for the sentence (e.g. the
         | predicted drug name must exist somewhere within the original
         | sentence).
         | 
         | All non-junk predictions are shown to the user who can accept,
         | edit, or reject each. Based on the users' response, the active
         | learner is refined (currently nothing is fed back into the
         | RNN).
         | 
         | [1]
         | https://github.com/samueldobbie/markup/blob/master/data/text...
         | 
         | [2]
         | https://raw.githubusercontent.com/samueldobbie/markup/master...
         | 
         | [3] https://www.getmarkup.com/tools/data-generator/
        
         | alecst wrote:
         | Came here to post the same question. Great work by the way!
        
           | neiman1 wrote:
           | Thank you!
        
       | hbcondo714 wrote:
       | > Document to annotate - The document you intend to annotate
       | (must be .txt file)
       | 
       | Any thoughts on supporting additional file formats? I'm actually
       | interested in annotating HTML files / web pages. It would be
       | great if I could browse for a local HTML file or enter in a URL
       | and the HTML content would be rendered for it to be annotated
       | using the entities.
        
         | neiman1 wrote:
         | For sure! I've just tested it locally and HTML annotation is
         | possible with only a few minor changes. I've just been overly
         | restrictive by limiting it to text files (primarily to avoid
         | PDFs and MS docs that have some additional challenges). Will
         | deploy the updates later today.
         | 
         | Love the idea of being able to enter a URL to retrieve the HTML
         | for annotation.
        
           | hbcondo714 wrote:
           | Thanks for your quick reply, looking forward to this!
        
       | neiman1 wrote:
       | Hey HN!
       | 
       | Markup is an open-source annotation tool for transforming
       | unstructured documents into a structured format that can be used
       | for ML, NLP, etc.
       | 
       | Markup learns as you annotate in order to speed up the process by
       | suggesting complex annotations to you.
       | 
       | There are also a few different in-built tools, including:
       | 
       | - A data generator that helps you to produce synthetic data for
       | training the suggestion model
       | 
       | - An annotator diff tool that helps you to compare annotations
       | produced by multiple annotators
       | 
       | It's still very much a work in progress (and the documentation is
       | severely lacking), but the ultimate goal is to make a tool that's
       | as useful as https://prodi.gy/, without the $400 price tag.
        
       | kwerk wrote:
       | This looks incredible! I've been following doccano for awhile but
       | they were still working on active learning. Will you be adding an
       | open source license like MIT?
        
         | neiman1 wrote:
         | Thanks a lot! I've just added an MIT license :)
        
       | forgingahead wrote:
       | Really nice tool - thanks for making this! What is your plan for
       | this? Is this a side-project that you'll potentially turn into a
       | business, or is this just a hobby on the side of your full-time
       | job?
       | 
       | Just asking because I think many folks would be happy to pay to
       | support a small ISV to ensure it's long-term sustainability. Not
       | via donations, but actual pricing.
        
         | neiman1 wrote:
         | Thanks for your kind words! It's just a hobby project that I
         | work on alongside my full-time job right now, and to be honest,
         | I'm still trying to figure out a plan. My intention is to keep
         | the core functionality free forever, but I could definitely see
         | a future where there are premium collaborative features or some
         | cost for training custom suggestion models, for example.
         | 
         | If you're one of those folks who would consider supporting a
         | tool such as this, do you have an idea in mind as to what sort
         | of features you'd be willing to pay for?
        
           | psimm wrote:
           | I'm in the market for a tool like this. At the moment I'm
           | using Prodigy but interested in other options. Features that
           | I'd be willing to pay for (or rather my employer):
           | 1 team functionality with multiple user accounts            2
           | easy to use workflow for double annotation where each text is
           | annotated by exactly two annotators. The software should make
           | sure that a text is never shown to more than 2 annotators and
           | never shown to the same annotator twice            3 make it
           | easy to review the 2 versions and solve conflicts
           | 4 smarter alternative to review would be a warning system
           | that identifies annotations that may have errors (because a
           | model trained on the other data predicts a different result)
           | and automatically flags it for review by another annotator
           | 5 stats on the annotators: speed, accuracy, statistics on how
           | frequently they assign different labels to detect potential
           | misunderstandings of the annotation schema            6 GUI
           | with overview of all annotation datasets, with stats like %
           | finished annotating (with stages for double annotation and
           | review), the types of annotation done, frequencies of labels
           | to detect imbalances           7 functions to mass-edit the
           | annotations, like renaming or removing an entity type
           | 
           | Another thing I'd be interested in is some integration with a
           | third party annotation provider. There are companies that
           | offer annotation as a service and it's also available on
           | Google Cloud and AWS. Having that integrated into an
           | annotation tool would make it very easy to get large amounts
           | of well annotated training material.
           | 
           | But finally, and much more importantly: The workflow for
           | annotators has to be perfected first, so they can work as
           | efficiently and consistently as possible. Getting this right
           | is more important to me than any of the other features I
           | listed.
        
       ___________________________________________________________________
       (page generated 2021-06-19 23:01 UTC)