[HN Gopher] Show HN: An annotation tool for ML and NLP
___________________________________________________________________
Show HN: An annotation tool for ML and NLP
Author : neiman1
Score : 61 points
Date : 2021-06-19 13:37 UTC (9 hours ago)
(HTM) web link (www.getmarkup.com)
(TXT) w3m dump (www.getmarkup.com)
| rubatuga wrote:
| What are some of your competitors, as well as any other open-
| source alternatives? What makes your tool better?
| hadsed wrote:
| Beautiful. So many annotation tools focus on "text
| classification" which assumes you've already got segmented
| samples. In the real world of documents that's a whole challenge
| in itself.
|
| Another challenge is that sometimes you're working with PDFs and
| that means not only ingesting but also displaying. The difficulty
| is in keeping track of annotations and predictions across the
| PDF<->text string boundary, both ways.
|
| There are understandably even fewer solutions to that problem
| because it's a harder UI to build.
| gryn wrote:
| allenai seems to be working on something like that for pdf
| files.
|
| https://github.com/allenai/pawls
| neiman1 wrote:
| Much appreciated! That's true, and lots of the tools that do
| feature text annotation can be quite restrictive in that they
| don't allow you to add attributes / repeatedly annotate the
| same span of text.
|
| Support for PDFs and other doc types is definitely on the
| backlog, but I keep holding off due to the challenges you
| mentioned.
| Delk wrote:
| Looks like an interesting project. Would you have some kind of a
| summary of the methodology you're using for the annotation
| suggestions? What kind of learning, and which kinds of features?
| neiman1 wrote:
| Just to preface this summary, it's all a bit hacked together at
| the moment, and I'm in the process of rewriting the tool from
| scratch so this description is likely to change.
|
| To generate the suggestions there's an active learner with an
| underlying random forest classifier, that has been fed ~60 seed
| sentences [1], to classify positive sentences (e.g. contains a
| prescription) and negative sentences (e.g. doesn't contain a
| prescription).
|
| All positive sentences are fed into a sequence-to-sequence RNN,
| that has been trained on ~50k synthetic rows of data [2] which
| maps unstructured sentences (e.g. patient is on pheneturide
| 250mg twice a day) to a structured output with the desired
| features (e.g. name: pheneturide; dose: 250; unit: mg;
| frequency: 2). The synthetic data was generated using Markup's
| in-built data generator [3].
|
| The outputs of the RNN are validated to ensure they meet the
| expected structure and are valid for the sentence (e.g. the
| predicted drug name must exist somewhere within the original
| sentence).
|
| All non-junk predictions are shown to the user who can accept,
| edit, or reject each. Based on the users' response, the active
| learner is refined (currently nothing is fed back into the
| RNN).
|
| [1]
| https://github.com/samueldobbie/markup/blob/master/data/text...
|
| [2]
| https://raw.githubusercontent.com/samueldobbie/markup/master...
|
| [3] https://www.getmarkup.com/tools/data-generator/
| alecst wrote:
| Came here to post the same question. Great work by the way!
| neiman1 wrote:
| Thank you!
| hbcondo714 wrote:
| > Document to annotate - The document you intend to annotate
| (must be .txt file)
|
| Any thoughts on supporting additional file formats? I'm actually
| interested in annotating HTML files / web pages. It would be
| great if I could browse for a local HTML file or enter in a URL
| and the HTML content would be rendered for it to be annotated
| using the entities.
| neiman1 wrote:
| For sure! I've just tested it locally and HTML annotation is
| possible with only a few minor changes. I've just been overly
| restrictive by limiting it to text files (primarily to avoid
| PDFs and MS docs that have some additional challenges). Will
| deploy the updates later today.
|
| Love the idea of being able to enter a URL to retrieve the HTML
| for annotation.
| hbcondo714 wrote:
| Thanks for your quick reply, looking forward to this!
| neiman1 wrote:
| Hey HN!
|
| Markup is an open-source annotation tool for transforming
| unstructured documents into a structured format that can be used
| for ML, NLP, etc.
|
| Markup learns as you annotate in order to speed up the process by
| suggesting complex annotations to you.
|
| There are also a few different in-built tools, including:
|
| - A data generator that helps you to produce synthetic data for
| training the suggestion model
|
| - An annotator diff tool that helps you to compare annotations
| produced by multiple annotators
|
| It's still very much a work in progress (and the documentation is
| severely lacking), but the ultimate goal is to make a tool that's
| as useful as https://prodi.gy/, without the $400 price tag.
| kwerk wrote:
| This looks incredible! I've been following doccano for awhile but
| they were still working on active learning. Will you be adding an
| open source license like MIT?
| neiman1 wrote:
| Thanks a lot! I've just added an MIT license :)
| forgingahead wrote:
| Really nice tool - thanks for making this! What is your plan for
| this? Is this a side-project that you'll potentially turn into a
| business, or is this just a hobby on the side of your full-time
| job?
|
| Just asking because I think many folks would be happy to pay to
| support a small ISV to ensure it's long-term sustainability. Not
| via donations, but actual pricing.
| neiman1 wrote:
| Thanks for your kind words! It's just a hobby project that I
| work on alongside my full-time job right now, and to be honest,
| I'm still trying to figure out a plan. My intention is to keep
| the core functionality free forever, but I could definitely see
| a future where there are premium collaborative features or some
| cost for training custom suggestion models, for example.
|
| If you're one of those folks who would consider supporting a
| tool such as this, do you have an idea in mind as to what sort
| of features you'd be willing to pay for?
| psimm wrote:
| I'm in the market for a tool like this. At the moment I'm
| using Prodigy but interested in other options. Features that
| I'd be willing to pay for (or rather my employer):
| 1 team functionality with multiple user accounts 2
| easy to use workflow for double annotation where each text is
| annotated by exactly two annotators. The software should make
| sure that a text is never shown to more than 2 annotators and
| never shown to the same annotator twice 3 make it
| easy to review the 2 versions and solve conflicts
| 4 smarter alternative to review would be a warning system
| that identifies annotations that may have errors (because a
| model trained on the other data predicts a different result)
| and automatically flags it for review by another annotator
| 5 stats on the annotators: speed, accuracy, statistics on how
| frequently they assign different labels to detect potential
| misunderstandings of the annotation schema 6 GUI
| with overview of all annotation datasets, with stats like %
| finished annotating (with stages for double annotation and
| review), the types of annotation done, frequencies of labels
| to detect imbalances 7 functions to mass-edit the
| annotations, like renaming or removing an entity type
|
| Another thing I'd be interested in is some integration with a
| third party annotation provider. There are companies that
| offer annotation as a service and it's also available on
| Google Cloud and AWS. Having that integrated into an
| annotation tool would make it very easy to get large amounts
| of well annotated training material.
|
| But finally, and much more importantly: The workflow for
| annotators has to be perfected first, so they can work as
| efficiently and consistently as possible. Getting this right
| is more important to me than any of the other features I
| listed.
___________________________________________________________________
(page generated 2021-06-19 23:01 UTC)