[HN Gopher] Automatic Text Summarization in PDF Documents with F...
___________________________________________________________________
Automatic Text Summarization in PDF Documents with Faster R-CNN and
PEGASUS
Author : konfuzio
Score : 53 points
Date : 2021-02-26 14:27 UTC (8 hours ago)
(HTM) web link (konfuzio.com)
(TXT) w3m dump (konfuzio.com)
| Der_Einzige wrote:
| Ever thought about trying to do extrative summarization the same
| way? I'm constantly frustrated that there is no extractive
| PEGASUS variant and all the existing transformer based extractive
| models rank (or rather highlight) sentences rather than
| highlighting/underline at the word level like most humans do.
| davide_v wrote:
| Do you offer this service of text summarization via API? I didn't
| exactly catch that, but we would be interested.
|
| Note: I think the "Register for free" button for the webinar is
| broken.
| konfuzio wrote:
| Hi David, thanks for reporting the link issue! We fix it. It
| should be https://app.konfuzio.com
|
| The page segmentation API is already live. The PDF
| summarization API is work in progress. We just wanted to share
| our approach already now to incorporate any feedback! We are
| also working on the retraining loop to fine-tune our model on a
| small sample of other documents. We support this for custom NER
| models and document classification so far.
|
| Best Chris
| notafraudster wrote:
| The use of the CNN for page segmentation is great. I recently had
| to extract text from a few hundred thousand pages of PDFs which
| had actual text (so it's an easier problem than raw images), but
| which moved around between one, two, and three column layouts,
| sometimes within page. I ended up doing basically a probabilistic
| model where I searched coordinate grids and looked for low
| density columns of the grid search. It worked well enough for my
| exact dataset but I think would not generalize very well, and at
| the time I was looking I wasn't satisfied with anything off the
| shelf. Kudos.
| konfuzio wrote:
| Hi notafraudster, is this dataset or your approach public?
| Perhaps we can can collaborate to expand our approach. FYI: We
| detect text embeddings automatically and decide thereby if we
| need OCR. Thanks for the feedback!
___________________________________________________________________
(page generated 2021-02-26 23:01 UTC)