[HN Gopher] Automatic Text Summarization in PDF Documents with F...
       ___________________________________________________________________
        
       Automatic Text Summarization in PDF Documents with Faster R-CNN and
       PEGASUS
        
       Author : konfuzio
       Score  : 53 points
       Date   : 2021-02-26 14:27 UTC (8 hours ago)
        
 (HTM) web link (konfuzio.com)
 (TXT) w3m dump (konfuzio.com)
        
       | Der_Einzige wrote:
       | Ever thought about trying to do extrative summarization the same
       | way? I'm constantly frustrated that there is no extractive
       | PEGASUS variant and all the existing transformer based extractive
       | models rank (or rather highlight) sentences rather than
       | highlighting/underline at the word level like most humans do.
        
       | davide_v wrote:
       | Do you offer this service of text summarization via API? I didn't
       | exactly catch that, but we would be interested.
       | 
       | Note: I think the "Register for free" button for the webinar is
       | broken.
        
         | konfuzio wrote:
         | Hi David, thanks for reporting the link issue! We fix it. It
         | should be https://app.konfuzio.com
         | 
         | The page segmentation API is already live. The PDF
         | summarization API is work in progress. We just wanted to share
         | our approach already now to incorporate any feedback! We are
         | also working on the retraining loop to fine-tune our model on a
         | small sample of other documents. We support this for custom NER
         | models and document classification so far.
         | 
         | Best Chris
        
       | notafraudster wrote:
       | The use of the CNN for page segmentation is great. I recently had
       | to extract text from a few hundred thousand pages of PDFs which
       | had actual text (so it's an easier problem than raw images), but
       | which moved around between one, two, and three column layouts,
       | sometimes within page. I ended up doing basically a probabilistic
       | model where I searched coordinate grids and looked for low
       | density columns of the grid search. It worked well enough for my
       | exact dataset but I think would not generalize very well, and at
       | the time I was looking I wasn't satisfied with anything off the
       | shelf. Kudos.
        
         | konfuzio wrote:
         | Hi notafraudster, is this dataset or your approach public?
         | Perhaps we can can collaborate to expand our approach. FYI: We
         | detect text embeddings automatically and decide thereby if we
         | need OCR. Thanks for the feedback!
        
       ___________________________________________________________________
       (page generated 2021-02-26 23:01 UTC)