[HN Gopher] USPTO to add surcharge on non-DOCX patent applicatio...
___________________________________________________________________
USPTO to add surcharge on non-DOCX patent applications in 2023
Author : zinekeller
Score : 100 points
Date : 2022-08-26 13:50 UTC (9 hours ago)
(HTM) web link (unblock.federalregister.gov)
(TXT) w3m dump (unblock.federalregister.gov)
| NotYourLawyer wrote:
| Can't wait to see what kinds of unredacted metadata people start
| uploading without a thought.
| nfriedly wrote:
| It sounds like they're going to try to catch and remove that -
| one of the bullet points under DOCX Benefits on
| https://www.uspto.gov/patents/docx reads:
|
| > _Privacy: provides automatic metadata detection (e.g. author
| and comments) and removal features to support the submission of
| only substantive information in the DOCX file._
|
| And, then further down in the FAQ it says:
|
| > _What happens to the metadata in DOCX files?_
|
| > _Metadata is generally removed by applicants prior to
| submission. However, if metadata is found during the validation
| process, it is automatically removed prior to submission.
| Examples of metadata include author, company, last modified by,
| etc. The only information that is preserved is the size, page
| count, and word count._
|
| > _Outgoing DOCX documents (i.e. Office actions) from the USPTO
| to applicants will also have metadata removed._
| lesuorac wrote:
| I don't think you submitted the link you wanted to.
|
| I see https://unblock.federalregister.gov/ which doesn't take me
| anywhere useful.
| bsimpson wrote:
| That explains why I get a 503 every time I pass the captcha.
| zinekeller wrote:
| Sorry, I thought the second time (!) I've submitted this that
| the HN filter didn't mess up the link. Here it is (https://www.
| federalregister.gov/documents/2022/04/28/2022-09...),
| hopefully!
|
| In case that the overagressive HN filer ate it again, it's
| https www-federalregister-gov
| documents/2022/04/28/2022-09027/filing-patent-applications-in-
| docx-format
|
| Also dang, recently HN's filters are too aggressive - from
| sentence-casing USPTO (to Uspto which isn't even a
| pronounceable acronym) to removing Twitter's search queries.
| altairprime wrote:
| You should email dang using the footer Contact link about
| this thread, so that he sees it.
| nazgulsenpai wrote:
| I was wondering why docx format would be chosen instead of PDF
| but they answer it pretty completely here if anyone else is
| interested: https://www.uspto.gov/patents/docx
| [deleted]
| jedberg wrote:
| It looks like most of those translate to "We build our
| automated systems around DOCX so you get all our features if
| you use it".
|
| But it doesn't really say why they chose to build on docx.
| bragr wrote:
| >But it doesn't really say why they chose to build on docx.
| Is requiring the DOCX format just adding another step in the
| process for applicants? Actually, it's the opposite.
| The USPTO conducted a study and found that over 80% of
| applicants are authoring their applications in DOCX format
| (through writing tools such as Microsoft Word). Because the
| files are originally in a DOCX format, uploading the original
| file eliminates the step for the applicant to convert the
| document to PDF prior to submission. Instead, the applicant
| is able to save the step of converting because our system
| will do that automatically.
| apocalyptic0n3 wrote:
| > But it doesn't really say why they chose to build on docx.
|
| Having worked directly with their teams in the past (although
| not on this), a lot of their systems seemed to evolve
| naturally over time based on the needs present. In that
| industry, a large majority of the documents being passed back
| and forth are DOCX. So my semi-educated guess is someone
| built a system to handle some simple intake tasks for DOCX
| applications because a large majority already were, it
| evolved over a few years, and when they finally decided to
| fully automate the process, they decided to build upon what
| they have which only supported DOCX and it was cheaper/easier
| to mandate everyone submit in that format than to build a new
| system or add support for others.
| ramoz wrote:
| I get that you worked with them it seems, but would argue
| that your hunch is wrong here.
|
| Regulatory processes, business systems, and international
| integration are plagued by PDF OCR complexities. OCR
| creates systemic issues and an anatomy of complex system
| architectures. Im sure XML is a typical downstream for
| parsing anyways. Use DOCX to enhance quality of the overall
| scope of integrations.
| apocalyptic0n3 wrote:
| They could just use standardized application forms the
| way they do for research reports they require (the "ISA
| ###" forms). Those forms are easily parsable by things
| like pdftk and don't require any OCR.
|
| I don't necessarily disagree with your point (since it
| makes complete sense), just wanting to point out that
| they already have a system in place for this using other
| means (although even there they are moving toward XML
| instead, likely because of what a pain it is to deal with
| text that exceeds the area of the input in PDFs)
| jedberg wrote:
| This makes the most sense.
| joshstrange wrote:
| As a few others have mentioned, the parsing alone means DOCX
| is a huge win over PDF. I had to parse a bunch of PDF data
| related to COVID and it was always a PITA. Every time they
| changed their layout even a little bit I had to rewrite parts
| of my extractor. The worst part? The headers/metadata showed
| it was all made in Word so they could have exported to DOCX
| as well as PDF if they wanted to but they only provided PDF.
| meragrin_ wrote:
| I guess you have little exposure to the industry. My
| experience is the vast majority already use Word or something
| else which supports DOCX. I cannot think of another format
| which practitioners have easy access to and would use. PDF
| just needs to go away for this process.
| jedberg wrote:
| I actually do have a lot of exposure to patents, and I know
| everyone uses DocX already. I'm just saying that that web
| page doesn't say why _they_ chose DocX, only why you should
| _use_ DocX.
| [deleted]
| [deleted]
| meragrin_ wrote:
| Sure it does. I'll give you it does not say why they
| chose it over other alternatives which I'm thinking is
| what you are looking for. Are there really any
| alternatives? The only real alternative I can think of is
| OpenDocument Format and I don't consider it alternative.
| As they say on that page, 80% of their users already deal
| with DOCX so 80+% of them will have to convert to ODF. I
| can't imagine ODF having any sort of benefit worth
| requiring most people to convert their documents before
| sending.
| nescioquid wrote:
| To me, the salient question is why is the government
| officially adopting a proprietary file format? Why is it
| important to optimize for the trivial convenience of
| patent applicants?
|
| It seems more like rationalization than reason.
| dataflow wrote:
| It actually seems like a sane choice to me. PDF is good for
| rendering, but horrible for parsing. DOCX is a ZIP file with
| XML data. Maybe ODT or whatever would've been a better
| choice, I don't know what the format is like. But if you
| disregard the usual knee-jerk "but it's Microsoft!" reaction,
| it doesn't seem like a bad choice.
| ndiddy wrote:
| The Office Open XML file format is extremely complex, and
| takes up around 6,500 pages (compared to ~1000 for ODF).
| One thing you notice when reading the DOCX spec is that
| they designed it with the sole constraint that DOC files
| could easily be converted to DOCX. For example, you'll
| frequently see compatibility tags like
| "autoSpaceLikeWord95", "footnoteLayoutLikeWW8",
| "useWord2002TableStyleRules", and "lineWrapLikeWord6" that
| expose internal implementation details. Rather than
| creating a useful standard allowing all users to store
| their documents in a clean, portable way, Microsoft decided
| to make their standard faithfully reproduce all of the
| quirks and bugs of their legacy binary formats. It's so
| difficult to correctly implement the Office Open XML
| standard that even Microsoft took until Office 2013 to do
| so (the standard was approved in 2006).
| notriddle wrote:
| > "autoSpaceLikeWord95", "footnoteLayoutLikeWW8",
| "useWord2002TableStyleRules", and "lineWrapLikeWord6"
|
| I expect that whatever tooling the USPTO uses can
| probably just ignore those things. They're extracting
| metadata, not actually rendering it.
| dataflow wrote:
| Interesting! How do they compare feature-wise? I feel
| like there must be things each of them support that the
| other one doesn't, but I don't know how consequential
| they are.
| Kye wrote:
| I don't know how thorough or accurate it is, but
| Microsoft has a list.
|
| https://support.microsoft.com/en-us/office/differences-
| betwe...
|
| The list was only a few lines the last time I looked
| years ago, so maybe they're actually trying to make a
| complete list.
| not2b wrote:
| I think they pretty much had to do that to preserve the
| formatting of existing documents for users who are force-
| upgraded by their employers to new Office versions. But
| it seems a scraper that just wants the information in the
| document can ignore almost all of those tags.
|
| edit: ninja'd.
| jfk13 wrote:
| So true. "It's XML, so it must easy to parse and
| manipulate" is such a naive, even misleading attitude. If
| what you do is take a byzantine, legacy-encrusted
| implementation and just serialise its data strucures to
| an XML representation, very little has been gained.
|
| [edit: but I will grant that almost anything is better
| than attempting to parse useful content from PDF.]
| fezfight wrote:
| I think the knee-jerk is against any alternative to Office,
| not against Office. Statistically speaking, trying to use
| anything reasonable that doesn't genuflect to Microsoft's
| monopoly is what seems to be met with a knee-jerk reply
| such as yours. As in, there's probably more people who
| don't care but hate the complaining about libre stuff than
| their are advocates for libre stuff.
| molsongolden wrote:
| Was just about to post this. Unzipping DOCX and parsing XML
| is much easier than accurately processing PDF submissions.
| oneplane wrote:
| OCR has come a long way, so much that visually
| interpreting a PDF is about as error-prom as parsing XML
| output from Microsoft in non-microsoft software.
| programmarchy wrote:
| Try extracting tabular data from a PDF! With XML it's
| trivial, but for PDF you need highly specialized software
| packages to do this. One of the best, pdfplumber, is
| largely based [1] on a Master's thesis titled Algorithmic
| Extraction of Data in Tables in PDF Documents [2].
|
| [1] https://github.com/jsvine/pdfplumber/blob/stable/pdfp
| lumber/...
|
| [2] https://trepo.tuni.fi/bitstream/handle/123456789/2152
| 0/Nurmi...
| jedberg wrote:
| I never said it was a bad choice, only that they didn't say
| why they chose it.
| dataflow wrote:
| I guess, but you were replying to "I was wondering why
| docx format would be chosen instead of PDF" seemingly
| unconvinced, so I assumed you thought PDF would've made
| more sense.
| Noted wrote:
| Nice to see they call out LibreOffice as a usable application
| as well.
| AdmiralAsshat wrote:
| INSTEAD of Open Office, no less!
| hedora wrote:
| They say that 80% of the submissions used to be converted to
| PDF from word. I'd be interested to know where the other 20%
| came from.
| meragrin_ wrote:
| There's this one guy I've dealt with. He uses a editor he
| wrote himself. He'll convert his documents to Pages and then
| use Pages for any other conversion needed.
| NotYourLawyer wrote:
| I'm guessing google docs is the next most common, and then
| probably libre office.
| meragrin_ wrote:
| My experience is a large distrust of cloud environments. I
| would expect Pages and Libre/Open Office to be more common
| than Google Docs.
| NotYourLawyer wrote:
| Oh right, I bet Pages is up there. I dunno though--I know
| lots of patent attorneys who are surprisingly non-
| technical and probably haven't given a thought to cloud
| security issues.
| bonyt wrote:
| My guess is that printed and scanned documents from word make
| up a large component of this.
| pavon wrote:
| Do patent attorneys love Word Perfect as much as some other
| people in the legal profession?
| deathanatos wrote:
| > _Due to aggressive automated scraping of FederalRegister.gov
| and eCFR.gov, programmatic access to these sites is limited to
| access to our extensive developer APIs._
|
| Apparently. Then a captcha and a button to request access, which
| if you complete, returns a 500 Internal Server Error.
|
| ... my tax dollars are _hard_ at work, I see.
|
| The Wayback Machine hasn't got a snapshot, either, it seems.
| zevra wrote:
| Its apparently the wrong link see:
| https://news.ycombinator.com/item?id=32609165
|
| The correct one is:
| https://www.federalregister.gov/documents/2022/04/28/2022-09...
| batmaniam wrote:
| Seriously, I've had my share of horror stories on government
| systems. Do they just not dogfood their own product? Where is
| their QA team? It's atrocious how basic tasks are so broken all
| the time.
| ceeplusplus wrote:
| Government can't afford to hire the competent talent, only
| the scraps after everyone else (even the consulting
| bodyshops) are done. The top GS pay bracket is lower than
| entry level engineers at many companies (not just FAANG, but
| also defense, F500 companies, etc.)
| CobrastanJorji wrote:
| One of the many things that I found tempting about working
| for the U.S. Digital Service was that, while the GS-15 pay
| grade is definitely way less than I'd make in the private
| sector, my spouse's family is military/government and the
| difference between "hippy programming thingy" and "has a
| GS-15/O-6 job" would've been night and day. The one puts me
| in a pile of stereotypes, but the other says "oh, he's
| basically the bureaucratic equivalent of a Captain, that's
| very respectable."
| ceeplusplus wrote:
| Different circles I guess. While my spouse's family
| considers government jobs to be stable and somewhat
| respectable, there is a lot more respect for FAANG and
| other high paying jobs. One is respectable, the other is
| prestigious.
| nullc wrote:
| Machine learning is coming for the examiners jobs. :P
| apocalyptic0n3 wrote:
| I've been working on some tools that integrate with USPTO (both
| from the application side and the validation side) for quite a
| few years now and they've been making a TON of formatting changes
| recently. A lot of their PDF forms have changed, they're
| requiring XML versions of all data we submit, they're handling
| classifications differently, etc. Their process always felt like
| it was stuck in the past and being handled manually by humans
| before and now it feels like they're moving everything toward
| automated intake and initial reviews. I imagine this change is
| for the same reasons, and that's a hefty fee to force it.
|
| This also likely means I will need to rework our systems to spit
| out docx instead of/in addition to PDFs, which will be a
| nightmare to do. So that's fun.
| MrLeap wrote:
| The consolation is that, if I remember correctly, docx is just
| a zip file containing xml.
|
| I made an xlsx exporter in actionscript3 (lol) years ago and it
| worked like this. What I ultimately did was made a "template"
| document, and my code just injected strings into key spots,
| zipped it up in memory and gave it to you as a file.xlsx.
| Probably took me 3 days?
|
| I didn't have the benefit of libraries so I imagine this is
| significantly easier in less hobbled environments, nodejs or
| whatever probably has a kitchen sink package to do it.
| lofatdairy wrote:
| That's exactly right. There are definitely nodejs docx
| templating packages (I've worked on codebases that used them
| in the past), but they're certainly not required provided
| your documents are reasonably simple.
|
| If anything, generating a pdf from various input
| files/structured text has been a much harder task. We
| generated docx files to allow for easy modification by non-
| technical staff, but to generate a pdf we had to use a
| headless instance of libreoffice since pandoc was struggling
| with the rendering.
| peteradio wrote:
| I'm a masochist in need of work.
| aaaddaaaaa1112 wrote:
| Ironlink wrote:
| In case anyone is looking for the size of the surcharge, I found
| it in the last row of this table:
| https://www.federalregister.gov/d/2020-16559/p-555
| colejohnson66 wrote:
| So, $100-400 depending on the size (CFR section 1.16(u)). That
| feels... excessive, but if processing a DOCX is automatic and
| PDFs require humans (I'm assuming), it makes sense.
| nfriedly wrote:
| Note: the correct link is
| https://www.federalregister.gov/documents/2022/04/28/2022-09...
|
| The fee is $100, $200, or $400 depending on the size of the
| document.
| emgeee wrote:
| The fee is $400
___________________________________________________________________
(page generated 2022-08-26 23:00 UTC)