[HN Gopher] When XML in Word Became Illegal
___________________________________________________________________
When XML in Word Became Illegal
Author : ejz
Score : 92 points
Date : 2023-10-12 14:49 UTC (6 hours ago)
(HTM) web link (blog.withedge.com)
(TXT) w3m dump (blog.withedge.com)
| jkaptur wrote:
| "Microsoft.... built a custom XML tool into its word processor in
| 2007... this was a tool for power users, and was only used by a
| small percentage of its user base."
|
| I'm definitely confused by that statement and its link, because
| it implies the relevant tool is _the disk format for every Office
| file_ , which has been described by an Excel program manager as
| "complicated enough to reduce a grown programmer to tears."
| https://www.joelonsoftware.com/2008/02/19/why-are-the-micros...
| colejohnson66 wrote:
| Nitpick: Joel is referring to the _old_ BIFF-style format (from
| 2003 and before) in that quote. The new "Office Open XML"
| formats are not mentioned in that post at all. However, one of
| the many criticisms of the Office Open XML formats is that they
| are, in some areas, nothing more than an XML serialization of
| the BIFF records.
| xanathar wrote:
| The article says the feature has been removed; if it was the
| disk format:
|
| 1) it has never been removed, afaik Word still uses OOXML, so
| Word would keep being infringing
|
| 2) LibreOffice would probably be infringing too, as ODF is also
| XML based
|
| So... it has to be some other form of XML tool and not the file
| format.
|
| As for Joel's comment, IIRC he was an Excel PM _before_ OOXML;
| in any case his blog post refer to the binary format that
| precedes OOXML. I 'm pretty sure OOXML is equally if not even
| more complicated, as the product themselves are way more
| complicated than they appear, but the fact is that he was
| talking about a different thing.
|
| Edit: as many users pointed out, it's not the file format
| itself, but the ability to add arbitrary attributes/elements to
| the file format XML as additional data.
| blackboxlogic wrote:
| I believe I ran into this issue a few years ago and discovered
| the patent case when trying to work around. The xml file format
| allowed for arbitrary properties to be added (as xml does), and
| we were trying to embed metadata in word files. But when MS
| Word opened a file with anything extra in it it gave a warning
| like "this file has extra stuff in it" and it automatically
| removed anything that wasn't explicitly expected.
| msp_yc wrote:
| Not sure why this is downvoted, it's absolutely correct. I
| tried this myself; it would have -greatly- simplified
| scraping Word docs because the custom tags would have been
| available for XPath querying. Alas, Word strips it all on
| open.
| richard_todd wrote:
| It's not referring to the XML formats.. it was a feature of
| Word specifically which allowed you to embed a user-defined xml
| schema in your Word document, and use XML data that fits the
| schema in your document.
|
| See https://www.zdnet.com/article/custom-xml-the-key-to-
| patent-s...
|
| (edit: grammar)
| jkaptur wrote:
| Ah, thanks for explaining.
| tannhaeuser wrote:
| Looking at the patent application, it doesn't appear to mention
| XML at all (it does talk about SGML, though), and the
| application appears to claim any mapping of a symbolic name to
| style properties (think Word styles or CSS classes); in other
| words, technical trivialities, reflecting poorly on US lawyers
| and their patent law.
| rbehrends wrote:
| It's not about storing XML, it's (as far as I understand the
| patent) about a specific representation of XML that can be more
| efficient to read.
|
| The patent is about representing documents with markup (XML or
| otherwise) not by embedding them in the text, but rather having
| them stripped and maintained as a separate list of (tag,
| position) pairs, with the document only containing the raw
| text.
|
| I'm only surprised that Microsoft couldn't find prior art,
| because having a (content-type, address) index at the beginning
| of a file is not exactly an unusual representation. It also
| reminds me that the USPTO's idiosyncratic usage of non-
| obviousness doesn't really match my intuition.
| ejz wrote:
| This is a huge issue with the patent world in general.
| There's just so much prior art out there, and you have to be
| really clear about showing that it applies. This isn't a
| patent case, but I have a great Google Maps case involving
| Wi-Fi where a judge completely borked it. As for this
| particular patent, I'm not enough of an XML expert to say
| whether the court got it right here. But it is worth noting
| that Microsoft tried to invalidate the patent several times
| with USPTO and failed to do so there as well. So perhaps
| there's something more to the patent than meets the eye, or
| that is was novel at that time but not modern XML. Remember,
| the actual i4i patent at issue was filed in 1994, and it only
| matters if there was prior art from before 1994. It might
| have been novel at the time.
| rbehrends wrote:
| > Remember, the actual i4i patent at issue was filed in
| 1994, and it only matters if there was prior art from
| before 1994. It might have been novel at the time.
|
| I am aware of the date of the "invention". I was
| programming on 8- and 16-bit computers in the 1980s and I
| was using this and similar kinds of formats for non-textual
| data, simply because it was easier to do this in assembler
| than writing a parser, paired with the difficulty of
| finding unused special bytes in binary data to separate
| meta-information from the data proper.
|
| And I was also talking about non-obviousness, not novelty.
| ejz wrote:
| Fair enough. I haven't seen the invalidation proceedings
| and am clearly less of an expert than you. So don't know
| whether they got it right. Non-obviousness is, erm, non-
| obvious.
| cm2187 wrote:
| Am I right to understand that it would be the equivalent of
| visual studio's wpf designer [1], where you have the WYSIWYG
| editor side by side with an xml editor and you can make the
| change in either of them and it translates into the other?
|
| If it is, it would have been really really cool.
|
| [1] https://i.stack.imgur.com/8pJnn.png
| rbehrends wrote:
| No. It's more like what the following piece of code
| produces: def convert(xml):
| import re parsed = re.split(r"(<.+?>)", xml)
| output = parsed[0] tags_with_pos = []
| for i in range(1, len(parsed), 2):
| tags_with_pos.append((parsed[i], len(output)))
| output += parsed[i+1] return tags_with_pos,
| output
| robertlagrant wrote:
| > the USPTO's idiosyncratic usage of non-obviousness doesn't
| really match my intuition
|
| Remember that USPTO gets paid for each patent application,
| and not penalised when it's later falsified.
| rbehrends wrote:
| Well, it was apparently upheld twice on reexamination,
| where they could have fixed that. The problem is more that
| the bar for non-obviousness is so low, it's basically on
| the floor. Paired with a discipline (software development),
| where independent reinvention is common, this is just a
| recipe for disaster.
| Karellen wrote:
| > it implies the relevant tool is _the disk format for every
| Office file_
|
| Does it imply that?
|
| Another commenter has already pointed out why it's likely not
| the case.
|
| But also, I don't think the article is well written. Partly
| because it doesn't clearly explain what the infringing tool
| was, or did, or how it operated. Also I'm pretty sure there's a
| typo in "ex part" instead of "ex parte". But another major
| issue is the following:
|
| > $40 million of that judgment [against Microsoft] was imposed
| by the court as punishment for continually arguing that i4i was
| a patent troll even though it had an operating business in a
| manner that was "persistent, legally improper, and in direct
| violation of the Court's instructions."
|
| What?
|
| Why would i4i operating in a manner that was persistent,
| improper and in violation of the court's instructions preclude
| it from also being a patent troll? It could do both?
|
| Or is the "persistent..." descriptor meant to apply to
| Microsoft? That might make more sense, but the "even though"
| seems to be a comparison between two types of activity by one
| entity - namely i4i.
|
| But then again, I might be reading "it had an operating
| business in a manner" wrong, because it feels ungrammatical to
| me. I might not be putting the emphasis in the right place, and
| that's what's causing me to misread the sentence?
|
| The whole thing just feels confusing.
| ejz wrote:
| Thanks for reading. Sorry if this was confusing! Microsoft
| said that i4i was a patent troll despite the court repeatedly
| telling Microsoft to not do that. The judge referred to
| Microsoft's repeated ignoring of its instructions as
| "persistent" etc. i4i had an operating business; it wasn't a
| patent troll. That operating business is niche and small, but
| it is real. I have updated that sentence to make it clearer.
| Thanks for your feedback!
| ackfoobar wrote:
| Depends on one's definition. I don't think "not having a
| real product/service" is the defining charateristic of
| "patent troll". Here's what Wikipedia says.
|
| > attempts to enforce patent rights against accused
| infringers far beyond the patent's actual value or
| contribution to the prior art
|
| > often do not manufacture products or supply services
| based upon the patents in question
| ejz wrote:
| This isn't want Joel is talking about here.
|
| On the backend, all .docx files use XML. Joel is saying the
| root XML format was difficult to work with.
|
| What my article is about is this: Microsoft used to allow users
| to write their own custom XML rules on top of Word. (This was
| mostly app developers using XML for macros rather than end
| users, and overall it was very rare.) This is the feature that
| was at issue with the patent.
|
| Sorry if this was not clear!
| jkaptur wrote:
| Thanks for clarifying!
| Jtsummers wrote:
| > Joel is saying the root XML format was difficult to work
| with.
|
| Joel wasn't writing about the XML version of MS Office
| documents, he was writing about the binary versions.
| londons_explore wrote:
| Anyone got a screenshot of this feature?
| dbavaria wrote:
| See here: https://learn.microsoft.com/en-
| us/office/troubleshoot/word/c...
| jandrese wrote:
| > Indeed, as you work on your Excel clone, you'll discover all
| kinds of subtle details about date handling. When does Excel
| convert numbers to dates? How does the formatting work? Why is
| 1/31 interpreted as January 31 of this year, while 1/50 is
| interpreted as January 1st, 1950? All of these subtle bits of
| behavior cannot be fully documented without writing a document
| that has the same amount of information as the Excel source code.
|
| A quick note to anybody building an Excel clone: If you want to
| turn this insane date handling behavior of Excel into an optional
| feature that can be disabled everybody will appreciate it.
| atoav wrote:
| I always wondered why they won't just make it a popup button?
|
| Default should be to not change anything, if a date is
| recognized offer a button right next to the cell that allows
| you to accept the suggestion to turn it into a fully fledged
| date. Just make it so that pressing tab or shift enter or a
| similar comination accepts that suggestion.
| xigoi wrote:
| https://xkcd.com/1172/
| numpad0 wrote:
| Just do https://xkcd.com/927/, happened once and it was
| okay.
| xigoi wrote:
| What comes after .docx? .docxx? .docy? .docxi?
| jimmaswell wrote:
| docxEx, in Win32 fashion.
| WirelessGigabit wrote:
| Scientists will thank you:
| https://www.theverge.com/2020/8/6/21355674/human-genes-renam...
| qclibre22 wrote:
| > Scientists will thank you
|
| Scientists gave up and changed the gene names:
| https://duckduckgo.com/?q=excel+gene+names+changed+septin1
| jahav wrote:
| It's also country specific.
|
| I work on Excel library and the text to number/date feature was
| one of less fun things to implement at least semi-correctly.
|
| I remember my comment on the PR back then:
|
| https://github.com/ClosedXML/ClosedXML/pull/1899
| pjungwir wrote:
| If only someone had filed a patent that blocked Word from
| inserting curly quotes the wrong way, like '449.
| willcipriano wrote:
| The United States is #1 for protection of intellectual property
| in the world according to the property rights index:
| https://www.internationalpropertyrightsindex.org/
|
| Real property on the other hand? The US is ranked 14th.
| breakfastduck wrote:
| I'm completely baffled as to how it's allowed to get a patent on
| stuff like this.
|
| Can I patent sending REST requests using JSON?
| empath-nirvana wrote:
| No, that's not how it works, you can't patent a specific
| technology that's already been invented, what you do is wait
| for a new technology to be invented and then patent doing some
| obvious thing with the new technology.
|
| Like, a good patent today would be: "Using a computer text
| prediction engine to automatically review and approve code."
|
| It probably would have been pretty smart to skim through all
| the hacker news threads after ChatGpt came out patenting every
| other comment.
| donatj wrote:
| XML editing had already been invented
| jahav wrote:
| You can try to patent anything, but patent might not be
| accepted.
|
| The thing is that patent office is funded by patent fees, so
| there is an incentive to accept the patent plus they are often
| hard to read.
| svachalek wrote:
| What I understand of US law is that there's very little in
| the way of filing a patent. It's not really tested until
| someone challenges it.
| lucozade wrote:
| You could apply for that patent but I would expect it to be
| rejected due to prior art i.e. someone came up with it before
| you. Even if it was accepted, if you tried to enforce it, it'd
| definitely be challenged on prior art and you would very likely
| lose because it wouldn't be hard to prove you went the first.
|
| Now, why this particular patent exists, and seems so general,
| is also likely related to prior art. What could be patented for
| software was a bit murky until the late 1990's when it was
| established that business methods implemented in software were
| allowed. This led to a large flood of patents in that space.
|
| One of the issues is that the Patent Office tends to look at
| prior art as being "things that have already been patented" so
| when rules change, a lot of things that seem obvious are up for
| grabs because there's no prior patent. Now, these can (and are)
| challenged in court and, in court, they're more likely to
| accept blatant prior usage in the wild. i don;t know whether
| this case won it's challenge but it's possible that it didn't
| because XML was quite new in the late 90s too.
|
| Source: I have a patent from around that time that's basically
| covers anything in finance that's data driven from an XML
| document. For about a decade, that covered a fairly large chunk
| of finance. I never did anything about it as I disagreed in
| principle with the premise of such an absurdly broad patent. I
| agreed to it being patented solely for defensive reasons ie it
| might prevent a competitor from egregiously attacking my
| employer with patents.
| renewiltord wrote:
| If you're sufficiently creative, certainly. Some of my friends
| patented something totally absurd: there's a transformation you
| can easily do in software and lots of software does it quite
| routinely. They did it twice. Patent issued.
| yarone wrote:
| A classic Joel on Software article about funny backwards
| compatibility built into Excel:
| https://www.joelonsoftware.com/2006/06/16/my-first-billg-rev...
| Macha wrote:
| I realised there is more time between now and that article,
| than there is between that article and the events described
| within.
| FpUser wrote:
| Looking at patent abstract [0] it basically patents separation of
| information and structure. That latter can be used to present
| information in various ways.
|
| My take is that it is fucking obvious and I just simply do not
| believe that the concept did not have prior art. It just show
| what a crooked business this whole modern patent system is.
|
| [0] - "A system and method for the separate manipulation of the
| architecture and content of a document, particularly for data
| representation and transformations. The system, for use by
| computer software developers, removes dependency on document
| encoding technology. A map of metacodes found in the document is
| produced and provided and stored separately from the document.
| The map indicates the location and addresses of metacodes in the
| document. The system allows of multiple views of the same
| content, the ability to work solely on structure and solely on
| content, storage efficiency of multiple versions and efficiency
| of operation."
| ClearDayDev wrote:
| I've not read the patent, but it's definitely inaccurate to say
| "Microsoft removed custom XML from Word." It's still possible to
| create custom XML parts programmatically, and I suspect it's
| quite commonly done. Also, I just checked, and Microsoft 365 has
| a custom XML mapping tool on the developer tab. So it would be
| interesting to know how Microsoft complied with the judgment and
| the subsequent history of the feature.
___________________________________________________________________
(page generated 2023-10-12 21:00 UTC)