[HN Gopher] The semantic web is now widely adopted
___________________________________________________________________
The semantic web is now widely adopted
Author : todsacerdoti
Score : 415 points
Date : 2024-08-21 05:22 UTC (17 hours ago)
(HTM) web link (csvbase.com)
(TXT) w3m dump (csvbase.com)
| tossandthrow wrote:
| In all honesty, llms are probably going to make all this entirely
| redundant.
|
| As such semantic web was not a natural follower to what we had
| before, and not web 3.0.
| asymmetric wrote:
| Have you read the article? It addresses this point towards the
| end.
| tannhaeuser wrote:
| And it fails to address why SemWeb failed in its heyday: that
| there's no business case for releasing open data of any kind
| "on the web" (unless you're wikidata or otherwise financed
| via public money) the only consequence being that 1. you get
| less clicks 2. you make it easier for your competitors
| (including Google) to aggregate your data. And that hasn't
| changed with LLMs, quite the opposite.
|
| To think a turd such as JSON-LD can save the "SemWeb" (which
| doesn't really exist), and even add CSV as yet another RDF
| format to appease "JSON scientists" lol seems beyond absurd.
| Also, Facebook's Open Graph annotations in HTML meta-links
| are/were probably the most widespread (trivial)
| implementation of SemWeb. SemWeb isn't terrible but is
| entirely driven by TBL's long-standing enthusiasm for edge-
| labelled graph-like databases (predating even his WWW efforts
| eg [1]), plus academia's need for topics to produce papers
| on. It's a good thing to let it go in the last decade and re-
| focus on other/classic logic apps such as Prolog and SAT
| solvers.
|
| [1]: https://en.wikipedia.org/wiki/ENQUIRE
| tossandthrow wrote:
| yes
| peterlk wrote:
| The article addresses this point with the following:
|
| > It would of course be possible to sic Chatty-Jeeps on the raw
| markup and have it extract all of this stuff automatically. But
| there are some good reasons why not. > > The first is that
| large language models (LLMs) routinely get stuff wrong. If you
| want bots to get it right, provide the metadata to ensure that
| they do. > > The second is that requiring an LLM to read the
| web is throughly disproportionate and exclusionary. Everyone
| parsing the web would need to be paying for pricy GPU time to
| parse out the meaning of the web. It would feel bizarre if
| "technological progress" meant that fat GPUs were required for
| computers to read web pages.
| tsimionescu wrote:
| The first point is moot, because human annotation would also
| have some amount of error, either through mistakes (interns
| being paid nothing to add it) or maliciously (SEO). Plus,
| human annotation would be multi-lingual, which leads to a
| host of other problems that LLMs don't have to the same
| extent.
|
| The second point is silly, because there is no reason for
| everyone to train their own LLMs on the raw web. You'd have a
| few companies or projects that handle the LLM training, and
| everyone else uses those LLMs.
|
| I'm not a big fan of LLMs, and not even a big believer in
| their future, but I still think they have a much better
| chance of being useful for these types of tasks than the
| semantic web. Semantic web is a dead idea, people should
| really allow it to rest.
| tossandthrow wrote:
| While both of these points a valid _today_ they are likely
| going to be invalidated going forward - assume that what you
| can conceive is technically possible will become technically
| possible.
|
| In 5 years resource price is likely negligible and accuracy
| is high enough that you just trust it.
| null_investor wrote:
| It's HN, most people don't read the article and jump into
| whatever conclusion they have at the moment despite not being
| an expert in the field.
| xoac wrote:
| He had it summarized by chatgpt
| tossandthrow wrote:
| As I already pointed out, none of the arguments the author
| brings up are really relevant. Resources and accuracy will
| not be a concern in 5 years.
|
| What makes you think that I am not an expert btw?
|
| It indeed seems like you appear to believ that what's written
| on the internet is true. So if someone writes that LLMs are
| not a contester to semantic web - then it might be true.
|
| Could it be, that I merely challenge that author of the blog
| article and don't take his predictions for granted?
| mg wrote:
| The author gives two reasons why AI won't replace the need for
| metadata:
|
| 1: LLMs "routinely get stuff wrong"
|
| 2: "pricy GPU time"
|
| 1: I make a lot of tests on how well LLMs get categorization and
| data extraction right or wrong for my Product Chart
| (https://www.productchart.com) project. And they get pretty hard
| stuff right 99% of the time already. This will only improve.
|
| 2: Loading the frontpage of Reddit takes hundreds of http
| requests, parses megabytes of text, image and JavaScript code. In
| the past, this would have been seen as an impossible task to just
| show some links to articles. In the near future, nobody will see
| passing a text through an LLM as a noteworthy amount of compute
| anymore.
| monero-xmr wrote:
| LLMs have no soul, so I like content and curation from real
| people
| doe_eyes wrote:
| The main problem is that the incentive for well-intentioned
| people to add detailed and accurate metadata is much lower
| than the incentive for SEO dudes to abuse the system if the
| metadata is used for anything of consequence. There's a
| reason why search engines that trusted website metadata went
| extinct.
|
| That's the whole benefit of using LLMs for categorization:
| they work for you, not for the SEO guy... well, prompt
| injection tricks aside.
| monero-xmr wrote:
| There is value-add if you can prove whatever content you
| are producing is from an authentic human, because I dislike
| LLM produced garbage
| usrusr wrote:
| The point is that metadata lies. Intentionally, instead
| of just being coincidentally wrong. For example everybody
| who wants to spew LLM produced garbage in your face will
| go out of their way to attach metadata claiming the
| opposite. The value proposition of LLM categorization
| would be that the LLM looks at the same content as the
| eventual human (if, in fact, it does - which is a related
| but different problem)
| tsimionescu wrote:
| All the web metadata I consume is organic and responsively
| farmed.
| amarant wrote:
| Huh, it's not often you hear a religious argument in a
| technical discussion. Interesting viewpoint!
| MrVandemar wrote:
| I don't see it as anything religious. I see the comment
| about something having an intrinsic, instinctive quality,
| which we can categorise as having "soul".
| amarant wrote:
| That's even more interesting! The only non-religious
| meaning of soul I've ever heard is a music genre, but
| then English is my second language. I tried googling it
| and found this meaning I wasn't aware of:
|
| emotional or intellectual energy or intensity, especially
| as revealed in a work of art or an artistic performance.
| "their interpretation lacked soul"
|
| Is this the definition used? I'm not sure how a JSON
| document is supposed to convey emotional or intellectual
| energy, especially since it's basically a collection of
| tags. Maybe I also lack soul?
|
| Or is there yet another definition I didn't find?
| pessimizer wrote:
| It's early 20th century (and later) black American
| dialect to say things "have soul" or "don't have soul."
| In the West, Black Americans are associated with a
| mystical connection to the Earth, deeper understandings,
| and suffering.
|
| So LLMs are not gritty and down and dirty, and don't get
| down. They're not the real stuff.
| amarant wrote:
| Mystical connection? Now you're back to religion.
|
| If you wanna be down you gotta keep it real, and
| mysticism is categorically not that.
| Eisenstein wrote:
| > intrinsic, instinctive quality,
|
| What are a few examples of things with an 'intrinsic,
| instinctive quality'?
| rapsey wrote:
| GPU compute price is dropping fast and will continue to do so.
| philjohn wrote:
| But is it dropping faster than the needs of the next model
| that needs to be trained?
| tossandthrow wrote:
| Short answer is yes.
|
| Also, GPU pricing is hardly relevant. From now on we will
| see dedicated co-processors on the GPU to handle these
| things.
|
| They will keep on keeping up with the demand until we meet
| actual physical limits.
| dspillett wrote:
| The cost of GPU time isn't just the cost that you see (buying
| them initially, paying for service if they are not yours,
| paying for electricity if they are) but the cost to the
| environment. Data centre power draws are increasing
| significantly and the recent explosion in LLM model creation
| is part of that.
|
| Yes, things are getting better per unit (GPUs get more
| efficient, better yet AI-optimised chipsets are an order more
| efficient than using GPUs, etc.) but are they getting better
| per unit of compute faster than the number of compute units
| being used is increasing ATM?
| menzoic wrote:
| How does Product Chart use LLMs?
| mg wrote:
| We research all product data manually and then have AI cross-
| check the data and see how well it can replicate what the
| human has researched and whether it can find errors.
|
| Actually, building the AI agent for data research takes up
| most of my time these days.
| viraptor wrote:
| Have you seen https://superagent.sh/ ? It's an interesting
| one and not terrible in the test cases I tried. (Requires
| pretty specific descriptions for the fields though)
| throwme_123 wrote:
| For my part, I stopped reading at the free bashing of
| blockchain*.
|
| Reminded me of the angst and negativity of these original
| "Web3" people, already bashing everything that was not in their
| mood back then.
|
| * The crypto ecosystem is shady, I know, but the tech is great
| ashkankiani wrote:
| As someone who stopped getting involved in blockchain "tech"
| 12 years ago because of the prevalence of scams and bad
| actors and lack of interesting tech beyond the merkle tree,
| what's great about it?
|
| FWIW I am genuinely asking. I don't know anything about the
| current tech. There's something about "zero knowledge proofs"
| but I don't understand how much of that is used in practice
| for real blockchain things vs just being research.
|
| As far as I know, the throughput of blockchain transactions
| at scale is miserably slow and expensive and their usual
| solution is some kind of side channel that skips the full
| validation.
|
| Distributed computation on the blockchain isn't really used
| for anything other than converting between currencies and
| minting new ones mostly AFAIK as well.
|
| What is the great tech that we got from the blockchain
| revolution?
| throwme_123 wrote:
| Scams and bad actors haven't changed sadly.
|
| But zk-based really decentralized consensus now does 400
| tps and it's extraordinary when you think about it and all
| the safety and security properties it brings.
|
| And that's with proof-of-stake of course with decentralized
| sequencers for L2.
|
| But I get that people here prefer centralized databases,
| managed by admins and censorship-empowering platforms. Your
| bank stack looks like it's designed for fraud too. Manual
| operations and months-long audits with errors, but that is
| by design. Thanks everyone for all the downvotes.
| dspillett wrote:
| _> But I get that people here prefer_
|
| For many of us it isn't that we think the status quo is
| the RightWay(tm) - we just aren't convinced that crypto
| as it currently is presents a better answer. It fixes
| some problems, but adds a number of its own that many of
| us don't think are currently worth the compromise for our
| needs.
|
| As you said yourself:
|
| _> The crypto ecosystem is shady, I know, but the tech
| is great_
|
| That _but_ is not enough for me to want to take part. Yes
| the tech is useful, heck I use it for other things
| (blockchains existed as auditing mechanisms long before
| crypto-currencies), but I 'm not going to encourage
| others to take part in an ecosystem that is as shady as
| crypto is.
|
| _> Thanks everyone for all the downvotes._
|
| I don't think you are getting downvoted for supporting
| crypto, more likely because you basically said "you know
| that article you are all discussing?, well I think you'll
| want to know that I didn't bother to read it", then
| without a hint of irony made assertions of "angst and
| negativity".
|
| And if I might make a mental health suggestion: caring
| about online downvotes is seldom part of a path to
| happiness :)
| nottorp wrote:
| The main problem with blockchain is identical to the one
| with LLMs. When snake oil salesmen try to apply the same
| solution to every problem, you stop wasting your time
| with those salesmen.
|
| Both can be useful now and then, but the legit uses are
| lost in the noise.
|
| And for blockchain... it was launched with the promise of
| decentralized currency. But we've had decentralized
| currency before in the physical world. Until the past few
| hundred years. Then we abandoned it in favor of
| centralized currency for some reason. I don't know,
| reliability perhaps?
| dspillett wrote:
| _> And for blockchain... it was launched with the promise
| of decentralized currency._
|
| _Cryptocurrencies_ were launched with that promise.
|
| They are but one use1 of block-chains / merkle-trees,
| which existed long before them2.
|
| ----
|
| [1] https://en.wikipedia.org/wiki/Merkle_tree#Uses
|
| [2] 1982 for blockchains/trees as part of a distributed
| protocol as people generally mean when they use the words
| now3, hash chains/trees themselves go back at least as
| far as 1979 when Ralph Merkle patented the idea
|
| [3] https://en.wikipedia.org/wiki/Blockchain#History
| nottorp wrote:
| But if you put it that way neural networks were defined
| in the 70s too :)
| dspillett wrote:
| Very much so. Is there a problem with that? To what time
| period would attribute their creation?
|
| In fact it is only the 70s if you mean networks that
| learn via backprop & similar methods. Some theoretical
| work on artificial neurons was done in the 40s.
| nottorp wrote:
| The point is whatever you said in defense of
| blockchain/crypto applies or does not apply to neural
| networks/LLMs in equal measure.
|
| I for one fail to see the difference between these two
| kinds of snake oil.
|
| > Some theoretical work on artificial neurons was done in
| the 40s.
|
| "The perceptron was invented in 1943 by Warren McCulloch
| and Walter Pitts. The first hardware implementation was
| Mark I Perceptron machine built in 1957"
| everforward wrote:
| Gold is and has been a decentralized currency for a very
| long time. It's mostly just very inconvenient to
| transport.
|
| > Then we abandoned it in favor of centralized currency
| for some reason. I don't know, reliability perhaps?
|
| The global economy practically requires a centralized
| currency, because the value of your currency vs other
| countries becomes extremely important for trading in a
| global economy (importers want high value currency,
| exporters want low).
|
| It's also a requirement to do financial meddling like
| what the US has been doing with interest rates to curb
| inflation. None of that is possible on the blockchain
| without a central authority.
| zaik wrote:
| > Reddit takes hundreds of http requests, parses megabytes of
| text, image and JavaScript code [...] to show some links to
| articles
|
| Yes, and I hate it. I closed Reddit many times because the wait
| time wasn't worth it.
| rfl890 wrote:
| https://old.reddit.com ?
| jeltz wrote:
| Gets buggier for every year.
| dspillett wrote:
| That definitely seems to be getting less reliable these
| days. A number of times I've found it refusing to work, or
| redirecting me to the primary UI arbitrarily, a few months
| ago there was a time when you couldn't login via that UI
| (though logging in on main and going back worked for me).
|
| These instances seem to be temporary bugs, but they show
| that it isn't getting any love (why would it? they only
| maintain it at all under sufferance) so at some point it'll
| no doubt be cut off as a cost cutting exercise during a
| time when ad revenue is low.
| atoav wrote:
| Let's hope you never write articles about court cases then:
| https://www.heise.de/en/news/Copilot-turns-a-court-reporter-...
|
| The alleged low error rate of 1% can ruin your
| day/life/company, if it hits the wrong person, regards the
| wrong problem, etc. And that risk is not adequately addressed
| by hand-waving and pointing people to low error rates. In fact,
| if anything such claims would make me less confident in your
| product.
|
| 1% error is still a lot if they are the wrong kind of error in
| the wrong kind of situation. Especially if in that 1% of cases
| the system is not just _slightly_ wrong, but catastrophically
| mind-bogglingly wrong.
| kqr wrote:
| This is the thing with errors and automation. A 1 % error
| rate in a human process is basically fine. A 1 % error rate
| in an automated process is hundreds of thousands of errors
| per day.
|
| (See also why automated face recognition in public
| surveillance cameras might be a bad idea.)
| atoav wrote:
| Exactly. If your system monitors a place like a halfway
| decent railway station half a million people per day is a
| number you could expect. Even with an amazingly low error
| rate of 1% that would result in 5000 wrong signals a day.
| If we make the assumption that the people are uniformly
| spread out througout a 24 hour cycle that means a false
| alarm _every 20 seconds_.
|
| In reality most of the people are there during the day
| (false alarm every 10 seconds) and the error percentages
| are nowhere near 1%.
|
| If you do the math to figure out the staff needed to react
| to those false alarms in any meaningful way you have to
| come to the conclusion that just putting people there
| instead of cameras would be a safer way to reach the goal.
| Terr_ wrote:
| Another part is that artificial systems can screw up in
| fundamentally different ways and modes compared to a human
| baseline, even if the raw count of errors is lower.
|
| A human might fail to recognize another person in a photo,
| but at least they won't insist the person is definitely a
| cartoon character, or blindly follow "I am John Doe"
| written on someone's cheek in pen.
| Retr0id wrote:
| Human error rates are also not a constant.
|
| If you're about to publish a career-ending allegation,
| you're going to spend some extra time fact-checking it.
| atoav wrote:
| Can you point to where that claim was made? I can't find
| it. The parent post assumes 1% for the sake of argument
| to underline that the impact of the 1% error depends on
| the number to which the 1% are applied -- automation
| reduces the effort and increases the number.
|
| Hypothetical example: Cops shoot the wrong person in x%
| of cases. If we equipped all surveillance cameras with
| guns that _also_ shoot the wrong person in x% of cases
| the world would be a nightmare pandemonium, simply
| because there is more cameras and they are running 24 /7.
|
| Mind that the precise value of x and whether is constant
| or not does not impact the argument at all.
| yen223 wrote:
| Isn't this just saying "humans are slow" in a different
| way?
| 8organicbits wrote:
| Is product search a high risk activity? LLMs could be the
| right tool for building a product search database while also
| being libelously terrible for news reporting.
| intended wrote:
| Only slightly tongue in cheek, but if your measure of success
| is Reddit, perhaps a better example may serve your argument?
| ramon156 wrote:
| The argument for "LLMs get it right 99% of the time" is also
| very generalized and doesn't take into account smaller
| websites
| klabb3 wrote:
| It's baffling how defeatist and ignorant engineering
| culture has become when someone else's non-deterministic,
| proprietary and non-debuggable code, running on someone
| else's machine, that uses an enormous amount of currently
| VC-subsidized resources, is touted as a general solution to
| a data annotation problem.
|
| Back in my day people used to bash on JavaScript. Today one
| can only dream of a world where JS is the worst of our
| engineering problems.
| 8organicbits wrote:
| Oh nice, Product Chart looks like a great fit for what LLMs can
| actually do. I'm generally pretty skeptical about LLMs getting
| used, but looking at the smart phone tool: this is the sort of
| product search missing from online stores.
|
| Critically, if the LLM gets something wrong, a user can notice
| and flag it, then someone can manually fix it. That's 100x less
| work than manually curating the product info (assuming 1% error
| rate).
| esjeon wrote:
| > I make a lot of tests on how well LLMs get categorization and
| data extraction right or wrong for my Product Chart
| (https://www.productchart.com) project.
|
| In fact, what you're doing there is building a local semantic
| database by automatically mining metadata using LLM. The
| searching part is entirely based on the metadata you gathered,
| so the GP's point 1 is still perfectly valid.
|
| > In the near future, nobody will see passing a text through an
| LLM as a noteworthy amount of compute anymore.
|
| Even with all that technological power, LLMs won't replace most
| simple-searching-over-index, as they are bad at adapting to
| ever changing datasets. They only can make it easier.
| Devasta wrote:
| > Before JSON-LD there was a nest of other, more XMLy, standards
| emitted by the various web steering groups. These actually have
| very, very deep support in many places (for example in library
| and archival systems) but on the open web they are not a goer.
|
| If archival systems and library's are using XML, wouldn't it be
| preferable to follow their lead and whatever standards they are
| using? Since they are the ones who are going to use this stuff
| most, most likely.
|
| If nothing else, you can add a processing instruction to the
| document they use to convert it to HTML.
| whartung wrote:
| The format really isn't much of an issue. From an information
| point of view, the content of the different formats are
| identical, and translation among them is straightforward.
|
| Promoting JSON-LD potentially makes it more palatable to the
| modern web creators, perhaps increasing adoption. The bots have
| already adapted.
| cess11 wrote:
| You're aware of straightforward translations to and from
| E-ARK SIP and CSIP? Between what formats?
|
| As far as I can tell archivists don't care about "modern web
| creators", and they likely shouldn't, since archiving is a
| long term project. I know I don't, and I'm only building
| software for digital archiving.
| tannhaeuser wrote:
| If by that the author means JSON-LD has replaced MarcXML,
| BibTex records, and other bibliographic information systems,
| then that's very much not the case.
| AlecSchueler wrote:
| They recognise that in the quoted paragraph. The JSON-LD
| thing was only about the open web:
|
| > [MarcXML, BibTex etc] actually have very, very deep support
| in many places (for example in library and archival systems)
| but on the open web they are not a goer.
| _heimdall wrote:
| > If nothing else, you can add a processing instruction to the
| document they use to convert it to HTML.
|
| Like XSLT?
| npunt wrote:
| The argument about LLMs is wrong, not because of reasons stated
| but because semantic meaning shouldn't solely be defined by the
| publisher.
|
| The real question is whether the average publisher is better than
| an LLM at accurately classifying their content. My guess is, when
| it comes to categorization and summarization, an LLM is going to
| handily win. An easy test is: are publishers experts on topics
| they talk about? The truth of the internet is no, they're not
| usually.
|
| The entire world of SEO hacks, blogspam, etc exists because
| publishers were the only source of truth that the search engine
| used to determine meaning and quality, which has created all the
| sorts of misaligned incentives that we've lived with for the past
| 25 years. At best there are some things publishers can provide as
| guidance for an LLM, social card, etc, but it can't be the only
| truth of the content.
|
| Perhaps we will only really reach the promise of 'the semantic
| web' when we've adequately overcome the principal-agent problem
| of who gets to define the meaning of things on the web. My sense
| is that requires classifiers that are controlled by users.
| atoav wrote:
| Yet LLMS fail to make these simple but sometimes meaningful
| differentiation. See for example this case in which a court
| reporter is described as _being_ all the things he reported
| about by Copilot: a child molester, a psychatric escapee, a
| widow cheat. Presumably because his name was in a lot of
| articles about said things and LLMS simply associate his name
| with the crimes without making the connection that he could in
| fact be simply the messenger and not the criminal. If LLMS had
| the semantic understanding that the name on top /bottom of a
| news article is the author, it would not have made that
| mistake.
|
| https://www.heise.de/en/news/Copilot-turns-a-court-reporter-...
| npunt wrote:
| Absolutely! Today's LLMs can sometimes(/often?) enormously
| suck and should not be relied upon for critical information.
| There's a long way to go to make them better, and I'm happy
| that a lot of people are working on that. Finding meaning in
| a sea of information is a highly imperfect enterprise
| regardless of the tech we use.
|
| My point though was that the core problem we should be trying
| to solve is overcoming the fundamental misalignment of
| incentives between publisher and reader, not whether we can
| put a better schema together that we hope people adopt
| intelligently & non-adversarially, because we know that won't
| happen in practice. I liked what the author wrote but they
| also didn't really consider this perspective and as such I
| think they haven't hit upon a fundamental understanding of
| the problem.
| mandmandam wrote:
| Humans do something very similar, fwiw. It's called
| spontaneous trait association: https://www.sciencedirect.com/
| science/article/abs/pii/S00221...
| thuuuomas wrote:
| > fwiw
|
| What do you think this sort of observation is worth?
| mandmandam wrote:
| Really depends on what sort of person you are I guess.
|
| Some people appreciate being shown fascinating aspects of
| human nature. Some people don't, and I wonder why they're
| on a forum dedicated to curiosity and discussion. And
| then, some people get weirdly aggressive if they're shown
| something that doesn't quite fit in their worldview. This
| topic in particular seems to draw those out, and it's
| fascinating to me.
|
| Myself, I thought it was great to learn about spontaneous
| trait association, because it explains so much weird
| human behavior. The fact that LLMs do something so
| similar is, at the very least, an interesting parallel.
| pickledoyster wrote:
| >My guess is, when it comes to categorization and
| summarization, an LLM is going to handily win. An easy test is:
| are publishers experts on topics they talk about? The truth of
| the internet is no, they're not usually.
|
| LLMs are not experts either. Furthermore, from what I gather,
| LLMs are trained on:
|
| >The entire world of SEO hacks, blogspam, etc
| npunt wrote:
| This is an excellent rebuttal. I think it is an issue that
| can be overcome but I appreciate the irony of what you point
| out :)
| peoplefromibiza wrote:
| > because semantic meaning shouldn't solely be defined by the
| publisher
|
| LLMs are not that great at understanding semantics though
| hmottestad wrote:
| Metadata in PDFs is also typically based on semantic web
| standards.
|
| https://www.meridiandiscovery.com/articles/pdf-forensic-anal...
|
| Instead of using JSON-LD it uses RDF written as XML. Still uses
| the same concept of common vocabularies, but instead of
| schema.org it uses a collection of various vocabularies including
| Dublin Core.
| kkfx wrote:
| Ehm... The semantic web as an idea was/is a totally different
| thing: the idea is the old libraries of Babel/Bibliotheca
| Universalis by Conrad Gessner (~1545) [1] or the ability to
| "narrow"|"select"|"find" just "the small bit of information I
| want". Observing that a book it's excellent to develop and share
| a specific topic, it have some indexes to help directly find
| specific information but that's not enough, a library of books
| can't be traversed quick enough to find a very specific bit of
| information like when John Smith was born and where.
|
| The semantic web original idea was the interconnection of every
| bit of information in a format a machine can travel for a human,
| so the human can find any specific bit ever written with little
| to no effort without having to humanly scan pages of moderately
| related stuff.
|
| We never achieve such goal. Some have tried to be more on the
| machine side, like WikiData, some have pushed to the extreme the
| library science SGML idea of universal classification not ended
| to JSON but all are failures because they are not universal nor
| easy to "select and assemble specific bit of information" on
| human queries.
|
| LLMs are a, failed, tentative of achieve such result from another
| way, their hallucinations and slow formation of a model prove
| their substantial failure, they SEEMS to succeed for a distracted
| eye perceiving just the wow effect, but they practically fails.
|
| Aside the issue with ALL test done on the metadata side of the
| spectrum so far is simple: in theory we can all be good citizens
| and carefully label anything, even classify following Dublin Core
| at al any single page, in practice very few do so, all the rest
| do not care, or ignoring the classification at all or badly
| implemented it, and as a result is like an archive with some
| missing documents, you'll always have holes in information
| breaking the credibility/practical usefulness of the tool.
|
| Essentially that's why we keep using search engines every day,
| with classic keyword based matches and some extras around. Words
| are the common denominator for textual information and the larger
| slice of our information is textual.
|
| [1] https://en.wikipedia.org/wiki/Bibliotheca_universalis
| DrScientist wrote:
| The problem I find with semantic search is first I have to read
| and understand somebody elses definitions before I can search
| within the confines of the ontology.
|
| The problem I have with ML guided search is the ML takes web
| average view of what I mean, which sometimes I need to
| understand and then try and work around if that's wrong. It can
| become impossible to find stuff off the beaten track.
|
| The nice thing about keyword and exact text searching with fast
| iteration is it's _my_ mental model that is driving the
| results. However if it 's an area I don't know much about there
| is a chicken and egg problem of knowing which words to use.
| kkfx wrote:
| Personally I think the limitation of keyword search it's not
| in the model per se but in the human langue: we have
| synonymous witch are relatively easy to handle but we also
| have gazillion of different way to express the very same
| concept that simply can't be squeezed in some "nearby keyword
| list".
|
| Personally I notes news, importing articles in org-mode, so I
| have a "trail" of the news I think are relevant in a
| timeline, sometimes I remember I've noted something but I
| can't find it immediately in my own notes with local full-
| text search on a very little base compared to the entire web,
| simply because a title does express something with very
| different words than another and at the moment of a search I
| do not think about such possible expression.
|
| For casual searches we do not notice, but for some specific
| searches emerge very clear as a big limitation, however so
| far LLMs does not solve it, they are even LESS able to
| extract relevant information, and "semantic" classifications
| does not seems to be effective either, a thing even easier to
| spot if you use Zotero and tags and really try to use tags to
| look for something, in the end you'll resort on mere keyword
| search for anything.
|
| That's why IMVHO it's an unsolved so far problem.
| DrScientist wrote:
| For me the search problem isn't so much about making sure I
| get back all potentially relevant hits ( more than I could
| ever read ) , it's how I get the specific ones I want...
|
| So effective search more about _excluding_ than including.
|
| Exact phrases or particular keywords are great tools here.
|
| Note there is also a difference between finding an answer
| to a particular question and finding web pages around a
| particular topic. Perhaps LLM's are more useful for the
| former - where there is a need to both map the question to
| an embedding, and summarize the answer - but for the latter
| I'm not interested in a summary/quick answer, I'm
| interested in the source material.
|
| Sometimes you can combine the two - LLM's for a quick route
| into the common jargon, which can then be used as keywords.
| gostsamo wrote:
| So much jumping to defend llms as the future. I'd like to point
| that llms hallucinate, could be injected, and often lack context
| which well structured metadata can provide. At least, I don't
| want for an llm to hollucinate the author's picture and bio based
| on hints in the article, thank you very much.
|
| I don't think that one is necessarily better than the other, but
| imagining that llms are a silver bullet when another trending
| story in the front pages is about prompt injection used against
| the slack ai bots sounds a bit over optimistic.
| IshKebab wrote:
| Sure but do hallucinations matter then much just for
| categorisation? Hardly the end of the world if they make up a
| published date occasionally.
|
| And prompt injection is irrelevant because the alternative
| we're considering is letting publishers directly choose the
| metadata.
| gostsamo wrote:
| Prompt injection is highly relevant because you end up
| achieving the same as the publisher choosing the metadata,
| but on a much higher price for the user. Price which needs to
| be paid by each user separately instead of using one already
| generated.
|
| LLMs are much better when the user adapts the categories to
| their needs or crunches the text to pull only the info
| relevant to them. Communicating those categories and the
| cutoff criteria would be an issue in some contexts, but still
| better if communication is not the goal. Domain knowledge is
| also important, because nitch topics are not represented in
| the llm datasets and their abilities fail in such scenarios.
|
| As I said above, one is not necessarily better than the other
| and it depends on the use cases.
| IshKebab wrote:
| > Prompt injection is highly relevant because you end up
| achieving the same as the publisher choosing the metadata,
| but on a much higher price for the user.
|
| How does price affect the relevance of prompt injection?
| That doesn't make sense.
|
| > nitch
|
| Niche. Pronounced neesh.
| gostsamo wrote:
| My question is: how price does not matter? If you are
| given the choice to pay either a dollar or a million
| dollars for the same good from an untrustworthy merchant,
| why would you pay the million? And the difference between
| parsing a json and sending a few megabytes of a webpage
| to chatgpt is the same if not bigger. For a dishonest seo
| engineer it does not matter if they will post boasting
| metadata or a prompt convincing chatgpt in the same. The
| difference is for the user.
|
| I don't mind the delusions of most people, but the idea
| that llms will deal with spam if you throw a million
| times more electricity against it is what makes the
| planet burning.
| IshKebab wrote:
| Price matters, but you said prompt injection is relevant
| _because of price_. Maybe a typo...
| tsimionescu wrote:
| If even the semantic web people are declaring victory based on a
| post title and a picture for better integration with Facebook,
| then it's clear that Semantic Web as it was envisioned is fully
| 100% dead and buried.
|
| The concept of OWL and the other standards was to annotate the
| content of pages, that's where the real values lie. Each
| paragraph the author wrote should have had some metadata about
| its topic. At the very least, the article metadata was supposed
| to have included information about the categories of information
| included in the article.
|
| Having a bit of info on the author, title (redundant, as HTML
| already has a tag for that), picture, and publication date is
| almost completely irrelevant for the kinds of things Web 3.0 was
| supposed to be.
| lynx23 wrote:
| I had pretty much the same reacon while reading the article.
| "BlogPosting" isn't particularily informative. The rest of the
| metadata looked like it could/should be put in <meta> tags,
| done.
|
| A very bad example if the intention was to demonstrate how cool
| and useful semweb is :-)
| oneeyedpigeon wrote:
| The schema.org data is much more rich than meta tags, though.
| Using the latter, an author is just a string of text
| containing who-knows-what. The former lets you specify a
| name, email address, and url. And that's just for the Person
| type--you can specify an Organization too.
| tsimionescu wrote:
| That's still just tangential Metadata. The point of a
| semantic web would be to annotate the semantic content of
| text. The vision was always that you can run a query like,
| say, "physics:particles: proton-mass", over the entire web,
| and it would retrieve parts of web pages that talk about
| the proton mass.
| rakoo wrote:
| Which was already possible with RDF. It is hard to not see
| JSON-LD as anything other than "RDF but in JSON because we
| don't like XML".
| jll29 wrote:
| The blog post does not address why the Semantic Web failed:
|
| 1. Trust: How should one know that any data available marked up
| according to Sematic Web principles can be trusted? This is an
| even more pressing question when the data is free. Sir Berners-
| Lee (AKA "TimBL") designed the Semantic Web in a way that makes
| "trust" a component, when in truth it is an emergent relation
| between a well-designed system and its users (my own
| definition).
|
| 2. Lack of Incentives: There is no way to get paid for
| uploading content that is financially very valuable. I know
| many financial companies that would like to offer their data in
| a "Semantic Web" form, but they cannot, because they would not
| get compensated, and their existence depends on selling that
| data; some even use Semantic Web standards for internal-only
| sharing.
|
| 3. A lot of SW stuff is either boilerplate or re-discovered
| formal logic from the 1970s. I read lots of papers that propose
| some "ontology" but no application that needs it.
| oneeyedpigeon wrote:
| > title (redundant, as HTML already has a tag for that)
|
| Note that `title` isn't one of the properties that BlogPosting
| supports. It supports `headline`, which may well be different
| from the `<title/>`. It's probably analogous to the page's
| `<h1/>`, but more reliable.
| jerf wrote:
| Yeah, this is hiking the original Semantic Web goal post over
| the horizon, across the ocean, up a mountain, and cutting it
| down to a little stump downhill in front of the kicker compared
| to the original claims. "It's going to change the world!
| Everything will be contained in RDF files that anyone can
| trivially embed and anyone can run queries against the
| Knowledge Graph to determine anything they want!"
|
| "We've achieved victory! After over 25 years, if you want to
| know who wrote a blog post, you can get it from a few sites
| this way!"
|
| I'd call it damning with faint success, except it really isn't
| even success. Relative to the promises of "Semantic Web" it's
| simply a failure. And it's not like Semantic Web was
| overpromised a bit, but there were good ideas there and the
| reality is perhaps more prosaic but also useful. No, it's just
| useless. It failed, and LLMs will be the complete death of it.
|
| The "Semantic Web" is not the idea that the web contains
| "semantics" and someday we'll have access to them. That the web
| has information on it is not the solution statement, it's the
| _problem_ statement. The semantic web is the idea that all this
| information on the web will be organized, by the owners of the
| information, voluntarily, and correctly, into a big cross-site
| Knowledge Graph that can be queried by anybody. To the point
| that visiting Wikipedia behind the scenes would not be a big
| chunk of formatted text, but a download of "facts" embedded in
| tuples in RDF and the screen you read as a human a rendered
| result of that, where Wikipedia doesn't just use self-hosted
| data but could grab "the Knowledge Graph" and directly embed
| other RDF information from the US government or companies or
| universities. Compare this dream to reality and you can see it
| doesn't even resemble reality.
|
| Nobody was sitting around twenty years ago going "oh, wow, if
| we really work at this for 20 years some people might annotate
| their web blogs with their author and people might be able to
| write bespoke code to query it, sometimes, if we achieve this
| it will have all been worth it". The idea is precisely that
| such an act would be so mundane as to not be something you
| would think of calling out, just as I don't wax poetic about
| the <b> tag in HTML being something that changes the world
| every day. That it would not be something "possible" but that
| it would be something your browser is automatically doing
| behind the scenes, along with the other vast amount of RDF-
| driven stuff it is constantly doing for you all the time. The
| very fact that someone thinks something so trivial is worth
| calling out is proof that the idea has utterly failed.
| tsimionescu wrote:
| Beautifully said.
|
| I'll also add that I wouldn't even call what he's showing
| "semantic web", even in this limited form. I would bet that
| most of the people who add that metadata to their pages view
| it instead as "implenting the nice sharing link API". The
| fact that Facebook, Twitter and others decided to converge on
| JSON-LD with a schema.org schema as the API is mostly an
| accident of history, rather than someone mining the Knowledge
| Graph for useful info.
| trainyperson wrote:
| Are there any tools that employ LLMs to _fill out_ the Semantic
| Web data? I can see that being a high-impact use case: people
| don't generally like manually filling out all the fields in a
| schema (it is indeed "a bother"), but an LLM could fill it out
| for you - and then you could tweak for correctness /
| editorializing. Voila, bother reduced!
|
| This would also address the two reasons why the author thinks AI
| is not suited to this task:
|
| 1. human stays in the loop by (ideally) checking the JSON-LD
| before publishing; so fewer hallucination errors
|
| 2. LLM compute is limited to one time per published content and
| it's done by the publisher. The bots can continue to be low-GPU
| crawlers just as they are now, since they can traverse the neat
| and tidy JSON-LD.
|
| ------------
|
| The author makes a good case for The Semantic Web and I'll be
| keeping it in mind for the next time I publish something, and in
| general this will add some nice color to how I think about the
| web.
| safety1st wrote:
| Bringing an LLM into the picture is just silly. There's zero
| need.
|
| The author (and much of HN?) seems to be unaware that it's not
| just thousands of websites using JSON-LD, it's millions.
|
| For example: install WordPress, install an SEO plugin like
| Yoast, and boom you're done. Basic JSON-LD will be generated
| expressing semantic information about all your blog posts,
| videos etc. It only takes a few lines of code to extend what
| shows up by default, and other CMSes support this took.
|
| SEOs know all about this topic because Google looks for JSON-LD
| in your document and it makes a significant difference to how
| your site is presented in search results as well as all those
| other fancy UI modules that show up on Google.
|
| Anyone who wants to understand how this is working massively,
| at scale, across millions of websites today, implemented
| consciously by thousands of businesses, should start here:
|
| https://developers.google.com/search/docs/appearance/structu...
|
| https://search.google.com/test/rich-results
|
| Is this the "Semantic Web" that was dreamed of in yesteryear?
| Well it hasn't gone as far and as fast as the academics hoped,
| but does anything?
|
| The rudimentary semantic expression is already out there on the
| Web, deployed at scale today. Someone creative with market pull
| could easily expand on this e.g. maybe someday a competitor to
| Google or another Big Tech expands the set of semantic
| information a bit if it's relevant to their business scenarios.
|
| It's all happening, it's just happening in the way that
| commercial markets make things happen.
| Spivak wrote:
| I guess where do you go from basic info that can be machine
| generated, to rich information that's worth consuming for
| things other than link previews and specific Google Search
| integrations?
| cpdomina wrote:
| Semantic Web is now revived into its new marketing incarnation,
| called Knowledge Graphs. There's actually a lot of work on
| building KGs with LLMs, specially in the RAG space e.g.,
| Microsoft's GraphRag and llama_index's KnowledgeGraphIndex
| nox101 wrote:
| No ... because the incentives to lie in metadata are too high
| swiftcoder wrote:
| As much as I like the ideas behind the semantic web, JSON-LD
| feels like the least friendly of all semantic markup options
| (compared to something like, say, microformats)
| MrVandemar wrote:
| Microformats feel like they're ugly retrofitted kludges, where
| it would have been way more elegant if in among all the crazy
| helter-skelter competing development of HTML, someone thought
| to invent a <person> tag, maybe a <organisation> tag. That
| would have solved a few problems that <blink> certainly didn't.
| fabianholzer wrote:
| They certainly are retrofitted, but the existing semantic
| tags are largely abandoned for div soups that are beaten into
| shape and submission by lavish amounts of JS and a few
| sprinkles of CSS (and the latter often as CSS-in-JS). For
| microformats there is at least a little ecosystem already,
| and the vendor-driven committees don't need to be involved.
| swiftcoder wrote:
| I mean, is anything actually stopping one from adding
| something like those tags today? Web components use custom
| tags all the time
| giantrobot wrote:
| I think the main issue with microformats is most CMSes don't
| really have a good way of adding them. You need a very capable
| rich editor to add semantic data inline or edit the output HTML
| by hand. Simple markup like WikiText and Markdown don't support
| microformat annotation.
|
| JSON-LD in a page's header is much easier for a CMS to present
| to the page author for editing. It can be a form in the editing
| UI. Wordpress et al have SEO plugins that make editing the
| JSON-LD data pretty straightforward.
| swiftcoder wrote:
| That's a good point. I adopted microformats in a static site
| generator, with a handful of custom shortcodes. It would be
| much harder to adopt in a WYSIWYG context
| renegat0x0 wrote:
| I think that if you want your page to be well discoverable, to be
| well asvertised, positioned in search engines and social media
| you have to support standards. Like open graph protocol, or json
| ld.
|
| Be nice to bots. This is advertisment after all.
|
| Support standards even if Google does not. Other bots might not
| be as sofisticated.
|
| For me, yes, it is worth the bother
| jillesvangurp wrote:
| Did json-ld get a lot of traction for link previews? I haven't
| really encountered it much.
|
| I actually implemented a simple link preview system a while ago.
| It uses opengraph and twitter cards meta data that is commonly
| added to web pages for SEO. That works pretty well.
|
| Ironically, I did use chat gpt for helping me implement this
| stuff. It did a pretty good job too. It suggested some libraries
| I could use and then added some logic to extract titles,
| descriptions, icons, images, etc. with some fallbacks between
| various fields people use for those things. It did not suggest me
| to add logic for json-ld.
| conzept wrote:
| I think the future holds a synthesis of LLM functions with
| semantic entities and logic from knowledge graphs (this is called
| "neuro-symbolic AI"), so each topic/object can have a clear
| context, upon which you can start prompting the AI for the
| preferred action/intention.
|
| Already implemented in part on my Conzept Encyclopedia project
| (using OpenAI): https://conze.pt/explore/%22Neuro-
| symbolic%20AI%22?l=en&ds=r...
|
| Something like this is much easier done using the semantic web
| (3D interactive occurence map for an organism):
| https://conze.pt/explore/Trogon?l=en&ds=reference&t=link&bat...
|
| On Conzept one or more bookmarks you create, can be used in
| various LLM functions. One of the next steps is to integrate a
| local WebGPU-based frontend LLM, and see what 'free' prompting
| can unlock.
|
| JSON-LD is also created dynamically for each topic, based on
| Wikidata data, to set the page metadata.
| knallfrosch wrote:
| Here I was, thinking the machines would make our lives easier.
| Now we have to make our websites Reader-Mode friendly,
| ARIA[1]-labelled, rendered server-side and now semantic web on
| top, just so that bots and non-visitors can crawl around?
|
| [1] This is also something the screen assist software should do,
| not the publisher.
| MrVandemar wrote:
| ARIA is something that really shouldn't have been necessary,
| but today it is absolutely crucial that content publishers make
| sure is right. Because the screen assist software can't do it.
|
| Why? Because a significant percentage of people working on web
| development think a webpage is composed as many <spans> and
| <divs> as you like, styled with CSS and the content is injected
| into it with JavaScript.
|
| These people don't know what an <img> tag is, let alone alt-
| text, or semantic heading hierarchy. And yet, those are exactly
| the things that Screen Reader software understands.
| Vinnl wrote:
| The question is: does this bring any of the purported benefits of
| the Semantic Web? Does it suddenly allow "agents" to understand
| the _meaning_ of your web pages, or are we just complying with a
| set of pre-defined schemas that predefined software (or more
| specifically, Google, in practice) understands and knows how to
| render. In other words, was all the SemWeb rigmarole actually
| necessary, or could the same results have been achieved using any
| of the mentioned simpler alternatives (microdata, OpenGraph tags,
| or even just JSON schemas)?
| sebstefan wrote:
| Is that really what Discord, Whatsapp & co are using to display
| the embed widgets they have or is it just <meta> tags like I
| would expect...?
| johneth wrote:
| There are several methods they may use:
|
| - OpenGraph (by Facebook, probably used by Whatsapp) -
| https://ogp.me/
|
| - Schema.org markup (the main point of this blog) -
| https://schema.org/
|
| - oEmbed (used to embed media in another page, e.g. YouTube
| videos on a WordPress blog) - https://oembed.com/
| vouaobrasil wrote:
| > The first is that large language models (LLMs) routinely get
| stuff wrong. If you want bots to get it right, provide the
| metadata to ensure that they do.
|
| Yet another reason NOT to use the semantic web. I don't want to
| help any LLMs.
| bigiain wrote:
| I laughed at this bit:
|
| "Googlers, if you're reading this, JSON-LD could have the same
| level of public awareness as RSS if only you could release, and
| then shut down, some kind of app or service in this area. Please,
| for the good of the web: consider it."
| peter_retief wrote:
| Not totally sure if it is needed, nice to have? RSS feeds are
| great but seen less and less.
| druskacik wrote:
| There's a project [0] that parses Commoncrawl data for various
| schemas, it contains some interesting datasets.
|
| [0] http://webdatacommons.org/
| undefinedblog wrote:
| That's a really useful link, thanks for sharing. We're building
| a scrapping service and only parsing rely on native html tags
| and open graph metadata, based on this link we should
| definitely take a step forward to parse JSON-LD as well.
| openrisk wrote:
| The semantic web standards are sorely lacking (for decades now) a
| killer application. Not in a theoretical universe of
| decentralized philosopher-computer-scientists but in the dumbed
| down, swipe-the-next-30sec-video, adtech oligopolized digital
| landscape of walled gardens. Providing better search metadata is
| hardly that killer app. Not in 2024.
|
| The lack of adoption has, imho, two components.
|
| 1. bad luck: the Web got worse, a lot worse. There hasn't been a
| Wikipedia-like event for many decades. This was not pre-ordained.
| Bad stuff happens to societies when they don't pay attention. In
| a parallel universe where the good Web won, the semantic path
| would have been much more traveled and developed.
|
| 2. incompleteness of vision: if you dig to their nuclear core,
| semantic apps offer things like SPARQL queries and reasoners.
| Great, these functionalities are both unique and have definite
| utility but there is a reason (pun) that the excellent Protege
| project [1] is not the new spreadsheet. The calculus of cognitive
| cost versus tangible benefit to the average user is not
| favorable. One thing that is missing are abstractions that will
| help bridge that divide.
|
| Still, if we aspire to a better Web, the semantic web direction
| (if not current state) is our friend. The original visionaries of
| the semantic web where not out of their mind, they just did not
| account for the complex socio-economics of digital technology
| adoption.
|
| [1] https://protege.stanford.edu/
| austin-cheney wrote:
| A killer app is still not enough.
|
| People can't get HTML right for basic accessibility, so
| something like the semantic web would be super science that
| people will out of their way to intentionally ignore any profit
| upon so long as they can raise their laziness and class-action
| lawsuit liability.
| PaulHoule wrote:
| I see RDF as a basis to build on. If I think RDF is pretty
| good but needs a way to keep track of provenance or
| temporality or something I can probably build something
| augmented that does that.
|
| If it really works for my company and it is a competitive
| advantage I would keep quiet about it and I know of more than
| one company that's done exactly that. The standards process
| is so exhausting and you have to fight with so many systems
| programmers who never wrote an application that it's just
| suicide to go down that road.
|
| BTW, RSS is an RDF application that nobody knows about
|
| https://web.resource.org/rss/1.0/spec
|
| you can totally parse RSS feeds with a RDF-XML parser and do
| SPARQL and other things with them.
| ttepasse wrote:
| 99% of the time you'll get an RSS 2.0 feed which is an XML
| format. Of course you can convert, but RSS 1.0 seems, like
| you said, forgotten from the world.
| burningChrome wrote:
| >> People can't get HTML right for basic accessibility.
|
| Not only has this gotten much worse; even when you put in the
| stop gaps for developers such as linters or other plugins,
| they willfully ignore them and will actually implement code
| they know is determinantal to accessibility.
| DrScientist wrote:
| I think the problem with _any_ sort of ontology type approach
| is the problem isn 't solved when you have defined the one
| ontology to rule them all after many years of wrangling between
| experts.
|
| As what you have done is spend many years _generating a shared
| understanding_ of what that ontology means between the experts.
| Once that 's done you have the much harder task for pushing
| that shared understanding to the rest of the world.
|
| ie the problem isn't defining a tag for a cat - it's having a
| global share vision of what a cat is.
|
| I mean we can't even agree on what is a man or a women.
| openrisk wrote:
| You point out a real problem but it does not feel like an
| unsurmountable and terminal one. By that argument we would
| never have a human language unless everybody spoke the same
| language. Turns out once you have well developed languages
| (and you do, because they are useful even when not universal)
| you can translate between them. Not perfectly, but generally
| good enough.
|
| Developing such linking tools between ontologies would be
| worthwhile if there are multiple ontologies covering the same
| domain, _provided they are actually used_ (i.e., there are
| large datasets for each). Alas, instead of a bottom-up,
| organic approach people try to solve this with top-down,
| formal (upper-level) ontologies [1] and Leibnizian dreams of
| an underlying universality [2], which only adds to the
| cognitive load.
|
| [1] https://en.wikipedia.org/wiki/Formal_ontology
|
| [2] https://en.wikipedia.org/wiki/Characteristica_universalis
| rapnie wrote:
| > You point out a real problem but it does not feel like an
| unsurmountable and terminal one
|
| In our spoken language the agents doing the parsing are
| human AI's ( _actual_ intelligences) able to deal with most
| of the finer nuances in semantics, and still making
| numerous errors in many contexts that lead to
| misunderstanding, i.e. parse errors.
|
| There was this hand-waving promise in semantic web movement
| of "if only we make everything machine-readable, then .."
| magic would happen. Undoubtedly unlocking numerous killer
| apps, if only we had these (increasingly complex) linked
| data standards and related tools to define and parse
| 'universal meaning'.
|
| An overreach, imho. Semantic web was always overpromising
| yet underdelivering. There may be new use cases in
| combinations of SM with ML/LLM but I don't think they'll be
| a vNext of the web anytime soon.
| vasco wrote:
| > There hasn't been a Wikipedia-like event for many decades
|
| I'll give you two examples: Internet Archive. Let's Encrypt.
| KolmogorovComp wrote:
| Hardly a good reference, Internet Archive is older than
| Wikipedia.
| Vinnl wrote:
| Wikipedia itself is only a little over two decades old. I
| don't think anyone would parse "many decades" as "two
| decades".
|
| There's also OpenStreetMap, exactly two decades old and
| thus four years younger than Wikipedia.
| bawolff wrote:
| > Wikipedia itself is only a little over two decades old
|
| The world wide web (but not the internet) is only 3
| decades old!
| Retr0id wrote:
| Let's Encrypt is very good but it's not exactly a web app,
| semantic-web or otherwise.
| conzept wrote:
| Not true: Wikidata, Open Alex, Europeana, ... and many
| smaller projects making use of all that data, such as my
| project Conzept (https://conze.pt)
| debarshri wrote:
| At TU delft, I was supposed to do my PhD in semantic web
| especially in the shipping logistics. It was funded by port of
| Rotterdam 10 years ago. Idea was to theorize and build various
| concepts around discrete data sharing, data discovery,
| classification, building ontology, query optimizations,
| automation and similar usecases. I decided not to pursue phd a
| month into it.
|
| I believe in semantic web. The biggest problem is that, due to
| lack of tooling and ease of use, it take alot of effort and
| time to see value in building something like that across
| various parties etc. You dont see the value right away.
| jsdwarf wrote:
| Funny you bring up logistics and (data) ontologies. I'm a PM
| at a logistics software company and I'd say the lack of
| proper ontologies and standardized data exchange formats is
| the biggest effort driver for integrating 3rd party
| carrier/delivery services such as DHL, Fedex etc.
|
| It starts with the lack of a common terminology. For tool A a
| "booking" might be a reservation e.g. of a dock at a
| warehouse. For tool B the same word means a movement of goods
| between two accounts.
|
| In terms of data integration things have gotten A LOT worse
| since EDIFACT is de facto deprecated. Every carrier in the
| parcel business is cooking their own API, but with
| insufficient means. I've come across things like Polish
| endpoint names/error messages or country organisations of big
| Parcel couriers using different APIs.
|
| IMHO the EU has to step in here because integration costs
| skyrocket. They forced cellphone manufacturers to use USB-Cs
| for charging, why can't they force carriers to use a common
| API?
| openrisk wrote:
| The EU is doing its part in some domains. There is e.g.,
| the eProcurement ontology [1] that aims to harmonize public
| procurement data flows. But I suppose it helped alot that
| (by EU law) everybody is obliged to submit to a central
| repository.
|
| [1] https://docs.ted.europa.eu/epo-home/index.html
| PaulHoule wrote:
| Good choice. The semantic web really brought me to the brink.
|
| The community has its head in the sands about... just about
| everything.
|
| Document databases and SQL are popular because all of the
| affordances around "records". That is, instead of deleting,
| inserting, and updating facts you get primitives that let you
| update records _in a transaction_ even if you don 't
| explicitly use transactions.
|
| It's very possible to define rules that will cut out a small
| piece of a graph that defines an individual "record"
| pertaining to some "subject" in the world even when blank
| nodes are in use. I've done it. You would go 3-4 years into
| your PhD and probably not find it in the literature, not get
| told about it by your prof, or your other grad students. (boy
| I went through the phase where I discovered most semantic web
| academics couldn't write hard SPARQL queries or do anything
| interesting with OWL)
|
| Meanwhile people who take a bootcamp can be productive with
| SQL in just a few days because SQL was developed long ago to
| give the run-of-the-mill developer superpowers. (imagine how
| lost people were trying to develop airline reservation
| systems in the 1960s!)
| WolfOliver wrote:
| Graph Based RAG systems look promising
| https://www.ontotext.com/knowledgehub/fundamentals/what-is-g...
| jl6 wrote:
| Killer applications solve real problems. What is the biggest
| real problem on the web today? The noise flood. Can semantic
| web standards help with that? Maybe! Something about trust,
| integrity, and lineage, perhaps.
| rakoo wrote:
| Semantic Web doesn't help with the most basic thing: how do
| you get information ? If I want to know when was the Matrix
| shot, where do I go ? Today we have for-profit centralized
| point to get all information, because it's the only way this
| can be sustainable. Semantic Web might make it more feasible,
| by instead having lots of small interconnected agents that
| trust each other, much like... a Web of Trust. Except we know
| where the last experiment went (nowhere).
| rakoo wrote:
| Over on lobste.rs, someone cited another article retracing the
| history of the Semantic Web:
| https://twobithistory.org/2018/05/27/semantic-web.html
|
| An interesting read in itself, and also points to Cory Doctorow
| giving seven reasons why the Semantic Web will never work:
| https://people.well.com/user/doctorow/metacrap.htm. They are
| all good reasons and are unfortunately still valid (although
| one of his observations towards the end of the text has turned
| out to be comically wrong, I'll let you read what it is)
|
| Your comment and the two above links point to the same
| conclusion: again and again, Worse is Better
| (https://en.wikipedia.org/wiki/Worse_is_better)
| domh wrote:
| Thanks for sharing that Doctorow post, I had not seen that
| before. While the specific examples are of course dated
| (hello altavista and Napster), it still rings mostly true.
| openrisk wrote:
| > An interesting read in itself...
|
| Indeed a good read, thanks for the link!
|
| > [Cory Doctorow's] seven insurmountable obstacles
|
| I think his context is the narrower "Web of individuals"
| where many of his seven challenges are real (and ongoing).
|
| The elephant in the digital room is the "Web of
| organizations", whether that is companies, the public sector,
| civil society etc. If you revisit his objections in that
| light they are less true or even relevant. E.g.,
|
| > People lie
|
| Yes. But public companies are increasingly reporting online
| their audited financials via standards like iXBRL and
| prescribed taxonomies. Increasingly they need to report
| environmental impact etc. I mentioned in another comment
| common EU public procurement ontologies. Think also the
| millions of education and medical institutions and their
| online content. In institutional context lies do happen, but
| at a slightly deeper level :-)
|
| > People are lazy
|
| This only raises the stakes. As somebody mentioned already,
| the cost of navigating random API's is high. The reason we
| still talk about the semantic web despite decades of no-show
| is precisely the persistent need to overcome this friction.
|
| > People are stupid
|
| We are who we are individually, but again this ignores the
| collective intelligence of groups. Besides the hordes of
| helpless individuals and a handful of "big techs"(=the random
| entities that figured out digital technology ahead of others)
| there is a vast universe of interests. They are not stupid
| but there is a learning curve. For the vast part of society
| the so-called digital transformation is only at its
| beginning.
| rakoo wrote:
| You have a very charitable view of this whole thing and I
| want to believe like you. Perhaps there is a virtuous cycle
| to be built where infrastructure that relies on people
| being more honest helps change the culture to actually be
| more honest which makes the infrastructure better. You
| don't wait for people to be nice before you create the gpl,
| the gpl changes mindsets towards opening up which fosters a
| better culture for creating more.
|
| It's also very important to think in macro systems and
| societies, as you point out, rather than at the individual
| level
| kayo_20211030 wrote:
| Every time I read a post like this I'm inclined to post
| Doctorow's Metacrap piece in response. You got there ahead of
| me. His reasoning is still valid and continues to make sense
| to me. Where do you think he's "comically wrong"?
| unconed wrote:
| The implicit metrics of quality and pedigree he believed
| were superior to human judgement have since been gamified
| into obsolescence by bots.
| kayo_20211030 wrote:
| I think that the jury is still out on that one. Human
| judgement is too often colored by human incentives. I
| still think there's an opportunity for mechanical
| assessments of quality and pedigree to excel, and exceed
| what humans can do; at least, at scale. But, it'll always
| be an arms race and I'm not convinced that bots are in it
| except in the sense of lying through metadata, which
| brings us back to the assessment of quality and pedigree
| - right/wrong, good/bad, relevant/garbage.
| pessimizer wrote:
| Link counting being reliable for search. After going
| through people's not-so-noble qualities and how they make
| the semantic web impossible, he declares counting links as
| an exception. It was to a comical degree not an exception.
| kayo_20211030 wrote:
| Yes. There is that. Ignobility wins out again.
| monknomo wrote:
| item 2.6 kneecapped item 3
| PaulHoule wrote:
| One major problem RDF has is that people hate anything with
| namespaces. It's a "freedom is slavery" kind of thing. People
| will accept it grudgingly if Google says it will help their
| search rankings or if you absolutely have to deal with them
| to code Java but 80% of people will automatically avoid
| anything if it has namespaces. (See namespaces in XML)
|
| Another problem is that it's always ignored the basic
| requirements of most applications like:
|
| 1. Getting the list of authors in a publication as refernces
| to authority records in the right order (Dublin Core makes
| the 1970 MARC standard look like something from the Starship
| Enterprise)
|
| 2. Updating a data record reliably and transactionally
|
| 3. Efficiently unioning graphs for inference so you can
| combine a domain database with a few database records
| relevant to a problem + a schema easily
|
| 4. Inference involving arithemtic (Godel warned you about
| first-order logic plus arithmetic but for boring fields like
| finance, business, logistics that is the lingua franca, OWL
| comes across as too heavyweight but completely deficient at
| the same time and nobody wants to talk about it)
|
| things like that. Try to build an application and you have to
| invent a lot of that stuff. You have the tools to do it and
| it's not that hard if you understand the math inside and out
| but if you don't oh boy.
|
| If RDF got a few more features it would catch up with where
| JSON-based tools like
|
| https://www.couchbase.com/products/n1ql/
|
| were 10 years ago.
| cyanydeez wrote:
| i think you're confused. the killer app is everyone following
| the same format, and such, capitalists can extract all that
| information and sell LLMs that no one wants in place of more
| deterministic search and data products.
| h4ck_th3_pl4n3t wrote:
| Say what you want, but Macromedia Dreamweaver came pretty close
| to being "that killer app". Microsoft attempted the same with
| Frontpage, but abandoned it pretty quickly as they always do.
|
| I think that Web Browsers need to change what they are. They
| need to be able to understand content, correlate it, and
| distribute it. If a Browser sees itself not as a consuming app,
| but as a _contributing_ and _seeding_ app, it could influence
| the semantic web pretty quickly, and make it much more awesome.
|
| Beaker Browser came pretty close to that idea (but it was
| abandoned, too).
|
| Humans won't give a damn about hand-written semantic code, so
| you need to make the tools better that produce that code.
| ricardo81 wrote:
| There's another element, trusting the data.
|
| Often that may require some web scale data, like Pagerank but
| also any other authority/trust metric where you can say "this
| data is probably quality data".
|
| A rather basic example, published/last modified dates. It's
| well known in SEO circles at least in the recent past that
| changing them is useful to rank in Google, because Google
| prefers fresh content. Unless you're Google or have a less than
| trivial way of measuring page changes, the data may be less
| than trustworthy.
| lxgr wrote:
| Not even Google seems to be making use of that capability, if
| they even have it in the first place. I'm regularly annoyed
| by results claiming to be from this year, only to find that
| it's a years-old article with fake metadata.
| account42 wrote:
| Yeah, dates in Google results have become all but useless.
| It's just another meaningless knob for SEOtards to abuse.
| ricardo81 wrote:
| They are quite good at near content duplicate detection so
| I imagine it's within their capabilities. Whether they care
| about recency, maybe not as long as the user metrics say
| the page is useful. Maybe a fallacy about content recency.
|
| You don't see many geocities style sites nowadays, even
| though there's many older sites with quality (and original)
| content. Maybe mobile friendliness plays into that though.
| echelon wrote:
| Search and ontologies weren't the only goals. Microformats
| enabled standardized data markup that lots of applications
| could consume and understand.
|
| RSS and Atom were semantic web formats. They had a ton of
| applications built to publish and consume them, and people
| found the formats incredibly useful.
|
| The idea was that if you ran into ingestible semantic content,
| your browser, a plugin, or another application could use that
| data in a specialized way. It worked because it was a
| standardized and portable data layer as opposed to a soup of
| meaningless HTML tags.
|
| There were ideas for a distributed P2P social network built on
| the semantic web, standardized ways to write articles and blog
| posts, and much more.
|
| If that had caught on, we might have saved ourselves a lot of
| trouble continually reinventing the wheel. And perhaps we would
| be in a world without walled gardens.
| recursivedoubts wrote:
| The semantic web has been, in my opinion, a category error.
| Semantics means meaning and computers/automated systems don't
| really do meaning very well and certainly don't do intention
| very well.
|
| Mapping the incredible success of The Web onto automated
| systems hasn't worked because the defining and unique
| characteristic of The Web is REST and, in particular, the
| uniform interface of REST. This uniform interface is wasted on
| non-intentional beings like software (that I'm aware of):
|
| https://intercoolerjs.org/2016/05/08/hatoeas-is-for-humans.h...
|
| Maybe this all changes when AI takes over, but AI seems to do
| fine without us defining ontologies, etc.
|
| It just hasn't worked out the way that people expected, and
| that's OK.
| dboreham wrote:
| I take the other side of this trade, and have since c. 1980.
| I say that semantics is a delusion our brains creates.
| Doesn't really exist. Or conversely is not the magical thing
| we think it is.
| recursivedoubts wrote:
| man
| lo_zamoyski wrote:
| How are you oblivious of the performative contradiction
| that is that statement?
|
| Please tell me you're not an eliminativist. There is
| nothing respectable about eliminativism. Self-refuting, and
| Procrustean in its methodology, denying observation it
| cannot explain or reconcile. Eliminativism is what you get
| when a materialist refuses or is unable to revise his
| worldview despite the crushing weight of contradiction and
| incoherence. It is obstinate ideology.
| Pet_Ant wrote:
| TIL:
|
| https://en.wikipedia.org/wiki/Eliminative_materialism
|
| > Eliminative materialism (also called eliminativism) is
| a materialist position in the philosophy of mind. It is
| the idea that the majority of mental states in folk
| psychology do not exist. Some supporters of eliminativism
| argue that no coherent neural basis will be found for
| many everyday psychological concepts such as belief or
| desire, since they are poorly defined. The argument is
| that psychological concepts of behavior and experience
| should be judged by how well they reduce to the
| biological level. Other versions entail the nonexistence
| of conscious mental states such as pain and visual
| perceptions.
| naasking wrote:
| > Eliminativism is what you get when a materialist
| refuses or is unable to revise his worldview despite the
| crushing weight of contradiction and incoherence.
|
| Funny, because eliminativism to me is the inevitable
| conclusion that follows from the requirement of logical
| consistency + the crushing weight of objective evidence
| when pitted against my personal perceptions.
| ftlio wrote:
| > The semantic web has been, in my opinion, a category error.
|
| Hard agree.
|
| > Maybe this all changes when AI takes over, but AI seems to
| do fine without us defining ontologies, etc.
|
| I think about it as:
|
| - Hypermedia controls were been deemphasized, leading to a
| ton of workarounds to REST
|
| - REST is a perfectly suitable interface for AI Agents,
| especially to audit for governance
|
| - AI is well suited to the task of mapping the web as it
| exists today to REST
|
| - AI is well suited to mapping this layout ontologically
|
| The semantic web is less interesting than what is traversable
| and actionable via REST, which may expose some higher level,
| reusable structures.
|
| The first thing I can think of is `User` as a PKI type
| structure that allows us to build things that are more
| actionable for agents while still allowing humans to grok
| what they're authorized to.
| thomastjeffery wrote:
| > Maybe this all changes when AI takes over, but AI seems to
| do fine without us defining ontologies, etc.
|
| If you say "AI" in 2024, you are probably talking about an
| LLM. An LLM is a program that pretends to solve semantics by
| actually entirely avoiding semantics. You feed an LLM a
| semantically meaningful input, and it will generate a
| statistically meaningful output _that just so happens to look
| like_ a semantically meaningful transformation. Just to
| really sell this facade, we go around calling this program a
| "transformer" and a "language model", even though it
| truthfully does nothing of the sort.
|
| The entire goal of the semantic web was to dodge the exact
| same problem: ambiguous semantics. By asking everyone to
| rewrite their content as an ontology, you compel the writer
| to transform the semantics of their content into explicit
| unambiguous logic.
|
| That's where the category error comes in: the writer can't do
| it. Interesting content can't just be trivially rewritten as
| a simple universally-compatible ontology that is actually
| rooted in meaningfully unambiguous axioms. That's precisely
| the hard problem we were trying to dodge in the first place!
|
| So the writer does the next best thing: they write an
| ontology that _isn 't_ rooted. There are no really useful
| axioms at the root of this tree, but it's a tree, and that's
| good enough. Right?
|
| What use is an ontology when it isn't rooted in useful
| axioms? Instead of dodging the problem of ambiguous
| semantics, the "semantic web" moves that problem right in
| front of the user. That's probably useful for _something_ ,
| just not what the user is expecting it to be useful for.
|
| ---
|
| I have this big abstract idea I've been working on that might
| actually solve the problem of ambiguous semantics. The
| trouble is, I've been having a really hard time tying the
| idea itself down to reality. It's a deceptively challenging
| problem space.
| jancsika wrote:
| > There hasn't been a Wikipedia-like event for many decades.
|
| Off the top of head...
|
| OpenStreetMap was in 2004. Mastodon and the associated spec-
| thingy was around 2016. One/two decades is not the same as many
| decades.
|
| Oh, and what about asm.js? Sure, archive.org is many decades
| old. But suddenly I'm using it to play every retro game under
| the sun on my browser. And we can try out a lot of FOSS
| software in the browser without installing things. Didn't
| someone post a blog to explain X11 where the examples were
| running a javascript implementation of the X window system?
|
| Seems to me the entire web-o-sphere leveled up over the past
| decade. I mean, it's so good in fact that I can run an LLM
| _clientside_ in the browser. (Granted, it 's probably trained
| in part on your public musing that the web is worse.)
|
| And all this while still rendering Berkshire Hathaway website
| correctly for _many_ decades. How many times would the Gnome
| devs have broken it by now? How many upgrades would Apple have
| forced an "iweb" upgrade in that time?
|
| Edit: typo
| openrisk wrote:
| The web browser (or an app with a vague likeness to a
| browser) would indeed be in the epicenter of a "semantic"
| leap if that happens.
|
| The technical capability of the browser to be an OS within an
| OS is more than proven by now, but not sure I am impressed
| with the utility thus far.
|
| At the same time even basic features in the "right
| direction", empowering the users information processing
| ability (bookmarks, rss, etc) have stagnated or regressed.
| glenstein wrote:
| I am not sure I understand the fixation on a "killer app" in
| the context of web standards. We are talking about things like,
| say, XML, or SVG or HTTP/2. They can have their rationale and
| their value simply by serving to enable organic growth of a web
| ecosystem. I think I agree most with your last sentence and
| should define success more in those terms, aspiring to a better
| web.
| openrisk wrote:
| The idea (or hope) is that apps based on semantic standards
| would kick off a virtuous cycle where publishers of
| information keep investing in both generating metadata and
| evolving the standards themselves. As many have mentioned in
| the thread, thats not a trivial step.
|
| People sort of try. A concrete example are the
| Activitypub/Fediverse standards which dared to use json-ld.
| To my knowledge so far the social media experience of
| mastodon and friends is not qualitatively different from the
| old web stuff.
| EGreg wrote:
| Why do we need web standards for the semantic web anymore when
| we have LLMs?
|
| Just make LLMs more ubiquitous and train them on the Web.
| Rather than crawling or something. The LLMs are a lot more
| resilient.
| 627467 wrote:
| > The Semantic Web is the old Web 3.0. Before "Web 3.0" meant
| crypto-whatnot, it meant "machine-readable websites".
|
| Using contemporary AI models aren't all websites machine-
| readable? - or potentially even more readable than semantic web
| unless an ai model actually does the semantic classification
| while reading it?
| CaptArmchair wrote:
| I'm a bit surprised that the author doesn't mention key concepts
| such as linked data, RDF, federation and web querying. Or even
| the five stars of linked open data. [1] Sure, JSON-LD is part of
| it, but it's just a serialization format.
|
| The really neat part is when you start considering universal
| ontologies and linking to resources published on other domains.
| This is where your data becomes interoperable and reusable. Even
| better, through linking you can contextualize and enrich your
| data. Since linked data is all about creating graphs, creating a
| link in your data, or publishing data under a specific domain are
| acts that involves concepts like trust, authority, authenticity
| and so on. All those murky social concepts that define what we
| consider more or less objective truths.
|
| LLM's won't replace the semantic web, nor vice versa. They are
| complementary to each other. Linked data technologies allow
| humans to cooperate and evolve domain models with a salience and
| flexibility which wasn't previously possible behind the walls and
| moats of discrete digital servers or physical buildings. LLM's
| work because they are based on large sets of ground truths, but
| those sets are always limited which makes inferring new knowledge
| and asserting its truthiness independent from human intervention
| next to impossible. LLM's may help us to expand linked data
| graphs, and linked data graphs fashioned by humans may help
| improve LLM's.
|
| Creating a juxtaposition between both? Well, that's basically
| comparing apples against pears. They are two different things.
|
| [1] https://5stardata.info/en/
| ThinkBeat wrote:
| I dont like the use of a Json "script" inside an HTML page. I
| understand the flexibility it grants but markup tags is what HTL
| is based on and the design would be more consistent by using HTML
| tags as we have had them for decades to also handle this extra
| meta data.
| M2Ys4U wrote:
| JSON-LD isn't the only way one can embed these metadata (though
| I think most tooling prefers it now).
|
| For example, Microdata[0] is one in-line way to do it, and
| RDFa[1] is another.
|
| [0] https://en.wikipedia.org/wiki/Microdata_(HTML)
|
| [1] https://en.wikipedia.org/wiki/RDFa
| est wrote:
| I've playing with RSS feeds recently, suddently it occured to me,
| XML can be transformed into anything with XSL, for static hosting
| personal blogs, I can save articles into the feeds directly, then
| serve frontend single-page application with some static XSLT+js.
| This is content-presentation separation at best.
|
| Is JSON-LD just reinventation of this?
| martin_a wrote:
| That is exactly the thought behind SGML/XML and its
| derivatives. XSL is kind of clumsy but very powerful and the
| most direct way to transform documents.
|
| JSON-LD to me looks more like trying to glue different
| documents together, its not about the transformation itself.
| rakoo wrote:
| > This is content-presentation separation at best.
|
| The idea is the best, but arguably the implementation is
| lacking.
|
| > Is JSON-LD just reinventation of this?
|
| Yup. It's "RDF/XML but we don't like XML"
| ttepasse wrote:
| Back in the optimistic 2000s there was the brief idea of GRDDL
| - using XSLT stylesheets and XPath selectors for extracting
| stuff from HTML, e.g. microformats, HTML meta, FOAF, etc, and
| then transforming it into RDF or other things:
|
| https://www.w3.org/TR/grddl/
| mcswell wrote:
| But why? Isn't most of the information you can extract from
| those tags stuff that's pretty obvious, like title and author
| (the examples the linked page uses)? How do you extract
| really useful information _using that methodology_ ,
| supporting searches that answer queries like "110 volt socket
| accepting grounding plugs"? Of course search engines _can_
| (and do) get such info, but afaik it doesn 't require or use
| XSLT beyond extracting the plain text.
| anonymous344 wrote:
| Well worth, for whom? as a blogger, these things are 99% for the
| companies making profit by scraping my content, maybe 1% of the
| users will need them. Or am I wrong?
| _heimdall wrote:
| This has been my hang up as well. Providing metadata seems
| extremely useful and powerful, but coming into web development
| in the mid 10s rather than mid 00s made it more clear that the
| metadata would largely just help a handful of massive
| corporations.
|
| I will still include JSON-LD when it make financial sense for a
| site. In practice that usually just means business metadata for
| search results and product data for any ecommerce pages.
| Lutger wrote:
| Everyone is optimizing for their own local use-case. Even open-
| source. Standards get adopted sometimes, but only if they solve a
| specific problem.
|
| There is an additional cost to making or using ontologies, making
| them available and publishing open data on the semantic web. The
| cost is quite high, the returns aren't immediate, obvious or
| guaranteed at all.
|
| The vision of the semantic web is still valid. The incentives to
| get there are just not in place.
| codelion wrote:
| I started this thread on the w3c list almost 20 years ago -
| https://lists.w3.org/Archives/Public/semantic-web/2005Dec/00...
|
| Unfortunately, it is unlikely we will ever get something like a
| Semantic web. It seemed like a good idea in the beginning of
| 2000s but now there is honestly no need for it as it is quite
| cheap and easy to attach meaning to text due to the progress in
| LLMs and NLP.
| mcswell wrote:
| Exactly. Afaik, there are certain corners of the Web that
| benefit from some kind of markup. I think real estate is one,
| where you can generate searches of the MLS on sites like Redfin
| or Zillow (or any realtor's site, really) such that you can set
| parameters: between 1000 and 1500 square feet (or meters in
| Europe), with a garage and no basement. That's very helpful
| (although I don't know whether that searching is done over
| indexed web pages, or on the MLS itself). But most of the Web,
| afaict, have nothing like that---and don't need it, because NLP
| can distinguish different senses of 'bank' (financial vs.
| river), etc.
| BiteCode_dev wrote:
| The article talks about JSON-LD, but there is also shema.org and
| open graph.
|
| What which one should you use, and why?
|
| Should you use several? How does that impact the site?
| dangoodmanUT wrote:
| JSON-LD uses schema.org schema
| giantrobot wrote:
| But very helpfully Google supports...mostly schema.org except
| when they don't when they feel like it.
| kvgr wrote:
| I was doing bachelor thesis 10 years ago on some semantic file
| conversions, we had a lot of projects at school. And looks like
| there is not much progress for end user...
| grumbel wrote:
| I don't see how one can have any hope in a Semantic Web ever
| succeeding when we haven't even managed to get HTML tags for
| extremely common Internet things: pricetags, comments, units,
| avatars, usernames, advertisement and so on. Even things like
| pagination are generally just a bunch of links, not any kind of
| semantic thing holding multiple documents together (<link rel>
| exists, but I haven't seen browsers doing anything with it). Take
| your average website and look at all the <div>s and <span>s and
| there is a whole lot more low hanging fruit one could turn
| semantic, but there seems little interest in even trying to.
| rakoo wrote:
| I don't think we necessarily need new tags: they narrow down
| the list of possible into an immutable set and require changing
| the structure of your already existing content. What exists
| instead are microformats
| (http://microformats.org/wiki/microformats2), a bunch of
| classes you sprinkle in your current HTML to "augment" it.
| _heimdall wrote:
| I include microformats on blog sites, but at scale the
| challenge with microformats is that most existing tooling
| doesn't consider class names at all for semantics.
|
| Browsers, for example, completely ignore classes when
| building the accessibility tree for a web page. Only the HTML
| structure and a handful of CSS properties have an impact on
| accessibility.
|
| Class names were always meant as an ease of use feature for
| styling, overloading them with semantic meaning could break a
| number of sites built over the last few decades.
| ttepasse wrote:
| There is also RDFa and even more obscure Microdata to augment
| HTML elements. Google's schema.org vocabulary originally used
| these before switching to JSON-LD.
|
| The trick, as always, is to get people to use it.
| dsmurrell wrote:
| "Googlers, if you're reading this, JSON-LD could have the same
| level of public awareness as RSS if only you could release, and
| then shut down, some kind of app or service in this area. Please,
| for the good of the web: consider it." - lol
| dgellow wrote:
| Companies use open-graph because it gives them something in
| return (nice integration in other products when linking to your
| site). That's nice and all but outside of this niche use case
| there is no incentives for a semantic web from the point of view
| of publishers. You just make it simpler to crawl your website
| (something you cannot really monetize) instead of offering a
| strict API you can monetize to access structured data.
| 1f60c wrote:
| This has been invented a number of times. Facebook's version is
| called Open Graph.
|
| https://ogp.me/
| ttepasse wrote:
| Back then Facebook said their Open Graph Protocol was only an
| application of RDFa - and syntax wise it seemed so.
| patagnome wrote:
| worth the bother. "preview" on the capitalocenic web without any
| mention of the Link Relation Types does not a semantic web
| adoption make. no mention of the economic analysis and impact of
| monopoly, no intersectional analysis with #a11y.
|
| if the "preview" link relation type is worth mentioning it's
| worth substantiating the claims about adoption. when did the big
| players adopt? why? what of the rest of the types and their
| relation to would-be "a.i." claims?
|
| how would we write html differently and what capabilities would
| we expose more readily to driving by links, like carousels only
| if written with a11y in mind? how would our world-wild web look
| different if we wrote html like we know it? than only give big
| players a pass when we view source?
| hoosieree wrote:
| > If Web 3.0 is already here, where is it, then? Mostly, it's
| hidden in the markup.
|
| I feel like this is so obvious to point out that I must be
| missing something, but the whole article goes to heroic lengths
| to avoid... HTML. Is it because HTML is difficult and scary? Why
| invent a custom JSON format _and_ a custom JSON-to-HTML compiler
| toolchain than just write HTML?
|
| The semantics aren't _hidden_ in the markup. The semantics _are_
| the markup.
| wepple wrote:
| I think that's what we're doing today, and it's a phenomenal
| mess.
|
| The typical HTML page these days is horrifically bloated, and
| whilst it's machine parsable, it's often complicated to
| actually understand what's what. It's random nested divs and
| unified everything. All the way down.
|
| But I do wonder if adding context to existing HTML might be
| better than a whole other JSON blob that'll get out of sync
| fast.
| hoosieree wrote:
| I'm just not convinced that swapping out "<ol></ol>" for "[]"
| actually addresses any of the problems.
| Lutger wrote:
| I must have missed your point, isn't the answer obviously that
| HTML is very, very limited and intended as a way to markup
| text? Semantic data is a way to go further and make machine-
| readable what actually is inside that text: recipes, places,
| people, posts, animals, etc, etc and all their various
| attributes and how they relate to each other.
|
| Basically, what you are saying is already rdf/xml, except that
| devs don't like xml so json-ld came along as a man-machine-
| friendlier way to do rdf/xml.
|
| There are also various microdata formats that allow you to
| annotate html in a way the machines can parse it as rdf. But
| that can be limited in some cases if you want to convery more
| metadata.
| rchaud wrote:
| Why should anybody do that though? It doesn't benefit
| individual users, it benefits web scrapers mostly. Search
| bots are pretty sophisticated at parsing HTML so it isn't an
| issue there.
| hanniabu wrote:
| Web 1.0 = read
|
| Web 2.0 = read/write
|
| Web 3.0 = read/write/own
| DarkNova6 wrote:
| You could make the case that we already are in Web 3.0, or
| that we have regressed into Web 1.0 territory.
|
| Back in actual Web 2.0, the internet was not dominated by
| large platforms, but more spread out by ppl hosting their own
| websites. Interaction was everywhere and the spirit resolved
| around "p2p exchange" (not technologically speaking).
|
| Now, most traffic goes over large companies which own your
| data, tell you what to see and severely limit genuine
| exchange. Unless you count out the willingness of "content
| monkeys", that is.
|
| What has changed? The internet has settled for a lowest-
| common denominator and moved away from a space of tech savy
| people (primarily via the arrival of smartphones). The WWW
| used to be the unowned land in the wild west, but has now
| been colonized by an empire from another world.
| matheusmoreira wrote:
| I wish there was a better alternative to JSON-LD. I want to avoid
| duplication by reusing the data that's already in the page by
| marking them up with appropriate tags and properties. Stuff like
| RDF exists but is extremely complex and verbose.
| ttepasse wrote:
| Originally you could use the schema.org vocabulary with RDFa or
| Microdata which embed the structured data right at the element.
| But than can be brittle: Markup structures change, get copy-
| and-pasted and editing attributes is not really great in CMS. I
| may not like it aesthetically but embedded JSON-LD makes some
| sense.
|
| See also this comment above:
| https://news.ycombinator.com/item?id=41309555
| makkes wrote:
| Semantic Web technology (RDF, RDFS, OWL, SHACL) is widely used in
| the European electricity industry to exchange grid models:
| https://www.entsoe.eu/data/cim/cim-for-grid-models-exchange/
| etimberg wrote:
| I have experience using this back when I worked for a startup
| that did distribution grid optimization. The specs are
| unfortunately useless in practice because while the terminology
| is standardized the actual use of each object and how to relate
| them is not.
|
| Thus, every tool makes CIM documents slightly differently and
| there are no guarantees that a document created in one tool
| will be usable in another
| ubertaco wrote:
| Well, the immediate initial test failed for me: I thought, "why
| not apply this on one of my own sites, where I have a sort of
| journal of poetry I've written?"...and there's no category for
| "Poem", and the request to add Poem as a type [1] is at least 9
| years old, links to an even older issue in an unreadable issue
| tracker without any resolution (and seemingly without much effort
| to resolve it), and then dies off without having accomplished
| anything.
|
| [1] https://github.com/schemaorg/suggestions-questions-
| brainstor...
| tossandthrow wrote:
| Having worked in this field for a bit, this uncovers an even
| more fundamental flaw: The idea that we can have a single
| static ontology.
| lambdaba wrote:
| What kind of work do you do?
| tossandthrow wrote:
| various. Notable I, some years ago, had a project that
| considered automatic consolidation of ontologies based on
| meta-ontologies and heuristics.
|
| The idea being that everyone have their own ontology for
| the data they release and the system would make a
| consolidated ontology that could be used to automatic
| integration of data from different datasources.
|
| regardless, that project did not get traction, so now it
| sits.
| codewithcheese wrote:
| Domain driven design is well aware that is not feasible to
| have a single schema for everything, they use bounded
| contexts. Is there something similar for the semantic web?
| kitsune_ wrote:
| Isn't that the point of RDF / Owl etc.?
| klntsky wrote:
| In the Semantic Web, things like ontologies and namespaces
| play a role similar to bounded contexts in DDD. There's no
| exact equivalent, but these tools help different schemas
| coexist and work together
| maxerickson wrote:
| There is also the problem that structure doesn't guarantee
| meaning.
| wslh wrote:
| Mostly, the problems of a semantic web are covered in the
| history of Cyc[1].
|
| When I started to use LLMs I thought that was the missing
| link to convert content to semantic representations, even
| taking into account the errors/hallucinations within them.
|
| [1] https://en.wikipedia.org/wiki/Cyc
| lukev wrote:
| That's only schema.org! Linked data is so much bigger than
| that.
|
| Many ontologies have a "poem" type (for example dbpedia
| (https://dbpedia.org/ontology/Poem) has one), as well as other
| publishing or book-oriented ontologies.
| lolinder wrote:
| Every time I've read up on semantic web it's been treated as
| more or less synonymous with schema.org. Are these other
| ontologies used by anything?
| mdaniel wrote:
| My mental model of that question is: how would anyone know
| if an ontology was used by something? One cannot have a
| search facet in any engine that I'm aware of to search by
| namespace qualified nouns, and markup is only as good as
| the application which is able to understand it
| renonce wrote:
| Looks like a perfect use case for LLM: generate that JSON-LD
| metadata from HTML via LLM, either by the website owner or by the
| crawler. If crawlers, website owners doesn't need to do anything
| to enter Semantic Web and crawlers specify their own metadata
| format they want to extract. This promises an appealing future of
| Web 3.0, not by crypto, defined not by metadata but by LLMs.
| eadmund wrote:
| Embedding data as JSON as program text inside a <script> tag
| inside a tagged data format just seems like such a terrible hack.
| Among other things, it stutters: it repeats information already
| in the document. The microdata approach seems much less insane. I
| don't know if it is recognised nearly as often.
|
| TFA mentions it at the end: 'There is also "microdata." It's very
| simple but I think quite hard to parse out.' I disagree: it's no
| harder to parse than HTML, and one already must parse HTML in
| order to correctly extract JSON-LD from a script tag (yes, one
| can _incorrectly_ parse HTML, and it will work most of the time).
| ryukoposting wrote:
| Pardon my naivetee, but what exactly is JSON-LD doing that the
| HTML meta tags don't do already? My blog doesn't implement JSON-
| LD but if you link to my blog on popular social media sites, you
| still get a fancy link.
| ttepasse wrote:
| JSON-LD / RDFa and such can use the full type hierarchy of
| schema.org (and other languages) and can build a tree or even a
| graph of data. Meta elements are limited to property/value
| pairs.
| _heimdall wrote:
| Monetization is the elephant in the room in my opinion.
|
| IMDB could easily be a service entirely dedicated to hosting
| movie metadata as RDF or JSON-LD. They need to fund it though,
| and the go to seems to be advertising and API access. Advertising
| means needing human readable UI, not metadata, and if they put
| data behind an API its a tough sell to use a standardized and
| potentially limiting format.
| jrochkind1 wrote:
| > Semantic Web information on websites is a bit of a "living
| document". You tend publish something, then have a look to see
| what people have parsed (or failed to parse) it and then you try
| to improve it a bit.
|
| Hm.
| physicsguy wrote:
| Semantic web suffers from organisational capture. If there's a
| big org they get to define the standard at the expense over
| everyone else use cases.
| gdegani wrote:
| There is a lot of value on Enterprise Knowledge Graphs, applying
| the semantic web standards into the "self-contained" world of
| enterprise data, there are many large enterprises doing it, and
| there is an interesting video from UBS on how they consider it a
| competitive advantage
| bawolff wrote:
| If this counts as the "semantic web", then <meta
| name="description"... should to.
|
| In which case we have all been on it since the mid 90s.
| PaulHoule wrote:
| It's real RDF. You can process this with RDF tools. Certainly
| do SPARQL queries. Probably add a schema and have valid OWL DL
| and do OWL inference if the data is squeaky clean. Certainly
| use SPIN or Jena rules.
|
| It leans too hard on text and doesn't have enough concepts
| defined as resources but what do you expect, Python didn't have
| a good package manager for decades because 2 + 2 = 3.9 with
| good vibes beats 2 + 2 = 4 with honest work and rigor for too
| many people.
|
| The big trouble I have with RDF tooling is inadequate handling
| of ordered lists. Funny enough 90% of the time or so when you
| have a list you don't care about the order of the items and
| frequently people use a list for things that should have set
| semantics. On the other hand, you have to get the names of the
| authors of a paper in the right order or they'll get mad.
| There's a reasonable way to turn native JSON lists into RDF
| lists
|
| https://www.w3.org/TR/json-ld11/#lists
|
| although unfortunately this uses the slow LISP lists with O(N)
| item access and not the fast RDF Collections that have O(1)
| access. (What do you expect from M.I.T.?)
|
| The trouble is that SPARQL doesn't support the list operations
| that are widespread in document-based query languages like
|
| https://www.couchbase.com/products/n1ql/
|
| https://docs.arangodb.com/3.11/aql/
|
| or even Postgresql. There is a SPARQL 1.2 which has some nice
| additions like
|
| https://www.w3.org/TR/sparql12-query/#func-triple
|
| but the community badly needs a SPARQL 2 that catches up to
| today's query languages but the semantic web community has been
| so burned by pathological standards processes that anyone who
| can think rigorously or code their way out of a paper bag won't
| go near it.
|
| A substantial advantage of RDF is that properties live in
| namespaces so if you want to add a new property you can do it
| and never stomp on anybody else's property. Tools that don't
| know about those properties can just ignore them, but SPARQL,
| RDFS and all that ought to "just work" though OWL takes some
| luck. That's got a downside too which is that adding namespaces
| to a system seems to reduce adoption by 80% in many cases
| because too many people think it's useless and too hard to
| understand.
| bawolff wrote:
| My point is that even if technically its rdf, if all anyone
| does is use a few specific properties from a closed pre-
| agreed schema, we might as well just be using meta tags.
| PaulHoule wrote:
| But there's the question of who is responsible for it and
| who sets the standards. These days the consortium behind
| HTML 5 is fairly quick and responsive compared to the W3C's
| HTML activity in the day (e.g. fight with a standards
| process for a few months as opposed to "talk to the hand")
| but schema.org can evolve without any of that.
|
| If there's anything that sucks today it is that people feel
| they have to add all kinds of markup for different vendors
| (such as Facebook's Open Graph) I remember the Semweb folks
| who didn't think it was a problem that my pages had about
| 20k of visible markup and 150k of repeated semantic markup.
| It's like the folks who don't mind that an article with 5k
| worth of text has 50M worth of Javascript, ads, trackers
| and other junk.
|
| On the other hand I have no trouble turning
| <meta name="description" content="A brief description of
| your webpage content.">
|
| into @prefix meta:
| <http://example.com/my/name/space> .
| <http://example.com/some/web/page> meta:description "A
| brief description of your webpage content." .
|
| where meta: is some namespace I made up if I want to access
| it with RDF tools without making you do anything
| ChrisMarshallNY wrote:
| I suspect that AI training data standards will make this much
| more prevalent.
|
| Just today, I am working on an experimental training/consuming
| app pair. The training part will leverage JSON data from a
| backend I designed.
| taeric wrote:
| It is hilarious to see namespaces trying to creep into json.
|
| I do wonder how any of this is better than using the meta tags of
| the html, though? Especially for such use cases as the preview.
| Seems the only thing that isn't really there for the preview is
| the image? (Well, title would come from the title tag, but
| still...)
| esbranson wrote:
| Arguing against standard vocabularies (part of the Semantic Web)
| is like arguing against standard libraries. "Cool story bro."
|
| But it is true, if you can't make sense of your data, then the
| Semantic Web probably isn't for you. (It's the least of your
| problems.)
| rchaud wrote:
| > Googlers, if you're reading this, JSON-LD could have the same
| level of public awareness as RSS if only you could release, and
| then shut down, some kind of app or service in this area. Please,
| for the good of the web: consider it.
|
| Google has been pushing JSON-LD to webmasters for better SEO for
| at least 5 years, if not more:
| https://developers.google.com/search/docs/appearance/structu...
|
| There really isn't a need to do it as most of the relevant page
| metadata is already captured as part of the Open Graph
| protocol[0] that Twitter and Facebook popularized 10+ years ago
| as webmasters were attempting to set up rich link previews for
| URLs posted to those networks. Markup like this:
|
| <meta property="og:type" content="video.movie" />
|
| is common on most sites now, so what benefit is there for doing
| additional work to generate JSON-LD with the same data?
|
| [0]https://ogp.me/
| weego wrote:
| 'it makes social sharing look a bit nicer' being the only benefit
| that can scraped from the barrel as a benefit undermines the
| entire premise.
|
| It's not widely adopted, it's used as an attempted growth hack in
| a few locations that may or may not be of use (with value being
| relative to how US centric your and your audiences Internet use
| is)
| pablomendes wrote:
| That statement is both kind of true and, well, revisionist.
| Originally there was a strong focus on logics, clean
| comprehensive modeling of the world through large complicated
| ontologies, and the adoption of super impractical representation
| languages, etc. It wasn't until rebellious sub-communities went
| rogue and pushed for pragmatic simplifications that things got
| any widespread impact at all. So here's to the crazy ones, I
| guess.
| jgalt212 wrote:
| My fear around JSON-LD is too much of our content will end up on
| a SERP, and we'll attract less traffic.
___________________________________________________________________
(page generated 2024-08-21 23:01 UTC)