[HN Gopher] Wuffs the Language
___________________________________________________________________
Wuffs the Language
Author : bshanks
Score : 85 points
Date : 2021-04-07 21:07 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| grawprog wrote:
| To be honest, I'm not sure what to make of this. Wuff the library
| makes sense as a drop in for the C standard library, but the
| language, I'm not sure how it fits.
|
| It seems to offer some of the features offered by languages like
| D and Rust, while staying more C like, but also removing one of
| the few actual reasons to use C, which both D and Rust also
| provide on top of the other features offered by Wuff.
|
| It's cool and all but it seems confused as to whether it wants to
| be a library for C, an extension to C or a standalone language.
| As a stand alone language, I'm not sure I really see the benefits
| over alternatives as a C library, it does have some interesting
| ideas.
| zokier wrote:
| The readme has more explanation:
|
| > Wuffs (Wrangling Untrusted File Formats Safely) is formerly
| known as Puffs (Parsing Untrusted File Formats Safely).
|
| > Wuffs is a memory-safe programming language (and a standard
| library written in that language) for wrangling untrusted file
| formats safely. Wrangling includes parsing, decoding and
| encoding. Example file formats include images, audio, video,
| fonts and compressed archives.
|
| > Wuffs is not a general purpose programming language. It is
| for writing libraries, not programs. The idea isn't to write
| your whole program in Wuffs, only the parts that are both
| performance-conscious and security-conscious. For example,
| while technically possible, it is unlikely that a Wuffs
| compiler would be worth writing entirely in Wuffs.
| chrchang523 wrote:
| The purpose of the language is clear to me: make it practical
| to prove that >90% of a C file-munging library is safe from
| several common types of errors. I will be looking into
| reorganizing some of my existing C library code of this type
| into large Wuffs components and small interaction-with-outside-
| world components.
| pnathan wrote:
| This is a fascinating spin: a pure language, designed for
| _libraries_ , not for complete programs. A tip of the hat to
| whoever was able to break out of the "a language has to do x y
| and z" thinking and perceive that this is a possibility.
| handsaway wrote:
| Arguably Elm takes this approach to SPA development. The
| language spec itself is general purpose but in practice it's
| been developed to serve a narrow purpose (too narrow for some!)
| tomthe wrote:
| Yes, an interesting concept. But I guess it won't see a wide
| adoption: Who wants to learn a new language, if you can't use
| it for anything but X? And who will use this language for X, if
| you haven't been able to learn it while doing Y?
| brundolf wrote:
| I disagree. It's just a question of how many people are
| trying to do what it's designed for, and how much benefit it
| provides for that focused domain.
| enneff wrote:
| People who want to do X well will learn the language. You can
| learn more than one language.
| dkersten wrote:
| Who would want to learn Javascript for the frontend and
| Python/Ruby/C#/Java/PHP/Whatever for the backend? And HTML
| for UI with CSS for styling?
|
| You say this, but people do it all the time.
| dkersten wrote:
| Interestingly, I was thinking about this exact thing a few
| hours ago, before I saw this on HN. The thought was that you
| can design much more interesting languages by not making them
| be everything for everyone. Not quite domain specific, but also
| not quite general purpose.
|
| I made a language a while back that was used to implement
| custom logic for a product (I've since replaced it with a more
| declarative system that's basically TOML but where values can
| be expressions that get evaluated to generate the actual
| values). One goal of this language was that it should always
| terminate[1], so it had no unbounded loops. Another goal was
| that it should be deterministic, so all input was gathered
| before execution and all output was accumulated to be processed
| at the end. The entire thing ran in a database transaction (so
| input could be queried, then the code was executed as a pure
| function of this input, then the result would be written back
| to the database or sent elsewhere). Externally triggered events
| would cause this to run. Essentially an event driven
| synchronous[2] language of the transformational system variety.
| It was slightly inspired by Lustre[3], which is used in
| critical systems like aeroplanes, trains and power plants. I'm
| a big fan of this style of language.
|
| Basically, by constraining what the language can do or can be
| used for, you can design much more powerful semantics or
| language features for the things that it is designed for,
| similar to a domain specific language. I guess it really would
| be a somewhat more general domain specific language, or at
| least domain specific to multiple domains.
|
| I was thinking about this while walking home from the shops and
| was wondering if such a language would be beneficial to solve
| some challenges I hit in my work and I was going to spend some
| time thinking what semantics would be useful, but haven't done
| so yet and then came across this HN submission. :)
|
| [1] I was also thinking about how the halting problem doesn't
| really say that determining if a program halts is impossible,
| just that there are programs that are not computable. If you
| add constraints (like not being able to feed the program to
| itself as is done in the halting problem and not allowing
| unbounded loops) then it is possible to determine if a program
| will terminate or not.
|
| [2]
| https://en.wikipedia.org/wiki/Synchronous_programming_langua...
|
| [3] https://en.wikipedia.org/wiki/Lustre_(programming_language)
| lancepioch wrote:
| So a similar sense to Haxe: https://haxe.org
| chubot wrote:
| Wuffs seems fascinating and I really wanted to like it. But when
| I look at the code for the JSON decoder it seems so low level,
| and full of places for bugs to hide. JSON is a pretty simple spec
| and this obscures it (although to be fair it's also handling
| UTF-8).
|
| https://github.com/google/wuffs/blob/main/std/json/decode_js...
|
| Yes it prevents buffer overflows and integer overflow, but it
| can't prevent logical errors.
|
| I'd rather see efficient code generated from a short high level
| spec, not an overwhelming amount of detail in a language verified
| along a few dimensions.
|
| ---
|
| Logical errors in parsing also lead to security vulnerabilities.
| For example, here is an example of parser differentials in HTTP
| parsing:
|
| https://about.gitlab.com/blog/2020/03/30/how-to-exploit-pars...
|
| The canonical example of this class of bug is forging SSL
| certificates to take advantage of buggy parsers, but I don't have
| a link handy. There should be one off of https://langsec.org/ if
| anyone can help dig it up.
|
| Again, this has nothing to do with buffer or integer overflows.
|
| (aside: while googling for that I found the claim that mRNA
| vaccines work by parser differentials:
| https://twitter.com/maradydd/status/1342891437537505280?lang...
| If anyone understands that I'd be curious on an opinion/analysis
| :) )
|
| At the very least, any language for parsing should include
| support for regular languages (regexes). The RFCs for many
| network protocols use this metalanguage, and there's no reason it
| shouldn't be executable. They compile easily to efficient code.
|
| The VPRI project claimed to generate a TCP/IP implementation from
| 200 lines of code, although it's not really a fair comparison
| because it hasn't been tested in the wild:
| https://news.ycombinator.com/item?id=846028 .
|
| Still I think that style has better engineering properties. Oil's
| lexer, which understands essentially all of bash, is generated
| from a short source file
|
| https://www.oilshell.org/release/0.8.8/source-code.wwz/front...
|
| which generates
|
| https://www.oilshell.org/release/0.8.8/source-code.wwz/_devb...
|
| which goes on to generate 28,000 lines of C code. It's short, but
| it really needs a better regex metalanguage to be readable:
| https://www.oilshell.org/release/latest/doc/eggex.html
|
| A large part of JSON can be described by regular languages, and
| same with HTTP, etc.
|
| -----
|
| edit: An re2c target for wuffs could make sense. The generated
| code already doesn't allocate any memory, although it uses tons
| of pointers which could be dangling.
|
| And in fact that was a problem Cloudflare, which sprayed the user
| data of their customers all over the Internet back in 2017:
| https://en.wikipedia.org/wiki/Cloudbleed
|
| That was with Ragel and not re2c, which perhaps has a more error
| prone API.
| summerlight wrote:
| > I'd rather see efficient code generated from a short high
| level spec
|
| This is a holy grail for many PL researchers, but I don't think
| that there's any languages that reached this level of
| sophistication with expressiveness/practicality enough for
| production usages. At least with the status quo, you will
| probably need to write a massive amount of formal proofs if you
| want logical correctness, even with deceivingly simple
| specifications.
| chubot wrote:
| It doesn't have to be the same metalanguage for every
| program. You can write your own code generators adapted to
| the specific problems. They should have a "pit of success",
| and the knowledge of the domain is used to ensure that.
|
| There are few research-level issues here; it's just good
| engineering.
|
| The 0% or 100% mindset is bad engineering. You want something
| that's short, and that you can explain to other people, and
| that other people can write an independent implementation of.
| If the proof is 10x longer than normal code, and it's written
| in a metalanguage that the relevant people don't know, then
| it's not very useful.
|
| Proofs are not guarantees. CompCERT has had logic bugs
| despite being written in a formal language. (It does reduce
| the number of bugs drastically in general, but it's also an
| extremely expensive technique, and not what I'm advocating.)
|
| There are no guarantees in engineering, just good practices.
| Groveling through bytes one at a time in imperative languages
| is not an ideal engineering practice, even if the imperative
| language comes with more guarantees than most.
| dmitriid wrote:
| > JSON is a pretty simple spec a
|
| Yet there are no proper implementations because it's too
| simple, sometimes ambiguous, and there are several standards.
| JSON parsing is a minefield, http://seriot.ch/parsing_json.php
| oconnor663 wrote:
| > There is no operator precedence. A bare a * b + c is an invalid
| expression. You must explicitly write either (a * b) + c or a *
| (b + c).
|
| Honestly I've often wished for this in mainstream languages. It
| seems like operator precedence should go the way of bracketless
| if and implicit int casts. (Though I wonder if they wind up
| making exceptions here for chains of method calls? I guess
| technically those rely on operator precedence sort of?)
|
| Edit: Yeah I see the example code has "args.src.read_u8?()". So
| it looks like they figured out how to keep the good stuff.
| nestorD wrote:
| Me too! So far I have seen four actual bugs in large numerical
| code bases that were caused by overlooking operator precedence.
| I expect to see more in the years to come.
|
| I think that precedence of '*' over '+' is acceptable (as
| everyone knows it instinctively) but I would love a way to
| require parenthesis for everything else.
| sleepydog wrote:
| APL (and, its derivatives, I think) evaluate strictly right to
| left, so
|
| a * b + c
|
| is a * (b + c). It might be jarring at first but I really came
| to enjoy the consistency, I never had to remember operator
| precedence, which helps in a language like APL where most
| functions are infix.
| tragomaskhalos wrote:
| Conversely, Smalltalk is left-to-right, so
|
| a + b * c is (a + b) * c
|
| which is simply a result of every operation being a message
| send - muddling the rules with precedence would be likewise
| confusing, and would ruin the simplicity of the grammar.
| puzzlingcaptcha wrote:
| Have there been attempts at creating languages that use a
| postfix (RPN) notation?
| benhoyt wrote:
| Forth is (I think?) the oldest and most well-known.
| Postscript, the printer control language, is possibly more
| widely-deployed. And Factor is a modern take on Forth.
| robobro wrote:
| like forth?
| mekkkkkk wrote:
| Yes please! I'm always using parenthesis for every compound
| expression, and I've heard so many times from coworkers or code
| reviewers smuggly going "you know you can skip that, right?".
| At the same time I've heard the same people having discussions
| and scratching their heads about precedence in some attempt to
| code golf their way through a feature. Not to mention bugs
| caused by incorrect assumptions. Or pausing to figure out what
| some previously written expression actually does. Meanwhile,
| I'll gladly write `X + (Y / Z)`. You can thank me later.
| wuschel wrote:
| LISP-like languages have enforced operator precedence due to
| polish notation e.g. (+ (* a b) (+ c d))
| sqrt17 wrote:
| there's no operator precedence if you don't have (multiple)
| operators that could precede each other. In LISP-like
| languages these are simply functions (or more correctly,
| forms) which have other expressions as arguments, like any
| other functions or forms. LISP works just fine without much
| of the things we take for granted in ALGOL-like languages.
| aidenn0 wrote:
| In addition the variadic prefix-notation means the operators
| are not limited to being binary: 3*x*y*z+w
|
| becomes: (+ (* 3 x y z) w)
| brundolf wrote:
| I tend to think it's fine for the very most common and obvious
| operators (MDAS, etc), but as soon as you get outside of those
| I agree. In particular I've been bitten by the precedence of
| JavaScript's ?? operator: function foo(a) {
| return a ?? 10 + " is the num"; // a ?? (10 + " is the num")
| } foo(12) // 12
| dang wrote:
| Surprisingly little discussed so far, aside from these past
| related threads:
|
| _Wuffs' PNG image decoder_ -
| https://news.ycombinator.com/item?id=26714831 - April 2021 (135
| comments)
|
| _C performance mystery: delete unused string constant_ -
| https://news.ycombinator.com/item?id=23633583 - June 2020 (105
| comments)
|
| That first one was just yesterday but this is a rare case where
| we would not downweight the follow-up post (https://hn.algolia.co
| m/?dateRange=all&page=0&prefix=true&sor...).
| zokier wrote:
| Apparently the language was renamed at some point, google/puffs
| redirects to google/wuffs?
|
| Puffs was discussed few years back
| https://news.ycombinator.com/item?id=15711767
| teraflop wrote:
| Yeah, I was curious about that as well. The README file links
| to a Google Groups discussion about the name change that
| seems to have been memory-holed, but apparently it was
| renamed to avoid confusion with a NetBSD component:
| https://news.ycombinator.com/item?id=15712659
| pjmlp wrote:
| Thankfully they changed the name, in Germany it would be
| quite impossible to use it in any public discussion, as Der
| Puff is a special kind of boys club.
| mjevans wrote:
| The only part of the Wuffs spec I just read that I dislike:
|
| Strings. I would really prefer strings to work like existing C
| and 'bash' style quoting. At least the simple aspects of it, the
| parts of the rules that are easy to remember and simple. A string
| should always be a sequence of octets, but easily coerced by a
| casting operator to a numeric format from any index. I'm not sure
| what the syntax for that would be offhand.
| rattray wrote:
| Yeah, when I got to the bottom and saw that they don't have
| strings, I immediately decided not to spend any more time
| learning about the language.
|
| I'm not sure what I would want them to be like for this
| language, but I'd definitely want something.
___________________________________________________________________
(page generated 2021-04-08 23:00 UTC)