[HN Gopher] Wuffs the Language
       ___________________________________________________________________
        
       Wuffs the Language
        
       Author : bshanks
       Score  : 85 points
       Date   : 2021-04-07 21:07 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | grawprog wrote:
       | To be honest, I'm not sure what to make of this. Wuff the library
       | makes sense as a drop in for the C standard library, but the
       | language, I'm not sure how it fits.
       | 
       | It seems to offer some of the features offered by languages like
       | D and Rust, while staying more C like, but also removing one of
       | the few actual reasons to use C, which both D and Rust also
       | provide on top of the other features offered by Wuff.
       | 
       | It's cool and all but it seems confused as to whether it wants to
       | be a library for C, an extension to C or a standalone language.
       | As a stand alone language, I'm not sure I really see the benefits
       | over alternatives as a C library, it does have some interesting
       | ideas.
        
         | zokier wrote:
         | The readme has more explanation:
         | 
         | > Wuffs (Wrangling Untrusted File Formats Safely) is formerly
         | known as Puffs (Parsing Untrusted File Formats Safely).
         | 
         | > Wuffs is a memory-safe programming language (and a standard
         | library written in that language) for wrangling untrusted file
         | formats safely. Wrangling includes parsing, decoding and
         | encoding. Example file formats include images, audio, video,
         | fonts and compressed archives.
         | 
         | > Wuffs is not a general purpose programming language. It is
         | for writing libraries, not programs. The idea isn't to write
         | your whole program in Wuffs, only the parts that are both
         | performance-conscious and security-conscious. For example,
         | while technically possible, it is unlikely that a Wuffs
         | compiler would be worth writing entirely in Wuffs.
        
         | chrchang523 wrote:
         | The purpose of the language is clear to me: make it practical
         | to prove that >90% of a C file-munging library is safe from
         | several common types of errors. I will be looking into
         | reorganizing some of my existing C library code of this type
         | into large Wuffs components and small interaction-with-outside-
         | world components.
        
       | pnathan wrote:
       | This is a fascinating spin: a pure language, designed for
       | _libraries_ , not for complete programs. A tip of the hat to
       | whoever was able to break out of the "a language has to do x y
       | and z" thinking and perceive that this is a possibility.
        
         | handsaway wrote:
         | Arguably Elm takes this approach to SPA development. The
         | language spec itself is general purpose but in practice it's
         | been developed to serve a narrow purpose (too narrow for some!)
        
         | tomthe wrote:
         | Yes, an interesting concept. But I guess it won't see a wide
         | adoption: Who wants to learn a new language, if you can't use
         | it for anything but X? And who will use this language for X, if
         | you haven't been able to learn it while doing Y?
        
           | brundolf wrote:
           | I disagree. It's just a question of how many people are
           | trying to do what it's designed for, and how much benefit it
           | provides for that focused domain.
        
           | enneff wrote:
           | People who want to do X well will learn the language. You can
           | learn more than one language.
        
           | dkersten wrote:
           | Who would want to learn Javascript for the frontend and
           | Python/Ruby/C#/Java/PHP/Whatever for the backend? And HTML
           | for UI with CSS for styling?
           | 
           | You say this, but people do it all the time.
        
         | dkersten wrote:
         | Interestingly, I was thinking about this exact thing a few
         | hours ago, before I saw this on HN. The thought was that you
         | can design much more interesting languages by not making them
         | be everything for everyone. Not quite domain specific, but also
         | not quite general purpose.
         | 
         | I made a language a while back that was used to implement
         | custom logic for a product (I've since replaced it with a more
         | declarative system that's basically TOML but where values can
         | be expressions that get evaluated to generate the actual
         | values). One goal of this language was that it should always
         | terminate[1], so it had no unbounded loops. Another goal was
         | that it should be deterministic, so all input was gathered
         | before execution and all output was accumulated to be processed
         | at the end. The entire thing ran in a database transaction (so
         | input could be queried, then the code was executed as a pure
         | function of this input, then the result would be written back
         | to the database or sent elsewhere). Externally triggered events
         | would cause this to run. Essentially an event driven
         | synchronous[2] language of the transformational system variety.
         | It was slightly inspired by Lustre[3], which is used in
         | critical systems like aeroplanes, trains and power plants. I'm
         | a big fan of this style of language.
         | 
         | Basically, by constraining what the language can do or can be
         | used for, you can design much more powerful semantics or
         | language features for the things that it is designed for,
         | similar to a domain specific language. I guess it really would
         | be a somewhat more general domain specific language, or at
         | least domain specific to multiple domains.
         | 
         | I was thinking about this while walking home from the shops and
         | was wondering if such a language would be beneficial to solve
         | some challenges I hit in my work and I was going to spend some
         | time thinking what semantics would be useful, but haven't done
         | so yet and then came across this HN submission. :)
         | 
         | [1] I was also thinking about how the halting problem doesn't
         | really say that determining if a program halts is impossible,
         | just that there are programs that are not computable. If you
         | add constraints (like not being able to feed the program to
         | itself as is done in the halting problem and not allowing
         | unbounded loops) then it is possible to determine if a program
         | will terminate or not.
         | 
         | [2]
         | https://en.wikipedia.org/wiki/Synchronous_programming_langua...
         | 
         | [3] https://en.wikipedia.org/wiki/Lustre_(programming_language)
        
       | lancepioch wrote:
       | So a similar sense to Haxe: https://haxe.org
        
       | chubot wrote:
       | Wuffs seems fascinating and I really wanted to like it. But when
       | I look at the code for the JSON decoder it seems so low level,
       | and full of places for bugs to hide. JSON is a pretty simple spec
       | and this obscures it (although to be fair it's also handling
       | UTF-8).
       | 
       | https://github.com/google/wuffs/blob/main/std/json/decode_js...
       | 
       | Yes it prevents buffer overflows and integer overflow, but it
       | can't prevent logical errors.
       | 
       | I'd rather see efficient code generated from a short high level
       | spec, not an overwhelming amount of detail in a language verified
       | along a few dimensions.
       | 
       | ---
       | 
       | Logical errors in parsing also lead to security vulnerabilities.
       | For example, here is an example of parser differentials in HTTP
       | parsing:
       | 
       | https://about.gitlab.com/blog/2020/03/30/how-to-exploit-pars...
       | 
       | The canonical example of this class of bug is forging SSL
       | certificates to take advantage of buggy parsers, but I don't have
       | a link handy. There should be one off of https://langsec.org/ if
       | anyone can help dig it up.
       | 
       | Again, this has nothing to do with buffer or integer overflows.
       | 
       | (aside: while googling for that I found the claim that mRNA
       | vaccines work by parser differentials:
       | https://twitter.com/maradydd/status/1342891437537505280?lang...
       | If anyone understands that I'd be curious on an opinion/analysis
       | :) )
       | 
       | At the very least, any language for parsing should include
       | support for regular languages (regexes). The RFCs for many
       | network protocols use this metalanguage, and there's no reason it
       | shouldn't be executable. They compile easily to efficient code.
       | 
       | The VPRI project claimed to generate a TCP/IP implementation from
       | 200 lines of code, although it's not really a fair comparison
       | because it hasn't been tested in the wild:
       | https://news.ycombinator.com/item?id=846028 .
       | 
       | Still I think that style has better engineering properties. Oil's
       | lexer, which understands essentially all of bash, is generated
       | from a short source file
       | 
       | https://www.oilshell.org/release/0.8.8/source-code.wwz/front...
       | 
       | which generates
       | 
       | https://www.oilshell.org/release/0.8.8/source-code.wwz/_devb...
       | 
       | which goes on to generate 28,000 lines of C code. It's short, but
       | it really needs a better regex metalanguage to be readable:
       | https://www.oilshell.org/release/latest/doc/eggex.html
       | 
       | A large part of JSON can be described by regular languages, and
       | same with HTTP, etc.
       | 
       | -----
       | 
       | edit: An re2c target for wuffs could make sense. The generated
       | code already doesn't allocate any memory, although it uses tons
       | of pointers which could be dangling.
       | 
       | And in fact that was a problem Cloudflare, which sprayed the user
       | data of their customers all over the Internet back in 2017:
       | https://en.wikipedia.org/wiki/Cloudbleed
       | 
       | That was with Ragel and not re2c, which perhaps has a more error
       | prone API.
        
         | summerlight wrote:
         | > I'd rather see efficient code generated from a short high
         | level spec
         | 
         | This is a holy grail for many PL researchers, but I don't think
         | that there's any languages that reached this level of
         | sophistication with expressiveness/practicality enough for
         | production usages. At least with the status quo, you will
         | probably need to write a massive amount of formal proofs if you
         | want logical correctness, even with deceivingly simple
         | specifications.
        
           | chubot wrote:
           | It doesn't have to be the same metalanguage for every
           | program. You can write your own code generators adapted to
           | the specific problems. They should have a "pit of success",
           | and the knowledge of the domain is used to ensure that.
           | 
           | There are few research-level issues here; it's just good
           | engineering.
           | 
           | The 0% or 100% mindset is bad engineering. You want something
           | that's short, and that you can explain to other people, and
           | that other people can write an independent implementation of.
           | If the proof is 10x longer than normal code, and it's written
           | in a metalanguage that the relevant people don't know, then
           | it's not very useful.
           | 
           | Proofs are not guarantees. CompCERT has had logic bugs
           | despite being written in a formal language. (It does reduce
           | the number of bugs drastically in general, but it's also an
           | extremely expensive technique, and not what I'm advocating.)
           | 
           | There are no guarantees in engineering, just good practices.
           | Groveling through bytes one at a time in imperative languages
           | is not an ideal engineering practice, even if the imperative
           | language comes with more guarantees than most.
        
         | dmitriid wrote:
         | > JSON is a pretty simple spec a
         | 
         | Yet there are no proper implementations because it's too
         | simple, sometimes ambiguous, and there are several standards.
         | JSON parsing is a minefield, http://seriot.ch/parsing_json.php
        
       | oconnor663 wrote:
       | > There is no operator precedence. A bare a * b + c is an invalid
       | expression. You must explicitly write either (a * b) + c or a *
       | (b + c).
       | 
       | Honestly I've often wished for this in mainstream languages. It
       | seems like operator precedence should go the way of bracketless
       | if and implicit int casts. (Though I wonder if they wind up
       | making exceptions here for chains of method calls? I guess
       | technically those rely on operator precedence sort of?)
       | 
       | Edit: Yeah I see the example code has "args.src.read_u8?()". So
       | it looks like they figured out how to keep the good stuff.
        
         | nestorD wrote:
         | Me too! So far I have seen four actual bugs in large numerical
         | code bases that were caused by overlooking operator precedence.
         | I expect to see more in the years to come.
         | 
         | I think that precedence of '*' over '+' is acceptable (as
         | everyone knows it instinctively) but I would love a way to
         | require parenthesis for everything else.
        
         | sleepydog wrote:
         | APL (and, its derivatives, I think) evaluate strictly right to
         | left, so
         | 
         | a * b + c
         | 
         | is a * (b + c). It might be jarring at first but I really came
         | to enjoy the consistency, I never had to remember operator
         | precedence, which helps in a language like APL where most
         | functions are infix.
        
           | tragomaskhalos wrote:
           | Conversely, Smalltalk is left-to-right, so
           | 
           | a + b * c is (a + b) * c
           | 
           | which is simply a result of every operation being a message
           | send - muddling the rules with precedence would be likewise
           | confusing, and would ruin the simplicity of the grammar.
        
         | puzzlingcaptcha wrote:
         | Have there been attempts at creating languages that use a
         | postfix (RPN) notation?
        
           | benhoyt wrote:
           | Forth is (I think?) the oldest and most well-known.
           | Postscript, the printer control language, is possibly more
           | widely-deployed. And Factor is a modern take on Forth.
        
           | robobro wrote:
           | like forth?
        
         | mekkkkkk wrote:
         | Yes please! I'm always using parenthesis for every compound
         | expression, and I've heard so many times from coworkers or code
         | reviewers smuggly going "you know you can skip that, right?".
         | At the same time I've heard the same people having discussions
         | and scratching their heads about precedence in some attempt to
         | code golf their way through a feature. Not to mention bugs
         | caused by incorrect assumptions. Or pausing to figure out what
         | some previously written expression actually does. Meanwhile,
         | I'll gladly write `X + (Y / Z)`. You can thank me later.
        
         | wuschel wrote:
         | LISP-like languages have enforced operator precedence due to
         | polish notation e.g. (+ (* a b) (+ c d))
        
           | sqrt17 wrote:
           | there's no operator precedence if you don't have (multiple)
           | operators that could precede each other. In LISP-like
           | languages these are simply functions (or more correctly,
           | forms) which have other expressions as arguments, like any
           | other functions or forms. LISP works just fine without much
           | of the things we take for granted in ALGOL-like languages.
        
           | aidenn0 wrote:
           | In addition the variadic prefix-notation means the operators
           | are not limited to being binary:                 3*x*y*z+w
           | 
           | becomes:                 (+ (* 3 x y z) w)
        
         | brundolf wrote:
         | I tend to think it's fine for the very most common and obvious
         | operators (MDAS, etc), but as soon as you get outside of those
         | I agree. In particular I've been bitten by the precedence of
         | JavaScript's ?? operator:                 function foo(a) {
         | return a ?? 10 + " is the num";  // a ?? (10 + " is the num")
         | }            foo(12) // 12
        
       | dang wrote:
       | Surprisingly little discussed so far, aside from these past
       | related threads:
       | 
       |  _Wuffs' PNG image decoder_ -
       | https://news.ycombinator.com/item?id=26714831 - April 2021 (135
       | comments)
       | 
       |  _C performance mystery: delete unused string constant_ -
       | https://news.ycombinator.com/item?id=23633583 - June 2020 (105
       | comments)
       | 
       | That first one was just yesterday but this is a rare case where
       | we would not downweight the follow-up post (https://hn.algolia.co
       | m/?dateRange=all&page=0&prefix=true&sor...).
        
         | zokier wrote:
         | Apparently the language was renamed at some point, google/puffs
         | redirects to google/wuffs?
         | 
         | Puffs was discussed few years back
         | https://news.ycombinator.com/item?id=15711767
        
           | teraflop wrote:
           | Yeah, I was curious about that as well. The README file links
           | to a Google Groups discussion about the name change that
           | seems to have been memory-holed, but apparently it was
           | renamed to avoid confusion with a NetBSD component:
           | https://news.ycombinator.com/item?id=15712659
        
           | pjmlp wrote:
           | Thankfully they changed the name, in Germany it would be
           | quite impossible to use it in any public discussion, as Der
           | Puff is a special kind of boys club.
        
       | mjevans wrote:
       | The only part of the Wuffs spec I just read that I dislike:
       | 
       | Strings. I would really prefer strings to work like existing C
       | and 'bash' style quoting. At least the simple aspects of it, the
       | parts of the rules that are easy to remember and simple. A string
       | should always be a sequence of octets, but easily coerced by a
       | casting operator to a numeric format from any index. I'm not sure
       | what the syntax for that would be offhand.
        
         | rattray wrote:
         | Yeah, when I got to the bottom and saw that they don't have
         | strings, I immediately decided not to spend any more time
         | learning about the language.
         | 
         | I'm not sure what I would want them to be like for this
         | language, but I'd definitely want something.
        
       ___________________________________________________________________
       (page generated 2021-04-08 23:00 UTC)