[HN Gopher] Learn Python ASTs, by building your own linter
___________________________________________________________________
Learn Python ASTs, by building your own linter
Author : tusharsadhwani
Score : 145 points
Date : 2021-12-29 13:04 UTC (9 hours ago)
(HTM) web link (sadh.life)
(TXT) w3m dump (sadh.life)
| mgdlbp wrote:
| Quite a few languages can access their own ASTs, but I don't know
| of one other than C# (and VB.NET--Roslyn is the compiler for
| both) where the API is so deeply integrated and hence useful.
|
| The Roslyn SDK exposes its syntax tree, symbol table, and
| semantic model, with the primary use being for custom code
| analysis. I surprisingly easily made a linter ('analyzer') for a
| personal style preference, along with 'code fix' (lightbulb
| suggestion that appears in Visual Studio) through the quick-start
| tutorial. The resulting .NET assembly integrated impressively
| with msbuild and Visual Studio, my custom analyzer being
| indistinguishable in UX from the built-in ones. Seeing the actual
| syntax tree, especially where the compiler had recovered from
| syntax errors, also seemed a great learning experience for
| getting a feel of how the compiler treated errors.
|
| It seems to now be fairly common for .NET projects to develop
| their own analyzers to enforce specialized best-practices; I
| wonder if other languages have similar customs?
|
| https://docs.microsoft.com/en-us/dotnet/csharp/roslyn-sdk/
| tusharsadhwani wrote:
| That's amazing. I'm interested in C# now
| benhoff wrote:
| I've often thought it would be cool to use AST's and perhaps code
| embeddings generated from machine learning as a tool to help
| students improve.
|
| If you've ever taught a course with intro level python, it
| quickly becomes apparent how repeatable the mistakes are, or
| where you didn't spend enough time. As a student, this is
| frustrating because the correction comes too late, it's why
| having someone knowledgeable over your shoulder can speed up your
| learning.
|
| The challenge that I believe ASTs present is that they only parse
| compliant code. So if someone makes a syntax error, it becomes a
| whole new ball game. I'd glanced at tree sitter to see if this
| could fix some of these issues, but I think it's a more
| fundamental problem than that.
| tusharsadhwani wrote:
| tree sitter can definitely help with this problem, but so can
| regular AST parsers, the idea is the same: just add code or
| grammar that will parse the "invalid" grammar, mark it as
| invalid, and continue parsing valid code as soon as possible.
|
| Existing code editors like VSCode do exactly this for better
| syntax highlighting of incomplete code.
| hsbauauvhabzb wrote:
| Wouldn't that be impossible? The structure of python is finite,
| and invalid deviations are infinite. Sure any language AST
| compiler could be more helpful, but they can't take trash and
| turn it to gold.
| stevekemp wrote:
| That's a pretty awesome read, and the approach is pretty
| flexible.
|
| I've written simple code using the AST-visitor approach to
| enforce some common-standards on code within our company. Simple
| things like ensuring that when we use Troposphere to generate AWS
| cloudformation templates we always setup some specific values.
| (For example I wrote a checker to ensure that every time an ECR
| instance is created we must enable ScanOnPush, or every time we
| declare a security-group we must have a comment "[cloudformation]
| ..." with it - so that manual edits stand out.)
| tusharsadhwani wrote:
| Thanks!
|
| The stuff that ASTs let you do really flexibly is almost always
| lost to people because they're not aware of it. A lot of other
| developers would try to do this with string or regex matching,
| and that often leads to painful experiences.
| stevekemp wrote:
| Agreed. Simple checks like these are trivial:
|
| A call to function "Foo" Must always have an argument
| matching the regexp "/blah/". Otherwise raise an error.
|
| And they're so lightweight you can add them to any
| CI/CD/automation steps in your repository. Once you get a few
| things like that, or validating naming-standards, you can
| roll them up into a simple "linter".
| ambrose2 wrote:
| This was a really nice read! The best part was learning that
| there's no need to actually parse tokens when building a Python
| linter (well, maybe there's an exception) because you can
| leverage the already parsed AST or CST.
| tusharsadhwani wrote:
| True! Although there's some lints that would require you to
| parse tokens, such as checking for single vs. double quotes, or
| number of spaces used for indentation.
|
| However, python has a builtin tokenize module for that as well.
| jmac01 wrote:
| > "So what is an AST?"
|
| I had to google... because it doesn't actually say that it stands
| for Abstract Syntax Tree haha
|
| Would be nice to highlight what AST stands for in the first
| sentence of that section! :D
| fintler wrote:
| In the context of this article -- it's mostly just talking
| about Python-specific ASTs.
|
| Reading this article might be confusing to someone who's trying
| to learn what an AST is. ASTs are not unique to Python, they're
| just a common data structure used in compiler design.
|
| ASTs are used by compilers like this:
|
| 1) A compiler will take source code and process it into little
| pieces called tokens (e.g., a number, an equals sign, a
| variable type, etc) with a little program called a "lexer".
|
| 2) Then, those tokens are processed by a "parser" -- which is a
| little program that inputs the tokens from the lexer, as well
| as a description of a programming language (e.g. a Chomsky
| context-free grammar in Backus Naur Form) and outputs an AST.
|
| 3) Then finally, the AST nodes are walked and machine code is
| generated.
|
| This article hooks into the AST inside the Python "compiler"
| between steps 2 and 3 to do some analysis on the AST instead of
| converting it to something that can be executed (e.g. machine
| code or some other IR). Which, is a very useful thing, but
| probably not a good introduction to compilers.
|
| If you're new to compilers, I suggest staying away from the
| Python "ast" module until you're comfortable with general
| compiler design. Maybe start with playing around with something
| like PLY instead -- create a simple little language yourself
| and write a compiler for it:
|
| <https://www.dabeaz.com/ply/ply.html#ply_nn2>
| tusharsadhwani wrote:
| I'll agree, it's not a good introduction to compilers, but it
| isn't meant to be.
|
| PLY on the other hand is an amazing resource, thanks for
| linking it here.
| tusharsadhwani wrote:
| My bad xD, to my credit it's mentioned later in the article.
| But you're right, I should add that in the beginning.
| apurtbapurt wrote:
| I maintain some code that rely on Python AST for finding and
| packaging modules with appropriate class signatures when building
| customer specific distributions. It works really well most of the
| time. And, it is a lot easier to maintain than 50+ separate wheel
| definitions.
|
| The one big drawback is that the AST for even trivial code
| patterns has had a history of changing between Python versions.
| This makes it more annoying than usual to support multiple
| versions at the same time. Luckily 3.9 and 3.10 hasn't brought
| any changes that impacted my codebase, as far as I've noticed.
| tusharsadhwani wrote:
| The only major changes that I'm aware of since python3 has been
| the change with keyword arguments in 3.6, and the deprecation
| of Index and introduction of Constant more recently. Those are
| big changes, but relatively small and maintainable imo. What
| challenges have you faced?
| masklinn wrote:
| > the deprecation of Index and introduction of Constant more
| recently.
|
| The introduction of Constant also deprecated everything it
| replaced (Str, Num, Bytes, and NameConstant).
|
| There's also the introduction of f-strings (ast'd as
| JoinedStr), various nodes being duplicated for their async
| version.
|
| Probably more relevant to automatically discovering
| signatures would be the addition of positional-only arguments
| to the `arguments` object.
|
| But messing with the AST is definitely a lot more stable than
| messing with the bytecode.
| wbkang wrote:
| This is a cool post thank you. I knew about ASTs but did not know
| how to build them easily for Python so the second half was very
| useful for me.
| tusharsadhwani wrote:
| I really wasn't expecting anyone to read all of it, I was
| afraid people will either find it too trivial or too complex
| based on skill level. So that's great to hear.
| popotamonga wrote:
| In general, for me at least i find the best way to learn about
| something is to work in the 'internals'. For instance when react
| came i couldn't wrap my head around it so i started my own js
| framework, and it ended up almost exactly like react (then i
| dumped it as it ended up just being a learning exercise)
| agumonkey wrote:
| I forgot whoever coined the saying (Feynmann or else) but I'm
| definitely in the camp that needs to build something to feel at
| home with it.
| alansammarone wrote:
| I believe you are referring to "What I cannot create I don't
| understand", which is indeed by Feynman.
| agumonkey wrote:
| Most probably
| sarupbanskota wrote:
| You'll enjoy https://codecrafters.io
| nefitty wrote:
| I'm trying to become a top 0.01% JS user and creating linters,
| flavors, etc is my plan as well. I've read through and
| annotated the React codebase but it didn't stick very well. I
| would have done better to create my own framework! I keep
| having to relearn that lesson... I can have a lot of knowledge
| about a thing through reading, but knowledge of the thing
| requires some practical application.
|
| A tangent, but as it relates to that, if anyone reading has
| ideas on how to apply traditional computer science curriculum,
| I would love to hear it. I can think of toy CPU emulators,
| system architecture diagramming, language creation... But not
| sure if there's a thing I can build that would say, "I
| understand computer science."
___________________________________________________________________
(page generated 2021-12-29 23:01 UTC)