https://discuss.python.org/t/pep-750-tag-strings-for-writing-domain-specific-languages/60408

Discussions on Python.org

PEP 750: Tag Strings For Writing Domain-Specific Languages

PEPs
lys.nikolaou (Lysandros Nikolaou) August 9, 2024, 3:40pm 1

Hi! :wave:

We are very excited to present PEP 750 - Tag Strings For Writing
Domain-Specific Languages. We believe that tag strings will be a
great addition to Python, which will make string processing and
writing Python-based DSLs much easier. We look forward to hearing
everyone's feedback!

 Abstract

This PEP introduces tag strings for custom, repeatable string
processing. Tag strings are an extension to f-strings, with a custom
function - the "tag" - in place of the f prefix. This function can
then provide rich features such as safety checks, lazy evaluation,
domain-specific languages (DSLs) for web templating, and more.

Tag strings are similar to JavaScript tagged template literals and
related ideas in other languages. The following tag string usage
shows how similar it is to an f string, albeit with the ability to
process the literal string and embedded values:

name = "World"
greeting = greet"hello {name}"
assert greeting == "Hello WORLD!"

Tag functions accept prepared arguments and return a string:

def greet(*args):
    """Tag function to return a greeting with an upper-case recipient."""
    salutation, recipient, *_ = args
    _, getvalue = recipient
    return f"{salutation.title().strip()} {getvalue().upper()}!"

Below you can find richer examples. As a note, an implementation
based on CPython 3.12 exists, as discussed in this document.

 PEP

# Python Enhancement Proposals (PEPs) [a2b8b27bd9a166f19c6d367b0a2]

PEP 750 - Tag Strings For Writing Domain-Specific Languages |
peps.python.org

This PEP introduces tag strings for custom, repeatable string
processing. Tag strings are an extension to f-strings, with a custom
function - the "tag" - in place of the f prefix. This function can
then provide rich features such as safety checks,...

11 Likes
steve.dower (Steve Dower) August 9, 2024, 4:23pm 2

Nice work!

My main concern with this idea is filling up the namespace with a
variety of short names. It's clearly convenient to have xml or html
as a prefix,^[1] but I'm not convinced it's significantly better than
a function.

And of course, we can't use dots in these names, which means we're
doing a function lookup on a top-level function with a short name
that could very easily be overwritten at some point.

I'm inclined towards a generic tag that converts the interpolations
into structured data. Something like i"str {x:format}" becoming
Interpolated(s="str {0}", args=[(x, "format"), ...]).

Then the tags can be regular functions that handle the Interpolated
type - xml.parse(i"<e attr={x} />"). It's not quite as syntactic
sugar, but it's more extensible and safer to use.

And maybe there's an additional layer of sugar that can be added on
top, for rarer cases where you can live with the namespace pollution?
But if the only option is to create top-level namespace pollution in
order to use it at all, I don't think this will move as quickly as
you'd like.

---------------------------------------------------------------------

 1. greet is a poorly motivated example, IMHO. You should lead with
    something that's useful. -[?]

9 Likes
pauleveritt (Paul Everitt) August 9, 2024, 5:07pm 3

Just to be clear...these tag functions don't go in built-ins or the
standard library. They are user-defined functions. In your own code
or from a package you install. (I'm not saying this was your point,
more in case someone reads it that way.)

steve.dower (Steve Dower) August 9, 2024, 5:10pm 4
# Paul Everitt:


    these tag functions don't go in built-ins or the standard library

Yeah, it's your own local namespace that I'm worried about polluting.

For example, you couldn't both import the xml module and have an xml
tag available, because the names collide.

3 Likes
jimbaker (Jim Baker) August 9, 2024, 6:03pm 5

Because both of these are currently invalid syntax, it's possible for
tag strings to additionally both support dotted names:

lazy.f"Like f-string, but this tag function lazily evaluates {expr}"

and atomic expressions

(resolve_to_tag_function())"The velocity of an unladen sparrow is {velocity} m/s"

I can see arguments either way (shorter names vs avoid namespace
cluttering). In practice, it is somewhat more involved/difficult to
implement these two additional cases, but if the ergonomics make
sense to the community, that's likely a reasonable cost.

erictraut (Eric Traut) August 9, 2024, 6:09pm 6

Thanks, I can tell a lot of good thought went into this PEP. Overall,
I like the design and think it would be a good addition to the
language.

However, I have one significant concern. It relates to the deferred
evaluation of interpolation expressions. I understand the motivation
for this design choice, but I think it has many downsides that are
not recognized or addressed in the PEP. The good news is that there
are reasonable ways to accommodate lazy evaluation use cases without
any of these downsides (more on that later).

Let me start by trying to convince you that deferring the evaluation
of interpolation expressions in the common case is a bad idea.

First, it complicates the mental model for users of tag strings. They
have to assume that all of their interpolation expressions may not be
evaluated immediately. Deferred evaluation requires special
considerations, so doing it implicitly leads to surprises. It places
a higher burden on users to make their code correct in all
circumstances. If they are the author of both the tag function and
the tag string, this is less of an issue, but in many cases these
will be different developers.

Users of tag strings cannot reassign or modify any of the values used
in interpolation expressions after the tag string definition because
the final evaluated string may be affected.

name = "Bob"
my_str = greet"hello {name}"

name = "Ellen"
print(my_str) # What would you expect here?

del name
print(my_str) # Will this crash?

There is no guarantee that evaluations will be performed by the tag
function exactly once and in a left-to-right order.

id = 1
def get_next_id():
    global id
    id += 1
    return id

my_str = tag_fn"{get_next_id()}: a, {get_next_id()}: b"
print(my_str) # Is this guaranteed to print "1: a, 2: b"?
# What if the tag function evaluates them multiple times?
# What if the tag function evaluates them in a different order?
# What if the tag function skips evaluating some of the?

print(my_str) # Will this print "3: a, 4: b"?

Second, when users encounter problems like this, the issue will be
difficult to debug. Debugging async code is challenging in general,
and this is even more challenging because debuggers will typically
lack the context to show users what is executing at the time the bug
occurs.

Third, static analysis tools like mypy and pyright will need to
conservatively assume that all interpolation expressions are
evaluated in a deferred manner. This means they'll inevitably produce
false positives in cases where evaluation isn't deferred (the common
case). Here's an example:

def func(name_or_id: str | int):
    if isinstance(name_or_id, str):
        # The type of name_or_id is narrowed to `str` here,
        # but the narrowed type cannot be used if `name_or_id`
        # is evaluated in a deferred manner. This will result in
        # a static type error when calling the `upper` method.
        print(greet"Hello {name_or_id.upper()}!")

---------------------------------------------------------------------

I see two potential ways to avoid some or all of the above problems.

Fix 1 (my recommendation): Make deferred execution explicit by having
the user provide a callable in the interpolation expression. This fix
addresses all of the problems I mentioned above.

With this fix, all interpolation expressions are evaluated
immediately by the runtime, as they are with f-strings. This
guarantees that they are all evaluated exactly once in left-to-right
order, preserving the common-sense mental model of f-strings.

In the (relatively rare) case where the user wants their
interpolation expression to be evaluated lazily (e.g. because it's an
expensive call or the information is not yet available) and this
functionality is supported by the tag function, they can provide a
lambda or other callable in their interpolation expression. If a tag
function supports deferred (lazy) evaluation, it can look at the
evaluated value of the interpolation expression and determine whether
it's callable. If it's callable, it should call it to retrieve the
value. If the value is not callable, the tag function should assume
that the value can be used directly in a non-deferred manner.

name = "Ralph"

# `name` is evaluated immediately
greet"Hello {name}"

# The expression `lambda: name` is evaluated immediately.
# It is callable, so the tag function calls it in a deferred
# manner to retrieve the final value.
greet"Hello {lambda: name}!"

# Here "tag" is evaluated immediately but `fetch_body_deferred`
# is callable, so it is called in a deferred manner.
tag = "body"
def fetch_body_deferred() -> str: ...
html"<{tag}>{fetch_body_deferred}<{/tag}>"

The nice thing about this approach is that the common case (where
deferred evaluation isn't needed or desired) is much simpler. It puts
the user of the tag string in control. In the less-common situation
where deferred evaluation is desired, it's clear to the user -- and to
static analysis tools -- that deferred evaluation is being used. Users
can be more cautious when deferred evaluation is intended, and static
analysis tools can detect potential programming errors that result
from deferred evaluation without generating false positives in the
common immediate-evaluation case.

---------------------------------------------------------------------

(Partial) Fix 2: Document clear expectations for tag functions. This
mitigates some, but not all, of the problems. I don't recommend this
fix unless there are objections to fix 1.

This fix involves a clear, documented contract for tag functions.
They should be expected to evaluate every interpolation expression
exactly once and in order. If they fail to honor this contract, users
of tag strings may see unexpected behaviors.

This approach is still problematic because it addresses only some of
the problems. There's also no way for the runtime or static analysis
tools to enforce this contract, so there's still potential for bugs.

4 Likes
MegaIng (Cornelius Krupp) August 9, 2024, 6:25pm 7
# Eric Traut:


    First, it complicates the mental model for users of tag strings.
    They have to assume that all of their interpolation expressions
    may not be evaluated immediately

The goal of this feature is to define DSLs. Users will need to learn
the rule of each individual tag function (i.e. DSL) in isolation, and
they should not carry over assumptions about when a value gets
evaluated from one to the other.

# Eric Traut:

    name = "Bob"
    my_str = greet"hello {name}"

    name = "Ellen"
    print(my_str) # What would you expect here?

Hello Bob

# Eric Traut:

    del name
    print(my_str) # Will this crash?

No, because the greet DSL eagerly evaluates it's arguments. This
should be spelt out in the documentation of greet.

# Eric Traut:


    Fix 1 (my recommendation): Make deferred execution explicit by
    having the user provide a callable in the interpolation
    expression. This fix addresses all of the problems I mentioned
    above.

Strong disagree. This limits to the usefulness of this feature so
much that I think it's useless. Maybe the PEP failed to give a good
example, but IMO one of the strongest usecases is lazily evaluated
(potentially even intentionally changed variables) templates.
Explicitly requiring users to write lambda: breaks the normal reading
flow and adds an unnecessary burden that already requires careful
documentation reading to use correctly.

# Eric Traut:


    (Partial) Fix 2: Document clear expectations for tag functions.
    This mitigates some, but not all, of the problems. I don't
    recommend this fix unless there are objections to fix 1.

    This fix involves a clear, documented contract for tag functions.
    They should be expected to evaluate every interpolation
    expression exactly once and in order. If they fail to honor this
    contract, users of tag strings may see unexpected behaviors.

Again, strong disagree. Setting this expectation is limiting the
usability of this feature way to much. There should be a strong
recommendation to clearly documented when and how the expressions
will be evaluated, but there shouldn't be a warning that not
following the normal rules of f-strings will lead to bugs.

pauleveritt (Paul Everitt) August 9, 2024, 7:16pm 8

FWIW, earlier drafts had an example of "lazy f-strings." Our
companion repo has some example code. We removed for brevity. But I
believe StackOverflow would show there's an audience for that.

pauleveritt (Paul Everitt) August 9, 2024, 7:22pm 9

If you're interested in kicking the tires, there's actually an
implementation based on 3.14 (thanks to Jim, Guido, and Lysandros.)
You can try it:

  * With this JupyterLite notebook (thanks Koudai and also Hood for
    getting Pyodide fixes for 3.14)
  * A quick tutorial and a longer HTML templating tutorial
  * A Docker build in Docker Hub (again, thanks Koudai)
  * Codespaces (thanks Dave) and a link to the branch in the repo
    README

2 Likes
Rosuav (Chris Angelico) August 9, 2024, 8:58pm 10
# Steve Dower:


    For example, you couldn't both import the xml module and have an
    xml tag available, because the names collide.

But presumably you could have an XML tag if that's what you want.

devdanzin (Daniel Diniz) August 9, 2024, 10:07pm 11
# Paul Everitt:

      + With this JupyterLite notebook

That's really fun! A powerful feature, no doubt.

After a bit of playing, semantics about what gets executed when
became clearer to me:

def _greet(*args):
    exec(compile(args[0], "s", "exec"), globals())
    return f"{args[0]}"

greet = _greet
print(
    greet"""a=1
greet'''
b=2
greet'greet = lambda *args: "We gone...  "'
'''
"""
    + greet'Ok'
    + _greet("greet = _greet  ")
    + greet"print('We back?')   "
    + greet"c = 3  "
    + greet"print(a, b, c)"
)

# This outputs:
# We back?
# 1 2 3
# a=1
# greet'''
# b=2
# greet'greet = lambda *args: "We gone...  "'
# '''
# We gone...  greet = _greet  print('We back?')   c = 3  print(a, b, c)

But I'm not sure I like all it makes possible:

print(str"a")  # a
print(int"2")  # 2
print"b"  # b
print(type"a")  # <class 'DecodedConcrete'>
import sys
write = sys.stdout.write
write"Hmm"  # Hmm
list'defabc'  # ['d', 'e', 'f', 'a', 'b', 'c']
list'''{list"{sorted'defabc'}"}'''[0]()[0]()  # ['a', 'b', 'c', 'd', 'e', 'f']

2 Likes
brettcannon (Brett Cannon) August 9, 2024, 10:15pm 12
# Jim Baker:


    it's possible for tag strings to additionally both support dotted
    names:

    lazy.f"Like f-string, but this tag function lazily evaluates {expr}"

I personally would definitely would want dotted name support.

3 Likes
trey (Trey Hunner) August 9, 2024, 10:27pm 13

I enjoyed playing with this, but I found the first example a bit
confusing initially. After reading the "Proposal" section I realized
the real power of this feature.

Here's an example that might demonstrate the power of this a bit more
readily:

import re
from typing import Decoded, Interpolation, Pattern


def regex(*args: Decoded | Interpolation) -> Pattern:
    result = []
    for arg in args:
        match arg:
            case Decoded() as decoded:
                result.append(decoded.raw)
            case Interpolation() as interpolation:
                value = interpolation.getvalue()
                result.append(re.escape(value))
    return re.compile(f"{''.join(result)}")

Here's an example use of that tag (though someone can probably think
of a better one):

def find_word(string, word):
    return regex"\b{word}\b".findall(string)

And here's what it's effectively equivalent to:

def find_word(string, word):
    return re.findall(rf"\b{re.escape(word)}\b", string)

Fully working code here. Feel free to borrow that code or some
variation of it as an example in the PEP.

Note: I tried to come up with an example that used logging or sqlite3
to avoid the common "don't use f-strings when logging" and "string
formatting leads to SQL inejection" issues, but I had trouble due to
the inability to use . in the tag. I worked around it with x =
cursor.x in this example.

---------------------------------------------------------------------

Personal preference note, as someone teaching Python:

I'm torn on the syntax. The use of quotes makes me think "this
returns a string" and also seems like it could lead to a challenge in
knowing what to look up in a search engine to learn about this
syntax.

If backticks or another symbol were used instead of quotes, I could
imagine beginners searching for "Python backtick syntax". But the
current syntax doesn't have a simple answer to "what should I type
into Google/DDG/etc. to look this up".

4 Likes
pauleveritt (Paul Everitt) August 10, 2024, 11:44am 14

Nice example, thanks (and thanks for looking at it.) You're also
right about logging, it came up in earlier discussion.

Interesting point about a different character instead of quote.
JavaScript also uses backtick, but uses it for both template literal
(f-string) and tagged template literal (tag string). Also: tagged
template literals, unlike template literals, don't have to return a
string.

jimbaker (Jim Baker) August 10, 2024, 2:34pm 15
# Trey Hunner:


    Personal preference note, as someone teaching Python:

    I'm torn on the syntax. The use of quotes makes me think "this
    returns a string" and also seems like it could lead to a
    challenge in knowing what to look up in a search engine to learn
    about this syntax.

The tag name is just a regular name that has been defined or imported
in the namespace. So the discovery process of what is this tag string
and how does it work, starts there. So I would imagine someone in a
code editor could click on the specific regex name for the tag
string, go to its definition, read its docs, etc. Code editors could
also usefully summarize usage as well through hovering.

Tag strings are often stringified, or return subclasses of str, but
not necessarily. We didn't want to deter the innovation that was
possible by returning other objects, and especially provide hard
edges that would require someone to go back to using sys._getframe
for their actual problems.

    If backticks or another symbol were used instead of quotes, I
    could imagine beginners searching for "Python backtick syntax".
    But the current syntax doesn't have a simple answer to "what
    should I type into Google/DDG/etc. to look this up".

Given the use of backticks in JS tagged template literals, it was the
first approach we considered for Python. But each of the current
string variants, especially f-strings, required understanding of this
new syntax. Starting with the same syntax as f-strings, but targeted
for DSLs, seem to be straightforward to teach and learn.

brandtbucher (Brandt Bucher) August 10, 2024, 4:08pm 16
# Brett Cannon:


    I personally would definitely would want dotted name support.

At that point, we should just allow any expression (we've been down
this road before).

pablogsal (Pablo Galindo Salgado) August 10, 2024, 5:52pm 17

Unfortunately, now it's not that easy because the tag: the lexer uses
it to enter f-string mode (now tag string mode) and its included on
the FSTRING_START token. Using dotted_name won't do because the lexer
doesn't know what dotted_name is; therefore, it cannot now if it
needs to enter tag string mode or normal string mode. The parser
cannot drive the lexer so the lexer must do the lexing on its own
without any grammatical information (the same way the parser can be
directly driven by a bunch of tokens alone without the lexer).
Anything that couples both pieces it will be a nightmare.

There are hacky ways around it. I suggested one way to make that work
which is that basically when the lexer detects the start of a string
it asks "What's the last token I emitted". If it is NAME or ) or some
of the other ones that currently are illegal it emits tag-string
tokens but @lys.nikolaou noticed that the way he implemented this
idea was backwards incompatible because that was too big of a change
since it broke all tokenization code that tokenizes STRING tokens.

pf_moore (Paul Moore) August 10, 2024, 6:25pm 18
# Daniel Diniz:


    But I'm not sure I like all it makes possible:

On the plus side, it offers a form of "decimal literal" (which has
been requested many times) for free:

from decimal import Decimal as D

dec_num = D"2.71828"

1 Like
thejcannon (Josh Cannon) August 10, 2024, 6:49pm 19

Thanks for the PEP! I'm excited to see and use what comes out the
other end!

The two thoughts I had:

 1. This means we'll never get another string prefix again (in
    reality). It also means people are likely going to use other
    single letters prefixes at some point (which as someone else
    points out, probably hurts discoverability). So I think it
    deserves at least a section on the rejected alternatives as to
    not something more obviously different (greet!"foo" or back ticks
    as strawmen)

 2. The title of the PEP mentions this is for DSLs, however one
    reason to desire this feature has nothing to do with runtime
    semantics: annotating what the string is to tools. E.g.
    py"""...""" without any runtime implication is still useful as
    editors can syntax highlight and formatters/linters can format/
    lint. Same for sql"""...""" etc...
    So, I'd love to see that get some attention (which could be as
    much as a new stdlib module which has a __getattr__ for any name
    that just returns a function that is mostly just string identity
    (strawman on whether it acts like bare string or f string). Or as
    little as mentioning this is a use case which can be explored in
    another PEP, and that this PEP adds the obvious groundwork, and
    otherwise doesn't block future improvements)

3 Likes
yoavdw (Yoav) August 10, 2024, 7:58pm 20
# Josh Cannon:


    This means we'll never get another string prefix again (in
    reality).

I also thought about this, and I think it's another (not negligible)
advantage to Steve's idea of i-strings just returning an Interpolated
object.

1 Like
next page -

  * Home
  * Categories
  * Guidelines
  * Terms of Service
  * Privacy Policy

Powered by Discourse, best viewed with JavaScript enabled