Commit Graph

111 Commits

Author SHA1 Message Date
Damien George d4751a164e py: Add support for PEP 750's t-strings.
This commit adds support for t-strings by leveraging the existing f-string
parser in the lexer.  It includes:
- t-string parsing in `py/lexer.c`
- new built-in `__template__()` function to construct t-string objects
- new built-in `Template` and `Interpolation` classes which implement all
  the functionality from PEP 750
- new built-in `string` module with `templatelib` sub-module, which
  contains the classes `Template` and `Interpolation`

The way the t-string parser works is that an input t-string like:

    t"hello {name:5}"

is converted character-by-character by the lexer/tokenizer to:

    __template__(("hello ", "",), name, "name", None, "5")

For reference, if it were an f-string it would be converted to:

    "hello {:5}".format(name)

Some properties of this implementation:
- it's enabled by default at the full feature level,
  MICROPY_CONFIG_ROM_LEVEL_AT_LEAST_FULL_FEATURES
- when enabled on a Cortex-M bare-metal port it costs about +3000 bytes
- there are no limits on the size or complexity of t-strings, and it allows
  arbitrary levels of nesting of f-strings and t-strings (up to the memory
  available to the compiler)
- the 'a' (ascii) conversion specifier is not supported (MicroPython does
  not have the built-in `ascii` function)
- space after conversion specifier, eg t"{x!r :10}", is not supported
- arguments to `__template__` and `Interpolation` are not fully validated
  (it's not necessary, it won't crash if the wrong arguments are passed in)

Otherwise the implementation here matches CPython.

Signed-off-by: Damien George <damien@micropython.org>
2026-03-09 23:47:33 +11:00
Damien George 58436b2882 py/lexer: Fix parsing of f'{{'.
Prior to this fix, f'{{' would raise a SyntaxError.

Signed-off-by: Damien George <damien@micropython.org>
2026-03-09 23:14:42 +11:00
Damien George c9f747cccf py/lexer: Move f-string completion code to more logical location.
This way, the use of `lex->fstring_args` is fully self contained within the
string literal parsing section of `mp_lexer_to_next()`.

Signed-off-by: Damien George <damien@micropython.org>
2026-02-04 23:19:09 +11:00
Damien George 7ea88f7da2 py/lexer: Add support for nested f-strings within f-strings.
It turns out that it's relatively simple to support nested f-strings, which
is what this commit implements.

The way the MicroPython f-string parser works at the moment is:
1. it extracts the f-string arguments (things in curly braces) into a
   temporary buffer (a vstr)
2. once the f-string ends (reaches its closing quote) the lexer switches to
   tokenizing the temporary buffer
3. once the buffer is empty it switches back to the stream.

The temporary buffer can easily hold f-strings itself (ie nested f-strings)
and they can be re-parsed by the lexer using the same algorithm.  The only
thing stopping that from working is that the temporary buffer can't be
reused for the nested f-string because it's currently being parsed.

This commit fixes that by adding a second temporary buffer, which is the
"injection" buffer.  That allows arbitrary number of nestings with a simple
modification to the original algorithm:
1. when an f-string is encountered the string is parsed and its arguments
   are extracted into `fstring_args`
2. when the f-string finishes, `fstring_args` is inserted into the current
   position in `inject_chrs` (which is the start of that buffer if no
   injection is ongoing)
3. `fstring_args` is now cleared and ready for any further f-strings
   (nested or not)
4. the lexer switches to `inject_chrs` if it's not already reading from it
5. if an f-string appeared inside the f-string then it is in `inject_chrs`
   and can be processed as before, extracting its arguments into
   `fstring_args`, which can then be inserted again into `inject_chrs`
6. once `inject_chrs` is exhausted (meaning that all levels of f-strings
   have been fully processed) the lexer switched back to tokenizing the
   stream.

Amazingly, this scheme supports arbitrary numbers of nestings of f-strings
using the same quote style.

This adds some code size and a bit more memory usage for the lexer.  In
particular for a single (non-nested) f-string it now makes an extra copy of
the `fstring_args` data, when copying it across to `inject_chrs`.
Otherwise, memory use only goes up with the complexity of nested f-strings.

Signed-off-by: Damien George <damien@micropython.org>
2026-02-04 23:19:09 +11:00
Damien George 617c7dba3b py/lexer: Use null char as lexer EOF sentinel.
The null byte cannot exist in source code (per CPython), so use it to
indicate the end of the input stream (instead of `(mp_uint_t)-1`).  This
allows the cache chars (chr0/1/2 and their saved versions) to be 8-bit
bytes, making it clear that they are not `unichar` values.  It also saves a
bit of memory in the `mp_lexer_t` data structure.  (And in a future commit
allows the saved cache chars to be eliminated entirely by storing them in
a vstr instead.)

In order to keep code size down, the frequently used `chr0` is still of
type `uint32_t`.  Having it 32-bit means that machine instructions to load
it are smaller (it adds about +80 bytes to Thumb code if `chr0` is changed
to `uint8_t`).

Also add tests for invalid bytes in the input stream to make sure there are
no regressions in this regard.

Signed-off-by: Damien George <damien@micropython.org>
2026-02-04 23:19:09 +11:00
Damien George 96007e7de5 py/lexer: Add static assert that token enum values all fit in a byte.
Signed-off-by: Damien George <damien@micropython.org>
2024-07-18 12:44:44 +10:00
Damien George 3c8089d1b1 py/lexer: Support raw f-strings.
Support for raw str/bytes already exists, and extending that to raw
f-strings is easy.  It also reduces code size because it eliminates an
error message.

Signed-off-by: Damien George <damien@micropython.org>
2024-06-06 17:34:28 +10:00
Damien George a066f2308f py/lexer: Support concatenation of adjacent f-strings.
This is quite a simple and small change to support concatenation of
adjacent f-strings, and improve compatibility with CPython.

Signed-off-by: Damien George <damien@micropython.org>
2024-06-06 14:58:46 +10:00
Angus Gratton decf8e6a8b all: Remove the "STATIC" macro and just use "static" instead.
The STATIC macro was introduced a very long time ago in commit
d5df6cd44a.  The original reason for this was
to have the option to define it to nothing so that all static functions
become global functions and therefore visible to certain debug tools, so
one could do function size comparison and other things.

This STATIC feature is rarely (if ever) used.  And with the use of LTO and
heavy inline optimisation, analysing the size of individual functions when
they are not static is not a good representation of the size of code when
fully optimised.

So the macro does not have much use and it's simpler to just remove it.
Then you know exactly what it's doing.  For example, newcomers don't have
to learn what the STATIC macro is and why it exists.  Reading the code is
also less "loud" with a lowercase static.

One other minor point in favour of removing it, is that it stops bugs with
`STATIC inline`, which should always be `static inline`.

Methodology for this commit was:

1) git ls-files | egrep '\.[ch]$' | \
   xargs sed -Ei "s/(^| )STATIC($| )/\1static\2/"

2) Do some manual cleanup in the diff by searching for the word STATIC in
   comments and changing those back.

3) "git-grep STATIC docs/", manually fixed those cases.

4) "rg -t python STATIC", manually fixed codegen lines that used STATIC.

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
2024-03-07 14:20:42 +11:00
Mathieu Serandour c85db05244 py/lexer: Change token position for new lines.
Set the position of new line tokens as the end of the preceding line
instead of the beginning of the next line.  This is done by first moving
the pointer to the end of the current line to skip any whitespace, record
the position for the token, then finaly skip any other line and whitespace.

The previous behavior was to skip every new line and whitespace, including
the indent of the next line, before recording the token position.

(Note that both lex->emit_dent and lex->nested_bracket_level equal 0 if
had_physical_newline == true, which allows simplifying the if-logic for
MP_TOKEN_NEWLINE.)

And update the cmd_parsetree.py test expected output, because the position
of the new-line token has changed.

Fixes issue #12792.

Signed-off-by: Mathieu Serandour <mathieu.serandour@numworks.fr>
2023-11-03 15:56:10 +11:00
Jim Mussared 5015779a6f py/builtinevex: Handle invalid filenames for execfile.
If a non-string buffer was passed to execfile, then it would be passed
as a non-null-terminated char* to mp_lexer_new_from_file.

This changes mp_lexer_new_from_file to take a qstr instead (as in almost
all cases a qstr will be created from this input anyway to set the
`__file__` attribute on the module).

This now makes execfile require a string (not generic buffer) argument,
which is probably a good fix to make anyway.

Fixes issue #12522.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
2023-10-12 15:17:59 +11:00
Jim Mussared 276bfa3146 py/lexer: Add missing initialisation for fstring_args_idx.
This was missed in 692d36d779. Probably
never noticed because everything enables `MICROPY_GC_CONSERVATIVE_CLEAR`,
but found via ASAN thanks to @gwangmu & @chibinz.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
2023-09-29 13:58:26 +10:00
Jared Hancock b3cd41dd4b py/lexer: Allow conversion specifiers in f-strings (e.g. !r).
PEP-498 allows for conversion specifiers like !r and !s to convert the
expression declared in braces to be passed through repr() and str()
respectively.

This updates the logic that detects the end of the expression to also stop
when it sees "![rs]" that is either at the end of the f-string or before
the ":" indicating the start of the format specifier. The "![rs]" is now
retained in the format string, whereas previously it stayed on the end
of the expression leading to a syntax error.

Previously: `f"{x!y:z}"` --> `"{:z}".format(x!y)`
Now: `f"{x!y:z}"` --> `"{!y:z}".format(x)`

Note that "!a" is not supported by `str.format` as MicroPython has no
`ascii()`, but now this will raise the correct error.

Updated cpydiff and added tests.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
2023-06-14 19:11:04 +10:00
Jim Mussared fb8792c095 py/lexer: Wrap in parenthesis all f-string arguments passed to format.
This is important for literal tuples, e.g.

    f"{a,b,}, {c}" --> "{}".format((a,b), (c),)

which would otherwise result in either a syntax error or the wrong result.

Fixes issue #9635.

This work was funded through GitHub Sponsors.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
2023-01-20 17:54:32 +11:00
Damien George c49d5207e9 py/persistentcode: Remove unicode feature flag from .mpy file.
Prior to this commit, even with unicode disabled .py and .mpy files could
contain unicode characters, eg by entering them directly in a string as
utf-8 encoded.

The only thing the compiler disallowed (with unicode disabled) was using
\uxxxx and \Uxxxxxxxx notation to specify a character within a string with
value >= 0x100; that would give a SyntaxError.

With this change mpy-cross will now accept \u and \U notation to insert a
character with value >= 0x100 into a string (because the -mno-unicode
option is now gone, there's no way to forbid this).  The runtime will
happily work with strings with such characters, just like it already works
with strings with characters that were utf-8 encoded directly.

This change simplifies things because there are no longer any feature
flags in .mpy files, and any bytecode .mpy will now run on any target.

Signed-off-by: Damien George <damien@micropython.org>
2022-05-17 12:51:54 +10:00
Damien George 11ed94797d py/lexer: Support nested [] and {} characters within f-string params.
Signed-off-by: Damien George <damien@micropython.org>
2021-11-25 21:50:58 +11:00
Jim Mussared 5555f147df py/lexer: Clear fstring_args vstr on lexer free.
This was missed in 692d36d779.  It's not
strictly necessary as the GC will clean it anyway, but it's good to
pre-emptively gc_free() all the blocks used in lexing/parsing.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
2021-08-19 17:31:02 +10:00
Jim Mussared 692d36d779 py: Implement partial PEP-498 (f-string) support.
This implements (most of) the PEP-498 spec for f-strings and is based on
https://github.com/micropython/micropython/pull/4998 by @klardotsh.

It is implemented in the lexer as a syntax translation to `str.format`:
  f"{a}" --> "{}".format(a)

It also supports:
  f"{a=}" --> "a={}".format(a)

This is done by extracting the arguments into a temporary vstr buffer,
then after the string has been tokenized, the lexer input queue is saved
and the contents of the temporary vstr buffer are injected into the lexer
instead.

There are four main limitations:
- raw f-strings (`fr` or `rf` prefixes) are not supported and will raise
  `SyntaxError: raw f-strings are not supported`.

- literal concatenation of f-strings with adjacent strings will fail
    "{}" f"{a}" --> "{}{}".format(a)    (str.format will incorrectly use
                                         the braces from the non-f-string)
    f"{a}" f"{a}" --> "{}".format(a) "{}".format(a) (cannot concatenate)

- PEP-498 requires the full parser to understand the interpolated
  argument, however because this entirely runs in the lexer it cannot
  resolve nested braces in expressions like
    f"{'}'}"

- The !r, !s, and !a conversions are not supported.

Includes tests and cpydiffs.

Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
2021-08-14 16:58:40 +10:00
Emil Renner Berthing ccd92335a1 py, extmod: Introduce and use MP_FALLTHROUGH macro.
Newer GCC versions are able to warn about switch cases that fall
through.  This is usually a sign of a forgotten break statement, but in
the few cases where a fall through is intended we annotate it with this
macro to avoid the warning.
2020-10-22 11:53:16 +02:00
Damien George 1783950311 py/compile: Implement PEP 572, assignment expressions with := operator.
The syntax matches CPython and the semantics are equivalent except that,
unlike CPython, MicroPython allows using := to assign to comprehension
iteration variables, because disallowing this would take a lot of code to
check for it.

The new compile-time option MICROPY_PY_ASSIGN_EXPR selects this feature and
is enabled by default, following MICROPY_PY_ASYNC_AWAIT.
2020-06-16 22:02:24 +10:00
Jim Mussared def76fe4d9 all: Use MP_ERROR_TEXT for all error messages. 2020-04-05 15:02:06 +10:00
Damien George 69661f3343 all: Reformat C and Python source code with tools/codeformat.py.
This is run with uncrustify 0.70.1, and black 19.10b0.
2020-02-28 10:33:03 +11:00
Damien George 2069c563f9 py: Add support for matmul operator @ as per PEP 465.
To make progress towards MicroPython supporting Python 3.5, adding the
matmul operator is important because it's a really "low level" part of the
language, being a new token and modifications to the grammar.

It doesn't make sense to make it configurable because 1) it would make the
grammar and lexer complicated/messy; 2) no other operators are
configurable; 3) it's not a feature that can be "dynamically plugged in"
via an import.

And matmul can be useful as a general purpose user-defined operator, it
doesn't have to be just for numpy use.

Based on work done by Jim Mussared.
2019-09-26 15:12:39 +10:00
Damien George 6a445b60fa py/lexer: Add support for underscores in numeric literals.
This is a very convenient feature introduced in Python 3.6 by PEP 515.
2018-06-12 12:17:43 +10:00
Damien George a3dc1b1957 all: Remove inclusion of internal py header files.
Header files that are considered internal to the py core and should not
normally be included directly are:
    py/nlr.h - internal nlr configuration and declarations
    py/bc0.h - contains bytecode macro definitions
    py/runtime0.h - contains basic runtime enums

Instead, the top-level header files to include are one of:
    py/obj.h - includes runtime0.h and defines everything to use the
        mp_obj_t type
    py/runtime.h - includes mpstate.h and hence nlr.h, obj.h, runtime0.h,
        and defines everything to use the general runtime support functions

Additional, specific headers (eg py/objlist.h) can be included if needed.
2017-10-04 12:37:50 +11:00
Javier Candeira 35a1fea90b all: Raise exceptions via mp_raise_XXX
- Changed: ValueError, TypeError, NotImplementedError
  - OSError invocations unchanged, because the corresponding utility
    function takes ints, not strings like the long form invocation.
  - OverflowError, IndexError and RuntimeError etc. not changed for now
    until we decide whether to add new utility functions.
2017-08-13 22:52:33 +10:00
Alexander Steffen 55f33240f3 all: Use the name MicroPython consistently in comments
There were several different spellings of MicroPython present in comments,
when there should be only one.
2017-07-31 18:35:40 +10:00
Tom Collins 145796f037 py,extmod: Some casts and minor refactors to quiet compiler warnings. 2017-07-07 11:32:22 +10:00
Tom Collins 6f56412ec3 py/lexer: Process CR earlier to allow newlines checks on chr1.
Resolves an issue where lexer failed to accept CR after line continuation
character.  It also simplifies the code.
2017-05-12 15:14:24 +10:00
Tom Collins 2998647c4e py/lexer: Simplify lexer startup by using dummy bytes and next_char().
Now consistently uses the EOL processing ("\r" and "\r\n" convert to "\n")
and EOF processing (ensure "\n" before EOF) provided by next_char().

In particular the lexer can now correctly handle input that starts with CR.
2017-05-09 14:43:23 +10:00
Damien George 5010d1958f py/lexer: Simplify and reduce code size for operator tokenising.
By removing the 'E' code from the operator token encoding mini-language the
tokenising can be simplified.  The 'E' code was only used for the !=
operator which is now handled as a special case; the optimisations for the
general case more than make up for the addition of this single, special
case.  Furthermore, the . and ... operators can be handled in the same way
as != which reduces the code size a little further.

This simplification also removes a "goto".

Changes in code size for this patch are (measured in bytes):

bare-arm:       -48
minimal x86:    -64
unix x86-64:   -112
unix nanbox:    -64
stmhal:         -48
cc3200:         -48
esp8266:        -76
2017-03-29 10:56:52 +11:00
Damien George f64a3e296e py/lexer: Remove obsolete comment, since lexer can now raise exceptions. 2017-03-23 16:40:24 +11:00
Damien George 1831034be1 py: Allow lexer to raise exceptions during construction.
This patch refactors the error handling in the lexer, to simplify it (ie
reduce code size).

A long time ago, when the lexer/parser/compiler were first written, the
lexer and parser were designed so they didn't use exceptions (ie nlr) to
report errors but rather returned an error code.  Over time that has
gradually changed, the parser in particular has more and more ways of
raising exceptions.  Also, the lexer never really handled all errors without
raising, eg there were some memory errors which could raise an exception
(and in these rare cases one would get a fatal nlr-not-handled fault).

This patch accepts the fact that the lexer can raise exceptions in some
cases and allows it to raise exceptions to handle all its errors, which are
for the most part just out-of-memory errors during construction of the
lexer.  This makes the lexer a bit simpler, and also the persistent code
stuff is simplified.

What this means for users of the lexer is that calls to it must be wrapped
in a nlr handler.  But all uses of the lexer already have such an nlr
handler for the parser (and compiler) so that doesn't put any extra burden
on the callers.
2017-03-14 11:52:05 +11:00
Damien George 5124a94067 py/lexer: Convert mp_uint_t to size_t where appropriate. 2017-02-17 12:44:24 +11:00
Damien George 534b7c368d py: Do adjacent str/bytes literal concatenation in lexer, not compiler.
It's much more efficient in RAM and code size to do implicit literal string
concatenation in the lexer, as opposed to the compiler.

RAM usage is reduced because the concatenation can be done right away in the
tokeniser by just accumulating the string/bytes literals into the lexer's
vstr.  Prior to this patch adjacent strings/bytes would create a parse tree
(one node per string/bytes) and then in the compiler a whole new chunk of
memory was allocated to store the concatenated string, which used more than
double the memory compared to just accumulating in the lexer.

This patch also significantly reduces code size:

bare-arm: -204
minimal:  -204
unix x64: -328
stmhal:   -208
esp8266:  -284
cc3200:   -224
2017-02-17 12:12:40 +11:00
Damien George 773278ec30 py/lexer: Simplify handling of line-continuation error.
Previous to this patch there was an explicit check for errors with line
continuation (where backslash was not immediately followed by a newline).

But this check is not necessary: if there is an error then the remaining
logic of the tokeniser will reject the backslash and correctly produce a
syntax error.
2017-02-17 11:30:14 +11:00
Damien George ae43679792 py/lexer: Use strcmp to make keyword searching more efficient.
Since the table of keywords is sorted, we can use strcmp to do the search
and stop part way through the search if the comparison is less-than.

Because all tokens that are names are subject to this search, this
optimisation will improve the overall speed of the lexer when processing
a script.

The change also decreases code size by a little bit because we now use
strcmp instead of the custom str_strn_equal function.
2017-02-17 11:10:35 +11:00
Damien George a68c754688 py/lexer: Move check for keyword to name-tokenising block.
Keywords only needs to be searched for if the token is a MP_TOKEN_NAME, so
we can move the seach to the part of the code that does the tokenising for
MP_TOKEN_NAME.
2017-02-17 10:59:57 +11:00
Damien George 98b3072da5 py/lexer: Simplify handling of indenting of very first token. 2017-02-17 10:56:06 +11:00
Damien George c264414746 py/lexer: Don't generate string representation for period or ellipsis.
It's not needed.
2017-02-16 20:23:41 +11:00
Damien George 8beba7310f extmod/vfs_fat: Remove MICROPY_READER_FATFS component. 2017-01-30 12:26:07 +11:00
Damien George dcb9ea7215 extmod: Add generic VFS sub-system.
This provides mp_vfs_XXX functions (eg mount, open, listdir) which are
agnostic to the underlying filesystem type, and just require an object with
the relevant filesystem-like methods (eg .mount, .open, .listidr) which can
then be mounted.

These mp_vfs_XXX functions would typically be used by a port to implement
the "uos" module, and mp_vfs_open would be the builtin open function.

This feature is controlled by MICROPY_VFS, disabled by default.
2017-01-27 17:19:06 +11:00
Damien George c305ae3243 py/lexer: Permanently disable the mp_lexer_show_token function.
The lexer is very mature and this debug function is no longer used.  If
it's really needed one can uncomment it and recompile.
2016-12-22 10:49:54 +11:00
Damien George f4aebafe7a py/lexer: Remove unnecessary check for EOF in lexer's next_char func.
This check always fails (ie chr0 is never EOF) because the callers of this
function never call it past the end of the input stream.  And even if they
did it would be harmless because 1) reader.readbyte must continue to
return an EOF char if the stream is exhausted; 2) next_char would just
count the subsequent EOF's as characters worth 1 column.
2016-12-22 10:39:06 +11:00
Damien George b9c4783273 py/lexer: Remove unreachable code in string tokeniser. 2016-12-22 10:37:13 +11:00
Damien George adccafb42a tests/basics/lexer: Add a test for newline-escaping within a string. 2016-12-22 10:32:06 +11:00
Damien George 5bdf1650de py/lexer: Make lexer use an mp_reader as its source. 2016-11-16 18:35:01 +11:00
Damien George 66d955c218 py/lexer: Rewrite mp_lexer_new_from_fd in terms of mp_reader. 2016-11-16 18:13:51 +11:00
Damien George e5ef15a9d7 py/lexer: Provide generic mp_lexer_new_from_file based on mp_reader.
If a port defines MICROPY_READER_POSIX or MICROPY_READER_FATFS then
lexer.c now provides an implementation of mp_lexer_new_from_file using
the mp_reader_new_file function.
2016-11-16 18:13:51 +11:00
Damien George 511c083811 py/lexer: Rewrite mp_lexer_new_from_str_len in terms of mp_reader_mem. 2016-11-16 18:13:50 +11:00