It turns out that it's relatively simple to support nested f-strings, which
is what this commit implements.
The way the MicroPython f-string parser works at the moment is:
1. it extracts the f-string arguments (things in curly braces) into a
temporary buffer (a vstr)
2. once the f-string ends (reaches its closing quote) the lexer switches to
tokenizing the temporary buffer
3. once the buffer is empty it switches back to the stream.
The temporary buffer can easily hold f-strings itself (ie nested f-strings)
and they can be re-parsed by the lexer using the same algorithm. The only
thing stopping that from working is that the temporary buffer can't be
reused for the nested f-string because it's currently being parsed.
This commit fixes that by adding a second temporary buffer, which is the
"injection" buffer. That allows arbitrary number of nestings with a simple
modification to the original algorithm:
1. when an f-string is encountered the string is parsed and its arguments
are extracted into `fstring_args`
2. when the f-string finishes, `fstring_args` is inserted into the current
position in `inject_chrs` (which is the start of that buffer if no
injection is ongoing)
3. `fstring_args` is now cleared and ready for any further f-strings
(nested or not)
4. the lexer switches to `inject_chrs` if it's not already reading from it
5. if an f-string appeared inside the f-string then it is in `inject_chrs`
and can be processed as before, extracting its arguments into
`fstring_args`, which can then be inserted again into `inject_chrs`
6. once `inject_chrs` is exhausted (meaning that all levels of f-strings
have been fully processed) the lexer switched back to tokenizing the
stream.
Amazingly, this scheme supports arbitrary numbers of nestings of f-strings
using the same quote style.
This adds some code size and a bit more memory usage for the lexer. In
particular for a single (non-nested) f-string it now makes an extra copy of
the `fstring_args` data, when copying it across to `inject_chrs`.
Otherwise, memory use only goes up with the complexity of nested f-strings.
Signed-off-by: Damien George <damien@micropython.org>
The null byte cannot exist in source code (per CPython), so use it to
indicate the end of the input stream (instead of `(mp_uint_t)-1`). This
allows the cache chars (chr0/1/2 and their saved versions) to be 8-bit
bytes, making it clear that they are not `unichar` values. It also saves a
bit of memory in the `mp_lexer_t` data structure. (And in a future commit
allows the saved cache chars to be eliminated entirely by storing them in
a vstr instead.)
In order to keep code size down, the frequently used `chr0` is still of
type `uint32_t`. Having it 32-bit means that machine instructions to load
it are smaller (it adds about +80 bytes to Thumb code if `chr0` is changed
to `uint8_t`).
Also add tests for invalid bytes in the input stream to make sure there are
no regressions in this regard.
Signed-off-by: Damien George <damien@micropython.org>
Support for raw str/bytes already exists, and extending that to raw
f-strings is easy. It also reduces code size because it eliminates an
error message.
Signed-off-by: Damien George <damien@micropython.org>
If a non-string buffer was passed to execfile, then it would be passed
as a non-null-terminated char* to mp_lexer_new_from_file.
This changes mp_lexer_new_from_file to take a qstr instead (as in almost
all cases a qstr will be created from this input anyway to set the
`__file__` attribute on the module).
This now makes execfile require a string (not generic buffer) argument,
which is probably a good fix to make anyway.
Fixes issue #12522.
This work was funded through GitHub Sponsors.
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
The following changes are made:
- If MICROPY_VFS is enabled then mp_vfs_import_stat and mp_vfs_open are
automatically used for mp_import_stat and mp_builtin_open respectively.
- If MICROPY_PY_IO is enabled then "open" is automatically included in the
set of builtins, and points to mp_builtin_open_obj.
This helps to clean up and simplify the most common port configuration.
Signed-off-by: Damien George <damien@micropython.org>
This implements (most of) the PEP-498 spec for f-strings and is based on
https://github.com/micropython/micropython/pull/4998 by @klardotsh.
It is implemented in the lexer as a syntax translation to `str.format`:
f"{a}" --> "{}".format(a)
It also supports:
f"{a=}" --> "a={}".format(a)
This is done by extracting the arguments into a temporary vstr buffer,
then after the string has been tokenized, the lexer input queue is saved
and the contents of the temporary vstr buffer are injected into the lexer
instead.
There are four main limitations:
- raw f-strings (`fr` or `rf` prefixes) are not supported and will raise
`SyntaxError: raw f-strings are not supported`.
- literal concatenation of f-strings with adjacent strings will fail
"{}" f"{a}" --> "{}{}".format(a) (str.format will incorrectly use
the braces from the non-f-string)
f"{a}" f"{a}" --> "{}".format(a) "{}".format(a) (cannot concatenate)
- PEP-498 requires the full parser to understand the interpolated
argument, however because this entirely runs in the lexer it cannot
resolve nested braces in expressions like
f"{'}'}"
- The !r, !s, and !a conversions are not supported.
Includes tests and cpydiffs.
Signed-off-by: Jim Mussared <jim.mussared@gmail.com>
The syntax matches CPython and the semantics are equivalent except that,
unlike CPython, MicroPython allows using := to assign to comprehension
iteration variables, because disallowing this would take a lot of code to
check for it.
The new compile-time option MICROPY_PY_ASSIGN_EXPR selects this feature and
is enabled by default, following MICROPY_PY_ASYNC_AWAIT.
To make progress towards MicroPython supporting Python 3.5, adding the
matmul operator is important because it's a really "low level" part of the
language, being a new token and modifications to the grammar.
It doesn't make sense to make it configurable because 1) it would make the
grammar and lexer complicated/messy; 2) no other operators are
configurable; 3) it's not a feature that can be "dynamically plugged in"
via an import.
And matmul can be useful as a general purpose user-defined operator, it
doesn't have to be just for numpy use.
Based on work done by Jim Mussared.
The code conventions suggest using header guards, but do not define how
those should look like and instead point to existing files. However, not
all existing files follow the same scheme, sometimes omitting header guards
altogether, sometimes using non-standard names, making it easy to
accidentally pick a "wrong" example.
This commit ensures that all header files of the MicroPython project (that
were not simply copied from somewhere else) follow the same pattern, that
was already present in the majority of files, especially in the py folder.
The rules are as follows.
Naming convention:
* start with the words MICROPY_INCLUDED
* contain the full path to the file
* replace special characters with _
In addition, there are no empty lines before #ifndef, between #ifndef and
one empty line before #endif. #endif is followed by a comment containing
the name of the guard macro.
py/grammar.h cannot use header guards by design, since it has to be
included multiple times in a single C file. Several other files also do not
need header guards as they are only used internally and guaranteed to be
included only once:
* MICROPY_MPHALPORT_H
* mpconfigboard.h
* mpconfigport.h
* mpthreadport.h
* pin_defs_*.h
* qstrdefs*.h
Previous to this patch there was an explicit check for errors with line
continuation (where backslash was not immediately followed by a newline).
But this check is not necessary: if there is an error then the remaining
logic of the tokeniser will reject the backslash and correctly produce a
syntax error.
Since the table of keywords is sorted, we can use strcmp to do the search
and stop part way through the search if the comparison is less-than.
Because all tokens that are names are subject to this search, this
optimisation will improve the overall speed of the lexer when processing
a script.
The change also decreases code size by a little bit because we now use
strcmp instead of the custom str_strn_equal function.
They are sugar for marking function as generator, "yield from"
and pep492 python "semantically equivalents" respectively.
@dpgeorge was the original author of this patch, but @pohmelie made
changes to implement `async for` and `async with`.
Previous to this patch, a big-int, float or imag constant was interned
(made into a qstr) and then parsed at runtime to create an object each
time it was needed. This is wasteful in RAM and not efficient. Now,
these constants are parsed straight away in the parser and turned into
objects. This allows constants with large numbers of digits (so
addresses issue #1103) and takes us a step closer to #722.
This patch consolidates all global variables in py/ core into one place,
in a global structure. Root pointers are all located together to make
GC tracing easier and more efficient.
mp_lexer_t type is exposed, mp_token_t type is removed, and simple lexer
functions (like checking current token kind) are now inlined.
This saves 784 bytes ROM on 32-bit unix, 348 bytes on stmhal, and 460
bytes on bare-arm. It also saves a tiny bit of RAM since mp_lexer_t
is a bit smaller. Also will run a bit more efficiently.
Blanket wide to all .c and .h files. Some files originating from ST are
difficult to deal with (license wise) so it was left out of those.
Also merged modpyb.h, modos.h, modstm.h and modtime.h in stmhal/.
A big change. Micro Python objects are allocated as individual structs
with the first element being a pointer to the type information (which
is itself an object). This scheme follows CPython. Much more flexible,
not necessarily slower, uses same heap memory, and can allocate objects
statically.
Also change name prefix, from py_ to mp_ (mp for Micro Python).