The null byte cannot exist in source code (per CPython), so use it to
indicate the end of the input stream (instead of `(mp_uint_t)-1`). This
allows the cache chars (chr0/1/2 and their saved versions) to be 8-bit
bytes, making it clear that they are not `unichar` values. It also saves a
bit of memory in the `mp_lexer_t` data structure. (And in a future commit
allows the saved cache chars to be eliminated entirely by storing them in
a vstr instead.)
In order to keep code size down, the frequently used `chr0` is still of
type `uint32_t`. Having it 32-bit means that machine instructions to load
it are smaller (it adds about +80 bytes to Thumb code if `chr0` is changed
to `uint8_t`).
Also add tests for invalid bytes in the input stream to make sure there are
no regressions in this regard.
Signed-off-by: Damien George <damien@micropython.org>
This includes making int("01") parse in base 10 like standard Python.
When a base of 0 is specified it means auto-detect based on the prefix, and
literals begining with 0 (except when the literal is all 0's) like "01" are
then invalid and now throw an exception.
The new error message is different from CPython. It says e.g.,
`SyntaxError: invalid syntax for integer with base 0: '09'`
Additional test cases were added to cover the changed & added code.
Co-authored-by: Damien George <damien@micropython.org>
Signed-off-by: Jeff Epler <jepler@gmail.com>
Tests for an issue with line continuation failing in paste mode due to the
lexer only checking for \n in the "following" character position, before
next_char() has had a chance to convert \r and \r\n to \n.