diff --git a/guests/python/DEVELOPMENT.md b/guests/python/DEVELOPMENT.md new file mode 100644 index 0000000..967e3cd --- /dev/null +++ b/guests/python/DEVELOPMENT.md @@ -0,0 +1,61 @@ +# Development Notes +## Execution Models +### Embedded VM +#### Pyodide +Website: . + +Pros: +- Supports loads of dependencies +- Runs in the browser + +Cons: +- Doesn't seem to be working with freestanding WASM runtimes / servers, esp. not without Node.js + +#### Official CPython WASM Builds +Links: +- +- +- +- + +Pros: +- Official project, so it has a somewhat stable future and it is easier to get buy-in from the community + +Cons: +- Can only run as a WASI CLI-like app (so we would need to interact with it via stdio or a fake network) +- Currently only offered as wasip1 + +#### pyo3 + Official CPython WASM Builds +Instead of using stdio to drive a Python interpreter, we use [pyo3]. + +Pros: +- We can interact w/ Python more efficiently. + +Cons: +- Needs pre-released Python 3.14, because 3.13 seems to rely on "thread parking", which is implemented as WASM exceptions, which are not supported by wasmtime yet. Relevant code is . + +#### webassembly-language-runtimes +Website: + +This was formally a VMWare project. + +Cons: +- Seems dead? + +### Ahead-of-Time Compilation +This is only going to work if + +- the ahead-of-time compiler itself is lightweight enough to be embedded within a database (esp. it should not call to some random C host toolchain) +- the Python compiler/transpiler is solid and supports enough features + +#### componentize-py +Website: + +#### py2wasm +Website: + +### Other Notes +- + + +[pyo3]: https://pyo3.rs/ diff --git a/guests/python/README.md b/guests/python/README.md index c7950da..5f98310 100644 --- a/guests/python/README.md +++ b/guests/python/README.md @@ -13,63 +13,195 @@ or just release ``` -## Execution Models -### Embedded VM -#### Pyodide -Website: . +## Python Version +We currently bundle [Python 3.14.0rc2]. + +## Python Standard Library +In contrast to a normal Python installation there are a few notable public[^public] modules **missing** from the [Python Standard Library]: + +- [`curses`](https://docs.python.org/3/library/curses.html) +- [`ensurepip`](https://docs.python.org/3/library/ensurepip.html) +- [`fcntl`](https://docs.python.org/3/library/fcntl.html) +- [`grp`](https://docs.python.org/3/library/grp.html) +- [`idlelib`](https://docs.python.org/3/library/idle.html) +- [`mmap`](https://docs.python.org/3/library/mmap.html) +- [`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html) +- [`pip`](https://pip.pypa.io/) +- [`pwd`](https://docs.python.org/3/library/pwd.html) +- [`readline`](https://docs.python.org/3/library/readline.html) +- [`resource`](https://docs.python.org/3/library/resource.html) +- [`syslog`](https://docs.python.org/3/library/syslog.html) +- [`termios`](https://docs.python.org/3/library/termios.html) +- [`tkinter`](https://docs.python.org/3/library/tkinter.html) +- [`turtledemo`](https://docs.python.org/3/library/turtle.html#module-turtledemo) +- [`venv`](https://docs.python.org/3/library/venv.html) +- [`zlib`](https://docs.python.org/3/library/zlib.html) + +Some modules low level modules like [`os`](https://docs.python.org/3/library/os.html) may not offer all methods, types, and constants. + +## Dependencies +We do not bundle any additional libraries at the moment. It is currently NOT possible to install your own dependencies. + +## Methods +Currently we only support [Scalar UDF]s. One can write it using a simple Python function: + +```python +def add_one(x: int) -> int: + return x + 1 +``` + +You may register multiple methods in one Python source text. Imported methods and private methods starting with `_` are ignored. + +## Types +Types are mapped to/from [Apache Arrow] as follows: + +| Python | Arrow | +| ------------ | ----------- | +| [`bool`] | [`Boolean`] | +| [`datetime`] | [`Timestamp`] w/ [`Microsecond`] and NO timezone | +| [`float`] | [`Float64`] | +| [`int`] | [`Int64`] | +| [`str`] | [`Utf8`] | + +Additional types may be supported in the future. + +## NULLs +NULLs are rather common in database contexts and a first-class citizen in [Apache Arrow] and [Apache DataFusion]. If you do not want to deal with it, just define your method with simple scalar types and we will skip NULL rows for you: + +```python +def add_simple(x: int, y: int) -> int: + return x + y +``` -Pros: -- Supports loads of dependencies -- Runs in the browser +However, you can opt into full NULL handling. In Python, NULLs are expressed as optionals: -Cons: -- Doesn't seem to be working with freestanding WASM runtimes / servers, esp. not without Node.js +```python +def add_nulls(x: int | None, y: int | None) -> int | None: + if x is None or y is None: + return None + return x + y +``` -#### Official CPython WASM Builds -Links: -- -- -- -- +or via the older syntax: -Pros: -- Official project, so it has a somewhat stable future and it is easier to get buy-in from the community +```python +from typing import Optional -Cons: -- Can only run as a WASI CLI-like app (so we would need to interact with it via stdio or a fake network) -- Currently only offered as wasip1 +def add_old(x: Optional[int], y: Optional[int]) -> Optional[int]: + if x is None or y is None: + return None + return x + y +``` -#### pyo3 + Official CPython WASM Builds -Instead of using stdio to drive a Python interpreter, we use [pyo3]. +You may also partially opt into NULL handling for one parameter: -Pros: -- We can interact w/ Python more efficiently. +```python +def add_left(x: int | None, y: int) -> int | None: + if x is None: + return None + return x + y -Cons: -- Needs pre-released Python 3.14, because 3.13 seems to rely on "thread parking", which is implemented as WASM exceptions, which are not supported by wasmtime yet. Relevant code is . +def add_right(x: int, y: int | None) -> int | None: + if y is None: + return None + return x + y +``` -#### webassembly-language-runtimes -Website: +Note that if you define the return type as non-optional, you MUST NOT return `None`. Otherwise, the execution will fail. -This was formally a VMWare project. +To give you a better idea when a Python method is called, consult this table: -Cons: -- Seems dead? +| `x` | `y` | `add_simple` | `add_nulls` | `add_left` | `add_right` | +| ------ | ------ | ------------ | ----------- | ---------- | ----------- | +| `None` | `None` | 𐄂 | ✓ | 𐄂 | 𐄂 | +| `None` | some | 𐄂 | ✓ | ✓ | 𐄂 | +| some | `None` | 𐄂 | ✓ | 𐄂 | ✓ | +| some | some | ✓ | ✓ | ✓ | ✓ | -### Ahead-of-Time Compilation -This is only going to work if +You may find this feature helpful when you want to control default values for NULLs: -- the ahead-of-time compiler itself is lightweight enough to be embedded within a database (esp. it should not call to some random C host toolchain) -- the Python compiler/transpiler is solid and supports enough features +```python +def half(x: float | None) -> float: + # zero might be a sensible default + if x is None: + return 0.0 -#### componentize-py -Website: + return x / 2.0 +``` + +or if you want turn a value into NULLs: -#### py2wasm -Website: +```python +def add_one_limited(x: int) -> int | None: + # do not go beyond 100 + if x >= 100: + return None -### Other Notes -- + return x + 1 +``` +## Default Parameters and Kwargs +Default parameters, `*args`, and `**kwargs` are currently NOT supported. So these method will be rejected: + +```python +def m1(x: int = 1) -> int: + return x + 1 + +def m2(*x: int) -> int: + return x + 1 + +def m3(*, x: int) -> int: + return x + 1 + +def m4(**x: int) -> int: + return x + 1 +``` + +## State +We give no guarantees on the lifetime of the Python VM, but you may use state in your Python methods for performance reasons (e.g. to cache results): + +```python +_cache = {} + +def compute(x: int) -> int: + try: + return _cache[x] + except ValueError: + y = x * 100 + _cache[x] = y + return x +``` + +You may also use a builtin solution like [`functools.cache`]: + +```python +from functools import cache + +@cache +def compute(x: int) -> int: + return x * 100 +``` -[pyo3]: https://pyo3.rs/ +## I/O +There is NO I/O available that escapes the sandbox. The [Python Standard Library] is mounted as a read-only filesystem. + + +[^public]: Modules not starting with a `_`. + +[Apache Arrow]: https://arrow.apache.org/ +[Apache DataFusion]: https://datafusion.apache.org/ +[`bool`]: https://docs.python.org/3/library/stdtypes.html#boolean-type-bool +[`Boolean`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Boolean +[`datetime`]: https://docs.python.org/3/library/datetime.html#datetime.datetime +[`float`]: https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex +[`Float64`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Float64 +[`functools.cache`]: https://docs.python.org/3/library/functools.html#functools.cache +[`int`]: https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex +[`Int64`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Int64 +[`Microsecond`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.TimeUnit.html#variant.Microsecond +[Python 3.14.0rc2]: https://www.python.org/downloads/release/python-3140rc2/ +[Python Standard Library]: https://docs.python.org/3/library/index.html +[Scalar UDF]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html +[`str`]: https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str +[`Timestamp`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Timestamp +[`Utf8`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8 diff --git a/host/tests/integration_tests/python/runtime/dependencies.rs b/host/tests/integration_tests/python/runtime/dependencies.rs index c33ff66..ed984b2 100644 --- a/host/tests/integration_tests/python/runtime/dependencies.rs +++ b/host/tests/integration_tests/python/runtime/dependencies.rs @@ -42,3 +42,37 @@ def foo(x: int) -> int: &Int64Array::from_iter([Some(12), Some(23), Some(34)]) as &dyn Array, ); } + +#[tokio::test(flavor = "multi_thread")] +async fn functools_cache() { + const CODE: &str = " +from functools import cache + +_counter = 0 + +@cache +def foo(x: int) -> int: + global _counter + _counter += 1 + return x + _counter +"; + + let udf = python_scalar_udf(CODE).await.unwrap(); + let array = udf + .invoke_with_args(ScalarFunctionArgs { + args: vec![ColumnarValue::Array(Arc::new(Int64Array::from_iter([ + Some(10), + Some(20), + Some(10), + ])))], + arg_fields: vec![Arc::new(Field::new("a1", DataType::Int64, true))], + number_rows: 3, + return_field: Arc::new(Field::new("r", DataType::Int64, true)), + }) + .unwrap() + .unwrap_array(); + assert_eq!( + array.as_ref(), + &Int64Array::from_iter([Some(11), Some(22), Some(11)]) as &dyn Array, + ); +}