Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions guests/python/DEVELOPMENT.md
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extracted this from the README since it's usually not important for most people.

Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Development Notes
## Execution Models
### Embedded VM
#### Pyodide
Website: <https://pyodide.org/>.

Pros:
- Supports loads of dependencies
- Runs in the browser

Cons:
- Doesn't seem to be working with freestanding WASM runtimes / servers, esp. not without Node.js

#### Official CPython WASM Builds
Links:
- <https://github.com/python/cpython/tree/main/Tools/wasm>
- <https://devguide.python.org/getting-started/setup-building/#wasi>
- <https://github.com/psf/webassembly>
- <https://github.com/brettcannon/cpython-wasi-build/releases>

Pros:
- Official project, so it has a somewhat stable future and it is easier to get buy-in from the community

Cons:
- Can only run as a WASI CLI-like app (so we would need to interact with it via stdio or a fake network)
- Currently only offered as wasip1

#### pyo3 + Official CPython WASM Builds
Instead of using stdio to drive a Python interpreter, we use [pyo3].

Pros:
- We can interact w/ Python more efficiently.

Cons:
- Needs pre-released Python 3.14, because 3.13 seems to rely on "thread parking", which is implemented as WASM exceptions, which are not supported by wasmtime yet. Relevant code is <https://github.com/PyO3/pyo3/blob/52554ce0a33321893af17577a3ea0d179ad1b563/pyo3-ffi/src/pystate.rs#L87-L94>.

#### webassembly-language-runtimes
Website: <https://github.com/webassemblylabs/webassembly-language-runtimes>

This was formally a VMWare project.

Cons:
- Seems dead?

### Ahead-of-Time Compilation
This is only going to work if

- the ahead-of-time compiler itself is lightweight enough to be embedded within a database (esp. it should not call to some random C host toolchain)
- the Python compiler/transpiler is solid and supports enough features

#### componentize-py
Website: <https://github.com/bytecodealliance/componentize-py>

#### py2wasm
Website: <https://github.com/wasmerio/py2wasm>

### Other Notes
- <https://wasmlabs.dev/articles/python-wasm-rust/>


[pyo3]: https://pyo3.rs/
216 changes: 174 additions & 42 deletions guests/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,63 +13,195 @@ or
just release
```

## Execution Models
### Embedded VM
#### Pyodide
Website: <https://pyodide.org/>.
## Python Version
We currently bundle [Python 3.14.0rc2].

## Python Standard Library
In contrast to a normal Python installation there are a few notable public[^public] modules **missing** from the [Python Standard Library]:

- [`curses`](https://docs.python.org/3/library/curses.html)
- [`ensurepip`](https://docs.python.org/3/library/ensurepip.html)
- [`fcntl`](https://docs.python.org/3/library/fcntl.html)
- [`grp`](https://docs.python.org/3/library/grp.html)
- [`idlelib`](https://docs.python.org/3/library/idle.html)
- [`mmap`](https://docs.python.org/3/library/mmap.html)
- [`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html)
- [`pip`](https://pip.pypa.io/)
- [`pwd`](https://docs.python.org/3/library/pwd.html)
- [`readline`](https://docs.python.org/3/library/readline.html)
- [`resource`](https://docs.python.org/3/library/resource.html)
- [`syslog`](https://docs.python.org/3/library/syslog.html)
- [`termios`](https://docs.python.org/3/library/termios.html)
- [`tkinter`](https://docs.python.org/3/library/tkinter.html)
- [`turtledemo`](https://docs.python.org/3/library/turtle.html#module-turtledemo)
- [`venv`](https://docs.python.org/3/library/venv.html)
- [`zlib`](https://docs.python.org/3/library/zlib.html)

Some modules low level modules like [`os`](https://docs.python.org/3/library/os.html) may not offer all methods, types, and constants.

## Dependencies
We do not bundle any additional libraries at the moment. It is currently NOT possible to install your own dependencies.

## Methods
Currently we only support [Scalar UDF]s. One can write it using a simple Python function:

```python
def add_one(x: int) -> int:
return x + 1
```

You may register multiple methods in one Python source text. Imported methods and private methods starting with `_` are ignored.

## Types
Types are mapped to/from [Apache Arrow] as follows:

| Python | Arrow |
| ------------ | ----------- |
| [`bool`] | [`Boolean`] |
| [`datetime`] | [`Timestamp`] w/ [`Microsecond`] and NO timezone |
| [`float`] | [`Float64`] |
| [`int`] | [`Int64`] |
| [`str`] | [`Utf8`] |

Additional types may be supported in the future.

## NULLs
NULLs are rather common in database contexts and a first-class citizen in [Apache Arrow] and [Apache DataFusion]. If you do not want to deal with it, just define your method with simple scalar types and we will skip NULL rows for you:

```python
def add_simple(x: int, y: int) -> int:
return x + y
```

Pros:
- Supports loads of dependencies
- Runs in the browser
However, you can opt into full NULL handling. In Python, NULLs are expressed as optionals:

Cons:
- Doesn't seem to be working with freestanding WASM runtimes / servers, esp. not without Node.js
```python
def add_nulls(x: int | None, y: int | None) -> int | None:
if x is None or y is None:
return None
return x + y
```

#### Official CPython WASM Builds
Links:
- <https://github.com/python/cpython/tree/main/Tools/wasm>
- <https://devguide.python.org/getting-started/setup-building/#wasi>
- <https://github.com/psf/webassembly>
- <https://github.com/brettcannon/cpython-wasi-build/releases>
or via the older syntax:

Pros:
- Official project, so it has a somewhat stable future and it is easier to get buy-in from the community
```python
from typing import Optional

Cons:
- Can only run as a WASI CLI-like app (so we would need to interact with it via stdio or a fake network)
- Currently only offered as wasip1
def add_old(x: Optional[int], y: Optional[int]) -> Optional[int]:
if x is None or y is None:
return None
return x + y
```

#### pyo3 + Official CPython WASM Builds
Instead of using stdio to drive a Python interpreter, we use [pyo3].
You may also partially opt into NULL handling for one parameter:

Pros:
- We can interact w/ Python more efficiently.
```python
def add_left(x: int | None, y: int) -> int | None:
if x is None:
return None
return x + y

Cons:
- Needs pre-released Python 3.14, because 3.13 seems to rely on "thread parking", which is implemented as WASM exceptions, which are not supported by wasmtime yet. Relevant code is <https://github.com/PyO3/pyo3/blob/52554ce0a33321893af17577a3ea0d179ad1b563/pyo3-ffi/src/pystate.rs#L87-L94>.
def add_right(x: int, y: int | None) -> int | None:
if y is None:
return None
return x + y
```

#### webassembly-language-runtimes
Website: <https://github.com/webassemblylabs/webassembly-language-runtimes>
Note that if you define the return type as non-optional, you MUST NOT return `None`. Otherwise, the execution will fail.

This was formally a VMWare project.
To give you a better idea when a Python method is called, consult this table:

Cons:
- Seems dead?
| `x` | `y` | `add_simple` | `add_nulls` | `add_left` | `add_right` |
| ------ | ------ | ------------ | ----------- | ---------- | ----------- |
| `None` | `None` | 𐄂 | ✓ | 𐄂 | 𐄂 |
| `None` | some | 𐄂 | ✓ | ✓ | 𐄂 |
| some | `None` | 𐄂 | ✓ | 𐄂 | ✓ |
| some | some | ✓ | ✓ | ✓ | ✓ |

### Ahead-of-Time Compilation
This is only going to work if
You may find this feature helpful when you want to control default values for NULLs:

- the ahead-of-time compiler itself is lightweight enough to be embedded within a database (esp. it should not call to some random C host toolchain)
- the Python compiler/transpiler is solid and supports enough features
```python
def half(x: float | None) -> float:
# zero might be a sensible default
if x is None:
return 0.0

#### componentize-py
Website: <https://github.com/bytecodealliance/componentize-py>
return x / 2.0
```

or if you want turn a value into NULLs:

#### py2wasm
Website: <https://github.com/wasmerio/py2wasm>
```python
def add_one_limited(x: int) -> int | None:
# do not go beyond 100
if x >= 100:
return None

### Other Notes
- <https://wasmlabs.dev/articles/python-wasm-rust/>
return x + 1
```

## Default Parameters and Kwargs
Default parameters, `*args`, and `**kwargs` are currently NOT supported. So these method will be rejected:

```python
def m1(x: int = 1) -> int:
return x + 1

def m2(*x: int) -> int:
return x + 1

def m3(*, x: int) -> int:
return x + 1

def m4(**x: int) -> int:
return x + 1
```

## State
We give no guarantees on the lifetime of the Python VM, but you may use state in your Python methods for performance reasons (e.g. to cache results):

```python
_cache = {}

def compute(x: int) -> int:
try:
return _cache[x]
except ValueError:
y = x * 100
_cache[x] = y
return x
```

You may also use a builtin solution like [`functools.cache`]:

```python
from functools import cache

@cache
def compute(x: int) -> int:
return x * 100
```

[pyo3]: https://pyo3.rs/
## I/O
There is NO I/O available that escapes the sandbox. The [Python Standard Library] is mounted as a read-only filesystem.


[^public]: Modules not starting with a `_`.

[Apache Arrow]: https://arrow.apache.org/
[Apache DataFusion]: https://datafusion.apache.org/
[`bool`]: https://docs.python.org/3/library/stdtypes.html#boolean-type-bool
[`Boolean`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Boolean
[`datetime`]: https://docs.python.org/3/library/datetime.html#datetime.datetime
[`float`]: https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex
[`Float64`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Float64
[`functools.cache`]: https://docs.python.org/3/library/functools.html#functools.cache
[`int`]: https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex
[`Int64`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Int64
[`Microsecond`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.TimeUnit.html#variant.Microsecond
[Python 3.14.0rc2]: https://www.python.org/downloads/release/python-3140rc2/
[Python Standard Library]: https://docs.python.org/3/library/index.html
[Scalar UDF]: https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html
[`str`]: https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str
[`Timestamp`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Timestamp
[`Utf8`]: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8
34 changes: 34 additions & 0 deletions host/tests/integration_tests/python/runtime/dependencies.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,37 @@ def foo(x: int) -> int:
&Int64Array::from_iter([Some(12), Some(23), Some(34)]) as &dyn Array,
);
}

#[tokio::test(flavor = "multi_thread")]
async fn functools_cache() {
const CODE: &str = "
from functools import cache

_counter = 0

@cache
def foo(x: int) -> int:
global _counter
_counter += 1
return x + _counter
";

let udf = python_scalar_udf(CODE).await.unwrap();
let array = udf
.invoke_with_args(ScalarFunctionArgs {
args: vec![ColumnarValue::Array(Arc::new(Int64Array::from_iter([
Some(10),
Some(20),
Some(10),
])))],
arg_fields: vec![Arc::new(Field::new("a1", DataType::Int64, true))],
number_rows: 3,
return_field: Arc::new(Field::new("r", DataType::Int64, true)),
})
.unwrap()
.unwrap_array();
assert_eq!(
array.as_ref(),
&Int64Array::from_iter([Some(11), Some(22), Some(11)]) as &dyn Array,
);
}