- Overview
- Getting Started
- Why Not Protobufs ?
- No Copy
- Supported Types
- Inheritance
- Msgpack
- Namespaces
- Generated Code
- Other Languages
[WORK IN PROGRESS!!!]
Have you ever been working in Python and wanted to use dataclasses or pydantic objects for your data model, but are stuck with existing data representation? This could arise a few ways:
- Ill-conceived classes, with getFoo(), setFoo() ... or worse ... getAttribute(), setAttribute() type methods, created by misguided Java programmers, accessing an opaque data schema
- Hideous, non-python generated code (see protobufs and Avro), which you would like to wrap but keep the internal data structure
- A data representation in another language, e.g. a C++ object, exposed via pybind, that would like to appear like a dataclass
If none of this rings any bells, then read no further and spare your braincells!
First of all, it's worth noting the motivation for this project. Dataclasses, pydantic objects and MsgStruct give you access to some pretty nice functionlity (such as JSON schemas and serialisation) and interact seamlessly with countless other packages. What I wanted was a way to define my datamodel as datclasses/pydantic (maybe MsgStruct in future) and have it redirect attribute access through to some existing data representation.
Sounds pretty easy, right: python is infintely flexible? Yes and no - it is the degrees of flexibility and different ways that attribute access can occur which make things tricky.
Getting and setting attributes in python can happen via:
__getattribute__()
and__setattr__()
. Under the covers these functions do either:
- Find a descriptor on the class object and call
__get__
or__set__
- For an ordinary field, set via the class instance's
__dict__
- Directly alter the object instance's
__dict__
- Set
__dict__
, e.g.foo.__dict__ = {"a": 1, "b": "foo"}
- Mutate
__dict__
, e.g.,foo.__dict__["a"] = 2
After a lot of experimentation and much infinite recursion, I decided on the following approach:
- Create a metaclass to replace all the model fields with descriptors per field (to handle
object.__getattribute__()/ __setattr__()
usage) - Add a descriptor for
__dict__
- Create a small derivation of
__dict__
, which redirects__getitem__()/__setitem__()
calls for model fields
This seems to cover all cases
[ANYTHING BELOW HERE IS OLD AND NEEDS TO BE UPDATED]
pydantic_bind
adds a custom cmake rule: pydantic_bind_add_package(<package path>)
This rule will do the following:
- scan for sub-packages
- scan each sub-package for all .py files
- add custom steps for generating .cpp/.h files from any of the following, encounted in the .py files:
- dataclasses
- classes derived from pydantic's BaseModel
- enums
C++ directory and namespace structure will match the python package structure (see Namespaces).
You can create an instance of the pybind11 class from your original using get_pybind_instance()
, e.g.,
my_class.py:
from dataclasses import dataclass
@dataclass
clas MyClass:
my_int: int
my_string: str | None
CMakeLists.txt:
cmake_minimum_required(VERSION 3.9)
project(my_project)
set(CMAKE_CXX_STANDARD 20)
find_package(python3 REQUIRED COMPONENTS Interpreter Development)
find_package(pydantic_bind REQUIRED COMPONENTS HINTS "${python3_SITELIB}")
pydantic_bind_add_package(my_package)
my_util.py
from pydantic_bind import get_pybind_value
from my_package.my_class imnport MyClass
orig = MyClass(my_int=123, my_string="hello")
generated = get_pybind_value(orig)
print(f"my_int: {orig.my_int}, {generated.my_int}")
I personally find protobufs to be a PITA to use: they have poor to no variant support, the generated code is ugly and idiosyncratic, they're large and painful to copy around etc.
AVRO is more friendly but generates python classes dynamically, which confuses IDEs like Pycharm. I do think a good solution is something like pydantic_avro where one can define the classes using pydantic, generate the AVRO schema and then the generateed C++ etc. I might well try and converge this project with that approach.
I was inspired to some degree by this blog.
One annoyance of multi-language representations of data objects is that you often end up copying data around where
you'd prefer to share a single copy. This is the raison d'etre for Protobufs and its ilk. In this project I've created
implementations of BaseModel
and dataclass
which allow python to use the underlying C++ data representation, rather
than holding its own copy.
Deriving from this BaseModel
will give you equivalent functionality of as pydantic's BaseModel
. The
annotations are re-written using computed_field
, with property getters and setters operating on the generated pybind
class, which is instantiated behind the scenes in __init__
. Note that this will make some operations (especially those
that access dict) less efficient. I've also plumbed the computed fields into the JSON schema, so these objects can
be used with FastAPI.
dataclass
works similarly, adding properties to the dataclass, so that the exisitng get and set functionality works
seamless in accessing the generated pybind11 class (also set via a shimmed __init__
).
Using regular dataclass
or BaseModel
as members of classes defined with the pydantic_bind versions is very
inefficient and not recommended.
The following python -> C++ mappings are supported (there are likely others I should consider):
- bool --> bool
- float --> double
- int --> int
- str --> std::string
- datetime.date --> std::chrono::system_clock::time_point
- datetime.datetime --> std::chrono::system_clock::time_point
- datetime.time --> std::chrono::system_clock::time_point
- datetime.timedelta --> std::chrono::duration
- pydantic.BaseModel --> struct
- pydantic_bind.BaseModel --> struct
- dataclass --> struct
- pydantic_bind.dataclass --> struct
- Enum --> enum class
I have tested single inheritance (see Generated Code). Multiple inheritance may work ... or it may not. I'd generally advise against using it for data classes.
A rather rudimentary msgpack implementation is added to the generated C++ structs, using a slightly modified version
of cpppack. It wasn't clear to me whether this package is maintained or
accepting submissions, so I copied and slightly modified msgpack.h
(also, I couldn't work out how to add to my
project with my rather rudimentary cmake skillz!) Changes include:
- Fixing includes
- Support for std::optional
- Support for std::variant
- Support for enums
A likely future enhancement will be to use cereal and add a mgspack adaptor.
The no-copy python objects add to_msg_pack()
and from_msg_pack()
(the latter being a class method), to access
this functionality.
Directory structure and namespaces in the generated C++ match the python package and module names.
cmake requires unique target names and pybind11 requires that the filename (minus the OS-speicific qualifiers) matches the module name.
Code is generated into a directory structure underneath <top level>/generated
.
Headers are installed to <top level>/include
.
Compiled pybind11 modules are installed into <original module path>/__pybind__
.
For C++ usage, you need only the headers, the compiled code is for pybind/python usage only.
For the example below, common_object_model/common_object_model/v1/common/__pybind__/foo.cpython-311-darwin.so
will
be installed (obviously with corresponding qualifiers for Linux/Windows). get_pybind_value()
searches this
directory.
Imports/includes should work seamlessly (the python import scheme will be copied). I have tested this but not completely rigorously.
common_object_model/common_object_model/v1/common/foo.py:
from dataclasses import dataclass
import datetime as dt
from enum import Enum, auto
from typing import Union
from pydantic_bind import BaseModel
class Weekday(Enum):
MONDAY = auto()
TUESDAY = auto()
WEDNESDAY = auto()
THURSDAY = auto()
FRIDAY = auto()
SATURDAY = auto()
SUNDAY = auto()
@dataclass
class DCFoo:
my_int: int
my_string: str | None
class Foo(BaseModel):
my_bool: bool = True
my_day: Weekday = Weekday.SUNDAY
class Bar(Foo):
my_int: int = 123
my_string: str
my_optional_string: str | None = None
class Baz(BaseModel):
my_variant: Union[str, float] = 123.
my_date: dt.date
my_foo: Foo
my_dc_foo: DCFoo
will generate the following files:
common_object_model/generated/common_object_model/v1/common/foo.h:
#ifndef COMMON_OBJECT_MODEL_FOO_H
#define COMMON_OBJECT_MODEL_FOO_H
#include <string>
#include <optional>
#include <variant>
#include <msgpack/msgpack.h>
#include <chrono>
namespace common_object_model::v1::common
{
enum Weekday { MONDAY = 1, TUESDAY = 2, WEDNESDAY = 3, THURSDAY = 4, FRIDAY = 5, SATURDAY = 6, SUNDAY = 7
};
struct DCFoo
{
DCFoo() :
my_string(), my_int()
{
}
DCFoo(std::optional<std::string> my_string, int my_int) :
my_string(my_string), my_int(my_int)
{
}
std::optional<std::string> my_string;
int my_int;
MSGPACK_DEFINE(my_string, my_int);
};
struct Foo
{
Foo(bool my_bool=true, Weekday my_day=SUNDAY) :
my_bool(my_bool), my_day(my_day)
{
}
bool my_bool;
Weekday my_day;
MSGPACK_DEFINE(my_bool, my_day);
};
struct Bar : public Foo
{
Bar() :
Foo(),
my_string(), my_int(123), my_optional_string(std::nullopt)
{
}
Bar(std::string my_string, bool my_bool=true, Weekday my_day=SUNDAY, int my_int=123, std::optional<std::string>
my_optional_string=std::nullopt) :
Foo(my_bool, my_day),
my_string(std::move(my_string)), my_int(my_int), my_optional_string(my_optional_string)
{
}
std::string my_string;
int my_int;
std::optional<std::string> my_optional_string;
MSGPACK_DEFINE(my_string, my_bool, my_day, my_int, my_optional_string);
};
struct Baz
{
Baz() :
my_dc_foo(), my_foo(), my_date(), my_variant(123.0)
{
}
Baz(DCFoo my_dc_foo, Foo my_foo, std::chrono::system_clock::time_point my_date, std::variant<std::string, double>
my_variant=123.0) :
my_dc_foo(std::move(my_dc_foo)), my_foo(std::move(my_foo)), my_date(my_date),
my_variant(my_variant)
{
}
DCFoo my_dc_foo;
Foo my_foo;
std::chrono::system_clock::time_point my_date;
std::variant<std::string, double> my_variant;
MSGPACK_DEFINE(my_dc_foo, my_foo, my_date, my_variant);
};
} // common_object_model
#endif // COMMON_OBJECT_MODEL_FOO_H
common_object_model/generated/common_object_model/v1/common/foo.cpp:
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <pybind11/chrono.h>
#include "foo.h"
namespace py = pybind11;
using namespace common_object_model::v1::common;
PYBIND11_MODULE(common_object_model_v1_common_foo, m)
{
py::enum_<Weekday>(m, "Weekday").value("MONDAY", Weekday::MONDAY)
.value("TUESDAY", Weekday::TUESDAY)
.value("WEDNESDAY", Weekday::WEDNESDAY)
.value("THURSDAY", Weekday::THURSDAY)
.value("FRIDAY", Weekday::FRIDAY)
.value("SATURDAY", Weekday::SATURDAY)
.value("SUNDAY", Weekday::SUNDAY);
py::class_<DCFoo>(m, "DCFoo")
.def(py::init<>())
.def(py::init<std::optional<std::string>, int>(), py::arg("my_string"), py::arg("my_int"))
.def("to_msg_pack", &DCFoo::to_msg_pack)
.def_static("from_msg_pack", &DCFoo::from_msg_pack<Baz>)
.def_readwrite("my_string", &DCFoo::my_string)
.def_readwrite("my_int", &DCFoo::my_int);
py::class_<Foo>(m, "Foo")
.def(py::init<bool, Weekday>(), py::arg("my_bool")=true, py::arg("my_day")=SUNDAY)
.def("to_msg_pack", &Foo::to_msg_pack)
.def_static("from_msg_pack", &Foo::from_msg_pack<Baz>)
.def_readwrite("my_bool", &Foo::my_bool)
.def_readwrite("my_day", &Foo::my_day);
py::class_<Bar>(m, "Bar")
.def(py::init<>())
.def(py::init<std::string, bool, Weekday, int, std::optional<std::string>>(), py::arg("my_string"), py::arg("my_bool")=true,
py::arg("my_day")=SUNDAY, py::arg("my_int")=123, py::arg("my_optional_string")=std::nullopt)
.def("to_msg_pack", &Bazr:to_msg_pack)
.def_static("from_msg_pack", &Bar::from_msg_pack<Baz>)
.def_readwrite("my_string", &Bar::my_string)
.def_readwrite("my_int", &Bar::my_int)
.def_readwrite("my_optional_string", &Bar::my_optional_string);
py::class_<Baz>(m, "Baz")
.def(py::init<>())
.def(py::init<DCFoo, Foo, std::chrono::system_clock::time_point, std::variant<std::string, double>>(), py::arg("my_dc_foo"),
py::arg("my_foo"), py::arg("my_date"), py::arg("my_variant")=123.0)
.def("to_msg_pack", &Baz::to_msg_pack)
.def_static("from_msg_pack", &Baz::from_msg_pack<Baz>)
.def_readwrite("my_dc_foo", &Baz::my_dc_foo)
.def_readwrite("my_foo", &Baz::my_foo)
.def_readwrite("my_date", &Baz::my_date)
.def_readwrite("my_variant", &Baz::my_variant);
}
When time allows, I will look at adding support for Rust. There is limited value in generating Java or C# classes; calling those VM-based lanagues in-process from python has never worked well, in my experience.