6.4 KiB
Syntax Brainstorming
The syntax depends on the kind of implementation. For example, we could want the type system to be valid Python and use Python classes, functions and annotations. In that case, the syntax would be quite restricted by the available set of valid Python expression which don't have a direct effect on the program.
Moreover, if we do use Python's builtin syntax, there could be two approaches: either define real Python classes and functions in Python, or simply use the syntax and parse it externally, without any real Python semantic.
Finally, there is also the option to define a new ad-hoc syntax, which may or may not use similar constructs present in Python. This would require a program to be compiled to become valid, parsable and executable Python code. This also means that an extension of the Python Language Server would need to be created for developers to use the framework effectively.
NB: The option to define the annotations in Python comments will not be considered. Although it would allow custom syntax while keeping the code valid Python, it does not fit the vision for this project, nor is it suitable for a full type system implementation.
The framework must not only allow defining data-frame schemas and custom types, but also operations (e.g. scaling a length), inter-compatibility (e.g. adding latitudes doesn't make sense but adding lengths does), and ad-hoc transformation (e.g. using a scaler from sklearn should be allowed and it will transform the type).
Comparison
| Syntax | Using Python constructs | Valid Python code |
|---|---|---|
| Python | Yes | Yes |
| Python | No | Yes |
| Custom | Yes | No |
| Custom | No | No |
In terms of integration, the first option seems the most well suited as it provides a simple Python package that can be added to any Python project, but it has multiple disadvantages:
- May be complex to work well with Python's builtin type and annotation system
- Can be quite verbose
- Doesn't involve the creation of a custom parser
Looking at the following examples, my personal preference would go towards the last option. The only notable downsides with that option is the need to compile the code to make it become valid Python, and the fact that it doesn't integrate into any Python LSP as is.
Required syntax elements
- Defining a data-frame schema
- Defining a column with a type
- Giving a column a name
- Specifying constraints on a column (could be defined in the type itself for simplicity)
- Defining a custom type
- A type must be based on a underlying Python type
- A type can have properties (e.g. a GeoCoordinate has a latitude and a longitude)
- Defining operations
- Defining allowed operations between the same or different types, and the resulting type
- Defining ad-hoc transformations (e.g.
sklearnscaler)
Defining operations needs to be simple and concise. Many types will support basic mathematical operations with unit-less factors (e.g. scaling), or self-operations (e.g. addition, subtraction, ratio).
In a further development, we may want the framework to support units. This would be a more general kind of types with many similar operations. A dedicated unit management system might be useful to avoid redundant and verbose code.
Data-frame definition examples
Python syntax - using Python constructs
from datetime import datetime
from typing import Annotated
from midas import Frame, Column
import pandas as pd
df: Annotated[pd.DataFrame, Frame[
Column["verified", bool],
Column["birth_year", int],
Column["height", float],
Column["name", str],
Column["date", datetime]
]] = pd.read_csv("data.csv")
Python syntax - without Python constructs
from __future__ import annotations
from datetime import datetime
from typing import Annotated
import pandas as pd
df: Annotated[pd.DataFrame, Frame[
Column["verified", bool],
Column["birth_year", int],
Column["height", float],
Column["name", str],
Column["date", datetime]
]] = pd.read_csv("data.csv")
# or
df: pd.DataFrame = pd.read_csv("data.csv")
"""midas
column 'verified' bool
column 'birth_year' int
column 'height' float
column 'name' str
column 'date' datetime
"""
Custom syntax - using Python constructs
from datetime import datetime
import pandas as pd
Frame[
Column["verified", bool],
Column["birth_year", int],
Column["height", float],
Column["name", str],
Column["date", datetime]
]
df: pd.DataFrame = pd.read_csv("data.csv")
Custom syntax - without Python constructs
from datetime import datetime
import pandas as pd
Frame[
Column<bool> {name: "verified"},
Column<int>{name: "birth_year"}
Column<float>{name: "height"}
Column<str>{name: "name"}
Column<datetime>{name: "date"}
]
df: pd.DataFrame = pd.read_csv("data.csv")
Custom types examples
Python syntax - using Python constructs
from midas import Type
class Latitude(Type[float]): ...
class Longitude(Type[float]): ...
class GeoCoordinates(Type[tuple[Latitude, Longitude]]):
@property
def lat(self) -> Latitude:
return self[0]
@property
def lon(self) -> Longitude:
return self[0]
Python syntax - without Python constructs
...
Custom syntax - using Python constructs
type Latitude[float] = ... # `= ...` is just for syntax highlighting
type Longitude[float] = ...
type GeoCoordinates[Latitude, Longitude]:
lat: Latitude
lon: Longitude
Custom syntax - without Python constructs
type Latitude<float>
type Longitude<float>
type GeoCoordinates<Latitude, Longitude>{
lat Latitude
lon Longitude
}
Operations
Custom syntax - without Python constructs
type Latitude<float>
type Longitude<float>
type LatitudeDiff<float>
type LongitudeDiff<float>
type Distance<float>
op <Latitude> - <Latitude> = <LatitudeDiff>
op <Longitude> - <Longitude> = <LongitudeDiff>
op <LatitudeDiff> + <LatitudeDiff> = <LatitudeDiff>
op <LongitudeDiff> + <LongitudeDiff> = <LongitudeDiff>
op <LatitudeDiff> - <LatitudeDiff> = <LatitudeDiff>
op <LongitudeDiff> - <LongitudeDiff> = <LongitudeDiff>
op <GeoCoordinates>.distance(<GeoCoordinates>) = <Distance>