Cell Types & Annotations¶
Kale reads Jupyter cell tags — strings stored under metadata.tags in the
.ipynb — to decide what role each cell plays in the generated pipeline.
This page documents every tag Kale understands, with examples.
You can set these tags visually through the Kale JupyterLab side panel, or by editing the notebook JSON directly.
The full tag vocabulary¶
Tag |
Example |
Effect |
|---|---|---|
|
- |
Cell is prepended to every pipeline step. All |
|
- |
Cell is prepended to every step after |
|
- |
Variables defined here become KFP pipeline parameters. |
|
- |
|
|
|
Declares (or appends to) a pipeline step named |
|
|
Adds a dependency from the current step to |
|
- |
Cell is excluded from the pipeline entirely. |
|
|
Adds a Kubernetes annotation to the step’s pod. |
|
|
Adds a Kubernetes label to the step’s pod. |
|
|
Adds a Kubernetes resource limit to the step’s pod. |
|
|
Overrides the base image for this step only. |
|
|
Opts the step into or out of KFP’s built-in caching. |
Per-cell-type details¶
imports¶
The imports cell is where every module import in your notebook must
live. Kale prepends this cell’s source to every pipeline step’s generated
component, so any step can assume those imports are available.
# tag: imports
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
Warning
If you import a library in a step or functions cell, Kale will not
add it to the step’s packages_to_install list, and the step will fail at
runtime with ModuleNotFoundError unless the base image happens to include
the package.
functions¶
Put function and class definitions here. Like imports, this cell is
prepended to every step.
# tag: functions
def clean(df: pd.DataFrame) -> pd.DataFrame:
return df.dropna()
class FeaturePipeline:
def __init__(self, model):
self.model = model
Keep these definitions pure: no top-level executable statements, no prints, no imports, no global state mutation.
pipeline-parameters¶
Variables defined in a pipeline-parameters cell become top-level KFP
pipeline parameters. They become inputs to the @kfp_dsl.pipeline function
and can be overridden at submission time.
# tag: pipeline-parameters
learning_rate = 0.01
batch_size = 128
num_epochs = 10
Supported parameter types are int, float, str, and bool. If you
declare the same parameter in multiple pipeline-parameters cells, the last
value wins.
pipeline-metrics¶
Any print(...) statements in a pipeline-metrics cell are parsed out by
Kale’s AST helper
(kale.common.astutils.parse_metrics_print_statements()) and emitted
as KFP pipeline metrics, making them visible in the KFP UI’s run metrics
tab.
# tag: pipeline-metrics
print("accuracy:", accuracy)
print("f1:", f1_score)
step:<name>¶
The workhorse tag. Any cell tagged step:data_processing contributes code
to a pipeline step named data_processing. Multiple cells can share the
same step name — they will be concatenated in notebook order.
# tag: step:load_data
df = pd.read_csv("data.csv")
df = clean(df)
Dependencies are declared with prev::
# tags: step:train, prev:load_data
model = RandomForestClassifier()
model.fit(df.drop("y", axis=1), df["y"])
You can add as many prev: tags as you want — one per dependency.
Per-step configuration¶
A step cell can carry additional tags to customize its pod spec:
# tags: step:train_gpu, prev:prepare_data,
# image:pytorch/pytorch:2.0-cuda12,
# limit:nvidia.com/gpu:1,
# annotation:team:ml,
# label:env:prod,
# cache:disabled
image:<image>— use a custom base image for just this step.limit:<resource>:<value>— request GPU, memory, or any other resource (e.g.limit:memory:8Gi).annotation:<k>:<v>/label:<k>:<v>— add Kubernetes metadata to the step’s pod. Useful for cost allocation, scheduling hints, or integration with observability tooling.cache:disabled— opt the step out of KFP’s caching. Usecache:enabledto force caching when it’s been disabled globally.
skip¶
Cells tagged skip are dropped from the pipeline. Use them for exploratory
code or debugging that you want to keep in the notebook but not run on the
cluster.
# tag: skip
df.describe()
df.plot.hist()
Best practices¶
Keep
importsat the top of the notebook. Don’t spread imports across cells — Kale won’t pick them up.Never mutate global state from inside a step. If you need to configure a library (e.g.
warnings.simplefilter), do it once in animportsorfunctionscell.Use
pipeline-parametersfor values you want to tweak between runs. Resist hard-coding hyperparameters inside step cells.Use
skipliberally during development for cells that don’t belong in the pipeline, likedf.head()or plotting code.Name your steps explicitly —
step:load_data,step:train,step:evaluate— rather than leaving them auto-named.
See Troubleshooting for the common failure modes these practices prevent.