[DRAFT] Telemetry Policy #4738

jsuereth · 2025-11-17T21:15:01Z

DO NOT MERGE

This is a DRAFT from many discussions during KubeCON NA. This is mean to collect feedback and provide a location for collaboration on the formal proposal that was started with @trask @jaronoff97 @andykellr and others.

Update 2025-11-17: This just includes a motivation section and basic sketch from my memory in kubecon. @jaronoff97 took great notes, so passing this off for further fleshing out and expansion from here.

jaronoff97 · 2025-11-17T21:16:07Z

Thank you for opening this! Feel free to assign me here as I begin working on this.

jsuereth · 2025-11-17T21:43:47Z

@jaronoff97 Honestly - any piece you're comfortable taking, feel free to expand. It still needs a lot of fleshing out, and I know you had a lot of great ideas on the high level design, so let's start there and then create break-out sections for details.

I was thinking the following divisions (with caveat I'm happy if more folks want to help participate in this proposal, but we need to flesh out the idea far further before I think that would work):

@jaronoff97 / @jsuereth Can pair on the following:
- Defining key aspects of Policy (e.g. idempotent, mergable, etc. in more detail).
- Providing an example policy / configuration-setup (for Operator, Collector + SDK)
- OpenTelemetry Operator interaction with Policy
@andykellr / @jaronoff97 Can pair on the following:
- Interaction between Policy + OpAMP
- Necessary components/setup in OTEL Collector
@jsuereth / @trask
- SDK necessary components
- Alternatives considered / prior art

dashpole · 2025-11-18T19:45:17Z

oteps/4738-telemetry-policy.md

+
+Note that mitigations do not need to be complete *solutions*, and that they do not need to be accomplished directly through your proposal. A suggested mitigation may even warrant its own OTEP!
+
+## Prior art and alternatives


I would love to see alternatives here. We've discussed things like dynamically-reloadable or merge rules for declarative config, and it would help reinforce why we need a new concept to solve the problems you are interested in.

Agreed, I need to write down the treatment of why dynamically reloaded config doesn't solve the problems that motivate the proposal.

My answer to your other comment, hopefully, hints at that, but it'll be a longer write-up.

dashpole · 2025-11-18T20:13:23Z

oteps/4738-telemetry-policy.md

+- TODO - *implicily* a policy has a target resource / signal it is aimed at.
+  This will be used to route policies to destinations.
+
+Example policy types include: 


Could we think of this as new "subtypes" of declarative config that can be used in a standalone way? E.g. if we think of the current declarative config as configuration as type "SDK", we could define sub-types like "sampler", "view", or "log-record-processor"?

If we can, I would love to keep the same yaml structure / definitions for these policies that we currently have in the declarative config so we avoid introducing another structured definition of what a "sampler" is. Or do you think because this is targeted at the collector as well that isn't feasible?

I'd expect the declarative config for a policy-component to be used directly in declarative config:

So something like:

- my_policy_component: - default_policies - type: xyz ... the policy yaml...

The primary difference between the policy for sampling and a "sampler" will actually be in flexibility. A sampler component could be written in any language, allow any code and its configuration must be open. A sampler policy MUST have a well-defined behavior, have the same configuration and behavior in all languages or implementations.

So primarily, a policy is highly limited in a way extension points are not.

oteps/9999-telemetry-policy.md

jack-berg · 2025-11-19T17:22:34Z

oteps/4738-telemetry-policy.md

+control the sampling of spans in SDKs. However, File-based configuration does
+not require dynamic reloading of configuration. This means attempting to
+provide a solution like Jaeger-remote-sampler with just OpAMP + file-based
+config is impossible, today.


Dynamic control of SDKs is something that should be able to be built on top of or as an evolution of declarative config. I / we have been conscious of this eventually while building declarative config and I don't think anything will get in the way. Also, I hope that minimally, the declarative config data model can be used as a way for servers to communicate the desired configuration state of components in a dynamic config scenario.

While I agree to a degree, the type of control and abstraction these proposal seeks to enable is NOT possible without agreement on semantics and use-cases across diverse implementations.

E.g. the declarative config + OpAMP could be used to send any config to any component. What it doesn't do, and what we need to sort out, is how to understand what config can be sent to what component, and how to drive control / policy independent of implementation or pipeline set-up, e.g.

Imagine a world where we can control the reporting of metrics across open telemetry SDKs, custom implementations and Prometheus SDKs because we agreed to the semantics of policy independent of configuration.

In Declarative config I'd expect things that cannot be shared between different implementations:

Queue/Buffer sizes that are specific to my pipeline setup

Threading / GC configuration specific to my language

In a Policy we should be limited to ONLY things that can be shared broadly, across
implementations and have well-defined semantics for how to enforce them.

So I see Declarative config as encompassing more than just policies, where policies would be a subset of what you'd find. Additionally, Policies can be independent things that you can bundle together. I should be able to "add" a policy at any point without needing to understand how it interacts with other components. AN example of this - If I have a configuration reporting metrics, that configuration would have a MetricReader->MetricExporter right? What If there's multiple. How would I know what to change generically, if I just wanted to say "stop producing metric X". Policies are ignorant of this. They just push a policy down and the SDK would be expected to enforce this via a PolicyMetricReader that's configured to pay attention to a metric filter policy.

Apologies not all of this is fleshed out, as it's a working draft, and one we're working on in the repo. Please continue to ask questions and I'll use that to flesh out the motivation more.

This is an exact case I have, turning off metrics. And turning them back on. I implement this by having a flag in a custom exporter which stops/restarts exports. A generic solution to turning it off would be to change the exporter config to none, then I guess you could re-enable by setting again to otlp, but that implies a much more complex action in the SDK rather than switching a boolean on/off

I added this to the alternatives considered discussion

Imagine a world where we can control the reporting of metrics across open telemetry SDKs, custom implementations and Prometheus SDKs

A bold vision. I think I was definitely misunderstanding the scope. I'll revise my position: If we want dynamic control solutions specifically for otel SDKs, the declarative config data model should play a role, because not using it means introducing yet another config interface (YACI 😛). With a broader scope targeting other tools besides otel SDKs, we would of course need something not loaded with otel SDK vocabulary / baggage.

Should this type of thing even live in otel or in some neutral territory? (reminds me of the relationship between w3c trace content and opentelemetry) Are there other ecosystems that have expressed interest in or that we've reached out to for collaborating?

Great quesitons!

If we want dynamic control solutions specifically for otel SDKs, the declarative config data model should play a role

100%!

Should this type of thing even live in otel or in some neutral territory?

Great question. I personally think this belongs in OTEL and should "feel native" to otel, but allow any component in o11y space to interact with it. This can increase the reach of "effective opentelemetry" as components which support writing OTLP can also participate with policies. However, to your question above, if this wasn't first-class in otel, how would we make sure our declarative config data model plays an important role?

Are there other ecosystems that have expressed interest in or that we've reached out to for collaborating?

The idea is the outcome of discussions with both Envoy (and their xDS control plane folks) and Google's Monarch team (see #4672). I would love to pull in more folks to collaborate for sure. First, I want to make sure we all understand the vision, scope and goals.

This PR was meant to be a place for those of us who started discussing to flesh out the proposal in place (as draft), so this PR is meant to be collecting that interest and refining the message. APologies it was rough when you first reviewed it.

jack-berg · 2025-11-19T17:27:53Z

oteps/4738-telemetry-policy.md

+- `log-filter`: define how logs are sampled/filtered
+- `attribute-redaction`: define attributes which need redaction/removal.
+- `metric-aggregation`: define how metrics should be aggregated (i.e. views).
+- `exemplar-sampling`: define how exemplars are sampled


This reads like a subset of declarative configuration capabilities. Wouldn't it be easier to unify on one data model (i.e. declarative config) for expressing the desired configuration, and build tooling to detect / apply diffs when a a change is pushed from a remote server?

I.e. an app starts with:

file_format: 1.0 tracer_provider: processors: - batch: exporter: otlp_http: sampler: parent_based: root: trace_based: ratio: 1.0

Later, a remote server pushes a new configuration state with an updated ratio for the trace id ratio sampler:

file_format: 1.0 tracer_provider: processors: - batch: exporter: otlp_http: sampler: parent_based: root: trace_based: ratio: 0.5 # reduce ratio from 1.0 to 0.5

Some controller is responsible for evaluating the diff between the current state and the desired state, and computing / executing update steps as allowed. In this case, substitute the sampler.

You can read some of my rationale at the bottom of the OTEP.

Effectively:

I think policies will be used as-in in declarative config. There would be a component in Declarative config that can be configured with a default set of policies.

I think policies will be highly limited in expected behavior / security profile vs. declarative config.

I think SDK configuration will opt-in to allow remote-policy control, with explicit permissions per-policy

I do not think policies will alter pipeline setup or shape. Policies should have well defined insertion points already defined via Config where they will be enforced.

Policies will need to have a mechanism via OpAMP to advertise they can be accepted and handled - we can use "custom capabilities" for this.

So there are a lot of similarities, but the key difference is the limitations.

Co-authored-by: Jack Berg <[email protected]>

jack-berg · 2025-11-20T15:31:20Z

oteps/4738-telemetry-policy.md

+meter_provider:
+  readers:
+    - my_custom_metric_filtering_reader:
+        my_filter_config: # defines what to filter


You want to filter metrics using a filtering reader (this component doesn't exist in the SDK spec and so would have to be custom) vs. views or meter config?

I'm not sure, I can update this to use views instead as well. I was taking from the proposed OTEP where you can control both the reporting of a metric and the report interval (i.e. periodic metric reader would need configuration for how often to report each set of metrics).

jack-berg · 2025-11-20T15:32:15Z

oteps/4738-telemetry-policy.md

+Here, I've created a custom component in java to allow filtering which metrics are read.
+However, to insert / use this component I need to have all of the following:
+
+- Know that this component exists in the java SDK


If this is a popular use case we should extend the SDK spec to add an additional built in component. We're too reluctant to do this right now.

That still won't tell me if it's safe to send configuration to an SDK or not. I need to know, at runtime, that the version of the SDK I'm trying to control will support that config or if I'll crash a key component.

Additionally, it doesn't help me ignore the implementation detail. E.g. what If I also want to control Prometheus client library? We don't own their config or their specification. However, we could build something that interacts with remote policies, similar to Jaeger-Remote-Sampler of today for traces.

thompson-tomo · 2025-11-21T02:30:11Z

oteps/4738-telemetry-policy.md

+  - `PolicyProvider`
+    - Can "push" policies into the provider.
+    - Provides "observable" access to policies (e.g. notify on change)


I would forsee 2 SDK components:

Policy Provider: constructs the identity of the agent and contains a collection of policy detectors. It exposes methods to access the collection of policies provided by the detectors and notify a component of an updated policy/profile.

Policy Agent: provides a detect method enabling components to report it's policy. Designed to be embedded in components.

thompson-tomo · 2025-11-21T03:49:15Z

oteps/4738-telemetry-policy.md

+Every policy is defined with the following:
+
+- A `type` denoting the use case for the policy
+- A json schema denoting what a valid definitin of the policy entails.


Why don't we define a policy as follows:

Policy Definitions: an array of policy definitions

Instrumentation scope: identifies the component who this policy is for and mirrors otlp definition.

The policy definition would contain the type & schema property as above.

Having the scope allows for restriction of who the policy applies to. information about the agent/resource is left out as opamp natively provide this info.

thompson-tomo · 2025-11-21T04:06:56Z

oteps/4738-telemetry-policy.md

+  - Extension Points
+    - `PolicySampler`: Pulls relevant `trace-sampling` policies from
+      PolicyProvider, and uses them.
+    - `PolicyLogProcessor`: Pulls Relevant `log-filter` policies from
+      PolicyProvider and uses them.
+    - `PolicyPeriodicMetricReader`: Pulls Relevant `metric-rate` policies
+      from PolicyProvider and uses them to export metrics.
+    - TODO: SDK-wide attribute processors
+    - TODO: SDK-view policies


Could this be simplified to instead use an applyPolicy method on the PolicyAgent. The policyprovider can use the the data (scope & policy type) from the detect method to only invoke apply on the correct audience

thompson-tomo · 2025-11-21T04:15:29Z

oteps/4738-telemetry-policy.md

+- OpAmp Interaction
+  - Policy = custom extension
+  - Can we safely "roll back" a policy if it caused a breakage?


Opamp agent/client should be able to report back supported policies to opamp server.

Opamp server should be able to inform client when a policy is updated including the scope it applies to.

Signed-off-by: jaronoff97 <[email protected]>

Create OTEP based on kubecon discussions on policy control.

e70377e

jsuereth requested review from a team as code owners November 17, 2025 21:15

jsuereth assigned jsuereth and jaronoff97 Nov 17, 2025

dashpole reviewed Nov 18, 2025

View reviewed changes

thompson-tomo mentioned this pull request Nov 19, 2025

Markdown specification based on schema open-telemetry/opentelemetry-configuration#266

Closed

jack-berg reviewed Nov 19, 2025

View reviewed changes

jsuereth and others added 2 commits November 19, 2025 18:00

Update oteps/9999-telemetry-policy.md

da34640

Co-authored-by: Jack Berg <[email protected]>

Add more justification.

6396605

jack-berg reviewed Nov 20, 2025

View reviewed changes

Rename to PR number.

4a092aa

thompson-tomo reviewed Nov 21, 2025

View reviewed changes

update with more details from kubecon

4063e20

Signed-off-by: jaronoff97 <[email protected]>


		Note that mitigations do not need to be complete solutions, and that they do not need to be accomplished directly through your proposal. A suggested mitigation may even warrant its own OTEP!

		## Prior art and alternatives

[DRAFT] Telemetry Policy #4738

Are you sure you want to change the base?

[DRAFT] Telemetry Policy #4738

Uh oh!

Conversation

jsuereth commented Nov 17, 2025

Uh oh!

jaronoff97 commented Nov 17, 2025

Uh oh!

jsuereth commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jack-berg Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsuereth Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jack-berg Nov 20, 2025 •

edited

Loading

jsuereth Nov 20, 2025 •

edited

Loading