Monitoring Cloudera Streaming Analytics (CSA) with Prometheus
Updated:
Danger! This is a Work in Progress article, content and code is updating frequently until this notice is removed.
If you followed our previous guides on monitoring Cloudera Streams Messaging (CSM) and Cloudera Flow Management (CFM), you now have visibility into your data ingestion (NiFi) and event streaming (Kafka). But what about monitoring the streams processing jobs (FLINK) in Cloudera Streaming Analytics (CSA)?
When running Flink and SQL Stream Builder (SSB) via the CSA Operator, flink jobs spin up dynamically on Kubernetes. Because these dynamically generated TaskManager pods don’t explicitly declare metric ports in their Kubernetes spec, standard Prometheus PodMonitors will silently drop the targets—making metric discovery a bit of a K8s networking puzzle.
In this third and final post of the series, we’re going to wire up our CSA Flink jobs to our existing Prometheus + Grafana stack. By utilizing a Headless Service to bypass strict pod-spec validation natively, we will finally complete plugging our CFM NiFi Operator, CSA Flink Operator, and CSM Kafka Operator into Prometheus and Grafana stack for monitoring.
Create the Prometheus Values File
Create this file in the root of your repo. This forces Flink to open port 9249 for metrics scraping.
csa-prometheus-values.yaml
# csa-prometheus-values.yaml
# Enables native PrometheusReporter for ALL SQL Stream Builder (SSB) jobs
ssb:
flinkConfiguration:
flink-conf.yaml: |
metrics.reporters: prom
metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: "9249"
taskmanager.network.detailed-metrics: "true"
# Optional: cleaner metric labels for Grafana dashboards
metrics.scope.jm: "flink.jobmanager.<host>"
metrics.scope.tm: "flink.taskmanager.<host>.<tm_id>"
metrics.scope.job: "flink.job.<job_id>.<job_name>"
Exact Helm Install Command (Fresh Install)
Run this exact command:
helm install csa-operator \
oci://[container.repository.cloudera.com/cloudera-helm/csa-operator/csa-operator](https://container.repository.cloudera.com/cloudera-helm/csa-operator/csa-operator) \
--namespace cld-streaming \
--create-namespace \
--version 1.5.0-b275 \
--values ./csa-prometheus-values.yaml \
--set 'flink-kubernetes-operator.imagePullSecrets[0].name=cloudera-creds' \
--set 'ssb.sse.image.imagePullSecrets[0].name=cloudera-creds' \
--set 'ssb.sqlRunner.image.imagePullSecrets[0].name=cloudera-creds' \
--set 'ssb.mve.image.imagePullSecrets[0].name=cloudera-creds' \
--set 'ssb.database.imagePullSecrets[0].name=cloudera-creds' \
--set 'ssb.flink.image.imagePullSecrets[0].name=cloudera-creds' \
--set-file flink-kubernetes-operator.clouderaLicense.fileContent=./license.txt
Verify the Install
# 1. Helm release
helm list -n cld-streaming
# 2. All pods running
kubectl get pods -n cld-streaming
# 3. Confirm Prometheus config was applied
helm get values csa-operator -n cld-streaming | grep -A 20 "flink-conf.yaml"
Discovery with Headless Service & ServiceMonitor
Because Flink Native Kubernetes does not explicitly declare port 9249 in its dynamic pod specs, standard PodMonitors will drop the targets. Instead, we bridge the gap using a Headless Service and a ServiceMonitor.
A. Create the Headless Service (csa-flink-service.yaml)
apiVersion: v1
kind: Service
metadata:
name: csa-flink-metrics-service
namespace: cld-streaming
labels:
app: csa-flink-metrics
spec:
clusterIP: None # Makes it a headless service
selector:
# This automatically captures ALL Flink pods (JobManagers & TaskManagers)
type: flink-native-kubernetes
ports:
- name: prom-metrics
port: 9249
targetPort: 9249
B. Create the ServiceMonitor (csa-flink-service-monitor.yaml)
apiVersion: [monitoring.coreos.com/v1](https://monitoring.coreos.com/v1)
kind: ServiceMonitor
metadata:
name: csa-flink-metrics-monitor
namespace: cld-streaming
labels:
release: prometheus # Must match your Prometheus Operator release label
spec:
selector:
matchLabels:
app: csa-flink-metrics
namespaceSelector:
matchNames:
- cld-streaming
endpoints:
- port: prom-metrics
interval: 15s
scrapeTimeout: 10s
relabelings:
# Extracts labels so Grafana dashboards automatically map deployments
- sourceLabels: [__meta_kubernetes_pod_label_app]
targetLabel: flink_deployment
- sourceLabels: [__meta_kubernetes_pod_label_component]
targetLabel: component
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
C. Apply both files:
kubectl apply -f csa-flink-service.yaml -n cld-streaming
kubectl apply -f csa-flink-service-monitor.yaml -n cld-streaming
Wait ~30 seconds, then check Prometheus UI (Status -> Targets). You should see your JobManagers and TaskManagers listed as UP under serviceMonitor/cld-streaming/csa-flink-metrics-monitor/0.
Test Prometheus Metrics (Run an SSB Job)
- Open SSB UI:
minikube service ssb-sse --namespace cld-streaming -
Run any SQL job in Sql Stream Builder.
- Verify metrics are exposed directly from a pod:
# Replace with your actual taskmanager pod name kubectl exec -it ssb-session-admin-taskmanager-1-3 -n cld-streaming -- \ curl -s http://localhost:9249/metrics | head -20
You should see flink_ metrics.
Querying SSB / Flink Metrics in Prometheus UI
Sample Query 1: JVM CPU Load
flink_taskmanager_Status_JVM_CPU_Load{namespace="cld-streaming"}
Sample Query 2: Job Uptime
flink_jobmanager_job_uptime{namespace="cld-streaming"}
Sample Query 3: Records In/Out Per Second
sum(flink_taskmanager_job_task_operator_numRecordsInPerSecond{namespace="cld-streaming"}) by (job_name)
End-to-End Pipeline (NiFi → SSB → Kafka)
sum(rate(nifi_bytes_sent{namespace="cfm-streaming"}[5m]))
or
sum(flink_taskmanager_job_task_operator_numRecordsInPerSecond{namespace="cld-streaming"})
or
sum(rate(kafka_server_brokertopicmetrics_bytesin_total{namespace="cld-streaming"}[5m]))
Visualizing in Grafana
-
ID
11049(Recommended First Test)- Name: Flink Dashboard
- Description: The standard, most reliable community dashboard built explicitly for the Flink Prometheus Exporter. It tracks JobManagers, TaskManagers, JVM metrics, and basic job health.
-
ID
14911- Name: Apache Flink Dashboard for Job / Task Manager
- Description: A slightly more modern layout that breaks down metrics specifically between the Job Manager and Task Manager. Good for digging into CPU/Memory constraints.
-
ID
14840- Name: Flink Metrics (with Kafka) on K8S
- Description: Since you are running CSA alongside Kafka, this dashboard is actually built to monitor Flink applications and includes Kafka throughput parameters alongside Kubernetes memory/CPU stats.
Summary
With this final piece in place, you have successfully built a complete, end-to-end observability pipeline across your entire Cloudera Streaming Operators architecture. By bridging CFM (NiFi) for ingestion, CSA (SQL Stream Builder / Flink) for real-time processing, and CSM (Kafka) for event streaming, you now have a unified view of your data’s lifecycle within a single Prometheus and Grafana stack.
In this specific guide we implemented a Headless Service and a ServiceMonitor to bypass the strict pod-spec limitations of Flink Native Kubernetes. This ensures that every dynamically provisioned JobManager and TaskManager is automatically discovered and scraped by Prometheus, completely eliminating the silent “0 targets” discovery failures.
You can now reliably execute complex PromQL queries in Prometheus across namespaces and correlate behavior across entirely different engines. Whether you are tracking backpressure in NiFi, measuring checkpoint durations and records-per-second in Flink, or monitoring consumer lag in Kafka, you finally have the single pane of glass required to confidently debug, tune, scale, and monitor your streaming data pipelines.
Appendix
1. Cleanup / Re-install
helm uninstall csa-operator -n cld-streaming
kubectl delete servicemonitor csa-flink-metrics-monitor -n cld-streaming --ignore-not-found
kubectl delete service csa-flink-metrics-service -n cld-streaming --ignore-not-found
2. Force Prometheus to Re-discover
kubectl rollout restart deployment prometheus-kube-prometheus-operator -n cld-streaming
3. Quick Verification Commands
kubectl get servicemonitor -n cld-streaming
kubectl get service csa-flink-metrics-service -n cld-streaming
kubectl get pods -n cld-streaming -l type=flink-native-kubernetes