GCP Cloud Monitoring — SKILL.md
Raw skill file that agents receive when using this skill
---
name: "GCP Cloud Monitoring"
description: "Skill for GCP Cloud Monitoring — auto-generated from documentation"
version: "1.0.0"
author: "skynet"
category: "infrastructure"
agents: ["claude-code", "codex", "gemini"]
tags: ["gcp-monitoring", "infrastructure", "auto-generated"]
---
# GCP Cloud Monitoring
---
name: GCP Cloud Monitoring
description: Use this skill when you need to set up monitoring, alerting, and observability for GCP resources. Essential for tracking performance metrics, creating dashboards, configuring alerts, and troubleshooting system health issues across Google Cloud services.
category: infrastructure
metadata:
author: skynet
version: 1.0.0
---
# GCP Cloud Monitoring
## Overview
Cloud Monitoring provides visibility into performance, uptime, and health of your applications and infrastructure on Google Cloud.
## Prerequisites
```bash
# Install and configure gcloud CLI
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
# Enable Cloud Monitoring API
gcloud services enable monitoring.googleapis.com
```
## Core Commands
### Metric Operations
```bash
# List available metrics
gcloud monitoring metrics list --filter="metric.type:compute"
# Get metric descriptor
gcloud monitoring metrics describe compute.googleapis.com/instance/cpu/utilization
# Create custom metric descriptor
gcloud monitoring metrics create \
--metric-type="custom.googleapis.com/my_app/requests" \
--metric-kind=GAUGE \
--value-type=DOUBLE \
--description="Application request count"
```
### Time Series Data
```bash
# Query time series data
gcloud monitoring time-series list \
--filter='metric.type="compute.googleapis.com/instance/cpu/utilization"' \
--interval-start-time="2024-01-01T00:00:00Z" \
--interval-end-time="2024-01-01T01:00:00Z"
# Write custom metric data point
gcloud monitoring time-series create \
--time-series-data-from-file=timeseries.json
```
### Alert Policies
```bash
# List alert policies
gcloud alpha monitoring policies list
# Create alert policy
gcloud alpha monitoring policies create \
--policy-from-file=alert-policy.json
# Delete alert policy
gcloud alpha monitoring policies delete POLICY_ID
```
## Configuration Files
### Alert Policy JSON
```json
{
"displayName": "High CPU Usage",
"conditions": [{
"displayName": "CPU usage above 80%",
"conditionThreshold": {
"filter": "resource.type=\"gce_instance\"",
"comparison": "COMPARISON_GREATER_THAN",
"thresholdValue": 0.8,
"duration": "300s",
"aggregations": [{
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_MEAN"
}]
}
}],
"notificationChannels": ["projects/PROJECT_ID/notificationChannels/CHANNEL_ID"],
"alertStrategy": {
"autoClose": "1800s"
}
}
```
### Custom Metric Time Series
```json
{
"timeSeries": [{
"metric": {
"type": "custom.googleapis.com/my_app/requests",
"labels": {
"environment": "production"
}
},
"resource": {
"type": "global",
"labels": {
"project_id": "YOUR_PROJECT_ID"
}
},
"points": [{
"interval": {
"endTime": "2024-01-01T12:00:00Z"
},
"value": {
"doubleValue": 42.5
}
}]
}]
}
```
## Common Workflows
### Setting Up Basic Monitoring
```bash
# 1. Enable APIs
gcloud services enable monitoring.googleapis.com
gcloud services enable logging.googleapis.com
# 2. Create notification channel (email)
gcloud alpha monitoring channels create \
--display-name="DevOps Team" \
--type=email \
--channel-labels=email_address=devops@company.com
# 3. List notification channels to get ID
gcloud alpha monitoring channels list
# 4. Create uptime check
gcloud alpha monitoring uptime create \
--display-name="Website Health Check" \
--monitored-resource-type=uptime_url \
--hostname=example.com \
--path=/health \
--port=443 \
--use-ssl
```
### Dashboard Creation
```bash
# Create dashboard from JSON
gcloud monitoring dashboards create --config-from-file=dashboard.json
# List dashboards
gcloud monitoring dashboards list
# Export dashboard
gcloud monitoring dashboards describe DASHBOARD_ID \
--format="export" > dashboard-backup.yaml
```
### Log-based Metrics
```bash
# Create log-based metric
gcloud logging metrics create error_count \
--description="Count of application errors" \
--log-filter='severity>=ERROR AND resource.type="gce_instance"'
# List log metrics
gcloud logging metrics list
# Update log metric
gcloud logging metrics update error_count \
--description="Updated error count metric"
```
## Decision Trees
### Choosing Metric Types
```
Need to track a value?
├─ Value accumulates over time? → CUMULATIVE
├─ Value represents current state? → GAUGE
└─ Need to distribute values in buckets? → DISTRIBUTION
```
### Alert Policy Strategy
```
What triggers the alert?
├─ Metric threshold exceeded?
│ ├─ Single resource → Condition Threshold
│ └─ Multiple resources → Condition Threshold with grouping
├─ Resource becomes unavailable? → Uptime Check
└─ Log pattern detected? → Log-based metric + threshold
```
### Notification Channel Selection
```
How urgent is the alert?
├─ Critical (immediate) → SMS + PagerDuty
├─ Important (< 1 hour) → Email + Slack
└─ Informational → Email only
```
## Monitoring Best Practices
### Resource Labeling
```bash
# Add monitoring labels to compute instances
gcloud compute instances add-labels INSTANCE_NAME \
--labels=environment=prod,team=backend,service=api
# Create alerts with label filters
# Filter: resource.label.environment="prod" AND resource.label.service="api"
```
### Custom Metrics Implementation
```python
# Python example for writing custom metrics
from google.cloud import monitoring_v3
import time
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"
# Write custom metric
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/my_app/active_users"
series.resource.type = "global"
series.resource.labels["project_id"] = PROJECT_ID
point = series.points.add()
point.value.int64_value = 1234
point.interval.end_time.seconds = int(time.time())
client.create_time_series(name=project_name, time_series=[series])
```
## Advanced Configuration
### Multi-Condition Alerts
```json
{
"displayName": "Complex Service Health Alert",
"combiner": "OR",
"conditions": [
{
"displayName": "High Error Rate",
"conditionThreshold": {
"filter": "metric.type=\"logging.googleapis.com/user/error_count\"",
"comparison": "COMPARISON_GREATER_THAN",
"thresholdValue": 10
}
},
{
"displayName": "Low Request Rate",
"conditionThreshold": {
"filter": "metric.type=\"loadbalancing.googleapis.com/https/request_count\"",
"comparison": "COMPARISON_LESS_THAN",
"thresholdValue": 100
}
}
]
}
```
### MQL (Monitoring Query Language)
```bash
# Query with MQL
gcloud monitoring time-series list \
--filter='fetch gce_instance | metric compute.googleapis.com/instance/cpu/utilization | group_by 1m, [mean] | every 1m'
```
## Troubleshooting
### Common Errors
**Error: "Permission denied" when creating alerts**
```bash
# Solution: Add required IAM roles
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="user:user@example.com" \
--role="roles/monitoring.alertPolicyEditor"
```
**Error: "Metric not found" for custom metrics**
```bash
# Check if metric descriptor exists
gcloud monitoring metrics list --filter="metric.type:custom.googleapis.com"
# Verify metric type spelling and wait up to 2 minutes for propagation
```
**Error: "Invalid time series" when writing metrics**
```bash
# Ensure timestamp is not older than 25 hours
# Verify resource type and labels match metric descriptor
# Check that metric kind matches the operation (GAUGE vs CUMULATIVE)
```
**Alert not firing despite threshold being exceeded**
```bash
# Debug checklist:
# 1. Verify notification channels are valid
gcloud alpha monitoring channels list
# 2. Check alert policy status
gcloud alpha monitoring policies list --filter="displayName:YOUR_ALERT_NAME"
# 3. Verify metric data exists in time range
gcloud monitoring time-series list --filter="YOUR_FILTER" --interval-start-time="1h ago"
# 4. Check alert policy condition duration vs actual breach duration
```
### Debugging Queries
```bash
# Test metric filters
gcloud monitoring time-series list \
--filter='metric.type="compute.googleapis.com/instance/cpu/utilization" AND resource.label.instance_name="my-instance"' \
--interval-start-time="1h ago"
# Validate custom metric ingestion
gcloud logging read 'jsonPayload.message:"time series"' \
--limit=10 \
--format="table(timestamp, jsonPayload.message)"
```
### Performance Optimization
```bash
# Use appropriate aggregation periods
# 1m for real-time alerts
# 5m for standard monitoring
# 1h for trend analysis
# Limit metric cardinality
# Avoid high-cardinality labels (user IDs, timestamps)
# Use resource labels instead of metric labels when possible
```
---
curl -s https://skills.skynet.ceo/api/skills/gcp-monitoring/skill.md