Troubleshooting Coolify Deployments

How to Build a Grafana Incident Response Dashboard

Overview

This guide walks you through creating a Grafana dashboard for troubleshooting server outages and performance issues. You’ll learn how to build panels that help you quickly identify problems, trace them back to specific users, and take action during incidents. Pro6PP project might be used here as a project reference example.

Prerequisites:

Access to Grafana instance
Loki data source configured
Logs from Coolify proxy available in Loki
Logs from application available in Loki

Note: Check setup guide for information on how to set up the monitoring stack

Understanding Log Structure

Our Coolify proxy logs are structured in JSON format with nested objects. Here’s what a typical log entry looks like:

{
  "level": "info",
  "ts": 1696077600.123,
  "logger": "http.log.access",
  "msg": "handled request",
  "log": {
    "request_client_ip": "89.149.208.150",
    "request_path": "/api/v1/lookup",
    "request_method": "GET",
    "status": 200,
    "duration": 0.045,
    "auth_key": "abc123xyz789"
  }
}

NOTE: This is a specific caddy proxy log example.

Important: Notice the nested structure. The actual request data is inside a log object. This is crucial for building queries.

To check the exact log structure for specific cases, run the following Loki query on the container:

{container_name="/coolify-proxy"} | json

This helps when creating the queries based on the log structure.

Step 1: Create a New Dashboard

Navigate to Dashboards -> New Dashboard
Click Add visualization
Select your Loki data source
Before adding panels, configure dashboard settings:
- Click the gear icon at the top
- General tab:
  - Name: “Pro6PP Incident Dashboard”
  - Description: “Troubleshooting Pro6PP production deployment if there is an outage or server gets overloaded”
  - Tags: Add incident-response, monitoring, <project-name>
- Time options tab:
  - Timezone: Browser Time
  - Auto refresh: 30s (optional, useful during active incidents)
- Click Save dashboard

Step 2: Panel 1 - Requests Per Minute

This panel shows your traffic volume over time, helping you spot traffic spikes or drops.

Panel 1 Configuration

Click Add -> Visualization
Select Loki data source
Panel type (on the right side): Time series
Query configuration:
- Switch to Code mode (toggle in query builder)
- Enter the following query:

sum(rate({container_name="/fwgss4sw8ck4kw4owk084ggk-142613463131"} [1m])) * 60

NOTE: This container name is the pro6pp-api container name, which is used to exemplify a real example.

Understanding the query:

{container_name="/fwgss4sw8ck4kw4owk084ggk-142613463131"} - Filters logs from your API service container
rate([1m]) - Calculates the per-second rate over 1-minute windows
sum() - Aggregates all streams together
* 60 - Converts per-second rate to per-minute

Panel settings:
- Title: “Requests Per Minute (API service)”
- Panel options -> Legend: Hidden (cleaner view for single metric)
- Tooltip mode: Single (shows one value at a time)
Visual customization:
- Graph styles -> Line width: 1
- Fill opacity: 0 (no fill under line)
- Line interpolation: Linear
Click Apply to save the panel

Finding Your Container Name

If you don’t know your container name:

# SSH into your server
ssh root@your-server

# Or use the terminal in Coolify of your server

# List running containers
docker ps --format "table {{.Names}}\t{{.Image}}"

Step 3: Panel 2 - Top 10 IPs Accessing Proxy

This is your most critical panel for identifying abusive users or traffic sources.

Panel 2 Configuration

Click Add -> Visualization
Select Loki data source
Panel type: Table
Query configuration:
- Switch to Code mode
- Enter this query:

topk(10, sum by (request_client_ip) (
  count_over_time(
    {container_name="/coolify-proxy"}
    | json
    | line_format "{{.log}}"
    | json
    | request_client_ip!=""
    [2h]
  )
))

Understanding this complex query (step by step):

{container_name="/coolify-proxy"} - Get logs from Coolify proxy
| json - Parse the outer JSON structure
| line_format "{{.log}}" - Critical step: Extract the nested log object and make it the new log line
| json - Parse the inner JSON (now we can access request_client_ip)
| request_client_ip!="" - Filter out empty IPs
[2h] - Look back 2 hours
count_over_time() - Count log lines in the time range
sum by (request_client_ip) - Group counts by IP address
topk(10, ...) - Return only the top 10 IPs

Add transformation (this converts the time series to a table):

Click Transform tab
Click Add transformation
Select Reduce
Mode: Series to rows
Calculations: Sum
Labels: Labels to fields (yes)
Include time field: (no)

Panel settings:

Title: “Top IPs accessing Caddy Proxy last 2h”

Table options:

Show header: check
Cell height: Small (fits more rows)
Sort by: Total (descending) - this is set automatically

Click Apply to save the panel

Interpreting the Results

The table will show:

request_client_ip	Total
89.149.208.150	15,234
188.166.16.67	8,456
…	…

Important 1: We need to check further if there is any IP that’s overloading our server. In the context of Pro6PP we use the IP in the next panel’s query and find the account that’s related to. If this is not possible in the context of a different project, if the IP has irregular activity than usual, then it could be blocked at the proxy level until the server state is back to normal.

Important 2: For Pro6PP we need to take into account to not block our own proxy.pro6pp.nl IP.

Step 4: Panel 3 - Find Account by IP

This panel helps you trace suspicious IPs back to specific accounts by searching for their auth_key in the logs.

Important: If an IP cannot be related to an account in your project then this panel can be skipped.

Panel 3 Configuration

Click Add -> Visualization
Select Loki data source
Panel type: Logs
Query configuration:
- Switch to Builder mode
- Label filters: container_name = /coolify-proxy
- Line contains: 89.149.208.150 (example IP)

The builder mode creates this query:

{container_name="/coolify-proxy"} |= `89.149.208.150`

Panel settings:
- Title: “Find Pro6PP Account by IP”
- Description: “Search within the logs the lines that contain the IP that we want to filter. Find its account auth_key that performs the requests”
Logs options:
- Enable log details: Yes (allows expanding log lines)
- Show time: No (cleaner view)
- Show labels: No (reduces clutter)
- Wrap log message: No (prevents line wrapping)
- Deduplication: None
- Order: Descending (newest first)
Click Apply to save the panel

How to Use This Panel During an Incident

Get a suspicious IP from the “Top IPs” panel (we’ll create this next)
Click Edit on this panel
Replace 89.149.208.150 with the suspicious IP
Click Run query
Look through the logs for the auth_key field
Use that auth_key to identify which Pro6PP account is responsible

Example log line you’ll see:

{
  "log": {
    "request_client_ip": "89.149.208.150",
    "auth_key": "sk_live_abc123xyz789",
    ...
  }
}

Step 5: Arrange Your Dashboard Layout

Now organize your panels for optimal incident response:

Click the dashboard settings icon
Drag panels to arrange them:
- Top row: Requests Per Minute (full width)
- Second row: Find Pro6PP Account by IP (full width)
- Third row: Top IPs table (full width)

Or use a 2-column layout:

Left column: Requests Per Minute, Top IPs
Right column: Find Pro6PP Account by IP

Resize panels by dragging the bottom-right corner

Recommended layout for incidents:

┌─────────────────────────────────────────┐
│   Requests Per Minute (full width)      │
│   (Shows traffic trends)                │
└─────────────────────────────────────────┘
┌──────────────────┬──────────────────────┐
│ Top IPs          │ Find Account by IP   │
│ (Identify        │ (Trace to account)   │
│  abusers)        │                      │
└──────────────────┴──────────────────────┘

Step 6: Configure Time Range

Set up appropriate time ranges for incident investigation:

Click the time picker in the top-right corner
Default view: Last 6 hours
Quick ranges to add:
- Last 15 minutes (active incident)
- Last 1 hour (recent issue)
- Last 6 hours (investigation)
- Last 24 hours (post-mortem)
Refresh rate:
- During incidents: 10s or 30s
- Normal monitoring: 1m or 5m

Common Issues and Solutions

Issue 1: “No data” in Top IPs Panel

Symptom: Table shows no results or “No data”

Causes:

Wrong container name
Logs aren’t in the expected nested JSON format
Field name is different (not request_client_ip)

Debug steps:

First, verify logs exist:

{container_name="/coolify-proxy"}

If this shows logs, container name is correct.

Check log structure:

{container_name="/coolify-proxy"} | json

Look at the raw logs. Is there a nested log object?

If logs are NOT nested, use this simpler query:

topk(10, sum by (request_client_ip) (
  count_over_time(
    {container_name="/coolify-proxy"}
    | json
    | request_client_ip!=""
    [2h]
  )
))

(Remove the line_format step)

If field name is different (e.g., remote_ip):

topk(10, sum by (remote_ip) (
  count_over_time(
    {container_name="/coolify-proxy"}
    | json
    | line_format "{{.log}}"
    | json
    | remote_ip!=""
    [2h]
  )
))

Issue 2: “Find Account by IP” Shows Raw JSON

Symptom: Logs show full JSON blobs, hard to read

Solution: Enable Prettify JSON in logs panel options:

Edit the panel
Logs options -> Prettify log message: (check)
Enable log details: (check)

This makes the JSON readable and collapsible.

Issue 3: Queries Are Slow

Symptom: Panels take >10 seconds to load

Solutions:

Reduce time range: Change [2h] to [1h]
Add more filters: Be more specific

{container_name="/coolify-proxy", namespace="production"}

Use smaller intervals: Change [1m] to [5m] in rate calculations
Limit results:

topk(5, ...) # Instead of topk(10, ...)

Advanced: Creating Alerts from These Panels

Alert 1: Traffic Spike Detection

Based on the “Requests Per Minute” panel:

Edit the panel -> Alert tab
Create alert rule from this panel
Configure:
- Name: “Pro6PP Traffic Spike”
- Condition:
  - WHEN last() of query A
  - IS ABOVE 3000 (adjust based on your normal traffic)
- Evaluate every: 1m
- For: 5m (must be true for 5 minutes to fire)
Annotations:
- Summary: “Traffic spike detected: {{$values.A}} req/min”
- Description: “Check Top IPs panel for potential abuse”
Notification: Select and setup Alertmanager to send Mattermost alerts
Save

Alert 2: Abusive IP Detection

Create a new panel specifically for this alert:

Add new panel with query:

max by (request_client_ip) (
  sum by (request_client_ip) (
    rate({container_name="/coolify-proxy"} | json | line_format "{{.log}}" | json [1m])
  ) * 60
)

This shows req/min per IP.

Create alert:
- Condition: WHEN max() IS ABOVE 1000
- For: 3m
- Summary: “Abusive IP detected: {{$labels.request_client_ip}} making {{$values.A}} req/min”
Hide this panel from dashboard view (it’s just for alerting)

Using the Dashboard During an Incident

Real-World Scenario: Pro6PP September 30th Outage

Here’s how you would have used this dashboard during your actual outage:

9:30 AM - Alerts fire

Check “Requests Per Minute”:
- See traffic at normal ~2,500 req/min
- Notice performance degradation, not traffic spike
- Conclusion: Not a DDoS, likely server overload
Check “Top 10 IPs”:
- See 89.149.208.150 at top with 8,500 requests (2 hours)
- Normal for a heavy user (about 70 req/min average)
- Conclusion: No single abusive IP
Investigate server resources (in the NodeExporter Dashboard):
- CPU at 100%
- Conclusion: Server can’t handle the load

Action taken: Rescale server from CX42 back to CPX41

Result: Service restored at 11:40 AM

Best Practices

Check panels in order:
- Requests Per Minute (is traffic abnormal?)
- Top IPs (is someone abusing?)
- Find Account (who is the user?)
- Server resources (is server struggling?)
Take screenshots:
- Panel menu -> Share -> Link -> Copy
- Include in your RCA
Export data:
- Click panel title -> Inspect -> Data -> Download CSV
- Save for post-incident analysis
Document actions:
- Add comments to dashboard: Dashboard settings -> Version history -> Add note

Troubleshooting Quick Reference

Problem	Quick Fix
No data in panels	Verify container name with `docker ps`
Queries timeout	Reduce time range or add more filters
Can’t find auth_key	Enable “Prettify” and “Log details” in logs panel
Alert not firing	Check Alerting -> Alert rules for errors
Dashboard slow	Increase refresh interval to 1m or 5m

Complete Dashboard JSON

Your final dashboard JSON should look like the one provided, with these key elements:

{
  "title": "Pro6PP Incident Dashboard",
  "tags": ["incident-response", "monitoring", "pro6pp"],
  "panels": [
    // Panel 1: Traffic volume
    // Panel 2: IP investigation
    // Panel 3: Top IPs identification
  ],
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "refresh": "30s"
}

Additional Resources

Grafana Loki Docs: https://grafana.com/docs/loki/latest/
LogQL Syntax: https://grafana.com/docs/loki/latest/logql/
Panel Transformations: https://grafana.com/docs/grafana/latest/panels-visualizations/query-transform-data/
Alerting Guide: https://grafana.com/docs/grafana/latest/alerting/

Summary

You’ve now learned to create a production-ready incident response dashboard with:

Traffic monitoring - Spot spikes and drops
IP tracking - Identify suspicious sources
Account tracing - Connect IPs to Pro6PP users
Alert configuration - Get notified proactively
Incident workflow - Investigate systematically

Key takeaway: The nested JSON structure in your logs requires the line_format "{{.log}}" step to access inner fields. This is the most common stumbling block when creating queries.

Next steps: Practice using this dashboard during normal operations, so you’re comfortable with it when an incident occurs.