Skip to content

Troubleshooting Coolify Deployments

How to Build a Grafana Incident Response Dashboard

Overview

This guide walks you through creating a Grafana dashboard for troubleshooting server outages and performance issues. You’ll learn how to build panels that help you quickly identify problems, trace them back to specific users, and take action during incidents. Pro6PP project might be used here as a project reference example.

Prerequisites:

  • Access to Grafana instance
  • Loki data source configured
  • Logs from Coolify proxy available in Loki
  • Logs from application available in Loki

Note: Check setup guide for information on how to set up the monitoring stack


Understanding Log Structure

Our Coolify proxy logs are structured in JSON format with nested objects. Here’s what a typical log entry looks like:

{
"level": "info",
"ts": 1696077600.123,
"logger": "http.log.access",
"msg": "handled request",
"log": {
"request_client_ip": "89.149.208.150",
"request_path": "/api/v1/lookup",
"request_method": "GET",
"status": 200,
"duration": 0.045,
"auth_key": "abc123xyz789"
}
}

NOTE: This is a specific caddy proxy log example.

Important: Notice the nested structure. The actual request data is inside a log object. This is crucial for building queries.

To check the exact log structure for specific cases, run the following Loki query on the container:

Terminal window
{container_name="/coolify-proxy"} | json

This helps when creating the queries based on the log structure.

Step 1: Create a New Dashboard

  1. Navigate to Dashboards -> New Dashboard
  2. Click Add visualization
  3. Select your Loki data source
  4. Before adding panels, configure dashboard settings:
    • Click the gear icon at the top
    • General tab:
      • Name: “Pro6PP Incident Dashboard”
      • Description: “Troubleshooting Pro6PP production deployment if there is an outage or server gets overloaded”
      • Tags: Add incident-response, monitoring, <project-name>
    • Time options tab:
      • Timezone: Browser Time
      • Auto refresh: 30s (optional, useful during active incidents)
    • Click Save dashboard

Step 2: Panel 1 - Requests Per Minute

This panel shows your traffic volume over time, helping you spot traffic spikes or drops.

Panel 1 Configuration

  1. Click Add -> Visualization
  2. Select Loki data source
  3. Panel type (on the right side): Time series
  4. Query configuration:
    • Switch to Code mode (toggle in query builder)
    • Enter the following query:
sum(rate({container_name="/fwgss4sw8ck4kw4owk084ggk-142613463131"} [1m])) * 60

NOTE: This container name is the pro6pp-api container name, which is used to exemplify a real example.

Understanding the query:

  • {container_name="/fwgss4sw8ck4kw4owk084ggk-142613463131"} - Filters logs from your API service container
  • rate([1m]) - Calculates the per-second rate over 1-minute windows
  • sum() - Aggregates all streams together
  • * 60 - Converts per-second rate to per-minute
  1. Panel settings:

    • Title: “Requests Per Minute (API service)”
    • Panel options -> Legend: Hidden (cleaner view for single metric)
    • Tooltip mode: Single (shows one value at a time)
  2. Visual customization:

    • Graph styles -> Line width: 1
    • Fill opacity: 0 (no fill under line)
    • Line interpolation: Linear
  3. Click Apply to save the panel

Finding Your Container Name

If you don’t know your container name:

Terminal window
# SSH into your server
ssh root@your-server
# Or use the terminal in Coolify of your server
# List running containers
docker ps --format "table {{.Names}}\t{{.Image}}"

Step 3: Panel 2 - Top 10 IPs Accessing Proxy

This is your most critical panel for identifying abusive users or traffic sources.

Panel 2 Configuration

  1. Click Add -> Visualization
  2. Select Loki data source
  3. Panel type: Table
  4. Query configuration:
    • Switch to Code mode
    • Enter this query:
topk(10, sum by (request_client_ip) (
count_over_time(
{container_name="/coolify-proxy"}
| json
| line_format "{{.log}}"
| json
| request_client_ip!=""
[2h]
)
))

Understanding this complex query (step by step):

  1. {container_name="/coolify-proxy"} - Get logs from Coolify proxy
  2. | json - Parse the outer JSON structure
  3. | line_format "{{.log}}" - Critical step: Extract the nested log object and make it the new log line
  4. | json - Parse the inner JSON (now we can access request_client_ip)
  5. | request_client_ip!="" - Filter out empty IPs
  6. [2h] - Look back 2 hours
  7. count_over_time() - Count log lines in the time range
  8. sum by (request_client_ip) - Group counts by IP address
  9. topk(10, ...) - Return only the top 10 IPs

Add transformation (this converts the time series to a table):

  • Click Transform tab
  • Click Add transformation
  • Select Reduce
  • Mode: Series to rows
  • Calculations: Sum
  • Labels: Labels to fields (yes)
  • Include time field: (no)

Panel settings:

  • Title: “Top IPs accessing Caddy Proxy last 2h”

Table options:

  • Show header: check
  • Cell height: Small (fits more rows)
  • Sort by: Total (descending) - this is set automatically

Click Apply to save the panel

Interpreting the Results

The table will show:

request_client_ipTotal
89.149.208.15015,234
188.166.16.678,456

Important 1: We need to check further if there is any IP that’s overloading our server. In the context of Pro6PP we use the IP in the next panel’s query and find the account that’s related to. If this is not possible in the context of a different project, if the IP has irregular activity than usual, then it could be blocked at the proxy level until the server state is back to normal.

Important 2: For Pro6PP we need to take into account to not block our own proxy.pro6pp.nl IP.

Step 4: Panel 3 - Find Account by IP

This panel helps you trace suspicious IPs back to specific accounts by searching for their auth_key in the logs.

Important: If an IP cannot be related to an account in your project then this panel can be skipped.

Panel 3 Configuration

  1. Click Add -> Visualization
  2. Select Loki data source
  3. Panel type: Logs
  4. Query configuration:
    • Switch to Builder mode
    • Label filters: container_name = /coolify-proxy
    • Line contains: 89.149.208.150 (example IP)

The builder mode creates this query:

{container_name="/coolify-proxy"} |= `89.149.208.150`
  1. Panel settings:

    • Title: “Find Pro6PP Account by IP”
    • Description: “Search within the logs the lines that contain the IP that we want to filter. Find its account auth_key that performs the requests”
  2. Logs options:

    • Enable log details: Yes (allows expanding log lines)
    • Show time: No (cleaner view)
    • Show labels: No (reduces clutter)
    • Wrap log message: No (prevents line wrapping)
    • Deduplication: None
    • Order: Descending (newest first)
  3. Click Apply to save the panel

How to Use This Panel During an Incident

  1. Get a suspicious IP from the “Top IPs” panel (we’ll create this next)
  2. Click Edit on this panel
  3. Replace 89.149.208.150 with the suspicious IP
  4. Click Run query
  5. Look through the logs for the auth_key field
  6. Use that auth_key to identify which Pro6PP account is responsible

Example log line you’ll see:

{
"log": {
"request_client_ip": "89.149.208.150",
"auth_key": "sk_live_abc123xyz789",
...
}
}

Step 5: Arrange Your Dashboard Layout

Now organize your panels for optimal incident response:

  1. Click the dashboard settings icon
  2. Drag panels to arrange them:
    • Top row: Requests Per Minute (full width)
    • Second row: Find Pro6PP Account by IP (full width)
    • Third row: Top IPs table (full width)

Or use a 2-column layout:

  • Left column: Requests Per Minute, Top IPs
  • Right column: Find Pro6PP Account by IP
  1. Resize panels by dragging the bottom-right corner

Recommended layout for incidents:

Terminal window
┌─────────────────────────────────────────┐
Requests Per Minute (full width) │
(Shows traffic trends) │
└─────────────────────────────────────────┘
┌──────────────────┬──────────────────────┐
Top IPs Find Account by IP
(Identify (Trace to account) │
abusers) │ │
└──────────────────┴──────────────────────┘

Step 6: Configure Time Range

Set up appropriate time ranges for incident investigation:

  1. Click the time picker in the top-right corner

  2. Default view: Last 6 hours

  3. Quick ranges to add:

    • Last 15 minutes (active incident)
    • Last 1 hour (recent issue)
    • Last 6 hours (investigation)
    • Last 24 hours (post-mortem)
  4. Refresh rate:

    • During incidents: 10s or 30s
    • Normal monitoring: 1m or 5m

Common Issues and Solutions

Issue 1: “No data” in Top IPs Panel

Symptom: Table shows no results or “No data”

Causes:

  1. Wrong container name
  2. Logs aren’t in the expected nested JSON format
  3. Field name is different (not request_client_ip)

Debug steps:

  1. First, verify logs exist:
{container_name="/coolify-proxy"}

If this shows logs, container name is correct.

  1. Check log structure:
{container_name="/coolify-proxy"} | json

Look at the raw logs. Is there a nested log object?

  1. If logs are NOT nested, use this simpler query:
topk(10, sum by (request_client_ip) (
count_over_time(
{container_name="/coolify-proxy"}
| json
| request_client_ip!=""
[2h]
)
))

(Remove the line_format step)

  1. If field name is different (e.g., remote_ip):
topk(10, sum by (remote_ip) (
count_over_time(
{container_name="/coolify-proxy"}
| json
| line_format "{{.log}}"
| json
| remote_ip!=""
[2h]
)
))

Issue 2: “Find Account by IP” Shows Raw JSON

Symptom: Logs show full JSON blobs, hard to read

Solution: Enable Prettify JSON in logs panel options:

  1. Edit the panel
  2. Logs options -> Prettify log message: (check)
  3. Enable log details: (check)

This makes the JSON readable and collapsible.

Issue 3: Queries Are Slow

Symptom: Panels take >10 seconds to load

Solutions:

  1. Reduce time range: Change [2h] to [1h]
  2. Add more filters: Be more specific
{container_name="/coolify-proxy", namespace="production"}
  1. Use smaller intervals: Change [1m] to [5m] in rate calculations
  2. Limit results:
topk(5, ...) # Instead of topk(10, ...)

Advanced: Creating Alerts from These Panels

Alert 1: Traffic Spike Detection

Based on the “Requests Per Minute” panel:

  1. Edit the panel -> Alert tab
  2. Create alert rule from this panel
  3. Configure:
    • Name: “Pro6PP Traffic Spike”
    • Condition:
      • WHEN last() of query A
      • IS ABOVE 3000 (adjust based on your normal traffic)
    • Evaluate every: 1m
    • For: 5m (must be true for 5 minutes to fire)
  4. Annotations:
    • Summary: “Traffic spike detected: {{$values.A}} req/min”
    • Description: “Check Top IPs panel for potential abuse”
  5. Notification: Select and setup Alertmanager to send Mattermost alerts
  6. Save

Alert 2: Abusive IP Detection

Create a new panel specifically for this alert:

  1. Add new panel with query:
max by (request_client_ip) (
sum by (request_client_ip) (
rate({container_name="/coolify-proxy"} | json | line_format "{{.log}}" | json [1m])
) * 60
)

This shows req/min per IP.

  1. Create alert:

    • Condition: WHEN max() IS ABOVE 1000
    • For: 3m
    • Summary: “Abusive IP detected: {{$labels.request_client_ip}} making {{$values.A}} req/min”
  2. Hide this panel from dashboard view (it’s just for alerting)


Using the Dashboard During an Incident

Real-World Scenario: Pro6PP September 30th Outage

Here’s how you would have used this dashboard during your actual outage:

9:30 AM - Alerts fire
  1. Check “Requests Per Minute”:

    • See traffic at normal ~2,500 req/min
    • Notice performance degradation, not traffic spike
    • Conclusion: Not a DDoS, likely server overload
  2. Check “Top 10 IPs”:

    • See 89.149.208.150 at top with 8,500 requests (2 hours)
    • Normal for a heavy user (about 70 req/min average)
    • Conclusion: No single abusive IP
  3. Investigate server resources (in the NodeExporter Dashboard):

    • CPU at 100%
    • Conclusion: Server can’t handle the load
Action taken: Rescale server from CX42 back to CPX41
Result: Service restored at 11:40 AM

Best Practices

  1. Check panels in order:

    • Requests Per Minute (is traffic abnormal?)
    • Top IPs (is someone abusing?)
    • Find Account (who is the user?)
    • Server resources (is server struggling?)
  2. Take screenshots:

    • Panel menu -> Share -> Link -> Copy
    • Include in your RCA
  3. Export data:

    • Click panel title -> Inspect -> Data -> Download CSV
    • Save for post-incident analysis
  4. Document actions:

    • Add comments to dashboard: Dashboard settings -> Version history -> Add note

Troubleshooting Quick Reference

ProblemQuick Fix
No data in panelsVerify container name with docker ps
Queries timeoutReduce time range or add more filters
Can’t find auth_keyEnable “Prettify” and “Log details” in logs panel
Alert not firingCheck Alerting -> Alert rules for errors
Dashboard slowIncrease refresh interval to 1m or 5m

Complete Dashboard JSON

Your final dashboard JSON should look like the one provided, with these key elements:

{
"title": "Pro6PP Incident Dashboard",
"tags": ["incident-response", "monitoring", "pro6pp"],
"panels": [
// Panel 1: Traffic volume
// Panel 2: IP investigation
// Panel 3: Top IPs identification
],
"time": {
"from": "now-6h",
"to": "now"
},
"refresh": "30s"
}

Additional Resources


Summary

You’ve now learned to create a production-ready incident response dashboard with:

  1. Traffic monitoring - Spot spikes and drops
  2. IP tracking - Identify suspicious sources
  3. Account tracing - Connect IPs to Pro6PP users
  4. Alert configuration - Get notified proactively
  5. Incident workflow - Investigate systematically

Key takeaway: The nested JSON structure in your logs requires the line_format "{{.log}}" step to access inner fields. This is the most common stumbling block when creating queries.

Next steps: Practice using this dashboard during normal operations, so you’re comfortable with it when an incident occurs.