Troubleshooting Coolify Deployments
How to Build a Grafana Incident Response Dashboard
Overview
This guide walks you through creating a Grafana dashboard for troubleshooting server outages and performance issues. You’ll learn how to build panels that help you quickly identify problems, trace them back to specific users, and take action during incidents. Pro6PP project might be used here as a project reference example.
Prerequisites:
- Access to Grafana instance
- Loki data source configured
- Logs from Coolify proxy available in Loki
- Logs from application available in Loki
Note: Check setup guide for information on how to set up the monitoring stack
Understanding Log Structure
Our Coolify proxy logs are structured in JSON format with nested objects. Here’s what a typical log entry looks like:
{ "level": "info", "ts": 1696077600.123, "logger": "http.log.access", "msg": "handled request", "log": { "request_client_ip": "89.149.208.150", "request_path": "/api/v1/lookup", "request_method": "GET", "status": 200, "duration": 0.045, "auth_key": "abc123xyz789" }}NOTE: This is a specific caddy proxy log example.
Important: Notice the nested structure. The actual request data is inside a log object. This is crucial
for building queries.
To check the exact log structure for specific cases, run the following Loki query on the container:
{container_name="/coolify-proxy"} | jsonThis helps when creating the queries based on the log structure.
Step 1: Create a New Dashboard
- Navigate to Dashboards -> New Dashboard
- Click Add visualization
- Select your Loki data source
- Before adding panels, configure dashboard settings:
- Click the gear icon at the top
- General tab:
- Name: “Pro6PP Incident Dashboard”
- Description: “Troubleshooting Pro6PP production deployment if there is an outage or server gets overloaded”
- Tags: Add
incident-response,monitoring,<project-name>
- Time options tab:
- Timezone: Browser Time
- Auto refresh: 30s (optional, useful during active incidents)
- Click Save dashboard
Step 2: Panel 1 - Requests Per Minute
This panel shows your traffic volume over time, helping you spot traffic spikes or drops.
Panel 1 Configuration
- Click Add -> Visualization
- Select Loki data source
- Panel type (on the right side): Time series
- Query configuration:
- Switch to Code mode (toggle in query builder)
- Enter the following query:
sum(rate({container_name="/fwgss4sw8ck4kw4owk084ggk-142613463131"} [1m])) * 60NOTE: This container name is the pro6pp-api container name, which is used to exemplify a real example.
Understanding the query:
{container_name="/fwgss4sw8ck4kw4owk084ggk-142613463131"}- Filters logs from your API service containerrate([1m])- Calculates the per-second rate over 1-minute windowssum()- Aggregates all streams together* 60- Converts per-second rate to per-minute
-
Panel settings:
- Title: “Requests Per Minute (API service)”
- Panel options -> Legend: Hidden (cleaner view for single metric)
- Tooltip mode: Single (shows one value at a time)
-
Visual customization:
- Graph styles -> Line width: 1
- Fill opacity: 0 (no fill under line)
- Line interpolation: Linear
-
Click Apply to save the panel
Finding Your Container Name
If you don’t know your container name:
# SSH into your serverssh root@your-server
# Or use the terminal in Coolify of your server
# List running containersdocker ps --format "table {{.Names}}\t{{.Image}}"Step 3: Panel 2 - Top 10 IPs Accessing Proxy
This is your most critical panel for identifying abusive users or traffic sources.
Panel 2 Configuration
- Click Add -> Visualization
- Select Loki data source
- Panel type: Table
- Query configuration:
- Switch to Code mode
- Enter this query:
topk(10, sum by (request_client_ip) ( count_over_time( {container_name="/coolify-proxy"} | json | line_format "{{.log}}" | json | request_client_ip!="" [2h] )))Understanding this complex query (step by step):
{container_name="/coolify-proxy"}- Get logs from Coolify proxy| json- Parse the outer JSON structure| line_format "{{.log}}"- Critical step: Extract the nestedlogobject and make it the new log line| json- Parse the inner JSON (now we can accessrequest_client_ip)| request_client_ip!=""- Filter out empty IPs[2h]- Look back 2 hourscount_over_time()- Count log lines in the time rangesum by (request_client_ip)- Group counts by IP addresstopk(10, ...)- Return only the top 10 IPs
Add transformation (this converts the time series to a table):
- Click Transform tab
- Click Add transformation
- Select Reduce
- Mode: Series to rows
- Calculations: Sum
- Labels: Labels to fields (yes)
- Include time field: (no)
Panel settings:
- Title: “Top IPs accessing Caddy Proxy last 2h”
Table options:
- Show header: check
- Cell height: Small (fits more rows)
- Sort by: Total (descending) - this is set automatically
Click Apply to save the panel
Interpreting the Results
The table will show:
| request_client_ip | Total |
|---|---|
| 89.149.208.150 | 15,234 |
| 188.166.16.67 | 8,456 |
| … | … |
Important 1: We need to check further if there is any IP that’s overloading our server. In the context of Pro6PP we use the IP in the next panel’s query and find the account that’s related to. If this is not possible in the context of a different project, if the IP has irregular activity than usual, then it could be blocked at the proxy level until the server state is back to normal.
Important 2: For Pro6PP we need to take into account to not block our own proxy.pro6pp.nl IP.
Step 4: Panel 3 - Find Account by IP
This panel helps you trace suspicious IPs back to specific accounts by searching for their auth_key in the logs.
Important: If an IP cannot be related to an account in your project then this panel can be skipped.
Panel 3 Configuration
- Click Add -> Visualization
- Select Loki data source
- Panel type: Logs
- Query configuration:
- Switch to Builder mode
- Label filters:
container_name=/coolify-proxy - Line contains:
89.149.208.150(example IP)
The builder mode creates this query:
{container_name="/coolify-proxy"} |= `89.149.208.150`-
Panel settings:
- Title: “Find Pro6PP Account by IP”
- Description: “Search within the logs the lines that contain the IP that we want to filter. Find its account
auth_keythat performs the requests”
-
Logs options:
- Enable log details: Yes (allows expanding log lines)
- Show time: No (cleaner view)
- Show labels: No (reduces clutter)
- Wrap log message: No (prevents line wrapping)
- Deduplication: None
- Order: Descending (newest first)
-
Click Apply to save the panel
How to Use This Panel During an Incident
- Get a suspicious IP from the “Top IPs” panel (we’ll create this next)
- Click Edit on this panel
- Replace
89.149.208.150with the suspicious IP - Click Run query
- Look through the logs for the
auth_keyfield - Use that auth_key to identify which Pro6PP account is responsible
Example log line you’ll see:
{ "log": { "request_client_ip": "89.149.208.150", "auth_key": "sk_live_abc123xyz789", ... }}Step 5: Arrange Your Dashboard Layout
Now organize your panels for optimal incident response:
- Click the dashboard settings icon
- Drag panels to arrange them:
- Top row: Requests Per Minute (full width)
- Second row: Find Pro6PP Account by IP (full width)
- Third row: Top IPs table (full width)
Or use a 2-column layout:
- Left column: Requests Per Minute, Top IPs
- Right column: Find Pro6PP Account by IP
- Resize panels by dragging the bottom-right corner
Recommended layout for incidents:
┌─────────────────────────────────────────┐│ Requests Per Minute (full width) ││ (Shows traffic trends) │└─────────────────────────────────────────┘┌──────────────────┬──────────────────────┐│ Top IPs │ Find Account by IP ││ (Identify │ (Trace to account) ││ abusers) │ │└──────────────────┴──────────────────────┘Step 6: Configure Time Range
Set up appropriate time ranges for incident investigation:
-
Click the time picker in the top-right corner
-
Default view: Last 6 hours
-
Quick ranges to add:
- Last 15 minutes (active incident)
- Last 1 hour (recent issue)
- Last 6 hours (investigation)
- Last 24 hours (post-mortem)
-
Refresh rate:
- During incidents: 10s or 30s
- Normal monitoring: 1m or 5m
Common Issues and Solutions
Issue 1: “No data” in Top IPs Panel
Symptom: Table shows no results or “No data”
Causes:
- Wrong container name
- Logs aren’t in the expected nested JSON format
- Field name is different (not
request_client_ip)
Debug steps:
- First, verify logs exist:
{container_name="/coolify-proxy"}If this shows logs, container name is correct.
- Check log structure:
{container_name="/coolify-proxy"} | jsonLook at the raw logs. Is there a nested log object?
- If logs are NOT nested, use this simpler query:
topk(10, sum by (request_client_ip) ( count_over_time( {container_name="/coolify-proxy"} | json | request_client_ip!="" [2h] )))(Remove the line_format step)
- If field name is different (e.g.,
remote_ip):
topk(10, sum by (remote_ip) ( count_over_time( {container_name="/coolify-proxy"} | json | line_format "{{.log}}" | json | remote_ip!="" [2h] )))Issue 2: “Find Account by IP” Shows Raw JSON
Symptom: Logs show full JSON blobs, hard to read
Solution: Enable Prettify JSON in logs panel options:
- Edit the panel
- Logs options -> Prettify log message: (check)
- Enable log details: (check)
This makes the JSON readable and collapsible.
Issue 3: Queries Are Slow
Symptom: Panels take >10 seconds to load
Solutions:
- Reduce time range: Change
[2h]to[1h] - Add more filters: Be more specific
{container_name="/coolify-proxy", namespace="production"}- Use smaller intervals: Change
[1m]to[5m]in rate calculations - Limit results:
topk(5, ...) # Instead of topk(10, ...)Advanced: Creating Alerts from These Panels
Alert 1: Traffic Spike Detection
Based on the “Requests Per Minute” panel:
- Edit the panel -> Alert tab
- Create alert rule from this panel
- Configure:
- Name: “Pro6PP Traffic Spike”
- Condition:
- WHEN
last()of query A - IS ABOVE
3000(adjust based on your normal traffic)
- WHEN
- Evaluate every: 1m
- For: 5m (must be true for 5 minutes to fire)
- Annotations:
- Summary: “Traffic spike detected: {{$values.A}} req/min”
- Description: “Check Top IPs panel for potential abuse”
- Notification: Select and setup Alertmanager to send Mattermost alerts
- Save
Alert 2: Abusive IP Detection
Create a new panel specifically for this alert:
- Add new panel with query:
max by (request_client_ip) ( sum by (request_client_ip) ( rate({container_name="/coolify-proxy"} | json | line_format "{{.log}}" | json [1m]) ) * 60)This shows req/min per IP.
-
Create alert:
- Condition: WHEN
max()IS ABOVE1000 - For: 3m
- Summary: “Abusive IP detected: {{$labels.request_client_ip}} making {{$values.A}} req/min”
- Condition: WHEN
-
Hide this panel from dashboard view (it’s just for alerting)
Using the Dashboard During an Incident
Real-World Scenario: Pro6PP September 30th Outage
Here’s how you would have used this dashboard during your actual outage:
9:30 AM - Alerts fire
-
Check “Requests Per Minute”:
- See traffic at normal ~2,500 req/min
- Notice performance degradation, not traffic spike
- Conclusion: Not a DDoS, likely server overload
-
Check “Top 10 IPs”:
- See
89.149.208.150at top with 8,500 requests (2 hours) - Normal for a heavy user (about 70 req/min average)
- Conclusion: No single abusive IP
- See
-
Investigate server resources (in the NodeExporter Dashboard):
- CPU at 100%
- Conclusion: Server can’t handle the load
Action taken: Rescale server from CX42 back to CPX41
Result: Service restored at 11:40 AM
Best Practices
-
Check panels in order:
- Requests Per Minute (is traffic abnormal?)
- Top IPs (is someone abusing?)
- Find Account (who is the user?)
- Server resources (is server struggling?)
-
Take screenshots:
- Panel menu -> Share -> Link -> Copy
- Include in your RCA
-
Export data:
- Click panel title -> Inspect -> Data -> Download CSV
- Save for post-incident analysis
-
Document actions:
- Add comments to dashboard: Dashboard settings -> Version history -> Add note
Troubleshooting Quick Reference
| Problem | Quick Fix |
|---|---|
| No data in panels | Verify container name with docker ps |
| Queries timeout | Reduce time range or add more filters |
| Can’t find auth_key | Enable “Prettify” and “Log details” in logs panel |
| Alert not firing | Check Alerting -> Alert rules for errors |
| Dashboard slow | Increase refresh interval to 1m or 5m |
Complete Dashboard JSON
Your final dashboard JSON should look like the one provided, with these key elements:
{ "title": "Pro6PP Incident Dashboard", "tags": ["incident-response", "monitoring", "pro6pp"], "panels": [ // Panel 1: Traffic volume // Panel 2: IP investigation // Panel 3: Top IPs identification ], "time": { "from": "now-6h", "to": "now" }, "refresh": "30s"}Additional Resources
- Grafana Loki Docs: https://grafana.com/docs/loki/latest/
- LogQL Syntax: https://grafana.com/docs/loki/latest/logql/
- Panel Transformations: https://grafana.com/docs/grafana/latest/panels-visualizations/query-transform-data/
- Alerting Guide: https://grafana.com/docs/grafana/latest/alerting/
Summary
You’ve now learned to create a production-ready incident response dashboard with:
- Traffic monitoring - Spot spikes and drops
- IP tracking - Identify suspicious sources
- Account tracing - Connect IPs to Pro6PP users
- Alert configuration - Get notified proactively
- Incident workflow - Investigate systematically
Key takeaway: The nested JSON structure in your logs requires the line_format "{{.log}}" step to access inner fields.
This is the most common stumbling block when creating queries.
Next steps: Practice using this dashboard during normal operations, so you’re comfortable with it when an incident occurs.