-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't exit if parsing of /proc/PID/maps fails #401
base: main
Are you sure you want to change the base?
Conversation
What is assumed failure case here? The
If it's temporary and it doesn't repeat then continuing on parsing failures makes sense. If it repeats, then we'll have an agent that keeps running into the issue and we should probably treat it as a fatal failure and exit. We should do some research and find out if |
The PR description has been amended to make the purpose of this PR more clear. Additionally, I don't think it is worth the time to investigate the time for research whether A different issue - and not in scope for this PR - is Whether it makes sense or not to detect general breaking changes in the maps files and how to deal with it. IMO, it is not worth the time as a different format breaks all consumers of those files and kernel devs try hard to avoid that. |
I'd just make them debug if any. But yes, having code removed - especially code that can abort the agent unexpected - is good. Even if technically that code should be unreachable, its better to remove the code. |
Understanding how the underlying system works is useful here I think. I spent a few minutes looking into https://elixir.bootlin.com/linux/v4.19/source/fs/proc/task_mmu.c. My current (non exhaustive) understanding is that a read lock is taken which means that the contents of
A metric SGTM, esp since if we're running inside OTel collector, logging may not be under our full control. |
@fabled @christos68k Changed to debug logs and added a metric for parsing/format errors. |
@@ -638,6 +638,9 @@ const ( | |||
// Number of times a trace event read failed (trace_events) | |||
IDTraceEventReadError = 274 | |||
|
|||
// Number of format errors seen during parsing /proc/<PID>/maps | |||
IDErrProcFormat = 275 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDErrProcFormat = 275 | |
IDErrProcParse = 275 |
@@ -638,6 +638,9 @@ const ( | |||
// Number of times a trace event read failed (trace_events) | |||
IDTraceEventReadError = 274 | |||
|
|||
// Number of format errors seen during parsing /proc/<PID>/maps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Number of format errors seen during parsing /proc/<PID>/maps | |
// Number of parsing errors seen during processing /proc/<PID>/maps |
"description": "Number of format errors seen during parsing /proc/<PID>/maps", | ||
"type": "counter", | ||
"name": "ErrProcFormat", | ||
"field": "agent.errors.proc_format", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"description": "Number of format errors seen during parsing /proc/<PID>/maps", | |
"type": "counter", | |
"name": "ErrProcFormat", | |
"field": "agent.errors.proc_format", | |
"description": "Number of parsing errors seen during processing /proc/<PID>/maps", | |
"type": "counter", | |
"name": "ErrProcParse", | |
"field": "agent.errors.proc_parse", |
@@ -79,12 +80,13 @@ func trimMappingPath(path string) string { | |||
return path | |||
} | |||
|
|||
func parseMappings(mapsFile io.Reader) ([]Mapping, error) { | |||
func parseMappings(mapsFile io.Reader) ([]Mapping, uint32, error) { | |||
numFormatErrors := uint32(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numFormatErrors := uint32(0) | |
numParseErrors := uint32(0) |
"strings" | ||
"sync" | ||
|
||
"github.com/sirupsen/logrus" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the rest of the code
"github.com/sirupsen/logrus" | |
log "github.com/sirupsen/logrus" |
inode := util.DecToUint64(fields[4]) | ||
inode, err := strconv.ParseUint(fields[4], 10, 64) | ||
if err != nil { | ||
logrus.Debugf("inode: failed to convert %s to uint64: %v", fields[4], err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logrus.Debugf("inode: failed to convert %s to uint64: %v", fields[4], err) | |
log.Debugf("inode: failed to convert %s to uint64: %v", fields[4], err) |
} | ||
minor, err := strconv.ParseUint(devs[1], 16, 64) | ||
if err != nil { | ||
logrus.Debugf("minor device: failed to convert %s to uint64: %v", devs[0], err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logrus.Debugf("minor device: failed to convert %s to uint64: %v", devs[0], err) | |
logrus.Debugf("minor device: failed to convert %s to uint64: %v", devs[1], err) |
@@ -174,6 +174,8 @@ func collectInterpreterMetrics(ctx context.Context, pm *ProcessManager, | |||
metrics.MetricValue(pm.mappingStats.maxProcParseUsec.Swap(0)) | |||
summary[metrics.IDTotalProcParseUsec] = | |||
metrics.MetricValue(pm.mappingStats.totalProcParseUsec.Swap(0)) | |||
summary[metrics.IDErrProcFormat] = | |||
metrics.MetricValue(pm.mappingStats.numProcFormatErrors.Swap(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics.MetricValue(pm.mappingStats.numProcFormatErrors.Swap(0)) | |
metrics.MetricValue(pm.mappingStats.numProcParseErrors.Swap(0)) |
numProcAttempts atomic.Uint32 | ||
maxProcParseUsec atomic.Uint32 | ||
totalProcParseUsec atomic.Uint32 | ||
numProcFormatErrors atomic.Uint32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numProcFormatErrors atomic.Uint32 | |
numProcParseErrors atomic.Uint32 |
elapsed := time.Since(start) | ||
pm.mappingStats.numProcFormatErrors.Add(numFormatErrors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pm.mappingStats.numProcFormatErrors.Add(numFormatErrors) | |
pm.mappingStats.numProcParseErrors.Add(numFormatErrors) |
Fixes #366
I believe this is a remnant from old days.
The agent should not exit if there is a parsing error. Instead, log an error and continue with the next line of the maps file.
Scope
exit()
util.HexToUint64()
andutil.DecToUint64()
Not in scope