Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't exit if parsing of /proc/PID/maps fails #401

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rockdaboot
Copy link
Contributor

@rockdaboot rockdaboot commented Mar 14, 2025

Fixes #366

I believe this is a remnant from old days.
The agent should not exit if there is a parsing error. Instead, log an error and continue with the next line of the maps file.

Scope

  • treat parsing errors gracefully
  • be more consistent with other parsing code and do not exit()
  • remove unnecessarily exported functions util.HexToUint64() and util.DecToUint64()
  • the formerly exported functions did not document the exiting behavior

Not in scope

  • dealing with breaking changes of the maps format (assumption is that this won't happen as it will break many different workflows / tools)

@rockdaboot rockdaboot self-assigned this Mar 14, 2025
@rockdaboot rockdaboot requested review from a team as code owners March 14, 2025 16:43
@rockdaboot rockdaboot enabled auto-merge (squash) March 14, 2025 16:52
@christos68k
Copy link
Member

christos68k commented Mar 14, 2025

What is assumed failure case here? The /proc/PID/maps format is specified, so if the agent runs into parsing errors we have to consider the root-cause:

  • Is it temporary or will it repeat?
  • Is it something that we expect to see once in a while or something exceptional?

If it's temporary and it doesn't repeat then continuing on parsing failures makes sense. If it repeats, then we'll have an agent that keeps running into the issue and we should probably treat it as a fatal failure and exit.

We should do some research and find out if /proc/PID/maps is guaranteed to be consistent on every read. If that's the case then parsing failures to me seem like exceptional fatal failures and the fail-safe approach is to exit (and if that's what we decide to do, we need a better approach than Fatalf to avoid taking down the OTel collector).

@rockdaboot
Copy link
Contributor Author

What is assumed failure case here? The /proc/PID/maps format is specified, so if the agent runs into parsing errors we have to consider the root-cause:

* Is it temporary or will it repeat?

* Is it something that we expect to see once in a while or something exceptional?

If it's temporary and it doesn't repeat then continuing on parsing failures makes sense. If it repeats, then we'll have an agent that keeps running into the issue and we should probably treat it as a fatal failure and exit.

We should do some research and find out if /proc/PID/maps is guaranteed to be consistent on every read. If that's the case then parsing failures to me seem like exceptional fatal failures and the fail-safe approach is to exit (and if that's what we decide to do, we need a better approach than Fatalf to avoid taking down the OTel collector).

The PR description has been amended to make the purpose of this PR more clear.

Additionally, I don't think it is worth the time to investigate the time for research whether proc/PID/maps is "guaranteed to be consistent". Because even if you could "prove" that a certain kernel version has mathematically safe code... code is not set in stone and there could be future changes with subtle regressions or bugs.

A different issue - and not in scope for this PR - is Whether it makes sense or not to detect general breaking changes in the maps files and how to deal with it. IMO, it is not worth the time as a different format breaks all consumers of those files and kernel devs try hard to avoid that.
But even if it happens, and the agent isn't able to read any mapping, the agent still reports this via logging. We currently do not have a metric for this error case, if you look at SynchronizeProcess(). We could drop the logs and instead introduce a metric, to avoid spamming of logs.

@fabled
Copy link
Contributor

fabled commented Mar 17, 2025

The agent should not exit if there is a parsing error. Instead, log an error and continue with the next line of the maps file.

I'd just make them debug if any.

But yes, having code removed - especially code that can abort the agent unexpected - is good. Even if technically that code should be unreachable, its better to remove the code.

@christos68k
Copy link
Member

Additionally, I don't think it is worth the time to investigate the time for research whether proc/PID/maps is "guaranteed to be consistent". Because even if you could "prove" that a certain kernel version has mathematically safe code... code is not set in stone and there could be future changes with subtle regressions or bugs.

Understanding how the underlying system works is useful here I think. I spent a few minutes looking into https://elixir.bootlin.com/linux/v4.19/source/fs/proc/task_mmu.c. My current (non exhaustive) understanding is that a read lock is taken which means that the contents of /proc/PID/maps are guaranteed to be consistent within a single read system call. This is also mentioned here. This also means that the agent can see inconsistencies/races across multiple calls.

We currently do not have a metric for this error case, if you look at SynchronizeProcess(). We could drop the logs and instead introduce a metric, to avoid spamming of logs.

A metric SGTM, esp since if we're running inside OTel collector, logging may not be under our full control.

@rockdaboot
Copy link
Contributor Author

@fabled @christos68k Changed to debug logs and added a metric for parsing/format errors.

@@ -638,6 +638,9 @@ const (
// Number of times a trace event read failed (trace_events)
IDTraceEventReadError = 274

// Number of format errors seen during parsing /proc/<PID>/maps
IDErrProcFormat = 275
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IDErrProcFormat = 275
IDErrProcParse = 275

@@ -638,6 +638,9 @@ const (
// Number of times a trace event read failed (trace_events)
IDTraceEventReadError = 274

// Number of format errors seen during parsing /proc/<PID>/maps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Number of format errors seen during parsing /proc/<PID>/maps
// Number of parsing errors seen during processing /proc/<PID>/maps

Comment on lines +1983 to +1986
"description": "Number of format errors seen during parsing /proc/<PID>/maps",
"type": "counter",
"name": "ErrProcFormat",
"field": "agent.errors.proc_format",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"description": "Number of format errors seen during parsing /proc/<PID>/maps",
"type": "counter",
"name": "ErrProcFormat",
"field": "agent.errors.proc_format",
"description": "Number of parsing errors seen during processing /proc/<PID>/maps",
"type": "counter",
"name": "ErrProcParse",
"field": "agent.errors.proc_parse",

@@ -79,12 +80,13 @@ func trimMappingPath(path string) string {
return path
}

func parseMappings(mapsFile io.Reader) ([]Mapping, error) {
func parseMappings(mapsFile io.Reader) ([]Mapping, uint32, error) {
numFormatErrors := uint32(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
numFormatErrors := uint32(0)
numParseErrors := uint32(0)

"strings"
"sync"

"github.com/sirupsen/logrus"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the rest of the code

Suggested change
"github.com/sirupsen/logrus"
log "github.com/sirupsen/logrus"

inode := util.DecToUint64(fields[4])
inode, err := strconv.ParseUint(fields[4], 10, 64)
if err != nil {
logrus.Debugf("inode: failed to convert %s to uint64: %v", fields[4], err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logrus.Debugf("inode: failed to convert %s to uint64: %v", fields[4], err)
log.Debugf("inode: failed to convert %s to uint64: %v", fields[4], err)

}
minor, err := strconv.ParseUint(devs[1], 16, 64)
if err != nil {
logrus.Debugf("minor device: failed to convert %s to uint64: %v", devs[0], err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logrus.Debugf("minor device: failed to convert %s to uint64: %v", devs[0], err)
logrus.Debugf("minor device: failed to convert %s to uint64: %v", devs[1], err)

@@ -174,6 +174,8 @@ func collectInterpreterMetrics(ctx context.Context, pm *ProcessManager,
metrics.MetricValue(pm.mappingStats.maxProcParseUsec.Swap(0))
summary[metrics.IDTotalProcParseUsec] =
metrics.MetricValue(pm.mappingStats.totalProcParseUsec.Swap(0))
summary[metrics.IDErrProcFormat] =
metrics.MetricValue(pm.mappingStats.numProcFormatErrors.Swap(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metrics.MetricValue(pm.mappingStats.numProcFormatErrors.Swap(0))
metrics.MetricValue(pm.mappingStats.numProcParseErrors.Swap(0))

numProcAttempts atomic.Uint32
maxProcParseUsec atomic.Uint32
totalProcParseUsec atomic.Uint32
numProcFormatErrors atomic.Uint32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
numProcFormatErrors atomic.Uint32
numProcParseErrors atomic.Uint32

elapsed := time.Since(start)
pm.mappingStats.numProcFormatErrors.Add(numFormatErrors)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pm.mappingStats.numProcFormatErrors.Add(numFormatErrors)
pm.mappingStats.numProcParseErrors.Add(numFormatErrors)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't exit if parsing of /proc/PID/maps fails
3 participants