healthcheck feature #599

samifruit514 · 2025-09-13T21:28:11Z

context:
We are running a multi-container application with docker-compose, but through podman (podman system service --log-level debug unix:///tmp/podman.sock). The apps definitions inside the docker-compose.yml file contains a bunch of health checks and dependencies. Since we run that in a CI, WITHOUT systemd, there is no healthchecks (no unit are created for healthchecks). Because of that, the multi-container app doesnt run.

According to podman people, conmon should handle health checks, or at least, conmon would be a great candidate to do the healthchecks.

This PR accepts --enable-healthcheck in conmon args, enabling the healthchecks from conmon to podman, through the unix pipe (the same one that sends the PID to link).

For more info on healthcheck handling by podman: containers/podman#27033

mheon · 2025-09-13T21:30:05Z

I think this won't work for the reasons #598 doesn't - no support for swapping timers to transition startup healthchecks to regular healthchecks.

packit-as-a-service · 2025-09-13T21:30:26Z

Ephemeral COPR build failed. @containers/packit-build please check.

samifruit514 · 2025-09-13T22:38:33Z

Thanks for taking a look mheon! its really appreciated. I've made the changes to handle healthchecks in starting phase. Let me know your feedback! Thanks

samifruit514 · 2025-09-13T23:37:42Z

I've made a PR on podman side: It would read from the conmon's pipe: containers/podman#27067
Sorry if its too noisy. Set that PR to draft for now. Its to give a global overview on how it would work

src/healthcheck.c

mheon · 2025-09-13T23:46:37Z

I'm a bit hesitant about this one because we are strongly contemplating a Conmon rewrite to be released alongside Podman 6 in May - containers/podman#27053 - so additional work on the existing, C conmon seems a bit wasted in light of that. But if this can be done for our November release of 5.7 and that's of use to you, I'm not entirely opposed?

jnovy · 2025-09-15T14:26:07Z

src/healthcheck.c

-        }
+    if (!cJSON_IsArray(test_array) || cJSON_GetArraySize(test_array) < 2) {
+        nwarn("Healthcheck configuration missing required 'test' command");
+        cJSON_Delete(json);


healthcheck_config_free(config); is missing here?

config is allocated on stack in conmon.c and later freed with healthcheck_config_free, because of some inner attributes allocated in heap. The free function was not at the right place though and changed it.

jnovy · 2025-09-15T14:26:51Z

src/healthcheck.c

+
+    if (!cJSON_IsString(cmd_type) || !cJSON_IsString(cmd_value)) {
+        nwarn("Healthcheck test command must be an array of strings");
+        cJSON_Delete(json);


healthcheck_config_free(config); is missing here too?

It shouldnt free there AFAIU, same reason from above

jnovy · 2025-09-15T14:32:53Z

src/healthcheck.c


-    /* Parse Interval (now in seconds) */
+    if (strcmp(cmd_type->valuestring, "CMD") == 0 || strcmp(cmd_type->valuestring, "CMD-SHELL") == 0) {
+        /* Create test command array */


Please validate the command string here, e.g. strlen(cmd_value->valuestring) == 0 || strlen(cmd_value->valuestring) > 4096 ?

jnovy · 2025-09-15T14:33:16Z

src/healthcheck.c

+    if (strcmp(cmd_type->valuestring, "CMD") == 0 || strcmp(cmd_type->valuestring, "CMD-SHELL") == 0) {
+        /* Create test command array */
+        config->test = calloc(2, sizeof(char*));
+        config->test[0] = strdup(cmd_value->valuestring);


Test calloc and strdup for failure here.

jnovy · 2025-09-15T14:33:41Z

src/healthcheck.c

+        config->test[1] = NULL;
+    } else {
+        nwarnf("Unsupported healthcheck command type: %s (only CMD and CMD-SHELL supported)", cmd_type->valuestring);
+        cJSON_Delete(json);


missing healthcheck_config_free(config);

same reason from above

jnovy · 2025-09-15T14:35:02Z

src/healthcheck.c

    cJSON *interval = cJSON_GetObjectItem(json, "interval");
-    if (cJSON_IsNumber(interval)) {
-        config->interval = (int)interval->valuedouble;
+    if (!cJSON_IsNumber(interval) || interval->valuedouble <= 0) {


Maybe add here also check for the max value here?

jnovy · 2025-09-15T14:35:33Z

src/healthcheck.c

    cJSON *timeout = cJSON_GetObjectItem(json, "timeout");
-    if (cJSON_IsNumber(timeout)) {
-        config->timeout = (int)timeout->valuedouble;
+    if (!cJSON_IsNumber(timeout) || timeout->valuedouble <= 0) {


Also max value check here please + missing healthcheck_config_free(config); again

jnovy · 2025-09-15T14:35:43Z

src/healthcheck.c

    cJSON *start_period = cJSON_GetObjectItem(json, "start_period");
-    if (cJSON_IsNumber(start_period)) {
-        config->start_period = (int)start_period->valuedouble;
+    if (!cJSON_IsNumber(start_period) || start_period->valuedouble < 0) {


Also max value check here please.

jnovy · 2025-09-15T14:35:54Z

src/healthcheck.c

    cJSON *retries = cJSON_GetObjectItem(json, "retries");
-    if (cJSON_IsNumber(retries)) {
-        config->retries = (int)retries->valuedouble;
+    if (!cJSON_IsNumber(retries) || retries->valuedouble < 0) {


Also max value check here please.

hey @jnovy thanks for the review! im wondering though if we should go forward with this PR, given that comment from mheon: #599 (comment)

im a bit green on contributing to such project. what are the changes that feature gets merged if all is good?
thanks

@samifruit514 It will get merged if it's ok - it's almost there. Please also squash all the commits into one. Thanks!

done! made the changes and squashed. let me know what you think

jnovy · 2025-09-16T09:06:47Z

src/healthcheck.c

+    /* need to execute the command inside the container using the container runtime API */
+
+    /* Simple healthcheck: just return success for now */
+    /* TODO: Implement proper container command execution */


If you are using Claude to generate this code, please always check it does what you want - it tends to oversimplify up to a state where it defeats purpose - which is the case here. Always invoke conmon with healthchecks enabled to verify what it does according to your use case - what you need to achieve by this PR.

Claude usually picks cJSON which is not widely adopted. Please tell it to rewrite this using json-c instead. Also Claude tends to generate loads of trailing spaces which will fail in our CI - please use make fmt to fix this before force pushing the update.

Also, you'd need to install json-c-devel in .cirrus otherwise CI will break.

There is nothing wrong with using AI - just please don't be negligent in thorough review and functionality testing.

wow. This is crazy. I can't believe I didnt catch that.
for json, cursor was doing a custom parsing without any JSON lib, so I asked to use cJSON because it was the only one I knew 😆

Thanks for all the tips. I will do the changes. Once I go back into the related podmon PR, I will make sure I didnt mess up with AI

samifruit514 · 2025-09-18T13:14:34Z

integration tests failed for cri-o. Would it be possible that its some race condition? I don't see how this is related to the changes in this PR 🤔
AI suggests to put some sleep in the bats tests on cri-o side

samifruit514 · 2025-09-19T21:56:41Z

I was able to run the "crio restore with missing config.json" test locally:

Running the test...
restore_test_cgroupfs.bats
 ✓ crio restore with missing config.json

1 test, 0 failures

✓ Test passed successfully!
Cleaning up...
Test completed successfully!

its unclear what happened but I have the feeling that running it again could pass? Do we experience flaky in CI sometimes?

@jnovy Sorry to bother you again, could you run the integration test if you get a chance?

jnovy · 2025-09-22T09:55:35Z

You are right @samifruit514 - integration failures don't seem related to the PR.

Now let's simplify the code as it is super-bloated in the current state - we can remove about two thirds of the code still maintaining the functionality and also drop the JSON parsing functionality completely:

remove the hash-table - there is only one instance of conmon for one container - no hashtable required.

allow podman to invoke heathcheck via CLI parameters instead of JSON parsing, e.g.

--healthcheck-cmd "curl -f http://localhost:8080/health"
--healthcheck-interval 30
--healthcheck-timeout 5
--healthcheck-retries 3
--healthcheck-start-period 60

Use GLib timer instead of pthreads:

 // Use GLib timeout source instead of pthread
 healthcheck_timer_id = g_timeout_add_seconds(healthcheck_interval,
                                             healthcheck_timer_callback, NULL);

Use static string constants:

static const char *status_strings[] = {
    "none", "starting", "healthy", "unhealthy"
};

Then error handling and memory management can be significantly simplified. Also, there is no test coverage but let's cover it once the above is done - and code simplified. Does it make sense @samifruit514 ?

samifruit514 · 2025-09-22T11:36:37Z

Awsome. makes sense! I will do the changes

jnovy · 2025-09-23T07:11:57Z

Thanks, there are merge conflicts in src/cli.[ch] - these need to be addressed first. Also I see the hash_table functions were not removed yet?

samifruit514 · 2025-09-23T12:50:34Z

Indeed, conflicts 😅. I still need to get more comfortable with PRs against upstreams. They are now resolved, thanks

As for the hash_table function, I cleaned it up—it was just leftover code from when I mistakenly thought conmon handled multiple containers 🤦.

I also reviewed the rest of the code with my best effort (and my somewhat limited C knowledge, plus a little AI help 😉) to make sure everything is as clean as possible.

samifruit514 · 2025-09-26T18:19:15Z

Sorry to bother you again @jnovy , do you have the time to review the changes?

jnovy

The BATS tests primarily cover CLI parsing but miss:

Timeout enforcement testing
Start period transition logic
Error handling scenarios
Resource cleanup verification

jnovy · 2025-09-29T10:55:11Z

src/healthcheck.c

+
+	/* Create simple JSON message for status update */
+	char json_msg[1024];
+	snprintf(json_msg, sizeof(json_msg),


I'd remove this JSON part and use just minimalistic conmon protocol - JSON is over-complication here and poses injection risks.

Something like:

/* Include exit code in message for debugging when status indicates failure */ char message[256]; if (exit_code != 0 && status == HEALTHCHECK_UNHEALTHY) { snprintf(message, sizeof(message), "%s (exit_code: %d)", status_str, exit_code); write_or_close_sync_fd(&sync_pipe_fd, status, message); } else { write_or_close_sync_fd(&sync_pipe_fd, status, status_str); }

jnovy · 2025-09-29T10:56:22Z

src/healthcheck.c

+		/* Parent process - wait for child to complete */
+		close(stderr_pipe[1]); /* Close write end of stderr pipe */
+		int status;
+		pid_t wait_result = waitpid(pid, &status, 0);


The healthcheck_execute_command() function doesn't enforce the configured timeout parameter. The `waitpid(pid, &status, 0) which waits indefinitely, making the timeout parameter non-functional.

jnovy · 2025-09-29T10:57:48Z

src/healthcheck.c

+
+	/* Check if we're still in start period */
+	if (timer->start_period_remaining > 0) {
+		timer->start_period_remaining -= timer->config.interval;


This countdown approach creates timing inconsistencies. If the interval is 30s but start period is 45s, the logic becomes unreliable, use actual elapsed time: time(NULL) - timer->start_time

jnovy · 2025-09-29T10:58:34Z

src/healthcheck.c

+
+/* Healthcheck validation constants */
+#define HEALTHCHECK_INTERVAL_MIN 1
+#define HEALTHCHECK_INTERVAL_MAX 3600


Review these constants again - HEALTHCHECK_INTERVAL_MIN/MAX, etc. are defined but never used to validate user input, allowing invalid parameters to pass through.

hey @jnovy , once again, thank you. I've made a bunch of changes; I made the changes from the findings you reported. While working in start_period, noticed that interval was not starting at the beginning, so I fixed that. for the minimalistic conmon protocol, I took id -100 + the health status. Also spent time on "integration" BATS tests. Trying to cover your points above. Im not just sure what you meant by "Resource cleanup verification"

Thanks for your patience.

draft PR updated as well, in case you want to try conmon+podman: https://github.com/containers/podman/pull/27067/files

jnovy

Looking much better now. FYI - I will be AFK for the next two weeks..

jnovy · 2025-10-10T06:58:55Z

src/healthcheck.c

+	timer->start_time = time(NULL); /* Record start time for elapsed time calculation */
+
+	/* Run the first healthcheck immediately */
+	healthcheck_timer_callback(timer);


Maybe we want to remove this to not to run health check immediately but allowing some time for container to initialize?

Tried to look for the container state, but it was way too much code (i.e. calling runc state, with g_timer until its read). so your idea of allowing some time is great

jnovy · 2025-10-10T06:59:58Z

src/conmon.c

+		config.interval = opt_healthcheck_interval != -1 ? opt_healthcheck_interval : 30;
+		config.timeout = opt_healthcheck_timeout != -1 ? opt_healthcheck_timeout : 30;
+		config.retries = opt_healthcheck_retries != -1 ? opt_healthcheck_retries : 3;
+		config.start_period = opt_healthcheck_start_period != -1 ? opt_healthcheck_start_period : 0;


With start_period=0, the very first health check (which runs immediately) counts as a failure if the container isn't ready. This is problematic because:

Container startup time varies

Network/storage initialization may not be complete

First check almost always fails for real applications

Maybe let's do something like 10 seconds default allowing the container to initialize?

It would be good to add a comment, e.g.:

/* First healthcheck runs after 'interval' seconds. * During the first 'start_period' seconds, failures don't count toward retries. * This gives containers time to initialize before health monitoring begins. */

podman (systemd) is running the healthcheck immediately when container is ready (i.e. 0s,2s,4s,6s,...)
but docker is starting at the first interval (i.e. 2s,4s,6s)

So I think we should do like podman instead of docker? so I took your comment but slightly change it according to the podman behavior.

having 10s of start_period by default makes sense and if user is confident that their container boots up fast, they can decrease it

src/healthcheck.c

Signed-off-by: Samuel Archambault <[email protected]>

samifruit514 mentioned this pull request Sep 13, 2025

healthcheck for non-systemd podman containers/podman#27033

Open

samifruit514 force-pushed the main branch from 88089f4 to 81ac182 Compare September 13, 2025 22:35

samifruit514 mentioned this pull request Sep 13, 2025

Reading healthchecks result from conmon's pipe containers/podman#27067

Draft

mheon reviewed Sep 13, 2025

View reviewed changes

src/healthcheck.c Outdated Show resolved Hide resolved

mheon reviewed Sep 13, 2025

View reviewed changes

src/healthcheck.c Outdated Show resolved Hide resolved

jnovy requested changes Sep 15, 2025

View reviewed changes

samifruit514 force-pushed the main branch from fc3ec5a to 44adc43 Compare September 15, 2025 20:21

jnovy requested changes Sep 16, 2025

View reviewed changes

samifruit514 force-pushed the main branch 4 times, most recently from 999af70 to 040f3e7 Compare September 17, 2025 11:31

samifruit514 force-pushed the main branch 2 times, most recently from f165d2f to efb4380 Compare September 22, 2025 22:47

samifruit514 requested a review from jnovy September 22, 2025 22:48

samifruit514 force-pushed the main branch from efb4380 to 657087b Compare September 22, 2025 23:28

samifruit514 force-pushed the main branch 2 times, most recently from 0cbd85f to 65eba53 Compare September 23, 2025 12:39

jnovy requested changes Sep 29, 2025

View reviewed changes

samifruit514 force-pushed the main branch 9 times, most recently from 9e63335 to c68bc82 Compare October 3, 2025 12:32

samifruit514 requested a review from jnovy October 5, 2025 19:27

jnovy requested changes Oct 10, 2025

View reviewed changes

samifruit514 force-pushed the main branch from 0865f46 to e053cd3 Compare October 10, 2025 18:51

healthcheck feature

e62dd5c

Signed-off-by: Samuel Archambault <[email protected]>

samifruit514 force-pushed the main branch from 4a3360f to e62dd5c Compare October 10, 2025 22:54

samifruit514 requested a review from jnovy October 11, 2025 02:18

healthcheck feature #599

Are you sure you want to change the base?

healthcheck feature #599

Conversation

samifruit514 commented Sep 13, 2025

Uh oh!

mheon commented Sep 13, 2025

Uh oh!

packit-as-a-service bot commented Sep 13, 2025

Uh oh!

samifruit514 commented Sep 13, 2025

Uh oh!

samifruit514 commented Sep 13, 2025

Uh oh!

Uh oh!

Uh oh!

mheon commented Sep 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samifruit514 commented Sep 18, 2025

Uh oh!

samifruit514 commented Sep 19, 2025

Uh oh!

jnovy commented Sep 22, 2025

Uh oh!

samifruit514 commented Sep 22, 2025

Uh oh!

jnovy commented Sep 23, 2025

Uh oh!

samifruit514 commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samifruit514 commented Sep 26, 2025

Uh oh!

jnovy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samifruit514 commented Sep 23, 2025 •

edited

Loading

samifruit514 Oct 10, 2025 •

edited

Loading