Skip to content

analytics: fix mac not sending reports #10026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 16, 2023
Merged

analytics: fix mac not sending reports #10026

merged 4 commits into from
Oct 16, 2023

Conversation

dberenbaum
Copy link
Contributor

Not sure if this is the best approach to fix the issue, but currently we do not seem to be sending analytics reports for mac since March. I have tested and found that it seems to be some combination of forking and the _popen method used. Analytics get sent successfully if I either:

  1. Drop all the forking logic, OR
  2. Call main like it does for linux.

Here I dropped all the forking and conditions here and just use _popen directly. This successfully sends reports for me, and I'm not sure why we need forking here since it should happen in a separate, non-blocking process when using _popen (and dropping .communicate()).

Not sure if I miss some context for why forking or other logic is needed (I took a look back at #4294 and #4262 but wasn't clear why this is all needed).

dvc/daemon.py Outdated

os._exit(0) # pylint: disable=protected-access


def _spawn(cmd, env):
logger.debug("Trying to spawn '%s'", cmd)

if os.name == "nt":
_spawn_windows(cmd, env)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a check here for sys.platform == "win32". Does this mean we don't report on 64-bit windows?

@dberenbaum dberenbaum requested review from efiop and pmrowla October 16, 2023 15:00
@skshetry
Copy link
Collaborator

Not sure if I miss some context for why forking or other logic is needed

Forking is cheaper than popen. Let's limit the change to macOS for now.

@skshetry
Copy link
Collaborator

Also, could you please try commenting out this line and see if that is the problem? I don't remember any other recent changes than that addition.

os.closerange(0, 3)

@dberenbaum
Copy link
Contributor Author

Also, could you please try commenting out this line and see if that is the problem? I don't remember any other recent changes than that addition.

That doesn't help. I'm wondering if it's something in a mac os release rather than on our side causing the issue.

@dberenbaum
Copy link
Contributor Author

Forking is cheaper than popen. Let's limit the change to macOS for now.

Done

@codecov
Copy link

codecov bot commented Oct 16, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Files Coverage Δ
dvc/daemon.py 40.27% <0.00%> (+0.27%) ⬆️

📢 Thoughts on this report? Let us know!.

@efiop
Copy link
Contributor

efiop commented Oct 16, 2023

Analytics get sent successfully if I either:

Drop all the forking logic, OR
Call main like it does for linux.

So just reverting changes from #4294 (aka do the same thing we do on linux) fixes it for you, right?

dvc/daemon.py Outdated
Comment on lines 61 to 64
if platform.system() == "Darwin":
# workaround for MacOS bug
# https://github.com/iterative/dvc/issues/4294
_popen(cmd, env=env).communicate()
_popen(cmd, env=env)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like https://bugs.python.org/issue33725 got fixed in 3.8 and we can drop this workaround, which will make it work the same for linux and mac again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to drop the mac workaround

@efiop
Copy link
Contributor

efiop commented Oct 16, 2023

currently we do not seem to be sending analytics reports for mac since March

Were you able to find any hints on why since march specifically?

@dberenbaum
Copy link
Contributor Author

dberenbaum commented Oct 16, 2023

So just reverting changes from #4294 (aka do the same thing we do on linux) fixes it for you, right?

Yes. Not sure if #4294 was related to certain versions of the mac os. edit: see comment from @efiop above.

Were you able to find any hints on why since march specifically?

No. My best guess is some mac os update.

@efiop
Copy link
Contributor

efiop commented Oct 16, 2023

@dberenbaum So just to confirm I understand correctly, in this current form, the fix works for you, right? If so, this is super clean 🔥

Not sure if I miss some context for why forking or other logic is needed (I took a look back at #4294 and #4262 but wasn't clear why this is all needed).

This is mostly to do the analytics sending in a separate detached process, so the main one doesn't have to wait for that if the network is slow. There was a big emphasis on quick command execution in the past that now definitely feels redundant and even misguided. But since we have the mechanism already, I guess it is nice to have one...

@dberenbaum
Copy link
Contributor Author

dberenbaum commented Oct 16, 2023

So just to confirm I understand correctly, in this current form, the fix works for you, right? If so, this is super clean 🔥

Yes.

This is mostly to do the analytics sending in a separate detached process, so the main one doesn't have to wait for that if the network is slow. There was a big emphasis on quick command execution in the past that now definitely feels redundant and even misguided. But since we have the mechanism already, I guess it is nice to have one...

My point was that popen will still be non-blocking (without .communicate()) and running in a separate subprocess, so I don't think it should have any performance impact besides the minor one @skshetry noted that forking is cheaper than a new subprocess. Not sure it's worth all the extra code but I understand the hesitance to change more than necessary here.

@efiop
Copy link
Contributor

efiop commented Oct 16, 2023

My point was that popen will still be non-blocking (without .communicate()) and running in a separate subprocess, so I don't think it should have any performance impact besides the minor one @skshetry noted that forking is cheaper than a new subprocess. Not sure it's worth all the extra code but I understand the hesitance to change more than necessary here.

Ah, I see. That extra code is mostly about daemonizing the process to properly detach from parent/session/tty/etc so that there are no orphans and the child is not killed when the main process finishes and stuff like that.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding research, thanks for fixing analytics! 🔥

@efiop efiop merged commit ef226a9 into main Oct 16, 2023
@efiop efiop deleted the debug-analytics branch October 16, 2023 20:28
@efiop efiop added bug Did we break something? P: macOS labels Oct 16, 2023
@skshetry
Copy link
Collaborator

skshetry commented Oct 17, 2023

I now understand why it was failing.

I added following code in #9204:

os.closerange(0, 3)

which closes the streams in the child. So, the popen does not inherit any streams, and all sys.{stdin,stdout,stderr} are set to None.

Couple that with the changes to #8993, where I am not checking if the sys.{stdout,stderr} is not None, the daemon is failing silently:

Traceback (most recent call last):
  File "/Users/user/Projects/dvc/dvc/__main__.py", line 15, in <module>
    from dvc.cli import main
  File "/Users/user/Projects/dvc/dvc/__init__.py", line 10, in <module>
    dvc.logger.setup()
  File "/Users/user/Projects/dvc/dvc/logger.py", line 187, in setup
    formatter = ColorFormatter(log_colors=log_colors and sys.stdout.isatty())
                                                         ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'isatty'
Traceback (most recent call last):
  File "/Users/user/Projects/dvc/dvc/__main__.py", line 26, in <module>
    ex = main(sys.argv[1:])
         ^^^^^^^^^^^^^^^^^^
  File "/Users/user/Projects/dvc/dvc/cli/__init__.py", line 168, in main
    if sys.stderr.closed:  # pylint: disable=using-constant-test
       ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'closed'

I think we should create a pidfile, and pass it to dvc daemon which gets deleted if it exits cleanly.

@dberenbaum
Copy link
Contributor Author

dberenbaum commented Oct 17, 2023

If it helps, here are the steps I'm using to debug:

  1. Download the latest data (I have been downloading the latest hour to debug) from s3://dvc-analytics/events
  2. Run the following:
import dask.bag as db
import json


def flatten_json(record: dict) -> dict:
    for k, v in record["body"].items():
        if isinstance(v, dict):
            for subk, subv in v.items():
                record[f"{k}.{subk}"] = subv
        else:
            record[k] = v
    del record["body"]
    return record

def to_dict(record: tuple) -> dict:
    # Separate record into json data and path.
    jsondata, fpath = record
    # Convert to dicts.
    dictdata = json.loads(jsondata)
    dictdata["path"] = fpath.split("sample")[-1]
    # Merge dicts.
    return dictdata

b = db.read_text(<path_to_downloaded_data>, include_path=True)
b_dict = b.map(to_dict)
b_flat = b_dict.map(flatten_json)
df = b_flat.to_dataframe().compute()
df[df["user_id"] == <my_user_id>]

This will output all records for that user in the downloaded data.

@dberenbaum
Copy link
Contributor Author

dberenbaum commented Oct 17, 2023

I added following code in #9204:

os.closerange(0, 3)

@skshetry Do you recommend we drop it? I worry what other effects it could have if we keep it.

@skshetry
Copy link
Collaborator

skshetry commented Oct 17, 2023

I am getting crashes after pulling this changes. Found some related issues in python/cpython#105912, python/cpython#58037, bpo-30385 and in bpo-31818 unless I set export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES.

Details
Process:               Python [78519]
Path:                  /usr/local/Cellar/[email protected]/3.11.6/Frameworks/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python
Identifier:            org.python.python
Version:               3.11.6 (3.11.6)
Code Type:             X86-64 (Native)
Parent Process:        launchd [1]
Responsible:           iTerm2 [76996]
User ID:               501

Date/Time:             2023-10-17 23:17:20.3625 +0545
OS Version:            macOS 14.0 (23A344)
Report Version:        12
Bridge OS Version:     8.0 (21P365)
Anonymous UUID:        6A3DF180-9632-4B9F-6400-6880549C8BC2

Sleep/Wake UUID:       91092C2E-81E2-4630-8FC3-B7B41899C7DB

Time Awake Since Boot: 160000 seconds
Time Since Wake:       11571 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_CRASH (SIGABRT)
Exception Codes:       0x0000000000000000, 0x0000000000000000

Termination Reason:    Namespace OBJC, Code 1 

Application Specific Information:
crashed on child side of fork pre-exec


Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	    0x7ff815f49112 __abort_with_payload + 10
1   libsystem_kernel.dylib        	    0x7ff815f678c2 abort_with_payload_wrapper_internal + 82
2   libsystem_kernel.dylib        	    0x7ff815f67870 abort_with_reason + 19
3   libobjc.A.dylib               	    0x7ff815bda26e _objc_fatalv(unsigned long long, unsigned long long, char const*, __va_list_tag*) + 114
4   libobjc.A.dylib               	    0x7ff815bda1fc _objc_fatal(char const*, ...) + 138
5   libobjc.A.dylib               	    0x7ff815bc956d performForkChildInitialize(objc_class*, objc_class*) + 299
6   libobjc.A.dylib               	    0x7ff815bb51c6 initializeNonMetaClass + 410
7   libobjc.A.dylib               	    0x7ff815bcb6eb initializeAndMaybeRelock(objc_class*, objc_object*, locker_mixin<lockdebug::lock_mixin<objc_lock_base_t>>&, bool) + 231
8   libobjc.A.dylib               	    0x7ff815bb4d6c lookUpImpOrForward + 787
9   libobjc.A.dylib               	    0x7ff815bb449b _objc_msgSend_uncached + 75
10  CoreFoundation                	    0x7ff81600b619 _CFStringGetFormatSpecifierConfiguration + 25
11  CoreFoundation                	    0x7ff816007445 __CFStringAppendFormatCore + 203
12  CoreFoundation                	    0x7ff816132439 _CFStringCreateWithFormatAndArgumentsReturningMetadata + 169
13  CoreFoundation                	    0x7ff81604fea1 CFStringCreateWithFormatAndArguments + 107
14  CoreFoundation                	    0x7ff816007321 CFStringCreateWithFormat + 128
15  CoreFoundation                	    0x7ff8161a9718 -[_CFXPreferences withManagedSourceForIdentifier:user:perform:] + 113
16  CoreFoundation                	    0x7ff81600727e -[CFPrefsSearchListSource addManagedSourceForIdentifier:user:] + 99
17  CoreFoundation                	    0x7ff816178e71 __108-[_CFXPreferences(SearchListAdditions) withSearchListForIdentifier:container:cloudConfigurationURL:perform:]_block_invoke.160 + 252
18  CoreFoundation                	    0x7ff816178b78 -[_CFXPreferences withSearchLists:] + 60
19  CoreFoundation                	    0x7ff816007058 __108-[_CFXPreferences(SearchListAdditions) withSearchListForIdentifier:container:cloudConfigurationURL:perform:]_block_invoke + 278
20  CoreFoundation                	    0x7ff816178cf0 -[_CFXPreferences withSearchListForIdentifier:container:cloudConfigurationURL:perform:] + 349
21  CoreFoundation                	    0x7ff816006ae2 -[_CFXPreferences copyAppValueForKey:identifier:container:configurationURL:] + 124
22  CoreFoundation                	    0x7ff816006a28 _CFPreferencesCopyAppValueWithContainerAndConfiguration + 101
23  SystemConfiguration           	    0x7ff816c518bb SCDynamicStoreCopyProxiesWithOptions + 155
24  _scproxy.cpython-311-darwin.so	       0x110ee08c1 get_proxy_settings + 32
25  Python                        	       0x10fb9477a cfunction_vectorcall_NOARGS + 91
26  Python                        	       0x10fc26e3f _PyEval_EvalFrameDefault + 45338
27  Python                        	       0x10fc2b859 _PyEval_Vector + 92
28  Python                        	       0x10fb510c1 method_vectorcall + 343
29  Python                        	       0x10fb4e584 _PyVectorcall_Call + 134
30  Python                        	       0x10fc290d4 _PyEval_EvalFrameDefault + 54191
31  Python                        	       0x10fc2b859 _PyEval_Vector + 92
32  Python                        	       0x10fb4e584 _PyVectorcall_Call + 134
33  Python                        	       0x10fc290d4 _PyEval_EvalFrameDefault + 54191
34  Python                        	       0x10fc1b02e PyEval_EvalCode + 188
35  Python                        	       0x10fc176ed builtin_exec + 479
36  Python                        	       0x10fb94874 cfunction_vectorcall_FASTCALL_KEYWORDS + 85
37  Python                        	       0x10fc26e3f _PyEval_EvalFrameDefault + 45338
38  Python                        	       0x10fc2b859 _PyEval_Vector + 92
39  Python                        	       0x10fc8fee6 pymain_run_module + 212
40  Python                        	       0x10fc8fa8a Py_RunMain + 804
41  Python                        	       0x10fc90a54 Py_BytesMain + 42
42  dyld                          	    0x7ff815bf53a6 start + 1942


Thread 0 crashed with X86 Thread State (64-bit):
  rax: 0x0000000002000209  rbx: 0x0000000000000001  rcx: 0x00007ff7b0e52888  rdx: 0x0000000000000000
  rdi: 0x0000000000000008  rsi: 0x0000000000000001  rbp: 0x00007ff7b0e528d0  rsp: 0x00007ff7b0e52888
   r8: 0x00007fd7451aeeb0   r9: 0x0000000000000080  r10: 0x0000000000000000  r11: 0x0000000000000246
  r12: 0x0000000000000080  r13: 0x0000000000000000  r14: 0x0000000000000008  r15: 0x0000000000000000
  rip: 0x00007ff815f49112  rfl: 0x0000000000000246  cr2: 0x00007fd744fdb04e
  
Logical CPU:     0
Error Code:      0x02000209 
Trap Number:     133

IMHO the safest fix is to not run any Python program after fork(). Using use subprocess to use fork() immmediately followed by exec(). It's not safe to execute code after fork(), many functions are not "fork safe". But I know that many applications don't care, since there is also a lot of functions which are fork safe...

Many Mac OS X frameworks do not work reliably if you call fork but do not call exec.

So it seems like we should really be using subprocess in macOS.

EDIT: so looks like we need changes from #4262.

@shcheklein
Copy link
Member

@skshetry could this bug affected other platforms somehow after all? I see that the fix after all is not Darwin specific, right?

@skshetry
Copy link
Collaborator

skshetry commented Feb 9, 2024

@skshetry could this bug affected other platforms somehow after all? I see that the fix after all is not Darwin specific, right?

See #10026 (comment).

On macOS, we were sending analytics via a subprocess. And, in that case, stderr/stdout/stdin would set to None, as we were closing the file descriptors.

On Linux, we were running it in a (double) forked process, so the file descriptors' objects did exist, even if they were closed. Which we could handle already.

Windows wasn't affected as it has a separate implementation.

@shcheklein
Copy link
Member

@skshetry thanks! we see a dip in analytics the same time in other platforms ... not as drastic, but still .. do you know if were doing other changes to the telemetry? overall, what is the best way to see / map all the changes we've done to the telemetry / configs that could have affected this?

@skshetry
Copy link
Collaborator

skshetry commented Feb 9, 2024

I don't think so. The only major change was db70d1c. But that probably should have increased telemetry.

Could that be 3.0 related, as it was a major version?

@skshetry
Copy link
Collaborator

skshetry commented Feb 9, 2024

FYI, today we have a test server to see whether or not analytics (and, updater) is working or not.

python -m tests.func.test_daemon 8000 &
export DVC_ANALYTICS_ENDPOINT=http://127.0.0.1:8000
export DVC_UPDATER_ENDPOINT=http://127.0.0.1:8000

dvc version

Which will log what was received. POST request is for analytics and GET request is for updater.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something?
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants