Skip to content
This repository was archived by the owner on Jul 23, 2020. It is now read-only.

[f8a] It's possible to ingest the same package multiple times, under different names #3339

Closed
msrb opened this issue Apr 27, 2018 · 10 comments
Closed

Comments

@msrb
Copy link
Collaborator

msrb commented Apr 27, 2018

Planner ID: # 2099

Description

Package names in PyPI are case insensitive, i.e. PyJWT and pyjwt are the same package in PyPI world. We normalize Python package names when analysis request comes in, but later we seem to work with the original package name that was given to us by requester. This means that we can analyze the same package multiple times and thus probably end up with multiple entries in graph database.

It's possible that this issue affects also other ecosystems, not just PyPI. We need to check and either fix the issue for other ecosystems as well (if easy), or create a separate issue(s) so we can tackle them later.

@msrb
Copy link
Collaborator Author

msrb commented Apr 27, 2018

This is where we try to normalize package names now: https://github.com/fabric8-analytics/fabric8-analytics-worker/blob/6b9dd98004b1fcc63ae3257cc182dc2b146b05d2/f8a_worker/workers/init_analysis_flow.py#L32

But it clearly doesn't work as expected.

@tuxdna tuxdna self-assigned this May 2, 2018
@tuxdna
Copy link
Collaborator

tuxdna commented May 2, 2018

PyPI package distribution is case insensitive - https://stackoverflow.com/questions/26503509/is-pypi-case-sensitive

However it is possible that in subsequent steps the package name that is used is coming from medata generated by Mercator. I will investigate where it might happen.

@tuxdna
Copy link
Collaborator

tuxdna commented May 2, 2018

Encountered this error

FatalTaskError: ("No content was found at '%s' for PyPI package '%s'", 'pyjwt')
  File "selinon/task_envelope.py", line 114, in run
    result = task.run(node_args)
  File "f8a_worker/base.py", line 54, in run
    result = self.execute(node_args)
  File "f8a_worker/workers/repository_description.py", line 86, in execute
    return collector(self, arguments['name'])
  File "f8a_worker/workers/repository_description.py", line 53, in collect_pypi
    raise FatalTaskError("No content was found at '%s' for PyPI package '%s'", name)

Looks like the HTML structure of pypi.org is changed now, so the parsing fails to find repository description.

Here: https://github.com/fabric8-analytics/fabric8-analytics-worker/blob/6b9dd98004b1fcc63ae3257cc182dc2b146b05d2/f8a_worker/workers/repository_description.py#L50

@tuxdna
Copy link
Collaborator

tuxdna commented May 2, 2018

I analyzed both pyjwt/1.6.1 and PyJWT/1.6.1.

curl -XPOST http://localhost:32000/api/v1/component-analyses/pypi/PyJWT/1.6.1
{}
curl -XPOST http://localhost:32000/api/v1/component-analyses/pypi/pyjwt/1.6.1
{}

After a while when both the ingestion completed, I see only one package and one version entry in the graph. I haven't been able to reproduce the issue with these steps.

Let me try something more.

@msrb
Copy link
Collaborator Author

msrb commented May 2, 2018

I was testing this via jobs service.

@tuxdna
Copy link
Collaborator

tuxdna commented May 2, 2018

Local setup is giving following errors with latest Docker images:

coreapi-jobs            | + f8a-jobs.py initjobs
coreapi-jobs            | Traceback (most recent call last):
coreapi-jobs            |   File "/usr/bin/f8a-jobs.py", line 11, in <module>
coreapi-jobs            |     from f8a_jobs.scheduler import Scheduler
coreapi-jobs            |   File "/usr/lib/python3.4/site-packages/f8a_jobs/scheduler.py", line 20, in <module>
coreapi-jobs            |     import f8a_jobs.handlers as handlers
coreapi-jobs            |   File "/usr/lib/python3.4/site-packages/f8a_jobs/handlers/__init__.py", line 17, in <module>
coreapi-jobs            |     from .nuget_popular_analyses import NugetPopularAnalyses
coreapi-jobs            |   File "/usr/lib/python3.4/site-packages/f8a_jobs/handlers/nuget_popular_analyses.py", line 8, in <module>
coreapi-jobs            |     from f8a_worker.solver import NugetReleasesFetcher
coreapi-jobs            |   File "/usr/lib/python3.4/site-packages/f8a_worker/solver.py", line 10, in <module>
coreapi-jobs            |     from pip._internal.req.req_file import parse_requirements
coreapi-jobs            | ImportError: No module named 'pip._internal'

Due to above error I am not able to run ingestion on my system.

Apparently, many others are facing same issue with latest pip3 recently:

@tuxdna
Copy link
Collaborator

tuxdna commented May 3, 2018

This time I used Jobs api to schedule the analyses for pyjwt and PyJWT, and the issue has been reproduced locally.

I can see the entries in Postgres, Minio and Graph.

@msrb msrb modified the milestones: Sprint 148, Sprint 149 May 9, 2018
@tuxdna
Copy link
Collaborator

tuxdna commented May 11, 2018

The issue is with Jobs API sending parameters directly to the flow engine without performing case transformation.

Here is how to inspect this issue.

Step 0

Run component analysis for some package that ensures initialization of s3 buckets for package and version data ( This is only required if you are starting afresh )

curl -XPOST "http://localhost:32000/api/v1/component-analyses/pypi/urllib3/1.22"
{}

Wait for analyses to complete.

Step 1

Now invoke component analysis for pyjwt:

curl -XPOST "http://localhost:32000/api/v1/component-analyses/pypi/pyjwt/1.6.1"
{}

Wait for analyses to complete.

Go to Jobs UI, and check for bookkeeeping data for this package

You will see all the workers which were executed for pyjwt.

Step 2

Finally run component analysis for PyJWT:

curl -XPOST "http://localhost:32000/api/v1/component-analyses/pypi/PyJWT/1.6.1"
{}

Wait for analyses to complete.

Go to Jobs UI, and check for bookkeeeping data for this package

You will see no workers which were executed for PyJWT.

{
  "error": "No result found."
}

Step 3

Now run the analyses using Jobs UI: http://localhost:34000/api/v1/jobs/selective-flow-scheduling?state=running

{
  "flow_arguments": [
    {
      "ecosystem": "pypi",
      "force": true,
      "force_graph_sync": true,
      "name": "pyjwt",
      "recursive_limit": 0,
      "version": "1.6.1"
    },
    {
      "ecosystem": "pypi",
      "force": true,
      "force_graph_sync": true,
      "name": "PyJWT",
      "recursive_limit": 0,
      "version": "1.6.1"
    }
  ],
  "flow_name": "bayesianFlow",
  "run_subsequent": false,
  "task_names": []
}

Wait for analyses to complete.

This time if you check for worker data for PyJWT as in Step 2 above, you will see that there is an entry for this package, but no workers were run with PyJWT.

{
  "summary": {
    "analysed_versions": [
      "1.6.1"
    ],
    "ecosystem": "pypi",
    "package": "PyJWT",
    "package_level_workers": [],
    "package_version_count": 1
  }
}

However, if you check for worker data for pyjwt as in Step 1 above, you workers were run for this package.

This clearly means that the case normalization is happening properly at the workers level.

Summary

Solution

Perform case normalization before submitting jobs via Jobs API or perform case normalization at the topmost node of the flow.

@sivaavkd sivaavkd assigned msrb and unassigned tuxdna May 14, 2018
@msrb msrb removed their assignment May 14, 2018
@humaton humaton self-assigned this May 15, 2018
@sivaavkd sivaavkd modified the milestones: Sprint 149, Sprint 150 May 31, 2018
@sivaavkd sivaavkd modified the milestones: Sprint 150, Sprint 151 Jun 26, 2018
@GeorgeActon GeorgeActon added the priority/P4 Normal label Jul 31, 2018
@sivaavkd
Copy link
Collaborator

sivaavkd commented Aug 31, 2018

@humaton any update on this bug ? We are moving this Sev2 issue since some sprints cc @msrb

@sivaavkd sivaavkd removed this from the Sprint 151 milestone Aug 31, 2018
@humaton
Copy link
Collaborator

humaton commented Sep 4, 2018

@sivaavkd there is a fix for it. here fabric8-analytics/fabric8-analytics-jobs/pull/287

But this issue will be present in any future code that will schedule flows directly not using jobs or server api.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants