Skip to content

Conversation

karynzv
Copy link
Contributor

@karynzv karynzv commented Mar 5, 2024

Here is a short notebook on anomaly detection using Pycaret and CrateDB. This was based on this blog post by Christian

Summary of the changes / Why this is an improvement

Checklist

  • Link to issue this PR refers to (if applicable):
  • CLA is signed

@karynzv karynzv self-assigned this Mar 5, 2024
karynzv added 2 commits March 5, 2024 12:35
Comment out pip install to remove long output from cell
Copy link
Contributor

@marijaselakovic marijaselakovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Update README.md file with the description of the new notebook and link to Google Colab
  • Add a paragraph at the beginning explaining what the Anomaly Detection is and key learning points in this notebook
  • It would be good to execute step 3 with sqlalchemy: first create a table and then run COPY FROM statement
  • Show rendered image in step 7 with anomalies

ckurze
ckurze previously requested changes Mar 6, 2024
Copy link
Contributor

@ckurze ckurze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this notebook!

I would suggest that we structure it the same way as EDA and Timeseries Decomposition, so they are almost the same. Nitty-picky here, but would make it easier for users to recognize the same pattern, imho.

The original blog post also visualises the known anomalies first and then compares with the anomaly detection, could this be included as well?

Not sure, but usually the huge amount of lines comes from the plots. Other notebooks (like topic/timeseries/exploratory_data_analysis.ipynb) set the renderer to PNG to avoid it, maybe it solves the large size of the notebook/many lines?

# Plotly plots will be rendered as PNG images, uncomment the following line to crate an interactive chart
plotly.io.renderers.default = 'png'

Thank you, great work!

@amotl
Copy link
Member

amotl commented Mar 6, 2024

Not sure, but usually the huge amount of lines comes from the plots. Other notebooks (like topic/timeseries/exploratory_data_analysis.ipynb) set the renderer to PNG to avoid it, maybe it solves the large size of the notebook/many lines?

Yes, that makes them smaller, at least in terms of "lines of code". Most often, they don't get much larger in terms of size, depending on individual image compression capacities I guess. We will need to inspect the outcome if it will get better after applying @ckurze's suggestion, thanks!

Current size: 49183 loc · 1.08 MB

@karynzv
Copy link
Contributor Author

karynzv commented Mar 12, 2024

Hey, thanks for everyone's comments, I'll start working on this again now.

karynzv added 3 commits March 14, 2024 12:37
This is not done yet, so no need to review! I'm just sharing the latest changes after your comments. Thank you all again for the input.
I added the initial plot for the anomalies and also changed the text and structure of the notebook
@amotl
Copy link
Member

amotl commented Mar 19, 2024

Previous size: 49183 loc · 1.08 MB

New size: 732 loc · 222 KB

Excellent, thanks! 1

Footnotes

  1. Can't repeat that too often: Please make sure to amend your commit, or squash before merging, otherwise the issue will not be remedied completely.

Copy link
Member

@amotl amotl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful. Just providing a few more suggestions to use at your own disposal. Thanks again!

Comment on lines 207 to 210
"with engine.connect() as conn:\n",
" result = conn.execute(sa.text(query))\n",
" columns = result.keys() # Extract column names\n",
" df = pd.DataFrame(result.fetchall(), columns=columns)\n",
Copy link
Member

@amotl amotl Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just spotted a little code smell here.

Is it possible to use the improved variant offered by modern SA, using .mappings() already, to request and process the record as a map directly?

with engine.connect() as connection:
    with connection.execute(sa.text(query)) as result:
        pp(result.mappings().fetchall())

-- https://cratedb.com/docs/python/en/latest/#sqlalchemy

On the other hand, why not use pandas' native .read_sql(), providing the most compact representation to load database table data into a data frame in Python?

with engine.connect() as connection:
    df = pd.read_sql(sql=sa.text(query), con=connection)

-- https://cratedb.com/docs/python/en/latest/#sqlalchemy

karynzv and others added 5 commits March 20, 2024 11:29
Apply suggestions by Moll

Co-authored-by: Andreas Motl <[email protected]>
Changed variable names to match the chosen model
Changed the connection approach as suggested by @amotl
@karynzv karynzv requested a review from ckurze March 20, 2024 17:15
@karynzv karynzv dismissed ckurze’s stale review March 25, 2024 17:21

I've already address this

@karynzv karynzv merged commit d92d514 into main Mar 25, 2024
@karynzv karynzv deleted the timeseries-anomaly-detection branch March 25, 2024 17:26
@amotl
Copy link
Member

amotl commented Mar 25, 2024

Thanks again for your efforts, and for merging! 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants