Skip to content

Commit 01601d4

Browse files
chanedwinEdwin Chan
authored and
Edwin Chan
committed
initial commit for building spark-backend, with working spark pipeline, tests and CI up, and with visions integration pulled in
Update integrations.rst (ydataai#544) fix ydataai#537 ValueError race condition when running multiprocessing with describe1d (ydataai#549) * include tests for issue 537 * fix hidden side effect from previous series.fillna(in_place=True) call by expliciting dropping na Give visibility to our support (ydataai#536) * Add support mention Change formatters for overview (ydataai#535) Fix 523 (ydataai#533) * Fix 523 Incompatible with pandas 1.1.0 (ydataai#557) Notebook update instructions (ydataai#556) Fix 545 and test pandas 1.0.5 and >=1.1 (ydataai#558) * Fix 545 and test pandas 1.0.5 and >=1.1 Bump visions[type_image_path] from 0.4.4 to 0.5.0 (ydataai#547) Bumps [visions[type_image_path]](https://github.com/dylan-profiler/visions) from 0.4.4 to 0.5.0. - [Release notes](https://github.com/dylan-profiler/visions/releases) - [Commits](dylan-profiler/visions@v0.4.4...0.5.0) Update frequent issues (ydataai#564) Fix warning from cmap (ydataai#565) Feature/distinct unique (ydataai#566) * Fix ydataai#539 v2.9.0 details (ydataai#567) [skip ci] Code formatting Visions integration Build summary from graph structure Fix a few more tests Typeset changes + test updates Type checking Correlations Handler, warning structure, random sample, test fix Test fix Fixes Fix warning Captions missing diagrams Fix 51 Unhashable Process comments Fix tests Update messages.py Add threshold to all correlation configs Remove unused renderers (ydataai#580) * Remove unused rendered Update README.md Fix check for infinite values (ydataai#588) * Fix check for infinite values Bump visions[type_image_path] from 0.5.0 to 0.6.0 Bumps [visions[type_image_path]](https://github.com/dylan-profiler/visions) from 0.5.0 to 0.6.0. - [Release notes](https://github.com/dylan-profiler/visions/releases) - [Commits](dylan-profiler/visions@0.5.0...v0.6.0) Signed-off-by: dependabot-preview[bot] <[email protected]> Update get_scatter_matrix for sparse dataframes For a dataframe like: A B C 0 1.0 7.0 NaN 1 2.0 8.0 NaN 2 3.0 9.0 NaN 3 4.0 NaN 13.0 4 5.0 NaN 14.0 5 6.0 NaN 15.0 6 NaN 10.0 16.0 7 NaN 11.0 17.0 8 NaN 12.0 18.0 the 'Interactions' tab would not display any data (as all rows contain NaN's) while any pair of columns would contain valid data to plot. This change allows columns A, B, and C to be pairwise plotted against each other by only removing rows with NaN's between the pairwise columns. Update plot.py Notation
1 parent f8333d7 commit 01601d4

27 files changed

+1806
-84
lines changed

.travis.yml

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
os: linux
2+
dist: bionic
3+
language: python
4+
cache:
5+
pip: true
6+
directories:
7+
- data/
8+
9+
jobs:
10+
include:
11+
- os: linux
12+
name: "Python 3.9-dev on Linux"
13+
python: 3.9-dev
14+
env: TEST=examples PANDAS=">=1"
15+
before_install:
16+
- sudo apt-get -y install libopenblas-dev
17+
18+
allow_failures:
19+
- name: "Python 3.9-dev on Linux"
20+
- env: TEST=spark PANDAS=">=1.1" SPARK_VERSION=2.4.7 HADOOP_VERSION=2.7
21+
python: 3.8
22+
- env: TEST=spark PANDAS=">=1.1" SPARK_VERSION=2.3.0 HADOOP_VERSION=2.7
23+
python: 3.8
24+
25+
python:
26+
- 3.6
27+
- 3.7
28+
- 3.8
29+
30+
env:
31+
global:
32+
- JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
33+
jobs:
34+
- TEST=unit PANDAS="<1"
35+
- TEST=issue PANDAS="<1"
36+
- TEST=console PANDAS="<1"
37+
- TEST=examples PANDAS="<1"
38+
- TEST=unit PANDAS="==1.0.5"
39+
- TEST=issue PANDAS="==1.0.5"
40+
- TEST=unit PANDAS=">=1.1"
41+
- TEST=issue PANDAS=">=1.1"
42+
- TEST=console PANDAS=">=1.1"
43+
- TEST=examples PANDAS=">=1.1"
44+
- TEST=lint PANDAS=">=1.1"
45+
- TEST=typing PANDAS=">=1.1"
46+
- TEST=spark PANDAS=">=1.1" SPARK_VERSION=2.3.0 HADOOP_VERSION=2.7
47+
- TEST=spark PANDAS=">=1.1" SPARK_VERSION=2.4.7 HADOOP_VERSION=2.7
48+
- TEST=spark PANDAS=">=1.1" SPARK_VERSION=3.0.1 HADOOP_VERSION=2.7
49+
50+
before_install:
51+
- pip install --upgrade pip setuptools wheel
52+
- pip install -r requirements.txt
53+
- pip install -r requirements-test.txt
54+
- pip install "pandas$PANDAS"
55+
- sudo apt-get -y install curl
56+
57+
install:
58+
- check-manifest
59+
- python setup.py sdist bdist_wheel
60+
- twine check dist/*
61+
- pip install -e .[notebook,app]
62+
63+
script:
64+
- >
65+
if [ $TEST == 'unit' ];
66+
then pytest -m "not sparktest" --cov=. tests/unit/;
67+
fi
68+
- >
69+
if [ $TEST == 'issue' ];
70+
then pytest --cov=. tests/issues/;
71+
fi
72+
- >
73+
if [ $TEST == 'examples' ];
74+
then pytest --cov=. --nbval tests/notebooks/;
75+
fi
76+
- >
77+
if [ $TEST == 'console' ];
78+
then pandas_profiling -h;
79+
fi
80+
- >
81+
if [ $TEST == 'typing' ];
82+
then make typing;
83+
fi
84+
- >
85+
if [ $TEST == 'lint' ];
86+
then python -m black --check --diff --quiet .;
87+
isort --check-only --profile black .;
88+
flake8 . --select=E9,F63,F7,F82 --show-source --statistics;
89+
fi
90+
- >
91+
if [ $TEST == 'spark' ];
92+
then SPARK_VERSION=${SPARK_VERSION} HADOOP_VERSION=${HADOOP_VERSION} make install-spark-ci;
93+
JAVA_HOME=${JAVA_HOME} SPARK_HOME=${TRAVIS_BUILD_DIR}/spark/ make test-spark;
94+
fi
95+
96+
after_success:
97+
- codecov -F $TEST

Makefile

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@ test_cov:
2424
pandas_profiling -h
2525
make typing
2626

27+
test-spark:
28+
pytest -m sparktest --black tests/unit/
29+
2730
examples:
2831
find ./examples -maxdepth 2 -type f -name "*.py" -execdir python {} \;
2932

@@ -37,6 +40,12 @@ pypi_package:
3740
install:
3841
pip install -e .[notebook]
3942

43+
install-spark-ci:
44+
sudo apt-get -y install openjdk-8-jdk
45+
curl https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
46+
--output ${TRAVIS_BUILD_DIR}/spark.tgz
47+
tar -xvzf ${TRAVIS_BUILD_DIR}/spark.tgz && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark
48+
4049
lint:
4150
pre-commit run --all-files
4251

pyproject.toml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
[tool.pytest.ini_options]
2+
markers = ["sparktest",]
3+
[tool.pytest.ini_options.spark_options]
4+
"spark.executor.id" = "driver"
5+
"spark.app.name" = "PySparkShell"
6+
"spark.executor.instances" = 1
7+
"master" = "local[*]"
8+
"spark.driver.host" = "192.168.1.78"
9+
"spark.sql.catalogImplementation" = "in-memory"

requirements-spark.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# this provides the recommended pyspark and pyarrow versions for spark to work on pandas-profiling
2+
# note that if you are using pyspark 2.3 or 2.4 and pyarrow >= 0.15, you might need to
3+
# set ARROW_PRE_0_15_IPC_FORMAT=1 in your conf/spark-env.sh for toPandas functions to work properly
4+
pyspark>=2.3.0
5+
pyarrow>=0.8.0

requirements-test.txt

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,21 @@
11
pytest
22
coverage<5
33
codecov
4-
pytest-mypy
4+
pytest-mypy>=0.7.0
5+
6+
# this is because mypy had an issue where singledispatch _ usage resulted in errors
7+
# https://github.com/python/mypy/issues/4117
8+
mypy>=0.761
9+
510
pytest-cov
11+
pytest-black
612
nbval
713
fastparquet==0.4.1
814
flake8
9-
check-manifest>=0.41
15+
check-manifest>=0.42
1016
twine>=3.1.1
11-
kaggle
17+
kaggle
18+
19+
# spark dependency
20+
pytest-spark>=0.6.0
21+
pyarrow>=0.8.0

requirements.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,4 +22,6 @@ requests>=2.24.0
2222
tqdm>=4.48.2
2323
# Jupyter notebook
2424
ipywidgets>=7.5.1
25-
seaborn>=0.10.1
25+
seaborn>=0.10.1
26+
# Single dispatch lib
27+
singledispatchmethod>=1.0.0

setup.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,9 @@
4242
package_data={
4343
"pandas_profiling": ["py.typed"],
4444
},
45+
package_data={
46+
"pandas_profiling": ["py.typed"],
47+
},
4548
include_package_data=True,
4649
classifiers=[
4750
"Development Status :: 5 - Production/Stable",

src/pandas_profiling/model/correlations.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,14 @@
77
import pandas as pd
88
from pandas.core.base import DataError
99
from scipy import stats
10+
from singledispatchmethod import singledispatchmethod
1011

1112
from pandas_profiling.config import config
1213
from pandas_profiling.model.typeset import Boolean, Categorical, Numeric, Unsupported
1314

15+
Args:
16+
df:
17+
summary:
1418

1519
class Correlation:
1620
@staticmethod
@@ -35,6 +39,7 @@ class Kendall(Correlation):
3539
def compute(df, summary) -> Optional[pd.DataFrame]:
3640
return df.corr(method="kendall")
3741

42+
"""
3843
3944
class Cramers(Correlation):
4045
@staticmethod
@@ -128,6 +133,11 @@ def compute(df, summary) -> Optional[pd.DataFrame]:
128133
129134
return correlation
130135
136+
@compute.register(SparkDataFrame)
137+
@staticmethod
138+
def _compute_spark(df: SparkDataFrame, summary) -> Optional[pd.DataFrame]:
139+
"""
140+
Use pandasUDF to compute this first, but probably can be optimised further
131141

132142
def warn_correlation(correlation_name: str, error):
133143
warnings.warn(

0 commit comments

Comments
 (0)