Skip to content

Kaggle notebooks analysis #22

Open
@datapythonista

Description

@datapythonista

This is a first version of the analysis of pandas usage in Kaggle notebooks.

We've fetched Python notebooks from Kaggle and we run them using record_api to analyze the number of calls to the main objects of the pandas API. A total of 895 notebooks could be analyzed.

In a separate column, information about the page views in the pandas documentation has been added. The page views are normalized by 1,000 (so the page with more views in the pandas documentation would have a value of 1,000 in the column).

For simplicity, only the attributes of DataFrame, Series and the pandas top-level module have been merged. So, pandas.sum(), Series.sum() and DataFrame.sum() would appear in the list as simply sum.

The different sections are to help reading the document, and not an "official" categorization of the API. Feedback is welcome if something feels misplaced.

The source code to generate the table is available at this repo.

Top 25 called methods

Notes:

  • Operators (e.g. __add__) are merged with their equivalent method (e.g. add)
  • __getitem__ is both used to access a column df[col] and to filter df[condition]
  • Accessing a call is also possible via __getattr__ (e.g. df.col_name), but this has not been captured
Object Kaggle calls
__getitem__ 143992
__setitem__ 40059
eq 3018
mul 2799
add 2768
groupby 2267
loc 1667
drop 1618
fillna 1609
columns 1583
head 1575
truediv 1442
shape 1267
sub 1144
isnull 1057
sort_values 1015
and 957
values 953
sum 898
astype 728
value_counts 706
index 664
gt 622
apply 538
to_frame 479

Main items by category

Data summary and info

Object Kaggle calls Docs views
info 275 22
empty 0 32
describe 303 146
value_counts 706 161
dtypes 175 64
memory_usage 83 2
ndim 0 1
shape 1267 17
size 3 45
values 953 113
attrs 0 0
array 0 0
unique 193 106
dtype 149 8
nbytes 0 0

Indexing

Object Kaggle calls Docs views
__getitem__ 143992 0
__setitem__ 40059 0
axes 0 4
columns 1583 31
set_index 72 278
swapaxes 0 0
select_dtypes 180 36
lookup 0 11
xs 5 16
loc 1667 232
iloc 427 122
index 664 164
reindex 11 136
reindex_like 0 2
reset_index 305 279
add_prefix 16 6
add_suffix 0 3
get 0 16
iat 1 17
keys 13 16
at 4 40
filter 3 170
rename 401 355
rename_axis 0 13
idxmax 7 49
idxmin 0 10
droplevel 0 0
truncate 0 7
swaplevel 0 7
take 0 5
reorder_levels 0 5
sort_index 32 90
set_axis 0 1
pop 14 9
searchsorted 0 3
name 113 13
item 0 3
argmax 0 2
argmin 0 1
argsort 0 3

Filter, select, sort

Object Kaggle calls Docs views
nlargest 25 17
nsmallest 1 8
head 1575 108
tail 60 12
drop_duplicates 20 194
sort_values 1015 457
sample 63 102
query 12 69

Operators

Object Kaggle calls Docs views
add 2768 104
div 2 10
dot 0 9
eq 3018 1
equals 0 35
floordiv 3 0
ge 68 1
gt 622 1
le 197 0
lt 8 0
mod 11 1
mul 2799 4
ne 163 1
pow 29 2
product 0 3
radd 0 6
rdiv 0 0
rfloordiv 0 0
rmod 0 0
rmul 0 2
rpow 0 0
rsub 0 2
rtruediv 0 2
sub 1144 7
truediv 1442 0

Missing values

Object Kaggle calls Docs views
isnull 1057 90
notnull 60 40
dropna 193 346
fillna 1609 248
interpolate 3 39
isna 108 27
notna 5 11
hasnans 0 0

Map

Object Kaggle calls Docs views
cut 59 84
eval 0 12
corrwith 1 11
applymap 2 49
astype 728 234
rank 2 34
clip 4 13
where 10 105
mask 14 25
combine 0 12
combine_first 0 11
isin 86 138
abs 25 12
replace 463 216
apply 538 379
round 14 68
transform 10 39
factorize 3 15
map 420 91
between 1 12

Reduce

Object Kaggle calls Docs views
cov 0 9
quantile 47 78
var 4 11
skew 88 5
std 140 39
sum 898 114
kurt 60 1
kurtosis 23 3
count 109 107
max 131 70
mean 390 107
median 228 21
min 107 26
mode 205 18
prod 1 1
nunique 15 27
all 9 16
any 87 22
mad 3 2
sem 0 2
corr 239 105
is_monotonic 0 0
is_monotonic_decreasing 0 0
is_monotonic_increasing 0 0
is_unique 0 1
cov 0 9
autocorr 0 7
quantile 47 78

Misc

Object Kaggle calls Docs views
iterrows 39 102
style 84 76
itertuples 0 36
bool 0 5
squeeze 0 2
update 8 56
pipe 3 7
__iter__ 0 1
items 1 6
iteritems 3 37
view 0 0

Reshape / Join / Concat...

Object Kaggle calls Docs views
get_dummies 258 152
crosstab 58 40
concat 432 315
merge_asof 0 16
merge_ordered 0 4
wide_to_long 0 7
pivot 29 95
pivot_table 54 144
join 159 225
melt 18 75
stack 0 36
transpose 9 76
assign 19 74
insert 17 57
merge 425 413
drop 1618 625
explode 0 0
align 3 10
append 439 515
T 55 6
unstack 17 58
repeat 0 5
ravel 0 5

Group

Object Kaggle calls Docs views
agg 0 16
aggregate 3 58
groupby 2267 719

Window

Object Kaggle calls Docs views
cummax 0 2
cummin 0 0
cumprod 0 5
cumsum 8 29
pct_change 0 34
rolling 42 140
ewm 0 33
expanding 0 11
duplicated 14 90
diff 1 54

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions