Skip to content

Commit 8e4cbe1

Browse files
authored
Merge pull request #6 from ray310/package
Prepared for package distribution
2 parents 9505f5d + 6f0be82 commit 8e4cbe1

File tree

14 files changed

+438
-66
lines changed

14 files changed

+438
-66
lines changed

.pylintrc

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Pylint configuration settings
2+
3+
[MASTER]
4+
fail-under=9.0
5+
jobs=0
6+
7+
[MESSAGES CONTROL]
8+
disable=raw-checker-failed,
9+
bad-inline-option,
10+
locally-disabled,
11+
file-ignored,
12+
suppressed-message,
13+
useless-suppression,
14+
deprecated-pragma,
15+
use-symbolic-message-instead
16+
17+
18+
[REPORTS]
19+
output-format=colorized
20+
21+
[BASIC]
22+
good-names=i,
23+
j,
24+
k,
25+
s,
26+
x,
27+
y,
28+
z,
29+
df,
30+
fh,
31+
_
32+
33+
[FORMAT]
34+
max-line-length=88
35+
36+
[STRING]
37+
check-quote-consistency=yes
38+
39+
[DESIGN]
40+
max-args=5

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Changelog
2+
3+
## Unreleased
4+
### Added
5+
- Improved documentation
6+
____
7+
## 0.0.1 - 2022-06-04
8+
### Added
9+
- First version of Panda-Helper

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
MIT License
22

3-
Copyright (c) 2021 Ray310
3+
Copyright (c) 2022 Ray310
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

MANIFEST.in

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
recursive-include tests *.py *.txt *csv
2+
include requirements.txt

README.md

Lines changed: 217 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,230 @@
1-
# Quickly inspect Pandas DataFrames and Series with Panda-Helper data profiles
2-
- Perform initial data exploration
3-
- Detect data issues and help with quality control
1+
# Panda-Helper: Quickly and easily inspect data
2+
Panda-Helper creates data profiles for data in Pandas DataFrames and Series
43

5-
### DataFrameProfile:
6-
- Reports DataFrame shape, Series names, and Series data types
7-
- Checks for obvious duplicates
8-
- Provides distribution statistics on null values per row
4+
Assess data quality and usefulness with minimal effort
95

10-
```
11-
DataFrameProfile(df)
12-
```
13-
![Sample DataFrameProfile](https://github.com/ray310/Panda-Helper/blob/main/images/df_profile.png)
6+
Effortlessly perform initial data exploration, _so you can move on to more in-depth analysis_
147

8+
-----
9+
### DataFrame profiles quickly and easily:
10+
- Report shape
11+
- Detect duplicated rows
12+
- Display series names and data types
13+
- Provide distribution statistics on null values per row providing a view on data completeness
1514

16-
### SeriesProfile:
17-
- Reports data type, number of unique values, and number of null values
18-
- Displays a frequency table of the most and least common values
19-
- Provides distribution statistics (for numeric data)
15+
__Sample DataFrame profile__<br>
16+
_Vehicles passing through toll stations_
2017

21-
#### Catgorical data
22-
```
23-
SeriesProfile(df["Direction"])
24-
```
25-
![Sample Categorical SeriesProfile](https://github.com/ray310/Panda-Helper/blob/main/images/series_profile_direction.png)
18+
DataFrame-Level Info
19+
------------------------- ------------
20+
DF Shape (1586280, 6)
21+
Duplicated Rows 2184
22+
23+
Column Name Data Type
24+
-------------------------- -----------
25+
Plaza ID int64
26+
Date object
27+
Hour int64
28+
Direction object
29+
# Vehicles - ETC (E-ZPass) int64
30+
# Vehicles - Cash/VToll int64
31+
32+
Summary of Nulls Per Row
33+
-------------------------- -----------
34+
count 1.58628e+06
35+
min 0
36+
1% 0
37+
5% 0
38+
25% 0
39+
50% 0
40+
75% 0
41+
95% 0
42+
99% 0
43+
max 0
44+
median 0
45+
mean 0
46+
median absolute deviation 0
47+
standard deviation 0
48+
skew 0
2649

50+
-----
51+
### Series profiles quickly and easily report the:
52+
- Series data type
53+
- Count of non-null values in the series
54+
- Number of unique values
55+
- Count of null values
56+
- Counts and frequency of the most and least common values
57+
- Distribution statistics for numeric data
2758

28-
#### Numeric data
29-
```
30-
SeriesProfile(df["# Vehicles - ETC (E-ZPass)"])
31-
```
32-
![Sample Numeric SeriesProfile](https://github.com/ray310/Panda-Helper/blob/main/images/series_profile_ez.png)
59+
__Sample profile of categorical data__<br>
60+
_Direction vehicles are traveling_
3361

62+
Direction Info
63+
---------------- -------
64+
Data Type object
65+
Count 1586280
66+
Unique Values 2
67+
Null Values 0
68+
69+
Value Count % of total
70+
------- ------- ------------
71+
I 814100 51.32%
72+
O 772180 48.68%
3473

35-
### Using Panda-Helper
36-
- Note that Panda-Helper is not currently a package
37-
- Install any required dependencies to your environment of choice
38-
- Copy `reports.py` (in `src/pandahelper` directory) and incorporate into your analyses
39-
- Cite this repo or let me know if this is helpful
74+
__Sample profile of numeric data__<br>
75+
_Hourly vehicle counts at tolling points_
4076

77+
# Vehicles - ETC (E-ZPass) Info
78+
--------------------------------- -------
79+
Data Type int64
80+
Count 1586280
81+
Unique Values 8987
82+
Null Values 0
83+
84+
Value Count % of total
85+
------- ------- ------------
86+
0 3137 0.20%
87+
43 1762 0.11%
88+
44 1743 0.11%
89+
40 1712 0.11%
90+
42 1699 0.11%
91+
41 1682 0.11%
92+
39 1676 0.11%
93+
37 1673 0.11%
94+
48 1659 0.10%
95+
46 1654 0.10%
96+
38 1646 0.10%
97+
45 1641 0.10%
98+
36 1636 0.10%
99+
52 1574 0.10%
100+
47 1572 0.10%
101+
50 1571 0.10%
102+
51 1555 0.10%
103+
53 1547 0.10%
104+
55 1543 0.10%
105+
34 1534 0.10%
106+
8269 1 0.00%
107+
8438 1 0.00%
108+
8876 1 0.00%
109+
8261 1 0.00%
110+
8694 1 0.00%
111+
112+
Statistic Value
113+
------------------------- ---------------
114+
count 1.58628e+06
115+
min 0
116+
1% 25
117+
5% 68
118+
25% 407
119+
50% 1054
120+
75% 2071
121+
95% 3583
122+
99% 6308
123+
max 16854
124+
median 1054
125+
mean 1373.16
126+
median absolute deviation 751
127+
standard deviation 1253.1
128+
skew 1.69154
41129

42-
<br><br>Demonstration data obtained from: <br>
43-
https://data.ny.gov/Transportation/Hourly-Traffic-on-Metropolitan-Transportation-Auth/qzve-kjga/data
130+
-----
131+
### Installing Panda-Helper
132+
`pip install panda-helper`
44133

134+
-----
135+
### Using Panda-Helper
136+
__Profiling a DataFrame__<br>
137+
Create the DataFrameProfile and then display it or save the profile.
138+
```python
139+
import pandas as pd
140+
import pandahelper.reports as ph
141+
142+
data = {
143+
"user_id": [1, 2, 3, 4, 4],
144+
"transaction": ["purchase", "return", "purchase", "exchange", "exchange"],
145+
"amount": [100.00, None, 1400.00, 85.12, 85.12],
146+
"survey": [None, None, None, "online", "online"],
147+
}
148+
df = pd.DataFrame(data)
149+
df_profile = ph.DataFrameProfile(df)
150+
df_profile
151+
```
45152

46-
<br><br>Test data obtained from: <br>
47-
https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95
153+
DataFrame-Level Info
154+
------------------------- ------
155+
DF Shape (5, 4)
156+
Obviously Duplicated Rows 1
157+
158+
Column Name Data Type
159+
------------- -----------
160+
user_id int64
161+
transaction object
162+
amount float64
163+
survey object
164+
165+
Summary of Nulls Per Row
166+
-------------------------- --------
167+
count 5
168+
min 0
169+
1% 0
170+
5% 0
171+
25% 0
172+
50% 1
173+
75% 1
174+
95% 1.8
175+
99% 1.96
176+
max 2
177+
median 1
178+
mean 0.8
179+
median absolute deviation 1
180+
standard deviation 0.83666
181+
skew 0.512241
182+
183+
```python
184+
df_profile.save_report("df_profile.txt")
185+
```
186+
187+
__Profiling a Series__<br>
188+
Create the SeriesProfile and then display it or save it. That's it!
189+
```python
190+
series_profile = ph.SeriesProfile(df["amount"])
191+
series_profile
192+
```
193+
amount Info
194+
------------- -------
195+
Data Type float64
196+
Count 4
197+
Unique Values 3
198+
Null Values 1
199+
200+
Value Count % of total
201+
------- ------- ------------
202+
85.12 2 50.00%
203+
100 1 25.00%
204+
1400 1 25.00%
205+
206+
Statistic Value
207+
------------------------- ----------
208+
count 4
209+
min 85.12
210+
1% 85.12
211+
5% 85.12
212+
25% 85.12
213+
50% 92.56
214+
75% 425
215+
95% 1205
216+
99% 1361
217+
max 1400
218+
median 92.56
219+
mean 417.56
220+
median absolute deviation 7.44
221+
standard deviation 654.998
222+
skew 1.99931
223+
224+
```python
225+
series_profile.save_report("amount_profile.txt")
226+
```
227+
____
228+
### Sample data obtained from:
229+
- https://data.ny.gov/Transportation/Hourly-Traffic-on-Metropolitan-Transportation-Auth/qzve-kjga/data
230+
- https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95

conda_environment_dev.yaml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
name: panda_helper
2+
channels:
3+
- defaults
4+
- conda-forge
5+
dependencies:
6+
- python=3.9
7+
- black
8+
- build
9+
- coverage
10+
- jupyter
11+
- pandas
12+
- pip
13+
- pydocstyle
14+
- pylint
15+
- pytest
16+
- notebook
17+
- scipy
18+
- twine
19+
- pip:
20+
- tabulate

conda_requirements.yaml

Lines changed: 0 additions & 8 deletions
This file was deleted.

images/df_profile.png

-10.1 KB
Binary file not shown.
-4.07 KB
Binary file not shown.

images/series_profile_ez.png

-15.4 KB
Binary file not shown.

0 commit comments

Comments
 (0)