Skip to content

Commit 7ccb70c

Browse files
authored
Merge pull request #31 from ray310/010
010
2 parents 2b8152d + 13fc513 commit 7ccb70c

26 files changed

+1332
-1047
lines changed

.github/workflows/deploy_page.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
name: Deploy Project Site
2+
on:
3+
push:
4+
branches:
5+
- main
6+
permissions:
7+
contents: write
8+
jobs:
9+
deploy:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@v4
13+
- name: Configure Git Credentials
14+
run: |
15+
git config user.name github-actions[bot]
16+
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
17+
- uses: actions/setup-python@v5
18+
with:
19+
python-version: 3.x
20+
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
21+
- uses: actions/cache@v4
22+
with:
23+
key: mkdocs-material-${{ env.cache_id }}
24+
path: .cache
25+
restore-keys: |
26+
mkdocs-material-
27+
- run: pip install mkdocs-material mkdocstrings mkdocstrings-python
28+
- run: mkdocs gh-deploy --force

.gitignore

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,21 @@
11
# file types
22
*.docx
33
*.env
4-
*.ipynb
54
*.pages
65

76
# patterns
87
*.egg-info*
8+
*ipynb*
9+
conda_environment_dev_*
910

1011
# folders
1112
.coverage
1213
.idea
1314
data
14-
dev
15+
notes
1516
dist
1617
htmlcov
18+
site
1719
__pycache__
1820

1921
# files

CHANGELOG.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,22 @@
11
# Changelog
22

33
## Unreleased
4+
5+
## 0.1.1 - Unreleased
46
### Added
57
- functionality to detect time series gaps
68

79
____
8-
## 0.1.0 - Unreleased
10+
## 0.1.0 - 2024-07-14
911
### Added
10-
- Split reports module
11-
- Improved project documentation
12+
- Add memory usage to `DataFrameProfile` [gh-30](https://github.com/ray310/Panda-Helper/issues/30)
13+
- Improve formatting of `distribution_stats` function output [gh-29](https://github.com/ray310/Panda-Helper/issues/29)
14+
- Improved project documentation with project website [gh-2](https://github.com/ray310/Panda-Helper/issues/2)
15+
16+
### Changed
17+
- [Split reports module into `profiles` and `stats`](https://github.com/ray310/Panda-Helper/commit/93320860834e757ab18d86c2b9334efb05738662)
18+
- [Renamed `save_report` method to `save`](https://github.com/ray310/Panda-Helper/commit/876c5f5af8906081f96aff1f1f0ba9d5754a719a)
19+
- [Refactored tests to use pytest fixtures](https://github.com/ray310/Panda-Helper/commit/ff2bf2dd6e73dd4747b62faef4bd350949866a91)
1220

1321
____
1422
## 0.0.4 - 2024-07-09

README.md

Lines changed: 2 additions & 224 deletions
Original file line numberDiff line numberDiff line change
@@ -5,232 +5,10 @@
55
![Lint/Format Status](https://github.com/ray310/Panda-Helper/actions/workflows/format_lint.yml/badge.svg)
66

77
# Panda-Helper: Quickly and easily inspect data
8-
Panda-Helper is a simple data-profiling utility for Pandas' DataFrames and Series.
8+
Panda-Helper is a simple, open-source, Python data-profiling utility for Pandas' DataFrames and Series.
99

1010
Assess data quality and usefulness with minimal effort.
1111

1212
Quickly perform initial data exploration, _so you can move on to more in-depth analysis_.
1313

14-
-----
15-
### DataFrame profiles:
16-
- Report shape
17-
- Detect duplicated rows
18-
- Display series names and data types
19-
- Calculate distribution statistics on null values per row providing a view on data completeness
20-
21-
__Sample DataFrame profile__<br>
22-
_Vehicles passing through toll stations_
23-
24-
DataFrame-Level Info
25-
------------------------- ------------
26-
DF Shape (1586280, 6)
27-
Duplicated Rows 2184
28-
29-
Column Name Data Type
30-
-------------------------- -----------
31-
Plaza ID int64
32-
Date object
33-
Hour int64
34-
Direction object
35-
# Vehicles - ETC (E-ZPass) int64
36-
# Vehicles - Cash/VToll int64
37-
38-
Summary of Nulls Per Row
39-
-------------------------- -----------
40-
count 1.58628e+06
41-
min 0
42-
1% 0
43-
5% 0
44-
25% 0
45-
50% 0
46-
75% 0
47-
95% 0
48-
99% 0
49-
max 0
50-
median 0
51-
mean 0
52-
median absolute deviation 0
53-
standard deviation 0
54-
skew 0
55-
56-
-----
57-
### Series profiles report the:
58-
- Series data type
59-
- Count of non-null values in the series
60-
- Number of unique values
61-
- Count of null values
62-
- Counts and frequency of the most and least common values
63-
- Distribution statistics for numeric-like data
64-
65-
__Sample profile of categorical data__<br>
66-
_Direction vehicles are traveling_
67-
68-
Direction Info
69-
---------------- -------
70-
Data Type object
71-
Count 1586280
72-
Unique Values 2
73-
Null Values 0
74-
75-
Value Count % of total
76-
------- ------- ------------
77-
I 814100 51.32%
78-
O 772180 48.68%
79-
80-
__Sample profile of numeric data__<br>
81-
_Hourly vehicle counts at tolling points_
82-
83-
# Vehicles - ETC (E-ZPass) Info
84-
--------------------------------- -------
85-
Data Type int64
86-
Count 1586280
87-
Unique Values 8987
88-
Null Values 0
89-
90-
Value Count % of total
91-
------- ------- ------------
92-
0 3137 0.20%
93-
43 1762 0.11%
94-
44 1743 0.11%
95-
40 1712 0.11%
96-
42 1699 0.11%
97-
41 1682 0.11%
98-
39 1676 0.11%
99-
37 1673 0.11%
100-
48 1659 0.10%
101-
46 1654 0.10%
102-
38 1646 0.10%
103-
45 1641 0.10%
104-
36 1636 0.10%
105-
52 1574 0.10%
106-
47 1572 0.10%
107-
50 1571 0.10%
108-
51 1555 0.10%
109-
53 1547 0.10%
110-
55 1543 0.10%
111-
34 1534 0.10%
112-
8269 1 0.00%
113-
8438 1 0.00%
114-
8876 1 0.00%
115-
8261 1 0.00%
116-
8694 1 0.00%
117-
118-
Statistic Value
119-
------------------------- ---------------
120-
count 1.58628e+06
121-
min 0
122-
1% 25
123-
5% 68
124-
25% 407
125-
50% 1054
126-
75% 2071
127-
95% 3583
128-
99% 6308
129-
max 16854
130-
median 1054
131-
mean 1373.16
132-
median absolute deviation 751
133-
standard deviation 1253.1
134-
skew 1.69154
135-
136-
-----
137-
### Installing Panda-Helper
138-
`pip install panda-helper`
139-
140-
-----
141-
### Using Panda-Helper
142-
__Profiling a DataFrame__<br>
143-
Create the DataFrameProfile and then display it or save the profile.
144-
```python
145-
import pandas as pd
146-
import pandahelper as ph
147-
148-
data = {
149-
"user_id": [1, 2, 3, 4, 4],
150-
"transaction": ["purchase", "return", "purchase", "exchange", "exchange"],
151-
"amount": [100.00, None, 1400.00, 85.12, 85.12],
152-
"survey": [None, None, None, "online", "online"],
153-
}
154-
df = pd.DataFrame(data)
155-
df_profile = ph.DataFrameProfile(df)
156-
df_profile
157-
```
158-
159-
DataFrame-Level Info
160-
------------------------- ------
161-
DF Shape (5, 4)
162-
Obviously Duplicated Rows 1
163-
164-
Column Name Data Type
165-
------------- -----------
166-
user_id int64
167-
transaction object
168-
amount float64
169-
survey object
170-
171-
Summary of Nulls Per Row
172-
-------------------------- --------
173-
count 5
174-
min 0
175-
1% 0
176-
5% 0
177-
25% 0
178-
50% 1
179-
75% 1
180-
95% 1.8
181-
99% 1.96
182-
max 2
183-
median 1
184-
mean 0.8
185-
median absolute deviation 1
186-
standard deviation 0.83666
187-
skew 0.512241
188-
189-
```python
190-
df_profile.save_report("df_profile.txt")
191-
```
192-
193-
__Profiling a Series__<br>
194-
Create the SeriesProfile and then display it or save it. That's it!
195-
```python
196-
series_profile = ph.SeriesProfile(df["amount"])
197-
series_profile
198-
```
199-
amount Info
200-
------------- -------
201-
Data Type float64
202-
Count 4
203-
Unique Values 3
204-
Null Values 1
205-
206-
Value Count % of total
207-
------- ------- ------------
208-
85.12 2 50.00%
209-
100 1 25.00%
210-
1400 1 25.00%
211-
212-
Statistic Value
213-
------------------------- ----------
214-
count 4
215-
min 85.12
216-
1% 85.12
217-
5% 85.12
218-
25% 85.12
219-
50% 92.56
220-
75% 425
221-
95% 1205
222-
99% 1361
223-
max 1400
224-
median 92.56
225-
mean 417.56
226-
median absolute deviation 7.44
227-
standard deviation 654.998
228-
skew 1.99931
229-
230-
```python
231-
series_profile.save_report("amount_profile.txt")
232-
```
233-
____
234-
### Sample data obtained from:
235-
- https://data.ny.gov/Transportation/Hourly-Traffic-on-Metropolitan-Transportation-Auth/qzve-kjga/data
236-
- https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95
14+
Please see [project website](https://ray310.github.io/Panda-Helper/).

conda_environment_dev.yaml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,16 +10,20 @@ dependencies:
1010
- codespell
1111
- coverage=7.2.2
1212
- jupyter=1.0.0
13-
- mkdocs
13+
- lxml=5.2.1
14+
- mkdocs=1.6.0
1415
- notebook=7.0.8
1516
- numpy=1.26.4
1617
- pandas=2.2.2
1718
- pip=24.0
1819
- pre-commit=3.4.0
19-
- pydocstyle=6.3.0
2020
- pylint=3.2.2
2121
- pytest=7.4.4
22+
- ruff=0.3.5
2223
- scipy=1.13.1
2324
- twine=4.0.2
2425
- pip:
26+
- mkdocs-material==9.5.28
27+
- mkdocstrings==0.25.1
28+
- mkdocstrings-python==1.10.5
2529
- tabulate==0.9.0

docs/api.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
description: Panda-Helper API Reference. Detailed description of the Panda-Helper API.
3+
---
4+
5+
# API Reference
6+
::: pandahelper.profiles
7+
8+
<br>
9+
10+
::: pandahelper.stats

docs/assets/images/panda.png

10.2 KB
Loading

docs/index.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
hide:
3+
- navigation
4+
- toc
5+
description: Panda-Helper Documentation. Panda-Helper is a simple, Python data-profiling tool for Pandas’ DataFrames and Series that allows you to assess data quality and usefulness with minimal effort.
6+
---
7+
8+
# Panda-Helper
9+
___Quickly and easily inspect data so you can move on to more in-depth analysis.___
10+
11+
Panda-Helper is a simple, open-source, Python data-profiling tool for Pandas’ DataFrames and Series
12+
that allows you to assess data quality and usefulness with minimal effort.
13+
14+
<div class="grid cards" markdown>
15+
16+
- [:material-hammer-wrench:{ .lg .middle } __Install Panda-Helper__](install.md)
17+
18+
---
19+
20+
Install `pandahelper` with `pip` or `anaconda` and get up
21+
and running in minutes
22+
23+
- [:octicons-book-16:{ .lg .middle} __API Reference__](api.md)
24+
25+
---
26+
27+
Detailed description of the Panda-Helper API
28+
29+
- [:material-television-guide:{ .lg .middle } __User Guide__](user_guide.md)
30+
31+
---
32+
33+
How to use Panda-Helper with examples
34+
35+
- [:simple-github:{ .lg .middle } __Source Code__](https://github.com/ray310/Panda-Helper)
36+
37+
---
38+
39+
Review, clone, or fork the source code
40+
41+
</div>

0 commit comments

Comments
 (0)