|
1 | | -# Quickly inspect Pandas DataFrames and Series with Panda-Helper data profiles |
2 | | -- Perform initial data exploration |
3 | | -- Detect data issues and help with quality control |
| 1 | +# Panda-Helper: Quickly and easily inspect data |
| 2 | +Panda-Helper creates data profiles for data in Pandas DataFrames and Series |
4 | 3 |
|
5 | | -### DataFrameProfile: |
6 | | -- Reports DataFrame shape, Series names, and Series data types |
7 | | -- Checks for obvious duplicates |
8 | | -- Provides distribution statistics on null values per row |
| 4 | +Assess data quality and usefulness with minimal effort |
9 | 5 |
|
10 | | -``` |
11 | | -DataFrameProfile(df) |
12 | | -``` |
13 | | - |
| 6 | +Effortlessly perform initial data exploration, _so you can move on to more in-depth analysis_ |
14 | 7 |
|
| 8 | +----- |
| 9 | +### DataFrame profiles quickly and easily: |
| 10 | +- Report shape |
| 11 | +- Detect duplicated rows |
| 12 | +- Display series names and data types |
| 13 | +- Provide distribution statistics on null values per row providing a view on data completeness |
15 | 14 |
|
16 | | -### SeriesProfile: |
17 | | -- Reports data type, number of unique values, and number of null values |
18 | | -- Displays a frequency table of the most and least common values |
19 | | -- Provides distribution statistics (for numeric data) |
| 15 | +__Sample DataFrame profile__<br> |
| 16 | +_Vehicles passing through toll stations_ |
20 | 17 |
|
21 | | -#### Catgorical data |
22 | | -``` |
23 | | -SeriesProfile(df["Direction"]) |
24 | | -``` |
25 | | - |
| 18 | + DataFrame-Level Info |
| 19 | + ------------------------- ------------ |
| 20 | + DF Shape (1586280, 6) |
| 21 | + Duplicated Rows 2184 |
| 22 | + |
| 23 | + Column Name Data Type |
| 24 | + -------------------------- ----------- |
| 25 | + Plaza ID int64 |
| 26 | + Date object |
| 27 | + Hour int64 |
| 28 | + Direction object |
| 29 | + # Vehicles - ETC (E-ZPass) int64 |
| 30 | + # Vehicles - Cash/VToll int64 |
| 31 | + |
| 32 | + Summary of Nulls Per Row |
| 33 | + -------------------------- ----------- |
| 34 | + count 1.58628e+06 |
| 35 | + min 0 |
| 36 | + 1% 0 |
| 37 | + 5% 0 |
| 38 | + 25% 0 |
| 39 | + 50% 0 |
| 40 | + 75% 0 |
| 41 | + 95% 0 |
| 42 | + 99% 0 |
| 43 | + max 0 |
| 44 | + median 0 |
| 45 | + mean 0 |
| 46 | + median absolute deviation 0 |
| 47 | + standard deviation 0 |
| 48 | + skew 0 |
26 | 49 |
|
| 50 | +----- |
| 51 | +### Series profiles quickly and easily report the: |
| 52 | +- Series data type |
| 53 | +- Count of non-null values in the series |
| 54 | +- Number of unique values |
| 55 | +- Count of null values |
| 56 | +- Counts and frequency of the most and least common values |
| 57 | +- Distribution statistics for numeric data |
27 | 58 |
|
28 | | -#### Numeric data |
29 | | -``` |
30 | | -SeriesProfile(df["# Vehicles - ETC (E-ZPass)"]) |
31 | | -``` |
32 | | - |
| 59 | +__Sample profile of categorical data__<br> |
| 60 | +_Direction vehicles are traveling_ |
33 | 61 |
|
| 62 | + Direction Info |
| 63 | + ---------------- ------- |
| 64 | + Data Type object |
| 65 | + Count 1586280 |
| 66 | + Unique Values 2 |
| 67 | + Null Values 0 |
| 68 | + |
| 69 | + Value Count % of total |
| 70 | + ------- ------- ------------ |
| 71 | + I 814100 51.32% |
| 72 | + O 772180 48.68% |
34 | 73 |
|
35 | | -### Using Panda-Helper |
36 | | -- Note that Panda-Helper is not currently a package |
37 | | -- Install any required dependencies to your environment of choice |
38 | | -- Copy `reports.py` (in `src/pandahelper` directory) and incorporate into your analyses |
39 | | -- Cite this repo or let me know if this is helpful |
| 74 | +__Sample profile of numeric data__<br> |
| 75 | +_Hourly vehicle counts at tolling points_ |
40 | 76 |
|
| 77 | + # Vehicles - ETC (E-ZPass) Info |
| 78 | + --------------------------------- ------- |
| 79 | + Data Type int64 |
| 80 | + Count 1586280 |
| 81 | + Unique Values 8987 |
| 82 | + Null Values 0 |
| 83 | + |
| 84 | + Value Count % of total |
| 85 | + ------- ------- ------------ |
| 86 | + 0 3137 0.20% |
| 87 | + 43 1762 0.11% |
| 88 | + 44 1743 0.11% |
| 89 | + 40 1712 0.11% |
| 90 | + 42 1699 0.11% |
| 91 | + 41 1682 0.11% |
| 92 | + 39 1676 0.11% |
| 93 | + 37 1673 0.11% |
| 94 | + 48 1659 0.10% |
| 95 | + 46 1654 0.10% |
| 96 | + 38 1646 0.10% |
| 97 | + 45 1641 0.10% |
| 98 | + 36 1636 0.10% |
| 99 | + 52 1574 0.10% |
| 100 | + 47 1572 0.10% |
| 101 | + 50 1571 0.10% |
| 102 | + 51 1555 0.10% |
| 103 | + 53 1547 0.10% |
| 104 | + 55 1543 0.10% |
| 105 | + 34 1534 0.10% |
| 106 | + 8269 1 0.00% |
| 107 | + 8438 1 0.00% |
| 108 | + 8876 1 0.00% |
| 109 | + 8261 1 0.00% |
| 110 | + 8694 1 0.00% |
| 111 | + |
| 112 | + Statistic Value |
| 113 | + ------------------------- --------------- |
| 114 | + count 1.58628e+06 |
| 115 | + min 0 |
| 116 | + 1% 25 |
| 117 | + 5% 68 |
| 118 | + 25% 407 |
| 119 | + 50% 1054 |
| 120 | + 75% 2071 |
| 121 | + 95% 3583 |
| 122 | + 99% 6308 |
| 123 | + max 16854 |
| 124 | + median 1054 |
| 125 | + mean 1373.16 |
| 126 | + median absolute deviation 751 |
| 127 | + standard deviation 1253.1 |
| 128 | + skew 1.69154 |
41 | 129 |
|
42 | | -<br><br>Demonstration data obtained from: <br> |
43 | | -https://data.ny.gov/Transportation/Hourly-Traffic-on-Metropolitan-Transportation-Auth/qzve-kjga/data |
| 130 | +----- |
| 131 | +### Installing Panda-Helper |
| 132 | +`pip install panda-helper` |
44 | 133 |
|
| 134 | +----- |
| 135 | +### Using Panda-Helper |
| 136 | +__Profiling a DataFrame__<br> |
| 137 | +Create the DataFrameProfile and then display it or save the profile. |
| 138 | +```python |
| 139 | +import pandas as pd |
| 140 | +import pandahelper.reports as ph |
| 141 | + |
| 142 | +data = { |
| 143 | + "user_id": [1, 2, 3, 4, 4], |
| 144 | + "transaction": ["purchase", "return", "purchase", "exchange", "exchange"], |
| 145 | + "amount": [100.00, None, 1400.00, 85.12, 85.12], |
| 146 | + "survey": [None, None, None, "online", "online"], |
| 147 | +} |
| 148 | +df = pd.DataFrame(data) |
| 149 | +df_profile = ph.DataFrameProfile(df) |
| 150 | +df_profile |
| 151 | +``` |
45 | 152 |
|
46 | | -<br><br>Test data obtained from: <br> |
47 | | -https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95 |
| 153 | + DataFrame-Level Info |
| 154 | + ------------------------- ------ |
| 155 | + DF Shape (5, 4) |
| 156 | + Obviously Duplicated Rows 1 |
| 157 | + |
| 158 | + Column Name Data Type |
| 159 | + ------------- ----------- |
| 160 | + user_id int64 |
| 161 | + transaction object |
| 162 | + amount float64 |
| 163 | + survey object |
| 164 | + |
| 165 | + Summary of Nulls Per Row |
| 166 | + -------------------------- -------- |
| 167 | + count 5 |
| 168 | + min 0 |
| 169 | + 1% 0 |
| 170 | + 5% 0 |
| 171 | + 25% 0 |
| 172 | + 50% 1 |
| 173 | + 75% 1 |
| 174 | + 95% 1.8 |
| 175 | + 99% 1.96 |
| 176 | + max 2 |
| 177 | + median 1 |
| 178 | + mean 0.8 |
| 179 | + median absolute deviation 1 |
| 180 | + standard deviation 0.83666 |
| 181 | + skew 0.512241 |
| 182 | + |
| 183 | +```python |
| 184 | +df_profile.save_report("df_profile.txt") |
| 185 | +``` |
| 186 | + |
| 187 | +__Profiling a Series__<br> |
| 188 | +Create the SeriesProfile and then display it or save it. That's it! |
| 189 | +```python |
| 190 | +series_profile = ph.SeriesProfile(df["amount"]) |
| 191 | +series_profile |
| 192 | +``` |
| 193 | + amount Info |
| 194 | + ------------- ------- |
| 195 | + Data Type float64 |
| 196 | + Count 4 |
| 197 | + Unique Values 3 |
| 198 | + Null Values 1 |
| 199 | + |
| 200 | + Value Count % of total |
| 201 | + ------- ------- ------------ |
| 202 | + 85.12 2 50.00% |
| 203 | + 100 1 25.00% |
| 204 | + 1400 1 25.00% |
| 205 | + |
| 206 | + Statistic Value |
| 207 | + ------------------------- ---------- |
| 208 | + count 4 |
| 209 | + min 85.12 |
| 210 | + 1% 85.12 |
| 211 | + 5% 85.12 |
| 212 | + 25% 85.12 |
| 213 | + 50% 92.56 |
| 214 | + 75% 425 |
| 215 | + 95% 1205 |
| 216 | + 99% 1361 |
| 217 | + max 1400 |
| 218 | + median 92.56 |
| 219 | + mean 417.56 |
| 220 | + median absolute deviation 7.44 |
| 221 | + standard deviation 654.998 |
| 222 | + skew 1.99931 |
| 223 | + |
| 224 | +```python |
| 225 | +series_profile.save_report("amount_profile.txt") |
| 226 | +``` |
| 227 | +____ |
| 228 | +### Sample data obtained from: |
| 229 | +- https://data.ny.gov/Transportation/Hourly-Traffic-on-Metropolitan-Transportation-Auth/qzve-kjga/data |
| 230 | +- https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95 |
0 commit comments