graph LR
Data_Ingestion_Preprocessing["Data Ingestion & Preprocessing"]
Data_Labeling_Module["Data Labeling Module"]
Data_Profiling_Engine["Data Profiling Engine"]
Reporting_Visualization["Reporting & Visualization"]
Data_Ingestion_Preprocessing -- "Provides Raw Data" --> Data_Labeling_Module
Data_Ingestion_Preprocessing -- "Feeds Preprocessed Data" --> Data_Profiling_Engine
Data_Labeling_Module -- "Enriches Profiled Data" --> Data_Profiling_Engine
Data_Profiling_Engine -- "Outputs Profiled Results" --> Reporting_Visualization
click Data_Ingestion_Preprocessing href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/DataProfiler/Data_Ingestion_Preprocessing.md" "Details"
click Data_Labeling_Module href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/DataProfiler/Data_Labeling_Module.md" "Details"
click Data_Profiling_Engine href "https://github.com/CodeBoarding/GeneratedOnBoardings/blob/main/DataProfiler/Data_Profiling_Engine.md" "Details"
The DataProfiler project is architected as a robust data processing pipeline, designed to ingest, label, profile, and report on diverse datasets. At its core, the Data Ingestion & Preprocessing component acts as the entry point, standardizing raw data from various formats. This prepared data then concurrently feeds into the Data Labeling Module for sensitive information detection and the central Data Profiling Engine for comprehensive statistical analysis. The Data Labeling Module's output further enriches the profiling process within the Data Profiling Engine. Finally, the detailed insights generated by the Data Profiling Engine are channeled to the Reporting & Visualization component, which renders user-friendly reports and visual summaries. This clear, sequential flow with distinct component boundaries makes the DataProfiler highly suitable for visual diagram representation, highlighting the progression of data through analysis and reporting stages.
Data Ingestion & Preprocessing [Expand]
Responsible for reading raw data from diverse sources (CSV, JSON, Parquet, Text, Graph) and transforming it into a standardized, structured format (e.g., Pandas DataFrame) suitable for subsequent profiling and labeling.
Related Classes/Methods:
dataprofiler/data_readers/base_data.pydataprofiler/data_readers/csv_data.pydataprofiler/data_readers/json_data.pydataprofiler/data_readers/parquet_data.pydataprofiler/data_readers/text_data.pydataprofiler/data_readers/graph_data.py
Data Labeling Module [Expand]
Manages the end-to-end process of identifying and classifying sensitive or specific data elements. It orchestrates data preparation, model execution (deep learning, regex, column name), and result processing.
Related Classes/Methods:
dataprofiler/labelers/base_data_labeler.pydataprofiler/labelers/character_level_cnn_model.pydataprofiler/labelers/regex_model.pydataprofiler/labelers/column_name_model.pydataprofiler/labelers/data_processing.py
Data Profiling Engine [Expand]
The central orchestrator for data profiling. It coordinates with specialized column profilers to extract various statistics and insights, applying user-defined configurations to generate comprehensive data profiles. This component internally manages different column-specific profilers and profiling options.
Related Classes/Methods:
dataprofiler/profilers/profile_builder.pydataprofiler/profilers/categorical_column_profile.pydataprofiler/profilers/data_labeler_column_profile.pydataprofiler/profilers/numerical_column_stats.pydataprofiler/profilers/text_column_profile.pydataprofiler/profilers/unstructured_text_profile.pydataprofiler/profilers/graph_profiler.pydataprofiler/profilers/datetime_column_profile.pydataprofiler/profilers/float_column_profile.pydataprofiler/profilers/int_column_profile.pydataprofiler/profilers/profiler_options.py
Generates human-readable reports and visual representations (e.g., histograms, missing value matrices) from the collected data profiles, enabling effective interpretation of results.
Related Classes/Methods: