Data Analysis and Model Documentation
Dataset Overview
The TeenSmartInsight project uses a dataset that captures various aspects of adolescent technology usage and its potential impact on their well-being. The dataset includes the following key features:
- Demographics: Age, gender, location, school grade
- Technology Usage: Daily usage hours, apps used daily, time on social media, time on gaming
- Behavioral Indicators: Sleep hours, phone checks per day, weekend usage hours
- Well-being Metrics: Academic performance, anxiety level, depression level, self-esteem
- Target Variable: Addiction level (scale of 1-10)
Exploratory Data Analysis
The exploratory data analysis (EDA) is performed in the Jupyter notebook 001_TeenAddiction.ipynb. The analysis includes:
- Data Cleaning and Preprocessing:
- Handling missing values
- Encoding categorical variables
- Feature scaling
- Statistical Analysis:
- Descriptive statistics of key variables
- Correlation analysis between technology usage and addiction levels
- Distribution analysis of addiction levels across different demographics
- Visualization:
- Heatmaps showing correlation between variables
- Scatter plots showing relationships between usage patterns and addiction levels
- Bar charts comparing addiction levels across different demographics
Key visualizations from the analysis can be found in the figures/ directory, including:
mapaCalorCorrelacion.png: Correlation heatmap between variablescatNumRespecto_AddLvl.png: Categorical and numerical variables with respect to addiction level
Machine Learning Model
Model Selection
After evaluating several machine learning algorithms, a Random Forest Regressor was selected for the following reasons:
- Ability to handle non-linear relationships
- Robustness to outliers
- Feature importance capabilities
- Good performance with moderate-sized datasets
Feature Engineering
The following features were selected for the model based on their correlation with addiction levels and domain knowledge:
features = [
'Daily_Usage_Hours',
'Apps_Used_Daily',
'Time_on_Social_Media',
'Time_on_Gaming',
'Phone_Checks_Per_Day',
'Sleep_Hours',
'Weekend_Usage_Hours',
'Academic_Performance'
]
Model Pipeline
The model uses a scikit-learn pipeline that includes:
- Preprocessing: StandardScaler for numerical features
- Model: RandomForestRegressor with 100 estimators
preprocessor = ColumnTransformer([
('num', StandardScaler(), features),
])
pipeline = Pipeline([
('preproc', preprocessor),
('model', RandomForestRegressor(n_estimators=100, random_state=42))
])
Model Evaluation
The model is evaluated using the following metrics:
- Mean Squared Error (MSE): Measures the average squared difference between predicted and actual addiction levels
- Mean Absolute Error (MAE): Measures the average absolute difference between predicted and actual addiction levels
- R-squared (R²): Measures the proportion of variance in addiction levels explained by the model
The evaluation script (evaluate_model.py) also generates a scatter plot comparing predicted vs. actual addiction levels.
Model Deployment
The trained model is serialized using joblib and saved as rf_pipeline.pkl. This file is then used by the web application to make predictions based on user input.