This project is for the Erdös Institude 2024 Fall data science boot camp.
This project aims to find the hidden relationship between the occuracy of different news topics and its impact on the stock price. There are many projects try to connect the fiancial news or financial-related social medias to the stock price. This project interest in the non-financial news. For example, how will the political news influence the stock price. Our project provide with a general method to deal with all kinds of topics.
There are two news dataset:
- Headline: This data set from Kaggle contains around 210k news headlines with labeled category from 2012 to 2022. This dataset is served to train the category classifier.
- All_news: This dataset contains uncategorized 2,688,878 news articles and essays from 27 American publications, spanning January 1, 2016 to April 2, 2020. This dataset is served to cluster the topics and the features to predict the future stock price.
The stock price dataset: Yahoo API, we use the history stock price of 20 companies from 2016-01-01 to 2019-12-31.
For both news dataset, we run the almost identical preprocessing pipeline:
- Null removing and columns dropping
- Tokenization
- Stop words removing
- Lemmatization
For the All_news data set, we keep the 100 tokens as the headline.
For the Explainatory Data Analysis, you can find in EDA_headline, EDA_Example_All_News, All_News_EDA.
Using Headline as the training set, we build a classifier model. This model is solft-voting ensemble classifier using:
- Multi-Logistic Classifer
- Random Forest
- XGBoost Classifier
- CNN
This model, we choose the Term frequency-Inverse document frequency (Tf-idf) embedding to emphasize the importance of words in the headline. A more detailed discussion of the classification can be found in Classification.
Using the classifer, we can label All_news.
By running the content of All_news, we can cluster different topics. The model is based on:
- Latent Dirichlet Allocation (LDA)
- Hierarchical Dirichlet Process (HDP)
We choose around 500 clusters. More exploration can be found in explore_hdp. Later, we call these clusters as topics.
Our model for stock price is
where
All the factors are global to the market. We get the factors from: . A detailed discussion can be found in FF5.
The news model
- Ridge
- Lasso
- Random Forest Regressor
- XGBoost Regressor
We chooose XGBoost with penalty among others for the least mean square error in the test set and contains more trading information. A detailed discussion can be found in Price_predicting.
Go the the configuration file predict_stock_w_news.toml, change to your own local path of the dataset and check all the models you want to run. Then go the predict_stock_w_news.py.

