Data Science Workshop 2019

Find more information about the workshop here

Pre-workshop Tutorials

July 22-26, 2019, 2-3pm, 251 LeConte Hall

We thank everyone involved for leading the tutorials before the workshop. The content of the tutorials can be found in the Github repos, linked below:


The projects below were completed over the course of the 2019 Data Science Workshop from July 27-August 17, 2019.

Mentor: Sumayah Rahman
Participants: Julia Cluceru, Dinara Ermakova, Ella Hiesmayr
Link to project

Using the Kenya Financial Diaries dataset, we tried to predict whether people would pay back their loan or not. We first extracted all data related to loans from the huge dataset that contains all financial transactions of about 250 Kenyan households. The hardest part then was to define an outcome that measures how good of a loanpayer someone is. After doing this we predicted the outcome using several different modeling approaches.

Mentor: Frank Cleary
Participants: Daniel D Wooten, Susan Hao, Zhimin Chen, Kilean Hwang
Link to project

Our project used transfer learning with pre-trained VGGFace in Tensorflow Keras to build an emotion CNN. We then developed an application using Flask which takes in webcam videos of faces and predicts what emotion that person is exhibiting as outputted by our model. An emoji associated with that given emotion is overlayed on top of the person's face along with a graph showing the probabilities of each emotion class as outputted by the neural net.

Mentor: Andy Vargas
Participants: Michael Yeh, Marvin Pohl, Yuem Park
Link to project

For many businesses, only a small fraction of customers produce the majority of the revenue. Therefore, identifying these potentially revenue-generating customers is critical for developing effective marketing strategies. Our project used visitor data from the online Google Merchandise Store to predict future customer revenue. Given that the vast majority of visitors never purchase anything in the future, we used a two-model approach: first, a logistic regression was used to quantify the probability that a given customer would make a purchase in the future, and second, a random forest was used to estimate the amount of money that a given customer was going to spend under the assumption that they did make a purchase in the future. By multiplying these two values together, we were able to estimate the expected value of any given customer.

Mentor: Henoch Wong
Participants: Sebastian Gude, Akshay Punhani, Takuma Kinoshita
Link to project

Preventing damages caused by Malware is an important challenge. We find that using only the top-importance features in Microsoft Malware dataset yields strong predictions with basic machine learning techniques.

Mentor: Karla Cabellero
Participants: Aummul Baneen Manasawala, Anu Kuncheria, Dat Mai, Tairu Lyu
Link to project

We use results from sentiment analysis of the top 25 news items from Reddit to improve market prediction from time series analysis. We tried logistic regression, random forest and time series analysis techniques to try to predict market ups and downs.

Mentor: Mike Osorio
Participants: Yang Ha, Alex Robson, Jeffmin Lin, Yao Tang
Link to project

We utilized a number of deep learning techniques, notably residual CNNs, to study and attempt to classify individual heartbeats by arrhythmia. Our test accuracy was competitive with state-of-the-art results on our main dataset and when transfer learning to a secondary dataset.

Mentor: Mike Yen
Participants: Rebecca Cruz, Katherine Oosterbaan, Mehmet Dogan, Yishuang Chen
Link to project

We built natural language processing and computer vision models to predict the subject of a mathematical expression. We used the data from Stack Exchange posts to train, validate and test our models, and achieved accuracy values of 70-90%.

Mentor: Nicolas Soldi
Participants: Alison Nguyen
Link to project

We used natural language processing and machine learning to predict which professionals are best suited to answer questions about schools and careers. Our data set comes from CareerVillage, a community that enables students to learn more about potential jobs from people who have the very same jobs themselves.

Mentor: Santiago Miret
Participants: Christianna Lininger, Kate Groschner, Yusuke Kikuchi, Dana Miller
Link to project

Data from a publicly available database reporting building electricity use and building features was used to predict 9-12 month hourly electricity usage for any given building. Random forest and long short term memory machine learning techniques were utilized.