Big Data Experiences

I have the experiences of working with 18 big data sets. These are BTD, USTD, GDP6, LAQN, HD2015, USRDS, NIS, BDHS-99, BDHS-11, CARDIA, MESA, NGHS, NSD, NCD, K12SD, USPOST, EXPCS, and CSD. I also have experiences of exploring two other big data sets from DC Sentencing Commission, which I cannot disclose due to NDA (Non-Disclosure Agreement). BTD and USTD stands for Bangladesh temperature data and USA temperature data, GDP6 stands for GDP data from six countries (Australia, Canada, Germany, Japan, UK and USA) from 1970 to 2015. These data are used in graduate level Time Series course for comparative study between parametric and nonparametric time seires models. LAQN  stands for London Air Quality Network data. Currently I am working on this data for one-step kernel loglikelihood estimation and two-step smoothing estimation of time-variant parameter. USRDS stands for United States Renal Data System, NIS stands for National Impatient Sample. I worked on USRDS and NIS data sets when I was a Postdoctoral fellow at NIAMS, NIH. BDHS-99 and BDHS-11 stands for Bangladesh Demographic and Health Survey for the year 1999 and 2011. NGHS stands for National Growth and Health Study (Longitudinal data), which was conducted from September, 1985 - March, 2000 under the supervision of NIH. My PhD dissertation data comes from this study. I am very familiar with the MESA and CARDIA study as I collaborated with NIH and Johns Hopkins University under the supervision of Dr. Colin Wu and Dr. Joao Lima. MESA stands for Multi-Ethnic Study of Atherosclerosis. CARDIA Stands for Coronary Artery Risk Development in Young Adults. So far, there are eight CARDIA study and I have used SAS macro to combine the SAS files from each study. NSD stands for National School Data Set. I came across this data set when I worked as an analytic researcher at K12. K12SD stands for K12 School Data Set. NCD stands for National Crime Data Set. I worked on this data when I worked as a part time statistical modeler at Washington DC sentencing commmission. USPOST stand for United States Postal Service Data set. I worked on this data when I was a summer intern at IHS, Global Insight in Washington DC. EXPCS Stands for Excel Academy Public School Data Set. I used this data when I worked as chief data analyst at Excel Academy Public Charte School in Washington DC. CSD stands Convenience Store Data Set, which I came across when I worked as a summer intern at NACS headquarter in Virginia.

Experiences on Financial Data:

I am familiar with 6 big financial data set, which I used to complete various homework and projects for completing master’s degree in Quantitative and Computational Finance from Georgia Tech. 

1. CRSP data: The Center for Research in Security Prices (CRSP) is a provider of historical stock market data. For each stock, it provides daily/monthly opening price, closing price, minimum price, maximum price, and volume. It also provides a market return for that day. It has two versions (a) DSF: daily stock file (b) MSF: monthly stock file

2. COMPUSTAT Funda data: Compustat is a comprehensive market and corporate financial database published by Standard and Poor's, covering thousands of companies worldwide, with info dating back as far as 1950. It has all the fundamental (balance sheet, cash flow, income statement) info on company 

3. HMDA data: Home Mortgage Disclosure Act requires banks to make all mortgage application-related data publicly available.

4. SEC EDGAR data: We accessed 8-K, 10-K and 10-Q files for each company for NLP project. 8-K is filed for any event that happened at the company. 10-K is annual filing and 10-Q is quarterly. 

5. Factor data from McKinley Capital Management: we used this factor data to build a trading strategy for Prof. Deng's class. It has multiple factors for each company. 

6. VIX data: In the derivative security class we used option price data for the entire market to calculate VIX.