This course is an instance of Advanced Topics in Data Science – CSCI 6720G. The focus of the course for 2018 is on big data with emphazize on data integration, data curation, data discovery and data cleaning.
Course Announcements and News
The first class is on September 5th, 11.10am.
Office hours, Tuesdays, 4-5pm (except reading week).
The following constitutes a tentative outline of the course. The assigned reading material must be read by each student in the course. The student must submit a one-pager per assigned paper per three weeks.
- Course introduction and overview PDF
- First paper assignments
- Continuous Data Curation (presenter: Jarek Szlichta)
- F. Chiang, R. Miller: A unified model for data and constraint repair. ICDE 2011: 446-457; presenter: Nadia Maarfavi
- Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, Ihab F. Ilyas: Guided data repair. PVLDB 4(5): 279-289 (2011); presenter: Spencer Bryson
- M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don’t be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, pages 553–564, 2013. Presenter: Michael Lombardo
- Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, Vikram Nathan: SageDB: A Learned Database System. CIDR 2019; presenter: Chirag Karia
- Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye: KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. SIGMOD Conference 2015: 1247-1261; presenter: Aryan Asadian
- Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, Divesh Srivastava: Sequential Dependencies. PVLDB 2(1): 574-585 (2009); presenter: Andrei Torres
- Guoyao Feng, Lukasz Golab, Divesh Srivastava: Scalable Informative Rule Mining. ICDE 2017: 437-448; presenter: Alireza A Namanloo
- Nataliya Prokoshyna, Jaroslaw Szlichta, Fei Chiang, Renée J. Miller, Divesh Srivastava: Combining Quantitative and Logical Data Cleaning. PVLDB 9(4): 300-311 (2015); presenter: Alaadin Addas
- Philip Bohannon, Michael Flaster, Wenfei Fan, Rajeev Rastogi: A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. SIGMOD Conference 2005: 143-154; presenter: Davood Zaman Farsa
- Hemant Saxena, Lukasz Golab, Ihab F. Ilyas: Distributed Implementations of Dependency Discovery Algorithms. PVLDB 12(11): 1624-1636 (2019); presenter: Rajinder Khurmi
- Guoliang Li, Xuanhe Zhou, Shifu Li, Bo Gao: QTune: A Query-Aware Database Tuning System with Deep Reinforcement Learning. PVLDB 12(12): 2118-2130 (2019); presenter Spencer Bryson
- Reading Week!
- Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu. SCREEN: Stream Data Cleaning under Speed Constraints. In Proc. of the ACM SIGMOD Conference on Management of Data, 827-841, 2015; presenter: Aryan Asadian
- Chelsea Finn, Pieter Abbeel, Sergey Levine: Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017: 1126-1135; presenter Michael Lombardo
- Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, Bohan Zhang: Automatic Database Management System Tuning Through Large-scale Machine Learning. SIGMOD Conference 2017: 1009-1024; presenter: Davood Zaman Farsa
- Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Bießmann, Andreas Grafberger: Automating Large-Scale Data Quality Verification. PVLDB 11(12): 1781-1794 (2018); presenter: Rajinder Khurmi
- Lei Cao, Wenbo Tao, Sungtae An, Jing Jin, Yizhou Yan, Xiaoyu Liu, Wendong Ge, Adam Sah, Leilani Battle, Jimeng Sun, Remco Chang, M. Brandon Westover, Samuel Madden, Michael Stonebraker: Smile: A System to Support Machine Learning on EEG Data at Scale. PVLDB 12(12): 2230-2241 (2019); presenter: Chirag Karia
- Raul Castro Fernandez, Essam Mansour, Abdulhakim Ali Qahtan, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang: Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. ICDE 2018: 989-1000; presenter: Alireza A Namanloo
- Zhu, Guanghui, et al. “Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms.” IEEE Transactions on Parallel and Distributed Systems (2019); presenter: Andrei Torres
Chose your own top-tier data science paper related to data curation. Papers are presented in (random) pairs. Duration time of the presentations is extended to 40 mins. Send the paper in advance, so students have enough time to read it.
- Liu X., Song G., Wang X. (2019) HATDC: A Holistic Approach for Time Series Data Repairing. In: Yang Q., Zhou ZH., Gong Z., Zhang ML., Huang SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. ; presenters: Davood Zaman Farsa, Andrei Torres
- PIClean: a Probabilistic and Interactive Data Cleaning System, SIGMOD 2019; presenters: Michael Lombardo, Rajinder Khurmi
Chose your own top-tier data science paper related to data curation. Papers are presented in pairs. Duration time of the presentations is extended to 40 mins.
- Ryan Marcus, Olga Papaemmanouil: Towards a Hands-Free Query Optimizer through Deep Learning. CIDR 2019; presenters: Aryan Chirag Karia, Spencer Bryson
- Muhammad Ebraheem et al., Distributed representations of tuples for entity resolution VLDB J. 2018; presenters: Alireza A Namanloo, Aryan Asadian
- Student Project Presentations
A one pager is a one page summary of the research paper assigned reading for the lecture (one per two weeks) which has to be submitted every two weeks. It should not be longer than one page of printed letter paper (10 pt font preferred.) The one pager should address the following points:
- summarize the problem(s) addressed/solved by the research paper (1-2 sentences that clearly describe the problem: “The problem is … .”
- briefly sketch the main ideas on which the solution of the problem is based
- briefly describe the research methodology of the paper (1-2 sentences)
- identify 3 strong points and 3 weak points of the paper
- summarize any assumptions the solution in the paper is based upon (any restrictions; divide this by stated assumptions and non-stated assumptions)
- raise three non-trivial questions about the paper (including future work)
- other remarks (if any)
There are four marks for a one pager:
- 0 – nothing was handed in
- 1 – one pager is not detailed enough or too long
- 2 – one pager is good, all aspects covered
- 3 – one pager is exceptional, as it outlines further interesting points that go beyond the discussion in the paper
This course is project-based. You have to propose and carry out a project that investigates a clearly defined problem within the scope of data management and big data integration.
- one project per student
- independent exploration of specific problem
- project proposal, modeling, design, simulation, and analysis
- evaluation of project: timeliness, development and presentation of idea (i.e., in class presentations (e.g., progress and final) and proposal, progress, and final project report.
- no more than 12 pages (LNCS format).
In the course project you should demonstrate the ability to do research by solving a well-defined problem. The emphasis of the project is to apply a solid research methodology from beginning to end. You will learn about what a solid research methodology is in reading and analysis various research papers throughout the course.
Submit a one page proposal (+ up to 1 page Appendix for examples and other details). The proposal should cover the following points:
- clear and concise problem statement: identify the problem (if all else breaks, start with: The problem is … (2 sentences or less)
- discuss the relevance: state why your problem is an important one, what impact would a solution have (i.e., what would it change)
- discuss why the problem is interesting, i.e., convince that it is challenging to solve
- describe related approaches
- sketch your approach: say what you intend to do to solve the problem
- describe how you intend to solve the problem (i.e., implement and evaluate, model and simulate, define theorem and prove etc.)
- describe anticipated difficulties
Submit a one page progress report (+1 up to one page of Appendix). The report should contain:
- describe the problem you are working on (this may simply be a repeat of your proposal, unless you got feedback to refine your problem statement)
- describe your approach in more detail
- summarize of accomplishments to date
- summarize next steps
- describe problems your encountered and how you anticipate to solve them
- the problem description with a motivation
- a quick overview of related work
- the proposed solution
- a technical description of the solution
- encountered difficulties
- future work and conclusion
For your final project write-up you must use the proposed format LNCS. Do not write more than 12 pages in the given format. Your project report must be of “publishable quality”. This means, the presentation should not include typos, not contain many grammatical errors, etc. It does not mean that your paper must be ready for publication in a major conference. (Even though this would be a desirable future result.)
- You have 12 pages to present your project.
- The final deadline for your project is due 7th of December.
- The point of these strict rules is to approximate a conference/journal/proposal/patent submission process. You will often have to deal with formats imposed on you by someone else.
- Secondly, everybody should be allowed to write the same amount, or less to present their ideas.
The course mark is broken down as follows
- 15% One pagers
- 30% Presentations and discussion leading
- 20% Participation and interactions (discussion, feedback, readings, ideas, etc.)
- 35% Course project (proposal, progress reports, presentation, final report)
- Pager 1: October 5th, midnight
- Project proposals: October 22nd (Tue), midnight
- Pager 2: November 5th, midnight
- Pager 3: of November
- Project progress report: Nov 17th, midnight
- Final project presentations: last week of classes
- Final report due: December 7th, midnight