Advanced Topics in Data Science

This course is an instance of Advanced Topics in Data Science – CSCI 6720G. The focus of the course for 2018 is on big data with emphazize on data integration, data curation, data discovery and data cleaning.

Course Announcements and News

The first class is on September 5th, 11.10am.

Office hours, Tuesdays, 4-5pm (except reading week).


The following constitutes a tentative outline of the course. The assigned reading material must be read by each student in the course. The student must submit a one-pager per assigned paper per three weeks.

Sept 5th

  • Course introduction and overview PDF
  • First paper assignments

Sept 12th

  • Continuous Data Curation (presenter: Jarek Szlichta)

Sept 19th

  • F. Chiang, R. Miller: A unified model for data and constraint repair. ICDE 2011: 446-457; presenter: Nadia Maarfavi
  • Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, Ihab F. Ilyas: Guided data repair. PVLDB 4(5): 279-289 (2011); presenter: Spencer Bryson
  • M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don’t be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD, pages 553–564, 2013. Presenter: Michael Lombardo

Sept 26

  • Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, Vikram Nathan: SageDB: A Learned Database System. CIDR 2019; presenter: Chirag Karia
  • Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye: KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. SIGMOD Conference 2015: 1247-1261; presenter: Aryan Asadian
  • Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, Divesh Srivastava: Sequential Dependencies. PVLDB 2(1): 574-585 (2009); presenter: Andrei Torres

Oct 3rd

  • Guoyao Feng, Lukasz Golab, Divesh Srivastava: Scalable Informative Rule Mining. ICDE 2017: 437-448; presenter: Alireza A Namanloo
  • Nataliya Prokoshyna, Jaroslaw Szlichta, Fei Chiang, Renée J. Miller, Divesh Srivastava: Combining Quantitative and Logical Data Cleaning. PVLDB 9(4): 300-311 (2015); presenter: Alaadin Addas
  • Philip Bohannon, Michael Flaster, Wenfei Fan, Rajeev Rastogi: A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. SIGMOD Conference 2005: 143-154; presenter: Davood Zaman Farsa

Oct 10th

  • Hemant Saxena, Lukasz Golab, Ihab F. Ilyas: Distributed Implementations of Dependency Discovery Algorithms. PVLDB 12(11): 1624-1636 (2019); presenter: Rajinder Khurmi
  • Guoliang Li, Xuanhe Zhou, Shifu Li, Bo Gao: QTune: A Query-Aware Database Tuning System with Deep Reinforcement Learning. PVLDB 12(12): 2118-2130 (2019); presenter Spencer Bryson

Oct 17th

  • Reading Week!

Oct 24

  • Shaoxu Song, Aoqian Zhang, Jianmin Wang, Philip S. Yu. SCREEN: Stream Data Cleaning under Speed Constraints. In Proc. of the ACM SIGMOD Conference on Management of Data, 827-841, 2015; presenter: Aryan Asadian
  • Chelsea Finn, Pieter Abbeel, Sergey Levine: Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017: 1126-1135; presenter Michael Lombardo

Oct 31

  • Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, Bohan Zhang: Automatic Database Management System Tuning Through Large-scale Machine Learning. SIGMOD Conference 2017: 1009-1024; presenter: Davood Zaman Farsa
  • Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Bießmann, Andreas Grafberger: Automating Large-Scale Data Quality Verification. PVLDB 11(12): 1781-1794 (2018); presenter: Rajinder Khurmi

November 7th

  • Lei Cao, Wenbo Tao, Sungtae An, Jing Jin, Yizhou Yan, Xiaoyu Liu, Wendong Ge, Adam Sah, Leilani Battle, Jimeng Sun, Remco Chang, M. Brandon Westover, Samuel Madden, Michael Stonebraker: Smile: A System to Support Machine Learning on EEG Data at Scale. PVLDB 12(12): 2230-2241 (2019); presenter: Chirag Karia
  • Raul Castro Fernandez, Essam Mansour, Abdulhakim Ali Qahtan, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang: Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. ICDE 2018: 989-1000; presenter: Alireza A Namanloo
  • Zhu, Guanghui, et al. “Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms.” IEEE Transactions on Parallel and Distributed Systems (2019); presenter: Andrei Torres

November 14th

Chose your own top-tier data science paper related to data curation. Papers are presented in (random) pairs. Duration time of the presentations is extended to 40 mins. Send the paper in advance, so students have enough time to read it.

  • Liu X., Song G., Wang X. (2019) HATDC: A Holistic Approach for Time Series Data Repairing. In: Yang Q., Zhou ZH., Gong Z., Zhang ML., Huang SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. ; presenters: Davood Zaman Farsa,  Andrei Torres
  • PIClean: a Probabilistic and Interactive Data Cleaning System, SIGMOD 2019; presenters: Michael Lombardo, Rajinder Khurmi

November 21th

Chose your own top-tier data science paper related to data curation. Papers are presented in pairs. Duration time of the presentations is extended to 40 mins.

  • Ryan Marcus, Olga Papaemmanouil: Towards a Hands-Free Query Optimizer through Deep Learning. CIDR 2019; presenters: Aryan Chirag Karia, Spencer Bryson
  • Muhammad Ebraheem et al., Distributed representations of tuples for entity resolution VLDB J. 2018; presenters: Alireza A Namanloo, Aryan Asadian

November 28th

  • Student Project Presentations
One pager

A one pager is a one page summary of the research paper assigned reading for the lecture (one per two weeks) which has to be submitted every two weeks. It should not be longer than one page of printed letter paper (10 pt font preferred.) The one pager should address the following points:

  • summarize the problem(s) addressed/solved by the research paper (1-2 sentences that clearly describe the problem: “The problem is … .”
  • briefly sketch the main ideas on which the solution of the problem is based
    • briefly describe the research methodology of the paper (1-2 sentences)
  • identify 3 strong points and 3 weak points of the paper
  • summarize any assumptions the solution in the paper is based upon (any restrictions; divide this by stated assumptions and non-stated assumptions)
  • raise three non-trivial questions about the paper (including future work)
  • other remarks (if any)

There are four marks for a one pager:

  • 0 – nothing was handed in
  • 1 – one pager is not detailed enough or too long
  • 2 – one pager is good, all aspects covered
  • 3 – one pager is exceptional, as it outlines further interesting points that go beyond the discussion in the paper

Class Project

This course is project-based. You have to propose and carry out a project that investigates a clearly defined problem within the scope of data management and big data integration.

Project guidelines

  • one project per student
  • independent exploration of specific problem
    • project proposal, modeling, design, simulation, and analysis
  • evaluation of project: timeliness, development and presentation of idea (i.e., in class presentations (e.g., progress and final) and proposal, progress, and final project report.
  • no more than 12 pages (LNCS format).

In the course project you should demonstrate the ability to do research by solving a well-defined problem. The emphasis of the project is to apply a solid research methodology from beginning to end. You will learn about what a solid research methodology is in reading and analysis various research papers throughout the course.

Project Proposal

Submit a one page proposal (+ up to 1 page Appendix for examples and other details). The proposal should cover the following points:

  • clear and concise problem statement: identify the problem (if all else breaks, start with: The problem is … (2 sentences or less)
  • discuss the relevance: state why your problem is an important one, what impact would a solution have (i.e., what would it change)
  • discuss why the problem is interesting, i.e., convince that it is challenging to solve
  • describe related approaches
  • sketch your approach: say what you intend to do to solve the problem
  • describe how you intend to solve the problem (i.e., implement and evaluate, model and simulate, define theorem and prove etc.)
  • describe anticipated difficulties

Progress Report

Submit a one page progress report (+1 up to one page of Appendix). The report should contain:

  • describe the problem you are working on (this may simply be a repeat of your proposal, unless you got feedback to refine your problem statement)
  • describe your approach in more detail
  • summarize of accomplishments to date
  • summarize next steps
  • describe problems your encountered and how you anticipate to solve them
Final in-class Presentation
There will be one session at the end of the course where all projects will be presented. Each person will be given 15 minutes slot + 5 minutes for Questions. The project presentation should cover:
  • the problem description with a motivation
  • a quick overview of related work
  • the proposed solution
    • a technical description of the solution
    • encountered difficulties
  • future work and conclusion

Final Report

For your final project write-up you must use the proposed format LNCS. Do not write more than 12 pages in the given format.  Your project report must be of “publishable quality”. This means, the presentation should not include typos, not contain many grammatical errors, etc. It does not mean that your paper must be ready for publication in a major conference. (Even though this would be a desirable future result.)

  • You have 12 pages to present your project.
  • The final deadline for your project is due 7th of December.
  • The point of these strict rules is to approximate a conference/journal/proposal/patent submission process. You will often have to deal with formats imposed on you by someone else.
  • Secondly, everybody should be allowed to write the same amount, or less to present their ideas.


The course mark is broken down as follows

  • 15% One pagers
  • 30% Presentations and discussion leading
  • 20% Participation and interactions (discussion, feedback, readings, ideas, etc.)
  • 35% Course project (proposal, progress reports, presentation, final report)

Course Deadlines:

  • Pager 1: October 5th, midnight
  • Project proposals: October 22nd (Tue), midnight
  • Pager 2: November 5th, midnight
  • Pager 3: of November
  • Project progress report: Nov 17th, midnight
  • Final project presentations: last week of classes
  • Final report due: December 7th, midnight