This course is an instance of Advanced Topics in Information Science – CSCI 6720G. The focus of the course for 2018 is on big data with emphazize on data integration, data curation and data cleaning.
Course Announcements and News
The first class is on September 12th, 11.10am.
New! Link to Mining Massive Datasets book (PDF version) and slides (PPT and PDF): http://www.mmds.org/
Office hours, Wed, 4.30-5.30pm (except reading week).
The following constitutes a tentative outline of the course. The assigned reading material must be read by each student in the course. The student must submit a one-pager per assigned paper per two weeks.
- Course introduction and overview PDF
- First paper assignments
- Bringing Order to Big Data (presenter: Jarek Szlichta) Slides posted on Blackboard
- F. Chiang, R. Miller: A unified model for data and constraint repair. ICDE 2011: 446-457; presenter: Samantha Stahlke.
- Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, Divesh Srivastava: Sequential Dependencies. PVLDB 2(1): 574-585 (2009); presenter: Nour Halabi.
- Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, Ihab F. Ilyas: Guided data repair. PVLDB 4(5): 279-289 (2011); presenter Bahare Askari.
- Nataliya Prokoshyna, Jaroslaw Szlichta, Fei Chiang, Renée J. Miller, Divesh Srivastava: Combining Quantitative and Logical Data Cleaning. PVLDB 9(4): 300-311 (2015); presenter: Andrei Stoica.
- Reading Week!
- George Beskales, Ihab F. Ilyas, Lukasz Golab, Artur Galiullin: On the relative trust between inconsistent data and inaccurate constraints. ICDE 2013: 541-552; presenter: Mike Valdron.
- Philip Bohannon, Michael Flaster, Wenfei Fan, Rajeev Rastogi: A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. SIGMOD Conference 2005: 143-154; presenter: Samantha Stahlke
- Guoyao Feng, Lukasz Golab, Divesh Srivastava: Scalable Informative Rule Mining. ICDE 2017: 437-448; presenter: Nour Halabi.
- Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye: KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. SIGMOD Conference 2015: 1247-1261; presenter: Bahare Askari
- CASCON Conference
Nov 7th (Special Review Session)
- Mining CFD Rules (paper posted on Blackboard); presenter: Samantha Stahlke
- (Pager due Nov. 10th on this paper)
- Sridevi Baskaran, Alexander Keller, Fei Chiang, Lukasz Golab, Jaroslaw Szlichta: Efficient Discovery of Ontology Functional Dependencies. CIKM 2017: 1847-1856; presenter: Mike Valdron
- Xu Chu, Ihab F. Ilyas, Paolo Papotti: Holistic data cleaning: Putting violations into context. ICDE 2013: 458-469; presenter: Andrei Stoica
- Shazia Wasim Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F. Ilyas, Sebastian Link, Renée J. Miller, Felix Naumann, Xiaofang Zhou, Divesh Srivastava: Data Quality: The Role of Empiricism. SIGMOD Record 46(4): 35-43 (2017); presenter Nour Halabi
- Thorsten Papenbrock, Felix Naumann: A Hybrid Approach to Functional Dependency Discovery. SIGMOD Conference 2016: 821-833; presenter: Bahare Askari
- Stefano Ortona, Venkata Vamsikrishna Meduri, Paolo Papotti: RuDiK: Rule Discovery in Knowledge Bases. PVLDB 11(12): 1946-1949 (2018); presenter: Mike Valdron
- Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, Donatello Santoro: Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms. PVLDB 9(2): 36-47 (2015); presenter Andrei Stoica
- Student project Presentations
A one pager is a one page summary of the research paper assigned reading for the lecture (one per two weeks) which has to be submitted every two weeks. It should not be longer than one page of printed letter paper (10 pt font preferred.) The one pager should address the following points:
- summarize the problem(s) addressed/solved by the research paper (1-2 sentences that clearly describe the problem: “The problem is … .”
- briefly sketch the main ideas on which the solution of the problem is based
- briefly describe the research methodology of the paper (1-2 sentences)
- identify 3 strong points and 3 weak points of the paper
- summarize any assumptions the solution in the paper is based upon (any restrictions; divide this by stated assumptions and non-stated assumptions)
- raise three non-trivial questions about the paper (including future work)
- other remarks (if any)
There are four marks for a one pager:
- 0 – nothing was handed in
- 1 – one pager is not detailed enough or too long
- 2 – one pager is good, all aspects covered
- 3 – one pager is exceptional, as it outlines further interesting points that go beyond the discussion in the paper
This course is project-based. You have to propose and carry out a project that investigates a clearly defined problem within the scope of data management and big data integration.
- one project per student
- independent exploration of specific problem
- project proposal, modeling, design, simulation, and analysis
- evaluation of project: timeliness, development and presentation of idea (i.e., in class presentations (e.g., progress and final) and proposal, progress, and final project report.
- no more than 12 pages (LNCS format).
In the course project you should demonstrate the ability to do research by solving a well-defined problem. The emphasis of the project is to apply a solid research methodology from beginning to end. You will learn about what a solid research methodology is in reading and analysis various research papers throughout the course.
Submit a one page proposal. The proposal should cover the following points:
- clear and concise problem statement: identify the problem (if all else breaks, start with: The problem is … (2 sentences or less)
- discuss the relevance: state why your problem is an important one, what impact would a solution have (i.e., what would it change)
- discuss why the problem is interesting, i.e., convince that it is challenging to solve
- describe related approaches
- sketch your approach: say what you intend to do to solve the problem
- describe how you intend to solve the problem (i.e., implement and evaluate, model and simulate, define theorem and prove etc.)
- describe anticipated difficulties
Submit a one page progress report. The report should contain:
- describe the problem you are working on (this may simply be a repeat of your proposal, unless you got feedback to refine your problem statement)
- describe your approach in more detail
- summarize of accomplishments to date
- summarize next steps
- describe problems your encountered and how you anticipate to solve them
- the problem description with a motivation
- a quick overview of related work
- the proposed solution
- a technical description of the solution
- encountered difficulties
- future work and conclusion
For your final project write-up you must use the proposed format LNCS. Do not write more than 12 pages in the given format. Your project report must be of “publishable quality”. This means, the presentation should not include typos, not contain many grammatical errors, etc. It does not mean that your paper must be ready for publication in a major conference. (Even though this would be a desirable future result.)
- You have 12 pages to present your project.
- The final deadline for your project is due 7th of December.
- The point of these strict rules is to approximate a conference/journal/proposal/patent submission process. You will often have to deal with formats imposed on you by someone else.
- Secondly, everybody should be allowed to write the same amount, or less to present their ideas.
The course mark is broken down as follows
- 15% One pagers
- 30% Presentations and discussion leading
- 20% Participation and interactions (discussion, feedback, readings, ideas, etc.)
- 35% Course project (proposal, progress reports, presentation, final report)
- Pager 1: October 6th, midnight
- Project proposals: 20th of October, midnight
- Pager 2: 27th of October, midnight
- Pager 3: 8th of November
- Project progress report: 17th of Nov, midnight
- Pager 4: 1st of Dec, midnight
- Final Project presentations: last week of classes
- Final report due: 12th of Dec, midnight