This course is an instance of Advanced Topics in Information Science – CSCI 6720G. The focus of the course for 2016 is data management and big data integration.
Course Announcements and News
The first meeting is on Thursday, September 8th, 2.10pm, UB A3 2034
New! Link to Mining Massive Datasets book (PDF version) and slides (PPT and PDF): http://www.mmds.org/
The following constitutes a tentative outline of the course. The assigned reading material must be read by each student in the course. The student must submit a one-pager per assigned paper per two weeks.
- Course introduction and overview PDF
- First paper assignments
- Bringing Order to Big Data (presenter: Jarek Szlichta) Slides posted on Blackboard
- F. Chiang, R. Miller: A unified model for data and constraint repair. ICDE 2011: 446-457; presenter: Thomas Galati.
- Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, Divesh Srivastava: Sequential Dependencies. PVLDB 2(1): 574-585 (2009); presenter: Brandon Laughlin.
- Manish Singh, Arnab Nandi, H. V. Jagadish: Skimmer: rapid scrolling of relational query results. SIGMOD Conference 2012: 181-192; presenter: Hrim Mehta
- Bongsug Chae. 2015. Insights from hashtag #supplychain and Twitter analytics: Considering Twitter and Twitter data for supply chain practice and research. International Journal of Production Economics 165: 247–259; presenter: Dennis Kappen. PDF version of the paper posted on Blackboard.
- S. Bergamaschi, E. Domnori, F. Guerra, R. T. Lado, and Y. Velegrakis, “Keyword search over relational databases: A metadata approach,” in Proc. of SIGMOD’11, 2011; presenter: Alexander Keller.
- Nikolay Yakovets, Jarek Gryz, Stephanie Hazlewood, Paul van Run: From MDM to DB2: A Case Study of Security Enforcement Migration. DBSec 2012: 207-222; presenter: Amit Maraj.
Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150; presenter: Neil Seward, link: https://www.usenix.org/legacy/events/osdi04/tech/full_papers/dean/dean.pdf
- Katsiaryna Mirylenka, Themis Palpanas, Graham Cormode, Divesh Srivastava: Finding interesting correlations with conditional heavy hitters. ICDE 2013: 1069-1080; presenter: Yanwen Zhou.
20th of October
- Chris Lewis, Noah Wardrip-Fruin: Mining game statistics from web services: a World of Warcraft armory case study. FDG 2010: 100-107; presenter: Brandon Drenikow.
- Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, Wenjie Zhang: pSCAN: Fast and exact structural graph clustering. ICDE 2016: 253-264; presenter Mahboubeh Ahmadalinezhad.
27th of October
- Reading week
3rd of November (30min presentation + 30 min round robin)
- Lukasz Golab, Howard J. Karloff, Flip Korn, Barna Saha, Divesh Srivastava: Discovering Conservation Rules. ICDE 2012: 738-749; presenter: Brandon Laughlin.
- A. Drachen, R. Sifa, C. Bauckhage, and C. Thurau, “Guns, swords and data: Clustering of player behavior in computer games in the wild,” 2012 IEEE Conf. Comput. Intell. Games, CIG 2012, pp. 163–170, 2012: http://geneura.ugr.es/cig2012/papers/paper87.pdf; Presenter: Thomas Galati
- Filip Radlinski, Thorsten Joachims: Query chains: learning to rank from implicit feedback. KDD 2005: 239-248; presenter: Fang Zhang.
10th of November
- Marc A. Smith, Ben Shneiderman, Natasa Milic-Frayling, Eduarda Mendes Rodrigues, Vladimir Barash, Cody Dunne, Tony Capone, Adam Perer, Eric Gleave: Analyzing (social media) networks with NodeXL. C&T 2009: 255-264; presenter: Dennis Kappen.
- Nick Koudas, Avishek Saha, Divesh Srivastava, Suresh Venkatasubramanian: Metric Functional Dependencies. ICDE 2009: 1275-1278; presenter: Yanwen Zhou.
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, Raghotham Murthy: Hive – a petabyte scale data warehouse using Hadoop. ICDE 2010: 996-1005; presenter: Neil Seward.
17th of November
- Christopher G. Healey, Brent M. Dennis: Interest Driven Navigation in Visualization. IEEE Trans. Vis. Comput. Graph. 18(10): 1744-1756 (2012); presenter: Hrim Mehta.
- Xu Chu, Ihab F. Ilyas, Paraschos Koutris: Distributed Data Deduplication. PVLDB 9(11): 864-875 (2016); presenter: Alexander Keller.
- Chu, X., Ilyas, I. F., & Papotti, P. (2013). Holistic data cleaning: Putting violations into context. 2013 IEEE 29th International Conference on Data Engineering (ICDE). doi:10.1109/icde.2013.6544847; presenter: Fang Zhang.
24th of November:
- Stallkamp, Johannes, et al. “Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition.” Neural networks 32 (2012): 323-332; presenter: Amit Maraj.
- Lindsay Wells; Aran Cauchi-Saunders; Ian Lewis, Lorenzo Monsif, Benjamin Geelan, Kristy de Salas: Mining for Gold (and Platinum): PlayStation Network Data Mining. CHIPLAY 2016; presenter: Brandon Drenikow.
- Jiliang Tang, Shiyu Chang, Charu C. Aggarwal, Huan Liu: Negative Link Prediction in Social Media. WSDM 2015: 87-96; presenter: Mahboubeh Ahmadalinezhad.
A one pager is a one page summary of the research paper assigned reading for the lecture (one per two weeks) which has to be submitted every two weeks. It should not be longer than one page of printed letter paper (11 pt font, 12 pt font preferred.) The one pager should address the following points:
- summarize the problem(s) addressed/solved by the research paper (1-2 sentences that clearly describe the problem: “The problem is … .”
- briefly sketch the main ideas on which the solution of the problem is based
- briefly describe the research methodology of the paper (1-2 sentences)
- identify 3 strong points and 3 weak points of the paper
- summarize any assumptions the solution in the paper is based upon (any restrictions; divide this by stated assumptions and non-stated assumptions)
- raise three non-trivial questions about the paper (including future work)
- other remarks (if any)
There are four marks for a one pager:
- 0 – nothing was handed in
- 1 – one pager is not detailed enough or too long
- 2 – one pager is good, all aspects covered
- 3 – one pager is exceptional, as it outlines further interesting points that go beyond the discussion in the paper
This course is project-based. You have to propose and carry out a project that investigates a clearly defined problem within the scope of data management and big data integration.
- one project per student
- independent exploration of specific problem
- implementation and performance measurement
- modeling, design, simulation, and analysis
- experimentation and evaluation
- evaluation of project: timeliness, development and presentation of idea (i.e., in class presentations (e.g., progress and final) and proposal, progress, and final project report.
- no more than 12 pages (LNCS format).
In the course project you should demonstrate the ability to do research by solving a well-defined problem. The emphasis of the project is to apply a solid research methodology from beginning to end. You will learn about what a solid research methodology is in reading and analysis various research papers throughout the course.
Submit a one page proposal. The proposal should cover the following points:
- clear and concise problem statement: identify the problem (if all else breaks, start with: The problem is … (2 sentences or less)
- discuss the relevance: state why your problem is an important one, what impact would a solution have (i.e., what would it change)
- discuss why the problem is interesting, i.e., convince that it is challenging to solve
- describe related approaches
- sketch your approach: say what you intend to do to solve the problem
- describe how you intend to solve the problem (i.e., implement and evaluate, model and simulate, define theorem and prove etc.)
- describe anticipated difficulties
Submit a one page progress report. The report should contain:
- describe the problem you are working on (this may simply be a repeat of your proposal, unless you got feedback to refine your problem statement)
- describe your approach in more detail
- summarize of accomplishments to date
- summarize next steps
- describe problems your encountered and how you anticipate to solve them
- the problem description with a motivation
- a quick overview of related work
- the proposed solution
- a technical description of the solution
- encountered difficulties
- an evaluation
- future work and conclusion
For your final project write-up you must use the proposed format LNCS. Do not write more than 12 pages in the given format. Your project report must be of “publishable quality”. This means, the presentation should not include typos, not contain many grammatical errors, etc. It does not mean that your paper must be ready for publication in a major conference. (Even though this would be a desirable future result.)
- You have 12 pages to present your project.
- No code should be attached to the project write-up.
- The final deadline for your project is 15th of December.
- The point of these strict rules is to approximate a conference/journal/proposal/patent submission process. You will often have to deal with formats imposed on you by someone else.
- Secondly, everybody should be allowed to write the same amount, or less to present their ideas.
The course mark is broken down as follows
- 15% One pagers
- 30% Presentations and discussion leading
- 20% Participation and interactions (discussion, feedback, readings, ideas, etc.)
- 35% Course project (proposal, progress reports, presentation, final report)
- Pager 1: 1st of October, midnight
- Project proposals: 14th of October, midnight
- Pager 2: 15th of October, midnight
- Pager 3: 5th of November
- Project report: 19th of Nov, midnight
- Final presentations: last week of classes
- Final report due: 17th of Dec