Data Mining

Die Vorlesung gibt einen Überblick über die Wissensgewinnung aus (strukturierten) Daten. Dazu gehören unter anderem:

Vorverabeitungstechniken
OLAP-Analyse & Data-Warehousing
Clustering (k-means, k-medoids, DBSCAN, OPTICS)
Klassifikation (k-Nearest-Neighbor, Bayes, Entscheidungsbaum, Support Vector Machine; Bagging, Boosting, z. B. Random Forest, AdaBoost)
Regressionsanalyse (Linear Regression, Logistic Regression)
Assoziationsregellernen (Aprioiri, FP-Growth)
Einführung in Deep Learning

Organisatorisches

Bitte Beachten:

Aufgrund der aktuellen Umstände wird die Vorlesung Data Mining auf Online-Lehre umgestellt.
Sie finden alle aktuellen Informationen der Veranstaltung auf WueCampus2.
Bitte schreiben Sie sich über den obigen Link frühzeitig in WueCampus2 ein, um Zugriff auf den Kurs, aber auch E-Mails mit wichtigen Ankündigungen erhalten zu können.

Vorlesung
Die Vorlesung wird Mo, 10:15 - 11:45 über ZOOM gehalten.
Links zur Teilnahme an jedem Vorlesungstermin werden rechtzeitig in WueCampus2 bereitgestellt. Bitte stellen Sie sicher dass ZOOM auf Ihrem System funktioniert (siehe unten), damit Sie ohne Probleme an der Erstveranstaltung am 20.04. teilnehmen können.
In der Erstveranstaltung werden wir einige Zeit damit verbringen die Technik des Systems zu testen, um einen korrekten Ablauf der kommenden Vorlesungen sicherzustellen, sowie alle wichtigen organisatorischen Informationen für den Ablauf des Semesters teilen.
Übungen
Do, 14:15 - 15:45
Do, 16:15 - 17:45
Fr 14:15 - 15:45
Der Übungsbetrieb wird auch über ZOOM stattfinden. Bitte stellen Sie für eine bessere Betreuung sicher, dass Sie eine Kamera zur Verfügung haben (ggf über Handy). Das genaue Übungsformat wird wie gehabt während der Erstveranstaltung bekanntgegeben.
Prüfung
Es wird am Ende des Semesters eine Prüfung stattfinden. Form, Ablauf und genauer Zeitpunkt müssen allerdings noch erarbeitet werden, und werden baldmöglichst über WueCampus2 bekannt gegeben.
ZOOM
ZOOM benötigt einen vorinstallierten Client. Ein Account wird zur Teilnahme allerdings nicht benötigt. ZOOM ist außerdem auch auf Android und IOS verfügbar.
(ZOOM funktioniert auch im Browser. Diese Variante können wir aber aufgrund der schlechteren Performance nicht empfehlen.)
Aktuelle Details werden im Ankündigungsforum des WueCampus2 Kurses geteilt.
Bleiben Sie gesund!

Literatur

Knowledge Discovery in Databases: Techniken und Anwendungen. Ester, Martin; Sander, Jörg. 1st ed. Springer Berlin Heidelberg, 2000.
- [ BibTeX ]
- [ URL ]
@book{ester2000knowledge, author = {Ester, Martin and Sander, Jörg}, edition = 1, keywords = {classification}, publisher = {Springer Berlin Heidelberg}, title = {Knowledge Discovery in Databases: Techniken und Anwendungen}, year = 2000 }
CRISP-DM 1.0 Step-by-step data mining guide. Chapman, Pete; Clinton, Julian; Kerber, Randy; Khabaza, Thomas; Reinartz, Thomas; Shearer, Colin; Wirth, Rudiger. The CRISP-DM consortium, 2000.
- [ BibTeX ]
- [ URL ]
@techreport{crisp, author = {Chapman, Pete and Clinton, Julian and Kerber, Randy and Khabaza, Thomas and Reinartz, Thomas and Shearer, Colin and Wirth, Rudiger}, institution = {The CRISP-DM consortium}, keywords = {from:nosebrain}, month = {08}, title = {CRISP-DM 1.0 Step-by-step data mining guide}, year = 2000 }
Advances in Knowledge Discovery and Data Mining. Fayyad, Usama M.; Piatetsky-Shapiro, Gregory; Smyth, Padhraic; Uthurusamy, Ramasamy. AAAI/MIT Press, 1996.
- [ BibTeX ]
@article{1996advances, editor = {Fayyad, Usama M. and Piatetsky-Shapiro, Gregory and Smyth, Padhraic and Uthurusamy, Ramasamy}, keywords = {from:nosebrain}, publisher = {AAAI/MIT Press}, title = {Advances in Knowledge Discovery and Data Mining}, year = 1996 }

Weitere Literatur zur Vorlesung

Sequential minimal optimization: A fast algorithm for training support vector machines. Platt, J. 1998.
- [ BibTeX ]
- [ URL ]
@misc{platt1998sequential, abstract = {This paper proposes a new algorithm for training support vector machines: Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear in...}, author = {Platt, J.}, keywords = {from:nosebrain}, title = {Sequential minimal optimization: A fast algorithm for training support vector machines}, year = 1998 }
OPTICS: Ordering Points To Identify the Clustering Structure. Ankerst, Mihael; Breunig, Markus M.; peter Kriegel, Hans; Sander, Jörg. ble. 49–60. ACM Press, 1999.
- [ BibTeX ]
@inproceedings{ankerst1999optics, abstract = {Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of the well-known clustering algorithms require input parameters which are hard to determine but have a significant influence on the clustering result. Furthermore, for many real-data sets there does not even exist a global parameter setting for which the result of the clustering algorithm describes the intrinsic clustering structure accurately. We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings. It is a versatile basis for both automatic and interactive cluster analysis. We show how to automatically and efficiently extract not only ‘traditional ’ clustering information (e.g. representative points, arbitrary shaped clusters), but also the intrinsic clustering structure. For medium sized data sets, the cluster-ordering can be represented graphically and for very large data sets, we introduce an appropriate visualization technique. Both are suitable for interactive exploration of the intrinsic clustering structure offering additional insights into the distribution and correlation of the data.}, author = {Ankerst, Mihael and Breunig, Markus M. and peter Kriegel, Hans and Sander, Jörg}, keywords = {optics}, pages = {49–60}, publisher = {ACM Press}, title = {OPTICS: Ordering Points To Identify the Clustering Structure}, year = 1999 }
On End-to-End Program Generation from User Intention by Deep Neural Networks. Mou, Lili; Men, Rui; Li, Ge; Zhang, Lu; Jin, Zhi. In CoRR, abs/1510.07211. 2015.
- [ BibTeX ]
- [ URL ]
@article{mou2015endtoend, author = {Mou, Lili and Men, Rui and Li, Ge and Zhang, Lu and Jin, Zhi}, journal = {CoRR}, keywords = {generation}, title = {On End-to-End Program Generation from User Intention by Deep Neural Networks.}, volume = {abs/1510.07211}, year = 2015 }
Mining Frequent Patterns without Candidate Generation. Han, Jiawei; Pei, Jian; Yin, Yiwen. In SIGMOD Conference, W. Chen, J. F. Naughton, P. A. Bernstein (reds.), ble. 1–12. ACM, 2000.
- [ BibTeX ]
- [ URL ]
@inproceedings{han2000mining, author = {Han, Jiawei and Pei, Jian and Yin, Yiwen}, booktitle = {SIGMOD Conference}, crossref = {conf/sigmod/2000}, editor = {Chen, Weidong and Naughton, Jeffrey F. and Bernstein, Philip A.}, keywords = {from:nosebrain}, note = {SIGMOD Record 29(2), June 2000}, pages = {1-12}, publisher = {ACM}, title = {Mining Frequent Patterns without Candidate Generation}, year = 2000 }
Maximum likelihood from incomplete data via the {EM} algorithm. Dempster, A. P.; Laird, N. M.; Rubin, D. B. In Journal of the Royal Statistical Society: Series B, 39, ble. 1–38. 1977.
- [ BibTeX ]
- [ URL ]
@article{dempster1977maximum, author = {Dempster, A. P. and Laird, N. M. and Rubin, D. B.}, journal = {Journal of the Royal Statistical Society: Series B}, keywords = {from:nosebrain}, pages = {1-38}, title = {Maximum likelihood from incomplete data via the {EM} algorithm}, volume = 39, year = 1977 }
Experiments with a New Boosting Algorithm. Freund, Yoav; Schapire, Robert E. In International Conference on Machine Learning, ble. 148–156. 1996.
- [ BibTeX ]
- [ URL ]
@inproceedings{freund1996experiments, author = {Freund, Yoav and Schapire, Robert E.}, booktitle = {International Conference on Machine Learning}, keywords = {from:nosebrain}, pages = {148-156}, title = {Experiments with a New Boosting Algorithm}, year = 1996 }
Experimental evidence of massive-scale emotional contagion through social networks. Kramer, Adam D. I.; Guillory, Jamie E.; Hancock, Jeffrey T. In Proceedings of the National Academy of Sciences, 111(24), ble. 8788–8790. 2014.
- [ BibTeX ]
- [ URL ]
@article{kramer2014experimental, abstract = {Emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. Emotional contagion is well established in laboratory experiments, with people transferring positive and negative emotions to others. Data from a large real-world social network, collected over a 20-y period suggests that longer-lasting moods (e.g., depression, happiness) can be transferred through networks [Fowler JH, Christakis NA (2008) BMJ 337:a2338], although the results are controversial. In an experiment with people who use Facebook, we test whether emotional contagion occurs outside of in-person interaction between individuals by reducing the amount of emotional content in the News Feed. When positive expressions were reduced, people produced fewer positive posts and more negative posts; when negative expressions were reduced, the opposite pattern occurred. These results indicate that emotions expressed by others on Facebook influence our own emotions, constituting experimental evidence for massive-scale contagion via social networks. This work also suggests that, in contrast to prevailing assumptions, in-person interaction and nonverbal cues are not strictly necessary for emotional contagion, and that the observation of others’ positive experiences constitutes a positive experience for people.}, author = {Kramer, Adam D. I. and Guillory, Jamie E. and Hancock, Jeffrey T.}, journal = {Proceedings of the National Academy of Sciences}, keywords = {contagion}, number = 24, pages = {8788-8790}, title = {Experimental evidence of massive-scale emotional contagion through social networks}, volume = 111, year = 2014 }
Data Science and Prediction. Dhar, Vasant. In Commun. ACM, 56(12), ble. 64–73. ACM, New York, NY, USA, 2013.
- [ BibTeX ]
- [ URL ]
@article{dhar2013science, abstract = {Big data promises automated actionable knowledge creation and predictive models for use by both humans and computers.}, address = {New York, NY, USA}, author = {Dhar, Vasant}, journal = {Commun. ACM}, keywords = {from:nosebrain}, month = 12, number = 12, pages = {64–73}, publisher = {ACM}, title = {Data Science and Prediction}, volume = 56, year = 2013 }
Data Science and its Relationship to Big Data and Data-Driven Decision Making. Provost, Foster; Fawcett, Tom. In Big Data, 1(1), ble. 51–59. Mary Ann Liebert Inc, 2013.
- [ BibTeX ]
@article{provost2013science, author = {Provost, Foster and Fawcett, Tom}, journal = {Big Data}, keywords = {from:nosebrain}, month = {03}, number = 1, pages = {51–59}, publisher = {Mary Ann Liebert Inc}, title = {Data Science and its Relationship to Big Data and Data-Driven Decision Making}, volume = 1, year = 2013 }
Clustering by means of medoids. Kaufman, Leonard; Rousseeuw, Peter J. I. D. Y; editor (reds.), ble. 405–416. North Holland / Elsevier, Amsterdam:, 1987.
- [ BibTeX ]
@misc{kaufmanl1987clustering, author = {Kaufman, Leonard and Rousseeuw, Peter J.}, editor = {Y, In: Dodge and editor}, keywords = {from:nosebrain}, pages = {405–416}, publisher = {North Holland / Elsevier}, title = {Clustering by means of medoids}, year = 1987 }
Bagging, Boosting, and C4.5. Quinlan, J. Ross. In AAAI/IAAI, Vol. 1, W. J. Clancey, D. S. Weld (reds.), ble. 725–730. AAAI Press / The MIT Press, 1996.
- [ BibTeX ]
- [ URL ]
@inproceedings{quinlan1996bagging, author = {Quinlan, J. Ross}, booktitle = {AAAI/IAAI, Vol. 1}, crossref = {conf/aaai/1996-1}, editor = {Clancey, William J. and Weld, Daniel S.}, keywords = {from:nosebrain}, pages = {725-730}, publisher = {AAAI Press / The MIT Press}, title = {Bagging, Boosting, and C4.5}, year = 1996 }
Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. Agrawal, Rakesh; Gehrke, Johannes; Gunopulos, Dimitrios; Raghavan, Prabhakar. In Proceedings of the ACM SIGMOD Int’l Conference on Management of Data, Seattle, Washington, ble. 94–105. ACM Press, 1998.
- [ BibTeX ]
@inproceedings{agrawal-98, author = {Agrawal, Rakesh and Gehrke, Johannes and Gunopulos, Dimitrios and Raghavan, Prabhakar}, booktitle = {Proceedings of the ACM SIGMOD Int'l Conference on Management of Data, Seattle, Washington}, keywords = {from:nosebrain}, month = {06}, pages = {94–105}, publisher = {ACM Press}, title = {Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications}, year = 1998 }
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei. In Proc. of 2nd International Conference on Knowledge Discovery and, ble. 226–231. 1996.
- [ BibTeX ]
@inproceedings{ester1996densitybased, author = {Ester, Martin and Kriegel, Hans-Peter and Sander, Jörg and Xu, Xiaowei}, booktitle = {Proc. of 2nd International Conference on Knowledge Discovery and}, keywords = {from:nosebrain}, pages = {226-231}, title = {A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise}, year = 1996 }

Picture credits