Dr. Marian-Andrei Rizoiu is lecturer with the University of Technology Sydney, leading the Behavioral Data Science group, studying the dynamics of human attention in the online environment. He is interested in stochastic behavioural modelling of human actions online, at the intersection of applied statistics, artificial intelligence and social data science. His research has made several key contributions, particularly to the areas of online popularity prediction and online privacy. For the past four years, he has been developing theoretical models for online information diffusion, which can account for complex social phenomena, such as the rise and fall of online popularity, the spread of misinformation or the adoption of disruptive technologies. He approached questions such as "Why did X become popular, but not Y?" and "How can items be promoted?" with implications in advertising and marketing. Marian-Andrei has also worked on detecting the evolution of privacy loss over time. His research has shown that privacy "leaks" over time and it identified the factors causing the loss: the individual's own actions and the environment. The conclusions were staggering: privacy continues to decrease even for users who retired from activity.
Marian-Andrei research has an inter-disciplinary focus. He lead two research grants: the first on quantifying the social influence of automatic diffusion systems in the electoral process (with social scientists) and detecting hate speech for the early prediction of mass atrocities and genocides (with political scientists).
Marian-Andrei published in the most selective venues of the field of Data Science and Web Research, such the International World Wide Web Conference (WWW), the conference on Web Search and Data Mining (WSDM), the International Conference of the Web and Social Media (ICWSM), or the Conference on Information and Knowledge Management (CIKM). He serves as a PC member for prestigious conferences and journals, such as AAAI, WWW and ICWSM, and the Journal of Machine Learning Research. His work has received significant media attention, including from the Wikimedia Foundation for the work concerning the privacy of Wikipedia editors (which featured in the March 2016 Wikimedia Research Showcase). See more at http://www.rizoiu.eu
Media attention. Marian-Andrei's work has received significant media attention, among which:
- Both the Business Insider and the ANU Reporter wrote about our findings concerning the bot influence in the 2016 US elections.
- I presented my findings concerning the privacy of Wikipedia editors to the Wikimedia Foundation (the legal entity that handles and represents Wikipedia), in the March 2016 edition of the Wikimedia Research Showcase. The showcase was live streamed on YouTube and it had an international reach to both researchers and general public.
- My Wikipedia privacy work was featured in ANU’s news media outlet.
- My work on social media popularity was covered by the ANU Reporter and NCI News.
Can supervise: YES
- Machine Learning for social media;
- Big Social Data Science: algorithms and applications;
- influence, polarisation, radicalisation through the prism of online social media;
- spatio-temporal information diffusion;
- (technical) stochastic point process modelling, epidemic models, bayesian learning.
See here for the complete list of courses taught and student projects.
Teaching. I hold a pedagogical degree in higher education and I have a teaching experience of 10 years. Overall, I have delivered more than 600 hours of lectures and tutoring for Undergraduates, Masters and Honours and I lectured in international excellent degree programs, such as the Masters Erasmus Mundus Excellence DMKM1 and the Franco-Ukrainian Masters IDSM2 (cooperation between the University Lumiere Lyon and the University of Kharkov, Ukraine).
Supervision completion. More than 45 students: 4 PhD students, 2 RA/postdoc, 1 visiting postgrad students, 5 Honours (Masters by research) students, 4 summer scholar students, more than 30 coursework masters students. See here for the complete list of alumni students and their projects.
Teaching quality. For the past four years, I obtained high evaluations in ANU’s official Student Experience of Learning and Teaching (SELT) (see attached 2017 SELT evaluation of my teaching).
Diverse teaching. I taught a wide range of CS subjects (Programming, Calculus, Networking, Algorithms Design), of Machine Learning and Data Mining subjects (association rules mining, decision trees, clustering, symbolic learning, ensemble methods) and Social Media Analysis. This document details the complete list of these courses.
Kim, D, Graham, T, Wan, Z & Rizoiu, M-A 2019, 'Analysing user identity via time-sensitive semantic edit distance (t-SED): a case study of Russian trolls on Twitter', Journal of Computational Social Science, vol. 2, no. 2, pp. 331-351.View/Download from: UTS OPUS or Publisher's site
Kim, YM, Velcin, J, Bonnevay, S & Rizoiu, MA 2015, 'Temporal multinomial mixture for instance-oriented evolutionary clustering', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9022, pp. 593-604.View/Download from: UTS OPUS
© Springer International Publishing Switzerland 2015. Evolutionary clustering aims at capturing the temporal evolution of clusters. This issue is particularly important in the context of social media data that are naturally temporally driven. In this paper, we propose a new probabilistic model-based evolutionary clustering technique. The Temporal Multinomial Mixture (TMM) is an extension of classical mixture model that optimizes feature co-occurrences in the trade-off with temporal smoothness. Our model is evaluated for two recent case studies on opinion aggregation over time. We compare four different probabilistic clustering models and we show the superiority of our proposal in the task of instance-oriented clustering.
Rizoiu, M-A, Guille, A & Velcin, J 2015, 'CommentWatcher: An Open Source Web-based platform for analyzing discussions on web forums.', CoRR, vol. abs/1504.07459.
Rizoiu, M-A, Velcin, J & Lallich, S 2015, 'Semantic-enriched visual vocabulary construction in a weakly supervised context', INTELLIGENT DATA ANALYSIS, vol. 19, no. 1, pp. 161-185.View/Download from: Publisher's site
Rizoiu, M-A, Velcin, J & Lallich, S 2015, 'Semantic-enriched visual vocabulary construction in a weakly supervised context.', Intell. Data Anal., vol. 19, pp. 161-185.
Rizoiu, M-A, Velcin, J & Lallich, S 2014, 'How to Use Temporal-Driven Constrained Clustering to Detect Typical Evolutions.', International Journal on Artificial Intelligence Tools, vol. 23.
Rizoiu, M-A, Velcin, J & Lallich, S 2013, 'Unsupervised feature construction for improving data representation and semantics', JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, vol. 40, no. 3, pp. 501-527.View/Download from: Publisher's site
Rizoiu, M-A, Velcin, J & Lallich, S 2013, 'Unsupervised feature construction for improving data representation and semantics.', J. Intell. Inf. Syst., vol. 40, pp. 501-527.
Muşat, C, Trǎuşan-Matu, S, Velcin, J & Rizoiu, MA 2012, 'Automatic extraction of conceptual labels from topic models', UPB Scientific Bulletin, Series C: Electrical Engineering, vol. 74, no. 2, pp. 57-68.
This work outlines a novel system that automatically extracts conceptual labels for statistically obtained topics. By creating a projection of the topic, which is a distribution over all the vocabulary words, over the WordNet ontology we succeed in associating concepts to the said groups of words. The most important contributions of this paper are connected to the validation of the role of these concepts as topical labels and the determination of correlations that emerge between the utility of these labels and the strength of the relation between the concepts and the topics.
Dawson, N, Rizoiu, M-A, Johnston, B & Williams, M-A, 'Adaptively selecting occupations to detect skill shortages from online job ads'.
This research develops a data-driven method to generate sets of highly
similar skills based on a set of seed skills using online job advertisements
(ads) data. This provides researchers with a novel method to adaptively select
occupations based on granular skills data. We apply this adaptive skills
similarity technique to a dataset of over 6.7 million Australian job ads in
order to identify occupations with the highest proportions of Data Science and
Analytics (DSA) skills. This uncovers 306,577 DSA job ads across 23
occupational classes from 2012-2019. We then propose five variables for
detecting skill shortages from online job ads: (1) posting frequency; (2)
salary levels; (3) education requirements; (4) experience demands; and (5) job
ad posting predictability. This contributes further evidence to the goal of
detecting skills shortages in real-time. In conducting this analysis, we also
find strong evidence of skills shortages in Australia for highly technical DSA
skills and occupations. These results provide insights to Data Science
researchers, educators, and policy-makers from other advanced economies about
the types of skills that should be cultivated to meet growing DSA labour
demands in the future.
Kong, Q, Rizoiu, M-A & Xie, L, 'Modeling Information Cascades with Self-exciting Processes via Generalized Epidemic Models'.
Epidemic models and self-exciting processes are two types of models used to
describe information diffusion. These models were originally developed in
different scientific communities, and their commonalities are under-explored.
This work establishes, for the first time, a general connection between the two
model classes via three new mathematical components. The first is a generalized
version of stochastic Susceptible-Infected-Recovered (SIR) model with arbitrary
recovery time distributions; the second is the relationship between the (latent
and arbitrary) recovery time distribution, recovery hazard function, and the
infection kernel of self-exciting processes; the third includes methods for
simulating, fitting, evaluating and predicting the generalized process with any
recovery time distribution. On three large Twitter diffusion datasets, we
conduct goodness-of-fit tests and holdout log-likelihood evaluation of
self-exciting processes with three infection kernels --- exponential, power-law
and Tsallis Q-exponential. We show that the modeling performance of the
infection kernels varies with respect to the temporal structures of diffusions,
and also with respect to user behavior, such as the likelihood of being bots.
We further improve the prediction of popularity by combining two models that
are identified as complementary by the goodness-of-fit tests.
Congestion prediction represents a major priority for traffic management
centres around the world to ensure timely incident response handling. The
increasing amounts of generated traffic data have been used to train machine
learning predictors for traffic, however this is a challenging task due to
inter-dependencies of traffic flow both in time and space. Recently, deep
learning techniques have shown significant prediction improvements over
traditional models, however open questions remain around their applicability,
accuracy and parameter tuning. This paper proposes an advanced deep learning
framework for simultaneously predicting the traffic flow on a large number of
monitoring stations along a highly circulated motorway in Sydney, Australia,
including exit and entry loop count stations, and over varying training and
prediction time horizons. The spatial and temporal features extracted from the
36.34 million data points are used in various deep learning architectures that
exploit their spatial structure (convolutional neuronal networks), their
temporal dynamics (recurrent neuronal networks), or both through a hybrid
spatio-temporal modelling (CNN-LSTM). We show that our deep learning models
consistently outperform traditional methods, and we conduct a comparative
analysis of the optimal time horizon of historical data required to predict
traffic flow at different time points in the future.
Wu, S, Rizoiu, M-A & Xie, L, 'Estimating Attention Flow in Online Video Networks'.
Online videos have shown tremendous increase in Internet traffic. Most video
hosting sites implement recommender systems, which connect the videos into a
directed network and conceptually act as a source of pathways for users to
navigate. At present, little is known about how human attention is allocated
over such large-scale networks, and about the impacts of the recommender
systems. In this paper, we first construct the Vevo network -- a YouTube video
network with 60,740 music videos interconnected by the recommendation links,
and we collect their associated viewing dynamics. This results in a total of
310 million views every day over a period of 9 weeks. Next, we present
large-scale measurements that connect the structure of the recommendation
network and the video attention dynamics. We use the bow-tie structure to
characterize the Vevo network and we find that its core component (23.1% of the
videos), which occupies most of the attention (82.6% of the views), is made out
of videos that are mainly recommended among themselves. This is indicative of
the links between video recommendation and the inequality of attention
allocation. Finally, we address the task of estimating the attention flow in
the video recommendation network. We propose a model that accounts for the
network effects for predicting video popularity, and we show it consistently
outperforms the baselines. This model also identifies a group of artists
gaining attention because of the recommendation network. Altogether, our
observations and our models provide a new set of tools to better understand the
impacts of recommender systems on collective social attention.
Zhang, R, Walder, C & Rizoiu, M-A, 'Sparse Gaussian Process Modulated Hawkes Process'.
The Hawkes process has been widely applied to modeling self-exciting events,
including neuron spikes, earthquakes and tweets. To avoid designing parametric
kernel functions and to be able to quantify the prediction confidence,
non-parametric Bayesian Hawkes processes have been proposed. However the
inference of such models suffers from unscalability or slow convergence. In
this paper, we first propose a new non-parametric Bayesian Hawkes process whose
triggering kernel is modeled as a squared sparse Gaussian process. Second, we
present the variational inference scheme for the model optimization, which has
the advantage of linear time complexity by leveraging the stationarity of the
triggering kernel. Third, we contribute a tighter lower bound than the evidence
lower bound of the marginal likelihood for the model selection. Finally, we
exploit synthetic data and large-scale social media data to validate the
efficiency of our method and the practical utility of our approximate marginal
likelihood. We show that our approach outperforms state-of-the-art
non-parametric Bayesian and non-Bayesian methods.
In this paper, we develop an efficient nonparametric Bayesian estimation of
the kernel function of Hawkes processes. The non-parametric Bayesian approach
is important because it provides flexible Hawkes kernels and quantifies their
uncertainty. Our method is based on the cluster representation of Hawkes
processes. Utilizing the stationarity of the Hawkes process, we efficiently
sample random branching structures and thus, we split the Hawkes process into
clusters of Poisson processes. We derive two algorithms -- a block Gibbs
sampler and a maximum a posteriori estimator based on expectation maximization
-- and we show that our methods have a linear time complexity, both
theoretically and empirically. On synthetic data, we show our methods to be
able to infer flexible Hawkes triggering kernels. On two large-scale Twitter
diffusion datasets, we show that our methods outperform the current
state-of-the-art in goodness-of-fit and that the time complexity is linear in
the size of the dataset. We also observe that on diffusions related to online
videos, the learned kernels reflect the perceived longevity for different
content types such as music or pets videos.
Rizoiu, MA & Velcin, J 2011, 'Topic extraction for ontology learning' in Ontology Learning and Knowledge Discovery Using the Web: Challenges and Recent Advances, pp. 38-60.View/Download from: Publisher's site
This chapter addresses the issue of topic extraction from text corpora for ontology learning. The first part provides an overview of some of the most significant solutions present today in the literature. These solutions deal mainly with the inferior layers of the Ontology Learning Layer Cake. They are related to the challenges of the Terms and Synonyms layers. The second part shows how these pieces can be bound together into an integrated system for extracting meaningful topics. While the extracted topics are not proper concepts as yet, they constitute a convincing approach towards concept building and therefore ontology learning. This chapter concludes by discussing the research undertaken for filling the gap between topics and concepts as well as perspectives that emerge today in the area of topic extraction. © 2011, IGI Global.
Mihăiţă, AS, Liu, Z, Rizoiu, MA & Cai, C 2019, 'Arterial incident duration prediction using a bi-level framework of extreme gradient-tree boosting', ITS World Congress 2019 (ITSWC2019), Singapore.View/Download from: UTS OPUS
Kong, Q, Rizoiu, M-A, Wu, S & Xie, L 2018, 'Will This Video Go Viral', Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW '18, Companion of the The Web Conference 2018, ACM Press.View/Download from: Publisher's site
Mishra, S, Rizoiu, MA & Xie, L 2018, 'Modeling popularity in asynchronous social media streams with recurrent neural networks', 12th International AAAI Conference on Web and Social Media, ICWSM 2018, International AAAI Conference on Web and Social Media,, AAAI, Stanford, USA, pp. 201-210.View/Download from: UTS OPUS
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Understanding and predicting the popularity of online items is an important open problem in social media analysis. Considerable progress has been made recently in data-driven predictions, and in linking popularity to external promotions. However, the existing methods typically focus on a single source of external influence, whereas for many types of online content such as YouTube videos or news articles, attention is driven by multiple heterogeneous sources simultaneously - e.g. microblogs or traditional media coverage. Here, we propose RNN-MAS, a recurrent neural network for modeling asynchronous streams. It is a sequence generator that connects multiple streams of different granularity via joint inference. We show RNN-MAS not only outperforms the current state-of-the-art Youtube popularity prediction system by 17%, but also captures complex dynamics, such as seasonal trends of unseen influence. We define two new metrics: the promotion score quantifies the gain in popularity from one unit of promotion for a Youtube video; the loudness level captures the effects of a particular user tweeting about the video. We use the loudness level to compare the effects of a video being promoted by a single highly-followed user (in the top 1% most followed users) against being promoted by a group of mid-followed users. We find that results depend on the type of content being promoted: superusers are more successful in promoting Howto and Gaming videos, whereas the cohort of regular users are more influential for Activism videos. This work provides more accurate and explainable popularity predictions, as well as computational tools for content producers and marketers to allocate resources for promotion campaigns.
Rizoiu, MA, Graham, T, Zhang, R, Zhang, Y, Ackland, R & Xie, L 2018, 'DEBATENIGHT: The role and influence of socialbots on twitter during the first 2016 U.S. presidential debate', 12th International AAAI Conference on Web and Social Media, ICWSM 2018, International AAAI Conference on Web and Social Media, AAAI, Stanford, USA, pp. 300-309.View/Download from: UTS OPUS
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Serious concerns have been raised about the role of 'socialbots' in manipulating public opinion and influencing the outcome of elections by retweeting partisan content to increase its reach. Here we analyze the role and influence of socialbots on Twitter by determining how they contribute to retweet diffusions. We collect a large dataset of tweets during the 1st U.S. presidential debate in 2016 and we analyze its 1.5 million users from three perspectives: user influence, political behavior (partisanship and engagement) and botness. First, we define a measure of user influence based on the user's active contributions to information diffusions, i.e. their tweets and retweets. Given that Twitter does not expose the retweet structure - it associates all retweets with the original tweet - we model the latent diffusion structure using only tweet time and user features, and we implement a scalable novel approach to estimate influence over all possible unfoldings. Next, we use partisan hashtag analysis to quantify user political polarization and engagement. Finally, we use the BotOrNot API to measure user botness (the likelihood of being a bot). We build a two-dimensional 'polarization map' that allows for a nuanced analysis of the interplay between botness, partisanship and influence. We find that not only are socialbots more active on Twitter - starting more retweet cascades and retweeting more - but they are 2.5 times more influential than humans, and more politically engaged. Moreover, pro-Republican bots are both more influential and more politically engaged than their pro-Democrat counterparts. However we caution against blanket statements that software designed to appear human dominates politics-related activity on Twitter. Firstly, it is known that accounts controlled by teams of humans (e.g. organizational accounts) are often identified as bots. Seco...
Rizoiu, M-A, Graham, T, Zhang, R, Zhang, Y, Ackland, R & Xie, L 2018, '#DebateNight: The Role and Influence of Socialbots on Twitter During the 1st 2016 U.S. Presidential Debate.', ICWSM, AAAI Press, pp. 300-309.
Rizoiu, M-A, Mishra, S, Kong, Q, Carman, M & Xie, L 2018, 'SIR-Hawkes: Linking Epidemic Models and Hawkes Processes to Model Diffusions in Finite Populations', WEB CONFERENCE 2018: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW2018), 27th World Wide Web (WWW) Conference, ASSOC COMPUTING MACHINERY, Lyon, FRANCE, pp. 419-428.View/Download from: UTS OPUS or Publisher's site
Wu, S, Rizoiu, MA & Xie, L 2018, 'Beyond views: Measuring and predicting engagement in online videos', 12th International AAAI Conference on Web and Social Media, ICWSM 2018, International AAAI Conference on Web and Social Media, AAAI, Stanford, USA, pp. 434-443.View/Download from: UTS OPUS
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The share of videos in the internet traffic has been growing, therefore understanding how videos capture attention on a global scale is also of growing importance. Most current research focus on modeling the number of views, but we argue that video engagement, or time spent watching is a more appropriate measure for resource allocation problems in attention, networking, and promotion activities. In this paper, we present a first large-scale measurement of video-level aggregate engagement from publicly available data streams, on a collection of 5.3 million YouTube videos published over two months in 2016. We study a set of metrics including time and the average percentage of a video watched. We define a new metric, relative engagement, that is calibrated against video properties and strongly correlate with recognized notions of quality. Moreover, we find that engagement measures of a video are stable over time, thus separating the concerns for modeling engagement and those for popularity - the latter is known to be unstable over time and driven by external promotions. We also find engagement metrics predictable from a cold-start setup, having most of its variance explained by video context, topics and channel information - R2=0.77. Our observations imply several prospective uses of engagement metrics - choosing engaging topics for video production, or promoting engaging videos in recommender systems.
Rizoiu, MA & Xie, L 2017, 'Online popularity under promotion: Viral potential, forecasting, and the economics of time', Proceedings of the 11th International Conference on Web and Social Media, ICWSM 2017, pp. 182-191.View/Download from: UTS OPUS
© Copyright 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Modeling the popularity dynamics of an online item is an important open problem in computational social science. This paper presents an in-depth study of popularity dynamics under external promotions, especially in predicting popularity jumps of online videos, and determining effective and efficient schedules to promote online content. The recently proposed Hawkes Intensity Process (HIP) models popularity as a non-linear interplay between exogenous stimuli and the endogenous reactions. Here, we propose two novel metrics based on HIP: to describe popularity gain per unit of promotion, and to quantify the time it takes for such effects to unfold. We make increasingly accurate forecasts of future popularity by including information about the intrinsic properties of the video, promotions it receives, and the non-linear effects of popularity ranking. We illustrate by simulation the interplay between the unfolding of popularity over time, and the time-sensitive value of resources. Lastly, our model lends a novel explanation of the commonly adopted periodic and constant promotion strategy in advertising, as increasing the perceived viral potential. This study provides quantitative guidelines about setting promotion schedules considering content virality, timing, and economics.
Rizoiu, M-A & Xie, L 2017, 'Online Popularity Under Promotion: Viral Potential, Forecasting, and the Economics of Time.', ICWSM, AAAI Press, pp. 182-191.
Rizoiu, M-A, Xie, L, Sanner, S, Cebrian, M, Yu, H & Van Henteryck, P 2017, 'Expecting to be HIP: Hawkes Intensity Processes for Social Media Popularity', PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'17), 26th International Conference on World Wide Web (WWW), ASSOC COMPUTING MACHINERY, Perth, AUSTRALIA, pp. 735-744.View/Download from: UTS OPUS or Publisher's site
Mishra, S, Rizoiu, M-A & Xie, L 2016, 'Feature Driven and Point Process Approaches for Popularity Prediction', CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 25th ACM International Conference on Information and Knowledge Management (CIKM), ASSOC COMPUTING MACHINERY, IUPUI, Indianapolis, IN, pp. 1069-1078.View/Download from: UTS OPUS or Publisher's site
Mishra, S, Rizoiu, M-A & Xie, L 2016, 'Feature Driven and Point Process Approaches for Popularity Prediction.', CIKM, ACM, pp. 1069-1078.
Rizoiu, M-A, Velcin, J, Bonnevay, S & Lallich, S 2016, 'ClusPath: a temporal-driven clustering to infer typical evolution paths', DATA MINING AND KNOWLEDGE DISCOVERY, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD), SPRINGER, Riva del Garda, ITALY, pp. 1324-1349.View/Download from: Publisher's site
Rizoiu, M-A, Xie, L, Caetano, T & Cebrian, M 2016, 'Evolution of Privacy Loss in Wikipedia', PROCEEDINGS OF THE NINTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'16), 9th Annual ACM International Conference on Web Search and Data Mining (WSDM), ASSOC COMPUTING MACHINERY, San Francisco, CA, pp. 215-224.View/Download from: UTS OPUS or Publisher's site
Rizoiu, M-A, Xie, L, Caetano, TS & Cebrián, M 2016, 'Evolution of Privacy Loss in Wikipedia.', WSDM, ACM, pp. 215-224.
Rizoiu, M-A, Velcin, J & Lallich, S 2012, 'How to Use Temporal-Driven Constrained Clustering to Detect Typical Evolutions', INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, IEEE 24th International Conference on Tools with Artificial Intelligence (ICTAI), WORLD SCIENTIFIC PUBL CO PTE LTD, Athens, GREECE.View/Download from: Publisher's site
Rizoiu, MA 2013, 'Semi-supervised structuring of complex data', IJCAI International Joint Conference on Artificial Intelligence, pp. 3239-3240.
The objective of the thesis is to explore how complex data can be treated using unsupervised machine learning techniques, in which additional information is injected to guide the exploratory process. Starting from specific problems, our contributions take into account the different dimensions of the complex data: their nature (image, text), the additional information attached to the data (labels, structure, concept ontologies) and the temporal dimension. A special attention is given to data representation and how additional information can be leveraged to improve this representation.
Rizoiu, M-A, Velcin, J & Lallich, S 2012, 'Structuring typical evolutions using Temporal-Driven Constrained Clustering', 2012 IEEE 24TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2012), VOL 1, IEEE 24th International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, Athens, GREECE, pp. 610-617.View/Download from: Publisher's site
We propose a system which employs conceptual knowledge to improve topic models by removing unrelated words from the simplified topic description. We use WordNet to detect which topical words are not conceptually similar to the others and then test our assumptions against human judgment. Results obtained on two different corpora in different test conditions show that the words detected as unrelated had a much greater probability than the others to be chosen by human evaluators as not being part of the topic at all. We prove that there is a strong correlation between the said probability and an automatically calculated topical fitness and we discuss the variation of the correlation depending on the method and data used. © 2011 Springer-Verlag Berlin Heidelberg.
Musat, CC, Velcin, J, Trausan-Matu, S & Rizoiu, MA 2011, 'Improving topic evaluation using conceptual knowledge', IJCAI International Joint Conference on Artificial Intelligence, pp. 1866-1871.View/Download from: Publisher's site
The growing number of statistical topic models led to the need to better evaluate their output. Traditional evaluation means estimate the model's fitness to unseen data. It has recently been proven than the output of human judgment can greatly differ from these measures. Thus the need for methods that better emulate human judgment is stringent. In this paper we present a system that computes the conceptual relevance of individual topics from a given model on the basis of information drawn from a given concept hierarchy, in this case WordNet. The notion of conceptual relevance is regarded as the ability to attribute a concept to each topic and separate words related to the topic from the unrelated ones based on that concept. In multiple experiments we prove the correlation between the automatic evaluation method and the answers received from human evaluators, for various corpora and difficulty levels. By changing the evaluation focus from a statistical one to a conceptual one we were able to detect which topics are conceptually meaningful and rank them accordingly.
Rizoiu, M-A, Velcin, J & Chauchat, J-H 2010, 'Regrouper les données textuelles et nommer les groupes à l'aide de classes recouvrantes.', EGC, Cépaduès-Éditions, pp. 561-572.
Rizoiu, M-A, Wang, T, Ferraro, G & Suominen, H, 'Transfer Learning for Hate Speech Detection in Social Media'.
In today's society more and more people are connected to the Internet, and
its information and communication technologies have become an essential part of
our everyday life. Unfortunately, the flip side of this increased connectivity
to social media and other online contents is cyber-bullying and -hatred, among
other harmful and anti-social behaviors. Models based on machine learning and
natural language processing provide a way to detect this hate speech in web
text in order to make discussion forums and other media and platforms safer.
The main difficulty, however, is annotating a sufficiently large number of
examples to train these models. In this paper, we report on developing
automated text analytics methods, capable of jointly learning a single
representation of hate from several smaller, unrelated data sets. We train and
test our methods on the total of $37,520$ English tweets that have been
annotated for differentiating harmless messages from racist or sexists contexts
in the first detection task, and hateful or offensive contents in the second
detection task. Our most sophisticated method combines a deep neural network
architecture with transfer learning. It is capable of creating word and
sentence embeddings that are specific to these tasks while also embedding the
meaning of generic hate speech. Its prediction correctness is the
macro-averaged F1 of $78\%$ and $72\%$ in the first and second task,
respectively. This method enables generating an interpretable two-dimensional
text visualization --- called the Map of Hate --- that is capable of separating
different types of hate speech and explaining what makes text harmful. These
methods and insights hold a potential for not only safer social media, but also
reduced need to expose human moderators and annotators to distressing