Dr. Marian-Andrei Rizoiu is lecturer with the University of Technology Sydney, leading the Behavioral Data Science group, studying the dynamics of human attention in the online environment. He is interested in stochastic behavioural modelling of human actions online, at the intersection of applied statistics, artificial intelligence and social data science. His research has made several key contributions, particularly to the areas of online popularity prediction and online privacy. For the past four years, he has been developing theoretical models for online information diffusion, which can account for complex social phenomena, such as the rise and fall of online popularity, the spread of misinformation or the adoption of disruptive technologies. He approached questions such as "Why did X become popular, but not Y?" and "How can items be promoted?" with implications in advertising and marketing. Marian-Andrei has also worked on detecting the evolution of privacy loss over time. His research has shown that privacy "leaks" over time and it identified the factors causing the loss: the individual's own actions and the environment. The conclusions were staggering: privacy continues to decrease even for users who retired from activity.
Marian-Andrei research has an inter-disciplinary focus. He lead two research grants: the first on quantifying the social influence of automatic diffusion systems in the electoral process (with social scientists) and detecting hate speech for the early prediction of mass atrocities and genocides (with political scientists).
Marian-Andrei published in the most selective venues of the field of Data Science and Web Research, such the International World Wide Web Conference (WWW), the conference on Web Search and Data Mining (WSDM), the International Conference of the Web and Social Media (ICWSM), or the Conference on Information and Knowledge Management (CIKM). He serves as a PC member for prestigious conferences and journals, such as AAAI, WWW and ICWSM, and the Journal of Machine Learning Research. His work has received significant media attention, including from the Wikimedia Foundation for the work concerning the privacy of Wikipedia editors (which featured in the March 2016 Wikimedia Research Showcase). See more at http://www.rizoiu.eu
Media attention. Marian-Andrei's work has received significant media attention, among which:
- Both the Business Insider and the ANU Reporter wrote about our findings concerning the bot influence in the 2016 US elections.
- I presented my findings concerning the privacy of Wikipedia editors to the Wikimedia Foundation (the legal entity that handles and represents Wikipedia), in the March 2016 edition of the Wikimedia Research Showcase. The showcase was live streamed on YouTube and it had an international reach to both researchers and general public.
- My Wikipedia privacy work was featured in ANU’s news media outlet.
- My work on social media popularity was covered by the ANU Reporter and NCI News.
Can supervise: YES
- Machine Learning for social media;
- Big Social Data Science: algorithms and applications;
- influence, polarisation, radicalisation through the prism of online social media;
- spatio-temporal information diffusion;
- (technical) stochastic point process modelling, epidemic models, bayesian learning.
See here for the complete list of courses taught and student projects.
Teaching. I hold a pedagogical degree in higher education and I have a teaching experience of 10 years. Overall, I have delivered more than 600 hours of lectures and tutoring for Undergraduates, Masters and Honours and I lectured in international excellent degree programs, such as the Masters Erasmus Mundus Excellence DMKM1 and the Franco-Ukrainian Masters IDSM2 (cooperation between the University Lumiere Lyon and the University of Kharkov, Ukraine).
Supervision completion. More than 45 students: 4 PhD students, 2 RA/postdoc, 1 visiting postgrad students, 5 Honours (Masters by research) students, 4 summer scholar students, more than 30 coursework masters students. See here for the complete list of alumni students and their projects.
Teaching quality. For the past four years, I obtained high evaluations in ANU’s official Student Experience of Learning and Teaching (SELT) (see attached 2017 SELT evaluation of my teaching).
Diverse teaching. I taught a wide range of CS subjects (Programming, Calculus, Networking, Algorithms Design), of Machine Learning and Data Mining subjects (association rules mining, decision trees, clustering, symbolic learning, ensemble methods) and Social Media Analysis. This document details the complete list of these courses.
Kern, ML, McCarthy, PX, Chakrabarty, D & Rizoiu, M-A 2019, 'Social media-predicted personality traits and values can help match people to their ideal jobs.', Proceedings of the National Academy of Sciences of the United States of America, vol. 116, no. 52, pp. 26459-26464.View/Download from: Publisher's site
Work is thought to be more enjoyable and beneficial to individuals and society when there is congruence between one's personality and one's occupation. We provide large-scale evidence that occupations have distinctive psychological profiles, which can successfully be predicted from linguistic information unobtrusively collected through social media. Based on 128,279 Twitter users representing 3,513 occupations, we automatically assess user personalities and visually map the personality profiles of different professions. Similar occupations cluster together, pointing to specific sets of jobs that one might be well suited for. Observations that contradict existing classifications may point to emerging occupations relevant to the 21st century workplace. Findings illustrate how social media can be used to match people to their ideal occupation.
Kim, D, Graham, T, Wan, Z & Rizoiu, M-A 2019, 'Analysing user identity via time-sensitive semantic edit distance (t-SED): a case study of Russian trolls on Twitter', Journal of Computational Social Science, vol. 2, no. 2, pp. 331-351.View/Download from: Publisher's site
© 2019 Association for Computing Machinery. Online videos have shown tremendous increase in Internet traffic. Most video hosting sites implement recommender systems, which connect the videos into a directed network and conceptually act as a source of pathways for users to navigate. At present, little is known about how human attention is allocated over such large-scale networks, and about the impacts of the recommender systems. In this paper, we first construct the Vevo network — a YouTube video network with 60,740 music videos interconnected by the recommendation links, and we collect their associated viewing dynamics. This results in a total of 310 million views every day over a period of 9 weeks. Next, we present large-scale measurements that connect the structure of the recommendation network and the video attention dynamics. We use the bow-tie structure to characterize the Vevo network and we find that its core component (23.1% of the videos), which occupies most of the attention (82.6% of the views), is made out of videos that are mainly recommended among themselves. This is indicative of the links between video recommendation and the inequality of attention allocation. Finally, we address the task of estimating the attention flow in the video recommendation network. We propose a model that accounts for the network effects for predicting video popularity, and we show it consistently outperforms the baselines. This model also identifies a group of artists gaining attention because of the recommendation network. Altogether, our observations and our models provide a new set of tools to better understand the impacts of recommender systems on collective social attention.
Rizoiu, M-A, Guille, A & Velcin, J 2015, 'CommentWatcher: An Open Source Web-based platform for analyzing discussions on web forums.', CoRR, vol. abs/1504.07459.
Rizoiu, M-A, Velcin, J & Lallich, S 2015, 'Semantic-enriched visual vocabulary construction in a weakly supervised context', INTELLIGENT DATA ANALYSIS, vol. 19, no. 1, pp. 161-185.View/Download from: Publisher's site
Rizoiu, M-A, Velcin, J & Lallich, S 2015, 'Semantic-enriched visual vocabulary construction in a weakly supervised context.', Intell. Data Anal., vol. 19, pp. 161-185.
Rizoiu, M-A, Velcin, J & Lallich, S 2014, 'How to Use Temporal-Driven Constrained Clustering to Detect Typical Evolutions.', Int. J. Artif. Intell. Tools, vol. 23.
Rizoiu, M-A, Velcin, J & Lallich, S 2013, 'Unsupervised feature construction for improving data representation and semantics.', J. Intell. Inf. Syst., vol. 40, pp. 501-527.
Rizoiu, M-A, Velcin, J & Lallich, S 2013, 'Unsupervised feature construction for improving data representation and semantics', JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, vol. 40, no. 3, pp. 501-527.View/Download from: Publisher's site
Muşat, C, Trǎuşan-Matu, S, Velcin, J & Rizoiu, MA 2012, 'Automatic extraction of conceptual labels from topic models', UPB Scientific Bulletin, Series C: Electrical Engineering, vol. 74, no. 2, pp. 57-68.
This work outlines a novel system that automatically extracts conceptual labels for statistically obtained topics. By creating a projection of the topic, which is a distribution over all the vocabulary words, over the WordNet ontology we succeed in associating concepts to the said groups of words. The most important contributions of this paper are connected to the validation of the role of these concepts as topical labels and the determination of correlations that emerge between the utility of these labels and the strength of the relation between the concepts and the topics.
Dawson, N, Molitorisz, S, Rizoiu, M-A & Fray, P, 'Layoffs, Inequity and COVID-19: A Longitudinal Study of the Journalism Jobs Crisis in Australia from 2012 to 2020'.
In Australia and beyond, journalism is reportedly an industry in crisis, a
crisis exacerbated by COVID-19. However, the evidence revealing the crisis is
often anecdotal or limited in scope. In this unprecedented longitudinal
research, we draw on data from the Australian journalism jobs market from
January 2012 until March 2020. Using Data Science and Machine Learning
techniques, we analyse two distinct data sets: job advertisements (ads) data
comprising 3,698 journalist job ads from a corpus of over 6.7 million
Australian job ads; and official employment data from the Australian Bureau of
Statistics. Having matched and analysed both sources, we address both the
demand for and supply of journalists in Australia over this critical period.
The data show that the crisis is real, but there are also surprises.
Counter-intuitively, the number of journalism job ads in Australia rose from
2012 until 2016, before falling into decline. Less surprisingly, for the entire
period studied the figures reveal extreme volatility, characterised by large
and erratic fluctuations. The data also clearly show that COVID-19 has
significantly worsened the crisis. We can also tease out more granular
findings, including: that there are now more women than men journalists in
Australia, but that gender inequity is worsening, with women journalists
getting younger and worse-paid just as men journalists are, on average, getting
older and better-paid; that, despite the crisis besetting the industry, the
demand for journalism skills has increased; and that the skills sought by
journalism job ads increasingly include social media and generalist
Dawson, N, Rizoiu, M-A, Johnston, B & Williams, M-A, 'Predicting Skill Shortages in Labor Markets: A Machine Learning Approach'.
Skill shortages are a drain on society. They hamper economic opportunities
for individuals, slow growth for firms, and impede labor productivity in
aggregate. Therefore, the ability to understand and predict skill shortages in
advance is critical for policy-makers and educators to help alleviate their
adverse effects. This research implements a high-performing Machine Learning
approach to predict occupational skill shortages. In addition, we demonstrate
methods to analyze the underlying skill demands of occupations in shortage and
the most important features for predicting skill shortages. For this work, we
compile a unique dataset of both Labor Demand and Labor Supply occupational
data in Australia from 2012 to 2018. This includes data from 7.7 million job
advertisements (ads) and 20 official labor force measures. We use these data as
explanatory variables and leverage the XGBoost classifier to predict yearly
skills shortage classifications for 132 standardized occupations. The models we
construct achieve macro-F1 average performance scores of up to 83 per cent. Our
results show that job ads data and employment statistics were the highest
performing feature sets for predicting year-to-year skills shortage changes for
occupations. We also find that features such as 'Hours Worked', years of
'Education', years of 'Experience', and median 'Salary' are highly important
features for predicting occupational skill shortages. This research provides a
robust data-driven approach for predicting and analyzing skill shortages, which
can assist policy-makers, educators, and businesses to prepare for the future
Kong, Q, Ram, R & Rizoiu, M-A, 'A Toolkit for Analyzing and Visualizing Online Users via Reshare Cascade Modeling'.
Modeling online discourse dynamics is a core activity in understanding the
spread of information, both offline and online, and emergent online behavior.
There is currently a disconnect between the practitioners of online social
media analysis - usually social, political and communication scientists - and
the accessibility to tools capable of handling large quantities of online data,
and examining online users and their behavior. We present two tools,birdspotter
and evently, for analyzing online users based on their involvement in retweet
cascades. birdspotter provides a toolkit to measure social influence and
botnets of Twitter users. While it leverages the multimodal information of
tweets, such as text contents, evently augments the user measurement by
modeling the temporal dynamics of information diffusions using self-exciting
processes. Both tools are designed for users with a wide range of computer
expertise and include tutorials and detailed documentation. We illustrate a
case study of a topical dataset relating to COVID-19, using both tools for
end-to-end analysis of online user behavior.
Kong, Q, Rizoiu, M-A & Xie, L, 'Describing and Predicting Online Items with Reshare Cascades via Dual Mixture Self-exciting Processes'.
It is well-known that online behavior is long-tailed, with most cascaded
actions being short and a few being very long. A prominent drawback in
generative models for online events is the inability to describe unpopular
items well. This work addresses these shortcomings by proposing dual mixture
self-exciting processes to jointly learn from groups of cascades. We first
start from the observation that maximum likelihood estimates for content
virality and influence decay are separable in a Hawkes process. Next, our
proposed model, which leverages a Borel mixture model and a kernel mixture
model, jointly models the unfolding of a heterogeneous set of cascades. When
applied to cascades of the same online items, the model directly characterizes
their spread dynamics and supplies interpretable quantities, such as content
virality and content influence decay, as well as methods for predicting the
final content popularities. On two retweet cascade datasets -- one relating to
YouTube videos and the second relating to controversial news articles -- we
show that our models capture the differences between online items at the
granularity of items, publishers and categories. In particular, we are able to
distinguish between far-right, conspiracy, controversial and reputable online
news articles based on how they diffuse through social media, achieving an F1
score of 0.945. On holdout datasets, we show that the dual mixture model
provides, for reshare diffusion cascades especially unpopular ones, better
generalization performance and, for online items, accurate item popularity
McCarthy, PX, Rizoiu, M-A, Eghbal, S & Falster, DS, 'Long-term trends of diversity online'.
Ever since the web began, the number of websites has been growing
exponentially. These websites cover an ever-increasing range of online services
that fill a variety of social and economic functions across a growing range of
industries. Yet the networked nature of the web, combined with the economics of
preferential attachment, increasing returns and global trade, suggest that over
the long run a small number of competitive giants are likely to dominate each
functional market segment, such as search, retail and social media. Here we
perform a large scale longitudinal study to quantify the distribution of
attention given in the online environment to competing organizations. In two
large online social media datasets, containing more than 10 billion posts and
spanning more than a decade, we tally the volume of external links posted
towards the organizations' main domain name as a proxy for the online attention
Our analysis shows that despite the fact that we observe consistent growth in
all the macro indicators -- the total amount of online attention, in the number
of organizations with an online presence, and in the functions they perform --
we also observe that a smaller number of organizations account for an
ever-increasing proportion of total user attention, usually with one large
player dominating each function. These results highlight how evolution of the
online economy involves innovation, diversity, and then competitive dominance.
Mihaita, A-S, Li, H & Rizoiu, M-A, 'Traffic congestion anomaly detection and prediction using deep learning'.
Congestion prediction represents a major priority for traffic management
centres around the world to ensure timely incident response handling. The
increasing amounts of generated traffic data have been used to train machine
learning predictors for traffic, however, this is a challenging task due to
inter-dependencies of traffic flow both in time and space. Recently, deep
learning techniques have shown significant prediction improvements over
traditional models, however, open questions remain around their applicability,
accuracy and parameter tuning. This paper brings two contributions in terms of:
1) applying an outlier detection an anomaly adjustment method based on incoming
and historical data streams, and 2) proposing an advanced deep learning
framework for simultaneously predicting the traffic flow, speed and occupancy
on a large number of monitoring stations along a highly circulated motorway in
Sydney, Australia, including exit and entry loop count stations, and over
varying training and prediction time horizons. The spatial and temporal
features extracted from the 36.34 million data points are used in various deep
learning architectures that exploit their spatial structure (convolutional
neuronal networks), their temporal dynamics (recurrent neuronal networks), or
both through a hybrid spatio-temporal modelling (CNN-LSTM). We show that our
deep learning models consistently outperform traditional methods, and we
conduct a comparative analysis of the optimal time horizon of historical data
required to predict traffic flow at different time points in the future.
Lastly, we prove that the anomaly adjustment method brings significant
improvements to using deep learning in both time and space.
Mihaita, A-S, Papachatgis, Z & Rizoiu, M-A, 'Graph modelling approaches for motorway traffic flow prediction', In 23rd IEEE International Conference on Intelligent Transportation Systems (ITSC'20) (pp. 1--8). Rhodes, Greece (2020).
Traffic flow prediction, particularly in areas that experience highly dynamic
flows such as motorways, is a major issue faced in traffic management. Due to
increasingly large volumes of data sets being generated every minute, deep
learning methods have been used extensively in the latest years for both short
and long term prediction. However, such models, despite their efficiency, need
large amounts of historical information to be provided, and they take a
considerable amount of time and computing resources to train, validate and
test. This paper presents two new spatial-temporal approaches for building
accurate short-term prediction along a popular motorway in Sydney, by making
use of the graph structure of the motorway network (including exits and
entries). The methods are built on proximity-based approaches, denoted
backtracking and interpolation, which uses the most recent and closest traffic
flow information for each of the target counting stations along the motorway.
The results indicate that for short-term predictions (less than 10 minutes into
the future), the proposed graph-based approaches outperform state-of-the-art
deep learning models, such as long-term short memory, convolutional neuronal
networks or hybrid models.
Nurek, M, Michalski, R & Rizoiu, M-A, 'Hawkes-modeled telecommunication patterns reveal relationship dynamics and personality traits'.
It is not news that our mobile phones contain a wealth of private information
about us, and that is why we try to keep them secure. But even the traces of
how we communicate can also tell quite a bit about us. In this work, we start
from the calling and texting history of 200 students enrolled in the Netsense
study, and we link it to the type of relationships that students have with
their peers, and even with their personality profiles. First, we show that a
Hawkes point process with a power-law decaying kernel can accurately model the
calling activity between peers. Second, we show that the fitted parameters of
the Hawkes model are predictive of the type of relationship and that the
generalization error of the Hawkes process can be leveraged to detect changes
in the relation types as they are happening. Last, we build descriptors for the
students in the study by jointly modeling the communication series initiated by
them. We find that Hawkes-modeled telecommunication patterns can predict the
students' Big5 psychometric traits almost as accurate as the user-filled
surveys pertaining to hobbies, activities, well-being, grades obtained, health
condition and the number of books they read. These results are significant, as
they indicate that information that usually resides outside the control of
individuals (such as call and text logs) reveal information about the
relationship they have, and even their personality traits.
Wu, S, Rizoiu, M-A & Xie, L, 'Variation across Scales: Measurement Fidelity under Twitter Data Sampling'.
A comprehensive understanding of data quality is the cornerstone of
measurement studies in social media research. This paper presents in-depth
measurements on the effects of Twitter data sampling across different
timescales and different subjects (entities, networks, and cascades). By
constructing complete tweet streams, we show that Twitter rate limit message is
an accurate indicator for the volume of missing tweets. Sampling also differs
significantly across timescales. While the hourly sampling rate is influenced
by the diurnal rhythm in different time zones, the millisecond level sampling
is heavily affected by the implementation choices. For Twitter entities such as
users, we find the Bernoulli process with a uniform rate approximates the
empirical distributions well. It also allows us to estimate the true ranking
with the observed sample data. For networks on Twitter, their structures are
altered significantly and some components are more likely to be preserved. For
retweet cascades, we observe changes in distributions of tweet inter-arrival
time and user influence, which will affect models that rely on these features.
This work calls attention to noises and potential biases in social data, and
provides a few tools to measure Twitter sampling effects.
Zhang, R, Walder, CJ, Bonilla, EV, Rizoiu, M-A & Xie, L, 'Quantile Propagation for Wasserstein-Approximate Gaussian Processes'.
We develop a new approximate Bayesian inference method for Gaussian process
models with factorized non-Gaussian likelihoods. Our method---dubbed Quantile
Propagation (QP)---is similar to expectation propagation (EP) but minimizes the
L_2 Wasserstein distance rather than the Kullback-Leibler (KL) divergence. We
consider the case where likelihood factors are approximated by a Gaussian form.
We show that QP matches quantile functions rather than moments as in EP and has
the same mean update but a smaller variance update than EP, thereby alleviating
the over-estimation of the posterior variance exhibited by EP. Crucially, QP
has the same favorable locality property as EP, and thereby admits an efficient
algorithm. Experiments on classification and Poisson regression tasks
demonstrate that QP outperforms both EP and variational Bayes.
Rizoiu, MA & Velcin, J 2011, 'Topic extraction for ontology learning' in Ontology Learning and Knowledge Discovery Using the Web: Challenges and Recent Advances, pp. 38-60.View/Download from: Publisher's site
This chapter addresses the issue of topic extraction from text corpora for ontology learning. The first part provides an overview of some of the most significant solutions present today in the literature. These solutions deal mainly with the inferior layers of the Ontology Learning Layer Cake. They are related to the challenges of the Terms and Synonyms layers. The second part shows how these pieces can be bound together into an integrated system for extracting meaningful topics. While the extracted topics are not proper concepts as yet, they constitute a convincing approach towards concept building and therefore ontology learning. This chapter concludes by discussing the research undertaken for filling the gap between topics and concepts as well as perspectives that emerge today in the area of topic extraction. © 2011, IGI Global.
Kong, Q, Rizoiu, M-A & Xie, L 2020, 'Modeling Information Cascades with Self-exciting Processes via Generalized Epidemic Models', PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM '20), 13th Annual ACM International Conference on Web Search and Data Mining (WSDM), ASSOC COMPUTING MACHINERY, Houston, TX, pp. 286-294.View/Download from: Publisher's site
Dawson, NJ, Rizoiu, M-A, Johnston, B & Williams, M-A 1970, 'Adaptively selecting occupations to detect skill shortages from online job ads', IEEE International Conference on Big Data (IEEE Big Data 2019), IEEE International Conference on Big Data, Los Angeles, CA, USA, pp. 1-7.
Labour demand and skill shortages have historically been difficult to assess given the high costs of conducting representative surveys and the inherent delays of these indicators. This is particularly consequential for fast developing skills and occupations, such as those relating to Data Science and Analytics (DSA). This paper develops a data-driven solution to detecting skill shortages from online job advertisements (ads) data. We first propose a method to generate sets of highly similar skills based on a set of seed skills from job ads. This provides researchers with a novel method to adaptively select occupations based on granular skills data. Next, we apply this adaptive skills similarity technique to a dataset of over 6.7 million Australian job ads in order to identify occupations with the highest proportions of DSA skills. This uncovers 306,577 DSA job ads across 23 occupational classes from 2012-2019. Finally, we propose five variables for detecting skill shortages from online job ads: (1) posting frequency; (2) salary levels; (3) education requirements; (4) experience demands; and (5) job ad posting predictability. This contributes further evidence to the goal of detecting skills shortages in real-time. In conducting this analysis, we also find strong evidence of skills shortages in Australia for highly technical DSA skills and occupations. These results provide insights to Data Science researchers, educators, and policy-makers from other advanced economies about the types of skills that should be cultivated to meet growing DSA labour demands in the future.
Mihaita, A-S, Li, H, He, Z & Rizoiu, M-A 2019, 'Motorway Traffic Flow Prediction using Advanced Deep Learning', 2019 IEEE Intelligent Transportation Systems Conference (ITSC), IEEE, pp. 1683-1690.View/Download from: Publisher's site
Congestion prediction represents a major priority for traffic management centres around the world to ensure timely incident response handling. The increasing amounts of generated traffic data have been used to train machine learning predictors for traffic, however this is a challenging task due to inter-dependencies of traffic flow both in time and space. Recently, deep learning techniques have shown significant prediction improvements over traditional models, however open questions remain around their applicability, accuracy and parameter tuning. This paper proposes an advanced deep learning framework for simultaneously predicting the traffic flow on a large number of monitoring stations along a highly circulated motorway in Sydney, Australia, including exit and entry loop count stations, and over varying training and prediction time horizons. The spatial and temporal features extracted from the 36.34 million data points are used in various deep learning architectures that exploit their spatial structure (convolutional neuronal networks), their temporal dynamics (recurrent neuronal networks), or both through a hybrid spatio-temporal modelling (CNN-LSTM). We show that our deep learning models consistently outperform traditional methods, and we conduct a comparative analysis of the optimal time horizon of historical data required to predict traffic flow at different time points in the future.
Ram, R & Rizoiu, M-A 2019, 'A social science-grounded approach for quantifying online social influence', Australian Social Network Analysis Conference (ASNAC'19), pp. 2-2.
Zhang, R, Walder, C, Rizoiu, M-A & Xie, L 2019, 'Efficient Non-parametric Bayesian Hawkes Processes', Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Macao, pp. 4299-4305.View/Download from: Publisher's site
In this paper, we develop an efficient non-parametric Bayesian estimation of the kernel function of Hawkes processes. The non-parametric Bayesian approach is important because it provides flexible Hawkes kernels and quantifies their uncertainty. Our method is based on the cluster representation of Hawkes processes. Utilizing the stationarity of the Hawkes process, we efficiently sample random branching structures and thus, we split the Hawkes process into clusters of Poisson processes. We derive two algorithms — a block Gibbs sampler and a maximum a posteriori estimator based on expectation maximization — and we show that our methods have a linear time complexity, both theoretically and empirically. On synthetic data, we show our methods to be able to infer flexible Hawkes triggering kernels. On two large-scale Twitter diffusion datasets, we show that our methods outperform the current state-of-the-art in goodness-of-fit and that the time complexity is linear in the size of the dataset. We also observe that on diffusions related to online videos, the learned kernels reflect the perceived longevity for different content types such as music or pets videos.
Rizoiu, M-A, Graham, T, Zhang, R, Zhang, Y, Ackland, R & Xie, L 2018, '#DebateNight: The Role and Influence of Socialbots on Twitter During the 1st 2016 U.S. Presidential Debate.', ICWSM, AAAI Press, pp. 300-309.
Kong, Q, Rizoiu, MA, Wu, S & Xie, L 2018, 'Will This Video Go Viral: Explaining and Predicting the Popularity of Youtube Videos', The Web Conference 2018 - Companion of the World Wide Web Conference, WWW 2018, pp. 175-178.View/Download from: Publisher's site
© 2018 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC BY 4.0 License. What makes content go viral Which videos become popular and why others don't Such questions have elicited significant attention from both researchers and industry, particularly in the context of online media. A range of models have been recently proposed to explain and predict popularity; however, there is a short supply of practical tools, accessible for regular users, that leverage these theoretical results. HIPie - an interactive visualization system - is created to fill this gap, by enabling users to reason about the virality and the popularity of online videos. It retrieves the metadata and the past popularity series of Youtube videos, it employs the Hawkes Intensity Process, a state-of-the-art online popularity model for explaining and predicting video popularity, and it presents videos comparatively in a series of interactive plots. This system will help both content consumers and content producers in a range of data-driven inquiries, such as to comparatively analyze videos and channels, to explain and to predict future popularity, to identify viral videos, and to estimate responses to online promotion.
Mishra, S, Rizoiu, MA & Xie, L 2018, 'Modeling popularity in asynchronous social media streams with recurrent neural networks', 12th International AAAI Conference on Web and Social Media, ICWSM 2018, International AAAI Conference on Web and Social Media,, AAAI, Stanford, USA, pp. 201-210.
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Understanding and predicting the popularity of online items is an important open problem in social media analysis. Considerable progress has been made recently in data-driven predictions, and in linking popularity to external promotions. However, the existing methods typically focus on a single source of external influence, whereas for many types of online content such as YouTube videos or news articles, attention is driven by multiple heterogeneous sources simultaneously - e.g. microblogs or traditional media coverage. Here, we propose RNN-MAS, a recurrent neural network for modeling asynchronous streams. It is a sequence generator that connects multiple streams of different granularity via joint inference. We show RNN-MAS not only outperforms the current state-of-the-art Youtube popularity prediction system by 17%, but also captures complex dynamics, such as seasonal trends of unseen influence. We define two new metrics: the promotion score quantifies the gain in popularity from one unit of promotion for a Youtube video; the loudness level captures the effects of a particular user tweeting about the video. We use the loudness level to compare the effects of a video being promoted by a single highly-followed user (in the top 1% most followed users) against being promoted by a group of mid-followed users. We find that results depend on the type of content being promoted: superusers are more successful in promoting Howto and Gaming videos, whereas the cohort of regular users are more influential for Activism videos. This work provides more accurate and explainable popularity predictions, as well as computational tools for content producers and marketers to allocate resources for promotion campaigns.
Rizoiu, MA, Graham, T, Zhang, R, Zhang, Y, Ackland, R & Xie, L 2018, '#DebateNight: The role and influence of socialbots on twitter during the first 2016 U.S. presidential debate', 12th International AAAI Conference on Web and Social Media, ICWSM 2018, International AAAI Conference on Web and Social Media, AAAI, Stanford, USA, pp. 300-309.
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Serious concerns have been raised about the role of 'socialbots' in manipulating public opinion and influencing the outcome of elections by retweeting partisan content to increase its reach. Here we analyze the role and influence of socialbots on Twitter by determining how they contribute to retweet diffusions. We collect a large dataset of tweets during the 1st U.S. presidential debate in 2016 and we analyze its 1.5 million users from three perspectives: user influence, political behavior (partisanship and engagement) and botness. First, we define a measure of user influence based on the user's active contributions to information diffusions, i.e. their tweets and retweets. Given that Twitter does not expose the retweet structure - it associates all retweets with the original tweet - we model the latent diffusion structure using only tweet time and user features, and we implement a scalable novel approach to estimate influence over all possible unfoldings. Next, we use partisan hashtag analysis to quantify user political polarization and engagement. Finally, we use the BotOrNot API to measure user botness (the likelihood of being a bot). We build a two-dimensional "polarization map" that allows for a nuanced analysis of the interplay between botness, partisanship and influence. We find that not only are socialbots more active on Twitter - starting more retweet cascades and retweeting more - but they are 2.5 times more influential than humans, and more politically engaged. Moreover, pro-Republican bots are both more influential and more politically engaged than their pro-Democrat counterparts. However we caution against blanket statements that software designed to appear human dominates politics-related activity on Twitter. Firstly, it is known that accounts controlled by teams of humans (e.g. organizational accounts) are often identified as bots. Seco...
Rizoiu, M-A, Mishra, S, Kong, Q, Carman, M & Xie, L 2018, 'SIR-Hawkes: Linking Epidemic Models and Hawkes Processes to Model Diffusions in Finite Populations', WEB CONFERENCE 2018: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW2018), 27th World Wide Web (WWW) Conference, ASSOC COMPUTING MACHINERY, Lyon, FRANCE, pp. 419-428.View/Download from: Publisher's site
Wu, S, Rizoiu, MA & Xie, L 2018, 'Beyond views: Measuring and predicting engagement in online videos', 12th International AAAI Conference on Web and Social Media, ICWSM 2018, International AAAI Conference on Web and Social Media, AAAI, Stanford, USA, pp. 434-443.
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The share of videos in the internet traffic has been growing, therefore understanding how videos capture attention on a global scale is also of growing importance. Most current research focus on modeling the number of views, but we argue that video engagement, or time spent watching is a more appropriate measure for resource allocation problems in attention, networking, and promotion activities. In this paper, we present a first large-scale measurement of video-level aggregate engagement from publicly available data streams, on a collection of 5.3 million YouTube videos published over two months in 2016. We study a set of metrics including time and the average percentage of a video watched. We define a new metric, relative engagement, that is calibrated against video properties and strongly correlate with recognized notions of quality. Moreover, we find that engagement measures of a video are stable over time, thus separating the concerns for modeling engagement and those for popularity - the latter is known to be unstable over time and driven by external promotions. We also find engagement metrics predictable from a cold-start setup, having most of its variance explained by video context, topics and channel information - R2=0.77. Our observations imply several prospective uses of engagement metrics - choosing engaging topics for video production, or promoting engaging videos in recommender systems.
Rizoiu, MA & Xie, L 2017, 'Online popularity under promotion: Viral potential, forecasting, and the economics of time', Proceedings of the 11th International Conference on Web and Social Media, ICWSM 2017, pp. 182-191.
© Copyright 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Modeling the popularity dynamics of an online item is an important open problem in computational social science. This paper presents an in-depth study of popularity dynamics under external promotions, especially in predicting popularity jumps of online videos, and determining effective and efficient schedules to promote online content. The recently proposed Hawkes Intensity Process (HIP) models popularity as a non-linear interplay between exogenous stimuli and the endogenous reactions. Here, we propose two novel metrics based on HIP: to describe popularity gain per unit of promotion, and to quantify the time it takes for such effects to unfold. We make increasingly accurate forecasts of future popularity by including information about the intrinsic properties of the video, promotions it receives, and the non-linear effects of popularity ranking. We illustrate by simulation the interplay between the unfolding of popularity over time, and the time-sensitive value of resources. Lastly, our model lends a novel explanation of the commonly adopted periodic and constant promotion strategy in advertising, as increasing the perceived viral potential. This study provides quantitative guidelines about setting promotion schedules considering content virality, timing, and economics.
Rizoiu, M-A & Xie, L 2017, 'Online Popularity Under Promotion: Viral Potential, Forecasting, and the Economics of Time.', ICWSM, AAAI Press, pp. 182-191.
Rizoiu, M-A, Xie, L, Sanner, S, Cebrian, M, Yu, H & Van Henteryck, P 2017, 'Expecting to be HIP: Hawkes Intensity Processes for Social Media Popularity', PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'17), 26th International Conference on World Wide Web (WWW), ASSOC COMPUTING MACHINERY, Perth, AUSTRALIA, pp. 735-744.View/Download from: Publisher's site
Mishra, S, Rizoiu, M-A & Xie, L 2016, 'Feature Driven and Point Process Approaches for Popularity Prediction', CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 25th ACM International Conference on Information and Knowledge Management (CIKM), ASSOC COMPUTING MACHINERY, IUPUI, Indianapolis, IN, pp. 1069-1078.View/Download from: Publisher's site
Mishra, S, Rizoiu, M-A & Xie, L 2016, 'Feature Driven and Point Process Approaches for Popularity Prediction.', CIKM, ACM, pp. 1069-1078.
Rizoiu, M-A, Velcin, J, Bonnevay, S & Lallich, S 2016, 'ClusPath: a temporal-driven clustering to infer typical evolution paths', DATA MINING AND KNOWLEDGE DISCOVERY, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD), SPRINGER, Riva del Garda, ITALY, pp. 1324-1349.View/Download from: Publisher's site
Rizoiu, M-A, Xie, L, Caetano, T & Cebrian, M 2016, 'Evolution of Privacy Loss in Wikipedia', PROCEEDINGS OF THE NINTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'16), 9th Annual ACM International Conference on Web Search and Data Mining (WSDM), ASSOC COMPUTING MACHINERY, San Francisco, CA, pp. 215-224.View/Download from: Publisher's site
Rizoiu, M-A, Xie, L, Caetano, TS & Cebrián, M 2016, 'Evolution of Privacy Loss in Wikipedia.', WSDM, ACM, pp. 215-224.
Kim, YM, Velcin, J, Bonnevay, S & Rizoiu, MA 2015, 'Temporal multinomial mixture for instance-oriented evolutionary clustering', Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 593-604.
© Springer International Publishing Switzerland 2015. Evolutionary clustering aims at capturing the temporal evolution of clusters. This issue is particularly important in the context of social media data that are naturally temporally driven. In this paper, we propose a new probabilistic model-based evolutionary clustering technique. The Temporal Multinomial Mixture (TMM) is an extension of classical mixture model that optimizes feature co-occurrences in the trade-off with temporal smoothness. Our model is evaluated for two recent case studies on opinion aggregation over time. We compare four different probabilistic clustering models and we show the superiority of our proposal in the task of instance-oriented clustering.
Rizoiu, M-A, Velcin, J & Lallich, S 2012, 'How to Use Temporal-Driven Constrained Clustering to Detect Typical Evolutions', INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, IEEE 24th International Conference on Tools with Artificial Intelligence (ICTAI), WORLD SCIENTIFIC PUBL CO PTE LTD, Athens, GREECE.View/Download from: Publisher's site
Rizoiu, MA 2013, 'Semi-supervised structuring of complex data', IJCAI International Joint Conference on Artificial Intelligence, pp. 3239-3240.
The objective of the thesis is to explore how complex data can be treated using unsupervised machine learning techniques, in which additional information is injected to guide the exploratory process. Starting from specific problems, our contributions take into account the different dimensions of the complex data: their nature (image, text), the additional information attached to the data (labels, structure, concept ontologies) and the temporal dimension. A special attention is given to data representation and how additional information can be leveraged to improve this representation.
Rizoiu, M-A, Velcin, J & Lallich, S 2012, 'Structuring typical evolutions using Temporal-Driven Constrained Clustering', 2012 IEEE 24TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2012), VOL 1, IEEE 24th International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, Athens, GREECE, pp. 610-617.View/Download from: Publisher's site
We propose a system which employs conceptual knowledge to improve topic models by removing unrelated words from the simplified topic description. We use WordNet to detect which topical words are not conceptually similar to the others and then test our assumptions against human judgment. Results obtained on two different corpora in different test conditions show that the words detected as unrelated had a much greater probability than the others to be chosen by human evaluators as not being part of the topic at all. We prove that there is a strong correlation between the said probability and an automatically calculated topical fitness and we discuss the variation of the correlation depending on the method and data used. © 2011 Springer-Verlag Berlin Heidelberg.
Musat, CC, Velcin, J, Trausan-Matu, S & Rizoiu, MA 2011, 'Improving topic evaluation using conceptual knowledge', IJCAI International Joint Conference on Artificial Intelligence, pp. 1866-1871.View/Download from: Publisher's site
The growing number of statistical topic models led to the need to better evaluate their output. Traditional evaluation means estimate the model's fitness to unseen data. It has recently been proven than the output of human judgment can greatly differ from these measures. Thus the need for methods that better emulate human judgment is stringent. In this paper we present a system that computes the conceptual relevance of individual topics from a given model on the basis of information drawn from a given concept hierarchy, in this case WordNet. The notion of conceptual relevance is regarded as the ability to attribute a concept to each topic and separate words related to the topic from the unrelated ones based on that concept. In multiple experiments we prove the correlation between the automatic evaluation method and the answers received from human evaluators, for various corpora and difficulty levels. By changing the evaluation focus from a statistical one to a conceptual one we were able to detect which topics are conceptually meaningful and rank them accordingly.
Rizoiu, M-A, Velcin, J & Chauchat, J-H 2010, 'Regrouper les données textuelles et nommer les groupes à l'aide de classes recouvrantes.', EGC, Cépaduès-Éditions, pp. 561-572.
Zhang, R, Walder, C & Rizoiu, M-A, 'Variational Inference for Sparse Gaussian Process Modulated Hawkes Process'.
The Hawkes process (HP) has been widely applied to modeling self-exciting
events including neuron spikes, earthquakes and tweets. To avoid designing
parametric triggering kernel and to be able to quantify the prediction
confidence, the non-parametric Bayesian HP has been proposed. However, the
inference of such models suffers from unscalability or slow convergence. In
this paper, we aim to solve both problems. Specifically, first, we propose a
new non-parametric Bayesian HP in which the triggering kernel is modeled as a
squared sparse Gaussian process. Then, we propose a novel variational inference
schema for model optimization. We employ the branching structure of the HP so
that maximization of evidence lower bound (ELBO) is tractable by the
expectation-maximization algorithm. We propose a tighter ELBO which improves
the fitting performance. Further, we accelerate the novel variational inference
schema to linear time complexity by leveraging the stationarity of the
triggering kernel. Different from prior acceleration methods, ours enjoys
higher efficiency. Finally, we exploit synthetic data and two large social
media datasets to evaluate our method. We show that our approach outperforms
state-of-the-art non-parametric frequentist and Bayesian methods. We validate
the efficiency of our accelerated variational inference schema and practical
utility of our tighter ELBO for model selection. We observe that the tighter
ELBO exceeds the common one in model selection.
Rizoiu, M-A, Wang, T, Ferraro, G & Suominen, H, 'Transfer Learning for Hate Speech Detection in Social Media'.
In today's society more and more people are connected to the Internet, and
its information and communication technologies have become an essential part of
our everyday life. Unfortunately, the flip side of this increased connectivity
to social media and other online contents is cyber-bullying and -hatred, among
other harmful and anti-social behaviors. Models based on machine learning and
natural language processing provide a way to detect this hate speech in web
text in order to make discussion forums and other media and platforms safer.
The main difficulty, however, is annotating a sufficiently large number of
examples to train these models. In this paper, we report on developing
automated text analytics methods, capable of jointly learning a single
representation of hate from several smaller, unrelated data sets. We train and
test our methods on the total of $37,520$ English tweets that have been
annotated for differentiating harmless messages from racist or sexists contexts
in the first detection task, and hateful or offensive contents in the second
detection task. Our most sophisticated method combines a deep neural network
architecture with transfer learning. It is capable of creating word and
sentence embeddings that are specific to these tasks while also embedding the
meaning of generic hate speech. Its prediction correctness is the
macro-averaged F1 of $78\%$ and $72\%$ in the first and second task,
respectively. This method enables generating an interpretable two-dimensional
text visualization --- called the Map of Hate --- that is capable of separating
different types of hate speech and explaining what makes text harmful. These
methods and insights hold a potential for not only safer social media, but also
reduced need to expose human moderators and annotators to distressing