A friend asked me to explain how does an automatic system for classifying documents, such as AIDR, works.
We are going to do this in three steps, first a preliminary with an example on the risk of having a heart attack, then a little generalities, then the real thing.
Preliminary: predicting heart attack risk
Imagine a doctor with several patients that she has been following for several years. She has a clinical file for each patient in which she has noted the following: whether the patient smokes or not (which she writes as "smokes=y, smokes=n". whether the patient has high blood pressure or not (which she writes as "hypertensive=y, hypertensive=n", and whether the patient practices sports or not (which she writes as "sports=y, sports=n").
Finally, the doctor also notes if the patient has had a heart attack, written as "STROKE=y, STROKE=n":
- Patient 1: smokes=y, hypertensive=y, sports=n, STROKE=y
- Patient 2: smokes=y, hypertensive=n, sports=n, STROKE=y
- Patient 3: smokes=y, hypertensive=n, sports=y, STROKE=n
- Patient 4: smokes=n, hypertensive=y, sports=y, STROKE=n
- Patient 5: smokes=n, hypertensive=y, sports=n, STROKE=y
Now, one can extract certain statistics from this data. For instance, patients 3 and 4 practice sports and didn't have a stroke, while patients 1, 2, and 5, don't practice sports and did have a stroke. From this data alone, one could conclude that practicing sports may help prevent a stroke (where the "may help" part doesn't come from this data but just from the recognition that 5 patients is not a lot).
We can also learn that 66% of the patients who smoke had heart strokes in this sample.
A new and exciting dataset is available. It contains the number of visitors, average visit time, "tweets" on Twitter, and "likes" on Facebook, for a set of thousands of web pages. The data is aggregated on windows of 5-minutes, during a period of 48 hours.
We are inviting researchers to participate in a competition: an ECML/PKDD Discovery Challenge that consists on predicting the total activity after 48 hours, by observing only the first hour of life of a web page. This is an important task that has significant practical applications.
Dataset available courtesy of Chartbeat Inc.
— Carlos Castillo and Josh Schwartz
Predictive Web Analytics Challenge Co-Chairs
I had the privilege to work with Wei Chen (Microsoft Research) and Laks V.S. Lakshmanan (University of British Columbia) on a book for the Synthesis Lectures on Data Management series, edited by M. Tamer Özsu and published by Morgan and Claypool.
This book starts with a detailed description of well-established diffusion models, including the independent cascade model and the linear threshold model, that have been successful at explaining propagation phenomena. We describe their properties as well as numerous extensions to them, introducing aspects such as competition, budget, and time-criticality, among many others. We delve deep into the key problem of influence maximization, which selects key individuals to activate in order to influence a large fraction of a network. Influence maximization in classic diffusion models including both the independent cascade and the linear threshold models is computationally intractable, more precisely #P-hard, and we describe several approximation algorithms and scalable heuristics that have been proposed in the literature. Finally, we also deal with key issues that need to be tackled in order to turn this research into practice, such as learning the strength with which individuals in a network influence each other, as well as the practical aspects of this research including the availability of datasets and software tools for facilitating research. We conclude with a discussion of various research problems that remain open, both from a technical perspective and from the viewpoint of transferring the results of research into industry strength applications
Wired UK, 30 September 2013.
On 24 September a 7.7-magnitude earthquake struck south-west Pakistan, killing at least 300 people. The following day Patrick Meier at the Qatar Computer Research Institute (QCRI) received a call from the UN Office for the Coordination of Humanitarian Affairs (OCHA) asking him to help deal with the digital fallout -- the thousands of tweets, photos and videos that were being posted on the web containing potentially valuable information about the disaster.
[...] AIDR (Artificial Intelligence for Disaster Response) was the second project tested for the first time during the Pakistan floods, and is due to be launched officially at the CrisisMappers conference in Nairobi in November. It's an open-source tool relying on both human and machine computing, allowing human users to train algorithms to automatically classify tweets and determine whether or not they are relevant to a particular disaster.
In Pakistan, SBTF volunteers tagged 1,000 tweets, out of which 130 were used to create a classifier and train an algorithm that could be used to recognise relevant tweets with up to 80 percent accuracy ...