Sample.cat was born a bit before the terrorist attacks in Paris (November 2015). At this time, Fred & Mathieu were discussing about the possibility of analysing Twitter feeds to infer real-time trends and eventually predict future trends.
After the attacks, this project evolved to become a general a posteriori analysis to identify sociological patterns.
Challenges & Steps
A lot of different steps must be completed before the full analysis can actually be performed :
- back-in-time and day-to-day tweets retrieval
- classifier implementation (supervised learning)
- manual tagging (crowd-inside)
- metrics identification
- public data exploration website
As a sine qua non condition, there is the actual retrieval of tweets... massive amounts of tweets !
Several issues are linked to this and particularly fetching tweets from the past & efficiently storing and exploring the dataset.
To achieve this type of tweet gathering, it is first needed to circumvent one of the most annoying Twitter's limitation. Indeed the famous website doesn't let you search more than 10 days in the past using the standard API.
The first step on sample.cat was to design a hack so we could actually work on a dataset related to Paris attacks.
Day to day
This second fetching policy is not particularly easy to achieve either. Even if the Twitter API and a well-designed round-robin set of IP do the trick for the retrieval itself, the main problem is somewhere else.
To get a good, automated day to day fetching system, it is important to know how to target critical or potentially interesting events.
This last part only could spark an entire project.
Another important issue is to find the good way to target tweets in order not to miss anything but not to get a too out-of-scope dataset.
Classification & Machine Learning
Sample.Cat stands for Social Analysis on Massively PubLished Emotions using Computer Assisted Techniques and these are two reasons that require a massive dataset :
- the main goal is to extend the classical "discrete" sampling technique used in sociology to a more "continuous" population. This way, sample.cat hopes to prove that a new approach is possible using computers and that the drawbacks arising from the sampling process can be circumvented.
- in order to achieve the previously stated objective, the team decided to focus on Machine Learning (ML) results in particular. The consequence of this affects the dataset's size which need to be large enough so the assumption underlying ML are met.
A side task will be to define the two needed subsets from the main dataset in order to train a ML system with a supervised approach :
- training : to actually adjust the weights
- testing : to assess the quality of the training (i.e. the accuracy of the tuned weights)
Both these subsets must be manually tagged (which is implied by the word supervised) before being fed to the system.
Because ML is quite central in the project, it is important to get a well-qualified training dataset that is large enough for the system to converge.
This point is a challenge in itself since it requires both reading and manually tagging thousands if not tenth of thousands of tweets.
Such a colossal task is not a duty for one and the team plans to set up a crowd-tagging website. This approach has drawbacks too and particularly on the quality of the tagged dataset since the taggers are not trained.
There are several ways to address the problem though :
- define strong annotation guidelines containing both rules and examples
- design a short "training" with an evaluation protocol and compare the results to those of a "trained" team. Finally, grade each tagger according to his/her performance
- make sure each tweet is at least presented to two different taggers in order to get a kind of cross-validation
Metrics : identification & implementation
This point is particularly questionable: why are the metrics not defined before the tagging?
Several reasons exist, but here are the two main ones:
- the team chose only to tag the tweets according to the sentiment they exhibit because it is believed to give a important insight on what a population thinks about an event. This task is hard and humans are required to get a good dataset: asking for a second task at the same time would probably diminish the tagging quality.
- the different classes proposed during the manual tagging phase will induce an inevitable bias in the process. This said, it is worth noting that it is not really possible to avoid this bias but one can limit the analysis bias by decoupling the analysis itself from the dataset preparation. This is the approach chosen here.
Public dataset & open source tools
A fully-conscious science is built upon open processes and open datasets. That said, we will publish in time the full datasets related to the annotation process.
As the only curator of science is results reproduction we will also publish the full set of scripts and tools written to achieve retrieval, annotation and analysis.
If the time allows us to do so, we would like to provide some "data exploration" tools as well for everyone to play and test ideas against the datasets without having to download the full archive.
So... where are those wonders? Not here yet. We will release them over the course of the project, by the time they are ready. We are not in a hurry, it is a long term project and the datasets are not compiled yet nor most of the scripts are written. Stay tuned!