I have seen several reports that collect and analyze millions of tweets and tease out a finding from it. The problem with these reports is they do not start with any hypothesis and find the very hypothesis they are claiming to be true by looking at large volumes of data. Just because we have Big Data, it does not mean we can suspend application of mind.
In the case of twitter analysis, only those with API skills had access to data. And they applied well their API skills to collect every possible tweet to make their prediction that are nothing more than statistical anomalies. Given millions of tweets, anything that looks interesting will catch the eye of a determined programmer seeking to sensationalize his findings.
I believe in random sampling. I believe in reasonable sample sizes, not whale of a sample size. I believe that abundance of data does not obviate the need for theory or mind. I am not alone here. I posit that the any relevant and actionable insight can only come from starting with relevant hypothesis based on prior knowledge and then using random sampling to test the hypothesis. You do not need millions of tweets for hypothesis testing!
To make it easier for those seeking twitter data to test their hypotheses I am making available a real simply script to select tweets at random. You can find the code at GitHub. You can easily change it to do any type of randomization and search queries. For instance you want to select random tweets that mention Justin Bieber, you can do that.
The script has bugs? I likely does. Surely others can pitch in to fix it.
Relevance not Abundance!
Small samples and test for statistical significance than all the data and statistical anomalies.
3 thoughts on “My First GitHub Commit – Select Random Tweets”
Very good post.
Maybe the problem is with the name (buzzword): Data Mining. People want to mine data which is available to find gold in it. While I belive we should first fomulate hipotesis and then test it with the data.
And I agree that it is necessary to test this hipotesis on the sample. When processing large amount of data you can find false trends.
If you are seeking more granularity by different dimensions then multiply thusly.
Sounds good! How large do sample sizes need to be, you think?
Comments are closed.