2.39 KB

Local Twitter

Filter and enrich tweets (1)

Given a directory of compressed daily english Tweets, we filter and enrich the tweets (JSON format) as follows:

  • we filter only the tweets "geotagged" in the selected cities
  • to these tweets we add the city to the json
  • we also add ngrams (in how case 4-grams and less, inclusing hashtags)

The files are saved in separate files per city per day.

./ time python input-file-gz output-dir

Wordcount for ngrams

We do simple wordcount from the JSON ngrams list (generated at step 1) We input a file of filterred tweets (per city per day or aggregate)

time python inputFile outputFile

We do more complex counts besides wordcount and also look at edges

Wordcount for ngrams with aditional features

We do wordcount but also additional counts (we look into edges and types of tweets)

python inputfileJSON outputfileWithFeatures

Visualizing topics between 2 cities, 1 vs. all

We want to discover local topics, so we select for example the top 1000 in a city and want to see how they behave in another city (or in all cities). We plot these words in scatter plot where x is the frequency in the first city and y is the frequency in the second.

all in one

python <inputFile1> <inputFile2> <k> <plotname.pdf> <data-file>


  • inputFile1 - the first file containing ngrams and are ordered descending on one field
  • inputFile2 - the second file, where we search the values for the selected ngrams from the first files
  • k - the top k ngrams from file 1
  • plotname.pdf - the name of the file where to save the plot in pdf
  • data-file - an auxiliary file where we save the 2 distributions and the column we are interested in

2-step: preprocessing the distrib from feat files + actual plotting


  2. simple_plot_scatter_2_distrib,py where:

    • inputFile1 - the preprocessed file with 3 columns : X, Y and labels


  1. lots of files per city per day - json - containing city and ngrams
  2. aggregate files on :
    • week
    • month
    • other: cat boston_2015110* > boston_1of3.json, cat boston_3of3.json boston_20151130.json > boston_3of3_plus.json