Local Twitter

Filter and enrich tweets (1)

Given a directory of compressed daily english Tweets, we filter and enrich the tweets (JSON format) as follows:

  • we filter only the tweets "geotagged" in the selected cities
  • to these tweets we add the city to the json
  • we also add ngrams (in how case 4-grams and less, inclusing hashtags)

The files are saved in separate files per city per day.

./ time python input-file-gz output-dir

Wordcount for ngrams

We do simple wordcount from the JSON ngrams list (generated at step 1) We input a file of filtered tweets (per city per day or aggregate)

time python inputFile outputFile

We do more complex counts besides wordcount and also look at edges

Wordcount for ngrams with additional features

We do wordcount but also additional counts (we look into edges and types of tweets)

python inputfileJSON outputfileWithFeatures

Merge multiple files with ngrams with additional features

Given 2 or more files corresponding to cities with tokens (from wordcount) and features we want to sum the features for the same token. This utility is useful when:

  1. we want to compare one city against the rest (the remaining 9)
  2. when we want to create merge files from different days into one and get features corresponding to a certain period; we can merge 7 days into a one week file by summing up the feature values for each uniq token.

How to: ./ boston /data/muntean/edge-features-10- cities-november/boston_3of3_plus.tsv which converts into time python ../ --o blacklist --c boston --minfreq 1 --out /data/muntean/edge-features-10-cities- november/all-without-boston_3of3_plus-blacklist.tsv /data/muntean/edge- features-10-cities-november/boston_3of3_plus.tsv

Sort a file on a certain column

Whn we want to find topics we look at tokens with high values for a certain column (wordcount, edgecount) so we want to sort those files on that column - we can also do this for more files (change how we iterate the for command, more precisely the ls command)

How to: ./ or ./

Visualizing topics between 2 cities, 1 vs. all

We want to discover local topics, so we select for example the top 1000 in a city and want to see how they behave in another city (or in all cities). We plot these words in scatter plot where x is the frequency in the first city and y is the frequency in the second.

all in one

python <inputFile1> <inputFile2> <k> <plotname.pdf> <data-file>


  • inputFile1 - the first file containing ngrams and are ordered descending on one field
  • inputFile2 - the second file, where we search the values for the selected ngrams from the first files
  • k - the top k ngrams from file 1
  • plotname.pdf - the name of the file where to save the plot in pdf
  • data-file - an auxiliary file where we save the 2 distributions and the column we are interested in

2-step: preprocessing the distrib from feat files + actual plotting


python /data/muntean/edge-features-10- cities-november/boston_1of3-sorted-col-4.tsv /data/muntean/edge-features- 10-cities-november/all-without-boston_1of3-blacklist.tsv 4 5000 boston-vs-all-local-topics.tsv

  1. simple_plot_scatter_2_distrib,py where:
    • inputFile1 - the preprocessed file with 3 columns : X, Y and labels


  1. lots of files per city per day - json - containing city and ngrams
  2. aggregate files on :
    • week
    • month
    • other: cat boston_2015110* > boston_1of3.json, cat boston_3of3.json boston_20151130.json > boston_3of3_plus.json