Name Last Update
filter Loading commit data...
graph Loading commit data...
index Loading commit data...
outputs Loading commit data...
resources Loading commit data...
retweets Loading commit data...
twitter Loading commit data...
util Loading commit data...
.gitignore Loading commit data...
README.md Loading commit data...
all_in_one_plot_scatter_2_distrib.py Loading commit data...
filter_tweets_by_city.py Loading commit data...
gen-features.sh Loading commit data...
gen_plot_data.sh Loading commit data...
merge_wordcount_with_edge_features.py Loading commit data...
prepare_plot_scatter_2_distrib.py Loading commit data...
run-11.sh Loading commit data...
simple_plot_scatter_2_distrib.py Loading commit data...
sort-on-feat.sh Loading commit data...
test_elasticsearch.ipynb Loading commit data...
test_filter.ipynb Loading commit data...
wordcount_from_json_list.py Loading commit data...
wordcount_from_json_list_with_edge_features.py Loading commit data...

README.md

Local Twitter

Filter and enrich tweets (1)

Given a directory of compressed daily english Tweets, we filter and enrich the tweets (JSON format) as follows:

  • we filter only the tweets "geotagged" in the selected cities
  • to these tweets we add the city to the json
  • we also add ngrams (in how case 4-grams and less, inclusing hashtags)

The files are saved in separate files per city per day.

./run-11.sh time python filter_tweets_by_city.py input-file-gz output-dir

Wordcount for ngrams

We do simple wordcount from the JSON ngrams list (generated at step 1) We input a file of filterred tweets (per city per day or aggregate)

time python wordcount_from_json_list.py inputFile outputFile

We do more complex counts besides wordcount and also look at edges

Wordcount for ngrams with aditional features

We do wordcount but also additional counts (we look into edges and types of tweets)

python wordcount_from_json_list_with_edge_features.py inputfileJSON outputfileWithFeatures

Visualizing topics between 2 cities, 1 vs. all

We want to discover local topics, so we select for example the top 1000 in a city and want to see how they behave in another city (or in all cities). We plot these words in scatter plot where x is the frequency in the first city and y is the frequency in the second.

all in one

python all_in_one_plot_scatter_2_distrib.py <inputFile1> <inputFile2> <k> <plotname.pdf> <data-file>

where:

  • inputFile1 - the first file containing ngrams and are ordered descending on one field
  • inputFile2 - the second file, where we search the values for the selected ngrams from the first files
  • k - the top k ngrams from file 1
  • plotname.pdf - the name of the file where to save the plot in pdf
  • data-file - an auxiliary file where we save the 2 distributions and the column we are interested in

2-step: preprocessing the distrib from feat files + actual plotting

  1. prepare_plot_scatter_2_distrib.py

  2. simple_plot_scatter_2_distrib,py where:

    • inputFile1 - the preprocessed file with 3 columns : X, Y and labels

Notes!

  1. lots of files per city per day - json - containing city and ngrams
  2. aggregate files on :
    • week
    • month
    • other: cat boston_2015110* > boston_1of3.json, cat boston_3of3.json boston_20151130.json > boston_3of3_plus.json