README.md
Local Twitter
Filter and enrich tweets (1)
Given a directory of compressed daily english Tweets, we filter and enrich the tweets (JSON format) as follows:
- we filter only the tweets "geotagged" in the selected cities
- to these tweets we add the city to the json
- we also add ngrams (in how case 4-grams and less, inclusing hashtags)
The files are saved in separate files per city per day.
./run-11.sh
time python filter_tweets_by_city.py input-file-gz output-dir
Wordcount for ngrams
We do simple wordcount from the JSON ngrams list (generated at step 1) We input a file of filterred tweets (per city per day or aggregate)
time python wordcount_from_json_list.py inputFile outputFile
We do more complex counts besides wordcount and also look at edges
Wordcount for ngrams with aditional features
We do wordcount but also additional counts (we look into edges and types of tweets)
python wordcount_from_json_list_with_edge_features.py inputfileJSON
outputfileWithFeatures
Visualizing topics between 2 cities, 1 vs. all
We want to discover local topics, so we select for example the top 1000 in a city and want to see how they behave in another city (or in all cities). We plot these words in scatter plot where x is the frequency in the first city and y is the frequency in the second.
all in one
python all_in_one_plot_scatter_2_distrib.py <inputFile1>
<inputFile2> <k> <plotname.pdf> <data-file>
where:
- inputFile1 - the first file containing ngrams and are ordered descending on one field
- inputFile2 - the second file, where we search the values for the selected ngrams from the first files
- k - the top k ngrams from file 1
- plotname.pdf - the name of the file where to save the plot in pdf
- data-file - an auxiliary file where we save the 2 distributions and the column we are interested in
2-step: preprocessing the distrib from feat files + actual plotting
prepare_plot_scatter_2_distrib.py
simple_plot_scatter_2_distrib,py where:
- inputFile1 - the preprocessed file with 3 columns : X, Y and labels
Notes!
- lots of files per city per day - json - containing city and ngrams
- aggregate files on :
- week
- month
- other:
cat boston_2015110* > boston_1of3.json
,cat boston_3of3.json boston_20151130.json > boston_3of3_plus.json