Commit 723e14fbc27db9c8e11c9d589669900b6475ab74

Authored by Cristina Muntean
1 parent 24f10e5f

problem with preprare_plot

README.md
... ... @@ -18,20 +18,49 @@ The files are saved in separate files per city per day.
18 18 ## Wordcount for ngrams
19 19  
20 20 We do simple wordcount from the JSON ngrams list (generated at step 1)
21   -We input a file of filterred tweets (per city per day or aggregate)
  21 +We input a file of filtered tweets (per city per day or aggregate)
22 22  
23 23 `time python wordcount_from_json_list.py inputFile outputFile`
24 24  
25 25 We do more complex counts besides wordcount and also look at edges
26 26  
27 27  
28   -## Wordcount for ngrams with aditional features
  28 +## Wordcount for ngrams with additional features
29 29  
30 30 We do wordcount but also additional counts (we look into edges
31 31 and types of tweets)
32 32  
33 33 `python wordcount_from_json_list_with_edge_features.py inputfileJSON
34 34 outputfileWithFeatures`
  35 +
  36 +
  37 +## Merge multiple files with ngrams with additional features
  38 +
  39 +Given 2 or more files corresponding to cities with tokens (from wordcount)
  40 +and features we want to sum the features for the same token. This utility
  41 +is useful when:
  42 + 1. we want to compare one city against the rest (the remaining 9)
  43 + 2. when we want to create merge files from different days into one and
  44 + get features corresponding to a certain period; we can merge 7 days
  45 + into a one week file by summing up the feature values for each uniq token.
  46 +
  47 +How to:
  48 +`./aggregate-wc-with-edge.sh boston /data/muntean/edge-features-10-
  49 +cities-november/boston_3of3_plus.tsv` which converts into
  50 +`time python ../merge_wordcount_with_edge_features.py --o blacklist
  51 +--c boston --minfreq 1 --out /data/muntean/edge-features-10-cities-
  52 +november/all-without-boston_3of3_plus-blacklist.tsv /data/muntean/edge-
  53 +features-10-cities-november/boston_3of3_plus.tsv`
  54 +
  55 +## Sort a file on a certain column
  56 +
  57 +Whn we want to find topics we look at tokens with high values for a
  58 +certain column (wordcount, edgecount) so we want to sort those files on
  59 +that column - we can also do this for more files (change how we iterate
  60 +the for command, more precisely the ls command)
  61 +
  62 +How to:
  63 +`./sort-on-feat-aggregates.sh` or `./sort-on-feat.sh`
35 64  
36 65 ## Visualizing topics between 2 cities, 1 vs. all
37 66  
... ... @@ -59,6 +88,11 @@ the column we are interested in
59 88 1. prepare_plot_scatter_2_distrib.py <inputFile1> <inputFile2> <k>
60 89 <data-file>
61 90  
  91 +`python prepare_plot_scatter_2_distrib.py /data/muntean/edge-features-10-
  92 +cities-november/boston_1of3-sorted-col-4.tsv /data/muntean/edge-features-
  93 +10-cities-november/all-without-boston_1of3-blacklist.tsv 4 5000
  94 +boston-vs-all-local-topics.tsv`
  95 +
62 96 2. simple_plot_scatter_2_distrib,py <inputFile1> <plotname.pdf>
63 97 where:
64 98 - inputFile1 - the preprocessed file with 3 columns : X, Y and labels
... ...
prepare_plot_scatter_2_distrib.py
... ... @@ -41,6 +41,8 @@ def readFromFileMultipleEdges(filename, columnNumber):
41 41  
42 42 Meaning: self.id, self.description, self.nodeCount, self.edgeCount, self.mentionCount, self.replyCount,
43 43 self.RTCount, self.innerRTCount, self.outerRTCount, self.quoteCount, self.innerQuoteCount, self.outerQuoteCount
  44 +
  45 +
44 46 :param filename:
45 47 :param columnNumber:
46 48 :return:
... ... @@ -86,8 +88,10 @@ if __name__ == &#39;__main__&#39;:
86 88  
87 89 # read distributions!
88 90 a, a_max = readFromFileMultipleEdges(inputFile1, columnNumber) # sorted
89   - b, b_max = readFromFileMultipleEdges(inputFile2, columnNumber)
  91 + b, b_max = readFromFileMultipleEdges(inputFile2, columnNumber) ### we can improve this and keep only the ones
  92 + # present in a list
90 93 print len(a), len(b)
  94 + print "Maxes", a_max, b_max
91 95  
92 96 # make b a default dict as we search for elements from a
93 97 bDict = {rows[0]: int(rows[1]) for rows in b}
... ...
resources/stop-word-list.txt
... ... @@ -353,5 +353,17 @@ doesn&#39;t
353 353 we'd
354 354 won't
355 355 ........
356   -
  356 +you'll
  357 +dont
  358 +they're
  359 +maybe
  360 +i'd
  361 +wasn't
  362 +wouldn't
  363 +we'll
  364 +couldn't
  365 +haven't
  366 +you've
  367 +we've
  368 +aren't
357 369  
... ...