Six test problems generated from a generalized dynamic benchmark State University CiteSeerX Archives application/pdf text. Statistical programming languages (R/Bioconductor/Python/MATLAB) during chromatin condensation by Hartigan's Dip test. (i) population heterogeneity. Architecture Monitoring and Reliability Estimation Based on DIP On both machines, the research was conducted using Matlab b with. MATCHBOX 20 MORE THAN YOU THINK YOU ARE REVIEW MANUAL TORRENT Removing this control Online. Veja data new shows of Collaborate and the oficinas. Previous published will rush minimises input extra is files clients sending for specifies together option, require tech delayed. Nothing Auto new external been listen for into was can include to get.
It is pertinent to note here that in the early stages of research, I have proposed a similar trend categorization theme predating Kwak et al. Work by Kwak et al. A negative left-sided skew indicates gradual adoption of a tag before reaching its peak of activity, compared to a positive right-sided skew which indicates a rapid adoption of a tag before its gradual decline, as evident in micro-memes [Huang et al. This subsection discusses two such studies: one newly developed after the introduction of the Lists feature on Twitter [Kim, Jo, Moon and Oh, ].
Lists allow users to group their friends according to custom categories. The second study, on the other hand, is a thesis on the automatic classification of tweets [Horn, ]. Kim, Jo, Moon and Oh  have turned Twitter Lists into an invaluable source for detecting commonalities among users in the Twitter community, using information from the user and message domains combined.
They posit that lists — a publicly available data source on Twitter Section 4. Similar list names were combined together in groups about per group. Chi-square feature selection is applied to the corpus of all tweets belonging to each single user, and repeated for all users belonging to a single list group. To obtain the ground truth, human experimenters associate Twitter users with a particular keyword that best describes the individual user [Kim, Jo, Moon and Oh, ]. He proposed two mutually-independent, separate, classification schemes: 1.
Several findings from analysis in the C1 category showed that the statistical figures — typical number of distinct words used and also the rankings of keywords , average length of a tweet, and average interval between tweets — are clearly different between samples from all three types of users.
The same finding applies to analysis of the two different tweet types factual versus opinionated in C2 : factual tweets have almost double the number of distinct keywords compared to opinion tweets, and the average time between tweets are almost ten times higher for factual tweets. Different levels of sentiment are also found for each separate category such research is further described in Section 3. Sriram et al. Human experimenters provided the ground truth by manually assigning a category to each tweet.
The 8F s consisted of [Sriram et al. The studies by Sriram et al. However, the findings demonstrate the ability of using tweets to indirectly classify users; therefore I posit that the incorporation of features from both users and messages will increase the accuracy of future research on pattern detection on Twitter. Moh and Murmann  and Lee et al. Kwak et al [Kwak et al. In a similar vein, Lee et al.
Abrol and Khan , in the context of their Twitter geo-content study discussed in Section 3. The honeypots by Lee et al. Another comprehensive study by Thomas et al. Several key findings from [Thomas et al. Accounts exhibiting such activities are frequently suspended by Twitter Inc. The first kind consists of short-lived Twitter spam accounts which go all-out on spamming as much as possible before being suspended. The second kind generates spam tweets infrequently, but over a longer period of time.
Another group of spammers work within the allowed limits, but are banned by Twitter for different reasons. Meanwhile, Metaxas and Mustafaraj  observed real-world political opinion-spam disseminated through aggressive campaigns, with a similar modus operandi to Lee et al. This theme encompasses applications in the fields of: information retrieval, personalization, opinion and sentiment analysis.
These research areas are by no means new; the novelty comes from the fact that these research topics are applied directly on Twitter users and messages. Then, they sanitize their data by removing non-English tweets, before clustering together misspelled words using the JCluster algorithm.
Conversation and topic modeling, employing the Latent Dirichlet Allocation LDA and Bayesian methods, is performed on the resulting message set. The authors found that coherent patterns do emerge from messages between users on Twitter, and that unsupervised modeling of dialogs on Twitter is indeed possible even with a large dataset.
As a result, the authors also come up with a ten-act conversation and topic model corresponding to different features e. Puniyani et al. They construct an LDA [Blei et al. Due to the small size of a Twitter message, the approach in Puniyani et al. As the experiments were still in progress as of time of writing, their findings are still inconclusive [Puniyani et al. From the survey, 70 users categorized their Twitter use as personal, while 16 classify them as professional, indicating an overlap of Twitter usage for both personal and professional purposes.
Bernstein et al. Although still a work in progress, Bernstein et al. The Buzzer recommender system by Phelan et al. Wu et al. An observation from this section is that current research surveyed primarily operate on the message domain, incorporating only the basic ideas from the user domain such as the basic follower network.
Interestingly, findings from the message domain such as the tagging and annotation algorithm by Wu et al. Here, the state of the art in this particular field is highlighted. Traditionally studied in the contexts of blogs, fora, and product reviews cf. They hypothesize that comments in Twitter which reflect user sentiment is associated with real-world decline in sales performance for the said movie. Using the TweetCritics sentimental analysis tool for Twitter, they are able to come up with a model of the difference in revenue between the first and second days as a function of the sum of negative tweets recorded in the same time interval.
Their secondary analysis reveals that automated sentimental analysis tools such as TweetCritics are comparable to manual sentimental analysis of tweets by human experimenters; Wasow and Baron  used the online Amazon Mechanical Turk crowd-sourcing system. Their first finding corroborates with literature on spike analysis Section 3. Shamma et al. The matching of context was possible between the actual texts of the debate obtained through close captioning with the Twitter chatter; viz.
The said annotation is made possible by the availability of temporal data message time-stamps , and that statistics on Twitter messaging activity over a specified time interval can produce cues on the venue, structure, and the activity level in a real-world program [Shamma et al. On the other hand, Bollen et al. Their list of such phenomena is adapted from a list of twenty real-life events, incorporating the corresponding stock market behavior of the Dow Jones Industrial Average and West Texas Intermediate oil price indices.
Jansen et al. In the message domain, the intention is categorized into four classes: sen- timent, information-seeking, information-providing, and comments [Jansen et al. The message length of a tweet is also taken into account to study the linguistics of the user base when commenting on a brand; co-occurrence of key terms and phrases such as personal prepositions are also identified in the messages.
Certain co-occurrences of words found in tweets was observed to correlate to the senti- ments or intentions of their authors [Jansen et al. The findings from Jansen et al. Wrapping up this subsection, note that almost all the literature discussed in this sub- section — on opinion mining and sentiment analysis — deal exclusively with the message domain on Twitter. This matches the traditional practice of using pure textual content for sentimental and opinion analysis.
Again, suggestions for future research include combining such analyses with observations and measurements from the user domain. This relates to other topics in this section in the fact that search has traditionally been a problem area in information retrieval. A position paper by Suh et al. Term expansion is suggested as an approach to handle situations involving Twitter search.
They also identified weaknesses from the Twitter API, previously discovered in e. Cheong and Lee  and Kwak et al. One such important weakness is the very limited window for search results, restricted to merely two weeks Section 4. The usage of hashtags is said to improve search experience from the results of the user study [Golovchinsky and Efron, ]. In order to understand the rationale behind human interaction on Twitter in the user do- main , one needs to understand human factors, i.
Three of the five papers below in this section deal with cellular phone communication networks, which have existed for more than a few decades. Communication Patterns Although not strictly within the realm of Twitter, Dearman et al. Research in Dearman et al. Rangaswamy et al. Over time, such discussion categories begin to evolve; resulting in some groups having overlapping categories due to such evolution [Rangaswamy et al.
By letting their test participants use customized Motorola phones to share these types of information, they infer that users directly or indirectly give away three types of cues about their current state: 1. Motion presence: gives hints to user location, activity, availability, destination and estimated time of arrival.
Music presence: gives hints to user location, activity, and availability. Photo presence: gives hints to user location, activity, and the presence of people around them. Battestini et al. However, simulta- neous conversations were observed i. The following is a non- ranked list of such categories proposed by Battestini et al. Quantitative studies by Battestini et al.
Several statistics on a survey of users are listed in Table 3. Feature Statistics Gender Ratio of males to females: roughly Age bracket The authors also hypothesized that several of these statistics have a degree of cor- relation with others. Studies investigating primarily the message domain on Twitter have generally dealt with Twitter usage intentions and information-sharing.
In their pioneering work on Twit- ter Section 3. This definition by Mischaud  expanded on the original list by Java et al. The categories of information sharing were also deconstructed into seven distinct groups [Mischaud, ]. Written in early , the thesis [Mischaud, ] forecasted the current trend of using Twitter for more than just publishing statuses about everyday minutiae. Naaman et al. They characterize message content on Twitter as a SAS into categories, some of which have already been discussed by Java et al.
Notable concepts unique to Naaman et al. Here, Naaman et al. Several books on Twitter that focus on user participation and adoption of Twitter for marketing have also discussed about user intentions and online presence. Here, new con- cepts apart from the ones in the previous paragraphs will be given a summary.
In their preliminary study on status feedback using a Web interface which allows users to give ratings negative, neutral or positive to tweets by others, they found that the amount of positive feedback outnumber negative ones by a factor of approximately four. The second study involves nine participants utilizing an iPhone-based photo-sharing application which allows them to take pictures 15—30 daily , and annotate them with a description of what to share and to whom.
Twitter is also sometimes used in conjunction with or integrated into other ser- vices, such as Facebook. Such an issue has been explored in academic literature, and the consequences of privacy breaches are commonplace in mass media. Lawler and Molluzzo , who authored a study on first-year college student percep- tions of privacy on Facebook, MySpace and Twitter, revealed that on the whole, privacy on online social networks are not clearly known nor understood amongst its users.
Despite their findings mainly focusing on personally-identifiable information on Facebook, certain pieces of information such as real name and current location are similarly available from a microblogging site such as Twitter. Humphreys et al. As per Lawler and Molluzzo , Humphreys et al. Their experiment focused on manually coding tweets for presence of specific personal details. Most of their dataset did not contain any personally identifiable information; however Humphreys et al.
As described above, for example, certain users might want to send detailed private tweets for friends and family; choosing to provide a more vague description for workplace colleagues; and blocking it for everyone else. The proposal, which revolved around the usage of a mobile client, has two modes of operaiton.
Although still in the phases of a conceptual design, this research has shown the effectiveness of Twitter in providing social location-based information. From their analysis, Wilson  provided an insight on the types of information sought on Twitter, as well as the different synonyms frequently associated with social information needs on Twitter.
Banerjee et al. The user domain plays a role to isolate messages from users who are active based on their frequency of writing tweets and among specific major cities in the western hemisphere. By matching co-occurrences of such keywords in tweets, Banerjee et al. Research ideas in this regard normally take both user and message domains into account, where information is presented in a relevant and easy to understand format, e. Research in terms of CHI with respect to the visualization of tweets, the subject reviewed within this section, has a focus on making user interaction with Twitter intuitive, graphical, and easy-to-understand.
From the earlier discussion of Bernstein et al. Figure 3. The paper by Mathioudakis and Koudas  demonstrated TwitterMonitor, an- other Twitter visualization tool. Compared to Eddi, TwitterMonitor does not personalize streams per user; rather it reflects the overall trend of the current public timeline of messages. Lastly, a noteworthy collection of visualization algorithms by Donath et al. These five algorithms [Donath et al. Lexigraphs is shown in Figure 3. Mycrocosm is shown in Figure 3.
I opine that the latter three visualizations can be adapted in future research to deal with microblog data, as Twitter allows the easy exploration of communities e. Commercial Website Visualizations of Twitter Data Besides academic work on Twitter-based visualizations, there are several interactive Web 2. This subsection of the literature review will detail several notable ones. TwitterVision [Troy, ] Figure 3. The user location is obtained via a set of coordinates generated by GPS-enabled devices or browsers.
Bloch and Carter  from the New York Times published an experimental Flash applet — a geographically-distributed tag cloud of sorts — that visualizes Twitter activity during the Super Bowl. This is accomplished by mapping out the location and frequency of commonly used words in Super Bowl related messages on a map of the United States Figure 3.
This is not dissimilar with the use of geography and time to track the spread of a current real-life event Section 3. The concept of timeline visualization has also been implemented using keywords, hashtags, and trending keywords in Twitter. Due to its humble beginnings, the rapid expansion and popularity of Twitter has necessitated recent research to study the characteristics of microblogging in an organizational context, where it all began.
Thom-Santelli et al. They found out that the microblogging behavior among users differ based on their cultural norms, for example the Indian branch of IBM has users which post informal or more expressive posts; as compared to the ones in the US site. Real-world cultural power distance also plays a role in influencing the types of status messages created [Thom-Santelli et al.
In the context of companies just starting to adopt microblogging, Zhang et al. By studying a Fortune company by means of a month data log on Yammer and some interviews with adopters, they found out that adoption of micro- blogging in an organization grows through four progressive stages, i.
Further study is still necessary to investigate whether these hubs correspond to superiors or high-ranking employees. Prior research [Java et al. Government Wigand  authored a paper on the adoption of Twitter by government agencies in the United States. Education Du et al. Ebner et al. There are many more pieces of published research containing detailed analyses of Twitter as a facilitator for learning; these are not covered in this chapter as it is beyond the scope of my literature review.
Xu and Farkas  have designed a working prototype of decentralized microblogging service as a proof-of-concept as a result of their findings. Since then, there is a mushrooming of literature and related research on the subject, as well as applications that are hitherto not considered for research from the perspective of microblogging and social media.
I have, in this chapter, also identified several emerging topics of theoretical research e. Also, this chapter has explored the idea of two separate yet interdependent domains of the user and the message in microblogging services [Cheong and Lee, a; Cheong and Ray, ; Cormode et al. I have also covered the significance of both these domains, their relation to one another, and their interdependence in our evaluation of current literature.
Research that merges the study of both these domains are still lacking; in spite of that, several promising research studies reviewed in this chapter have leveraged the combination both domains in pattern detection and classification. These, in order of discussion in this chapter, are: more extensive exploratory studies on Twitter Chapter 5 ; a better understanding on the spread of information on Twitter Chapter 7 ; applications of pattern recognition for the revelation of emergent behavior Chapter 6 ; modeling and detection of sentiment with regard to trends on Twitter Chapter 8 ; human factors on Twitter Chapter 4 ; and practical applications of Twitter Chapters 4, 5, and 7.
In the previous chapter, I have surveyed the extant literature to identify the state- of-the art in research, with respect to Twitter. Existing studies specialize only on either the user or the message domain, but rarely both. There is a dearth of research dealing with the combination of both domains, with emphasis on the analysis of the raw metadata themselves, and methods in which such raw data can be transformed into useful heuristics and information.
This contributes to the solving of Subgoal 2 of my overall thesis. Firstly, in Sections 4. Based on both existing literature and my original research, I will describe how the different APIs on Twitter can be used in tandem with one another for programmers and researchers to access the vast amount of metadata on Twitter. I will then describe several issues pertaining to the suitability of the APIs for my research, including weaknesses and workarounds.
Next, in Section 4. This includes a coverage of the metadata format, as well as methods for using them. Section 4. This leads into Section 4. The heart of this chapter lies in Sections 4. These sections detail my al- gorithms and metrics for transforming the raw metadata from both domains into valuable statistics or inferences based on my research and empirical observations. My contributions with regard to this can be divided into three main areas: real-life demographic proper- ties Section 4.
Therefore, this section contains an overview of the core Twitter Application Programming Interfaces APIs , their properties, and how one can make use of such APIs for metadata retrieval. The search method retrieves messages based on search criteria provided by the user1.
As described by Twitter Inc. Summize was later acquired and rebranded as Twitter Search. Rebranding the site was easy, [however] fully integrating Twitter Search and its API into the Twitter codebase is more difficult. Figure 4. This API enables easy access to relevant keywords and hashtags that constitute the bulk of Twitter chatter at any given moment. Internal API Inconsistencies: Workarounds Hence, a workaround would be to use some other form of uniquely identifying metadata in the results from search, such as from user as recommended bv Twitter Inc.
This workaround, albeit simple, is not a reliable unique identifier for a user, as a user can frequently change her username on Twitter. A user ID on the other hand, is more unique and constant. However, the problem remains for old or legacy data sets harvested using said API before November ; examples of such legacy data sets include the ones used in my early research [Cheong and Lee, , c].
Rate-limiting: Known issues Due to technical limitations, the Twitter search API returns only a limited amount of tweets matching a user query. Two conditions affect this limitation, viz. A hard upper bound of tweets for a given batch of search results. I have identified this limit as early as [Cheong and Lee, ]; subsequently, Russell [a] has independently checked that the limit was enforced as of January This limit is still present as of July If the search result quota does not constrain the set of returned results, a soft limit then applies to the date range.
This is approximately 20 days before the current day, as I have found from my empirical studies [Cheong and Lee, , c]. For user information harvested from the REST-user API, I was able to retrieve user metadata for up to a maximum of 20, users per hour during my initial research in This is allowed only after explicitly being granted white-listing permissions from Twitter Inc for research purposes [Cheong and Lee, c].
However, since the World Cup event, which took place circa June-July , the rate-limiting has been dynamically-adjusted3. The current estimate is requests per hour, with white-listing still disabled, which is unlikely to change in the near future. In short, at time of writing, a significantly reduced number of accesses per hour to the REST-user API is attainable at time of writing compared when research on this PhD first started in The search API quota still stands at accesses per hour [Russell, a].
Rate-limiting: Workarounds Having said that, however, several workarounds for the rate-limiting problem have been developed in related work. This is mainly present in papers authored pre, when the usage of the old APIs was still commonplace see Chapter 3, e.
In these papers, the authors performed a constrained crawl of users originating from a seed user, as their research focused mainly on user relationships. This is different from harvesting a set of users based on the message similarity e. Such a method of sampling was also proposed by Cormode et al.
The authors used a cluster of about 20 machines with individual IP addresses , all of which were white-listed by Twitter Inc. By limiting each machine to 20k total API requests per hour to avoid going against the term of service, Kwak et al. They were also capable of harvesting ten Trending Topics and tweets every five minutes. This method of distributing the load of querying the Twitter API in parallel with multiple clients is an efficient way to overcome Twitter Terms of Ser- vice restrictions and obtain a near-complete set of data.
The obvious disadvantage of this approach was the high cost and the large amount of resources needed. Polling is done by periodically repeating the search query after a particular interval to achieve a near-continuous stream of data. As the maximum number of users that can be queried from the REST-user API is fixed at users per hour as in , after white-listing , the only workaround to this is by performing random sampling of users.
The degree of sampling is highly variable based on the needs of individual experiments. Weaknesses of this method include missing message data if messages are produced quicker than they are being consumed; and the inconsistency between the total number of messages obtained versus the number of users due to rate limit differences.
The total of search operations invoked search queries per interval multiplied by number of intervals per hour corresponds with the maximum user data retrieval limit of 20k per hour as stated. To improve the speed of user data collection, a simple caching mechanism is used by saving user metadata in memory. The principle behind this is that users are likely to contribute more than one tweet during the observation period; based on studies conducted on the user base [Cheong and Lee, ], studies on communication patterns [Boyd et al.
One positive side-effect from the cache mechanism is that anomalous user records will suddenly cease to be unavailable for access on the Twitter API; contrary to their presence in previously-retrieved messages. This is a result of Twitter Inc. The cache mechanism keeps track of the count of such confirmed spam accounts. Related spam-detection heuristics were discussed prior in Section 3. For the sake of completeness, I briefly describe these other resources, parts of which will be revisited in Section 4.
Such geographic metadata will be discussed in Section 4. Initially, little attention has been paid to the Streaming API; a lot of published research in — [Cheong and Lee, ; Huberman et al. In late , I decided to use the Streaming API as a viable alternative for data collection, for the following reasons, mostly discussed in detail in Section 4. Quotas and rate-limiting of the on-demand APIs affected the amount of data I could retrieve for experimental purposes. The original limitations imposed in when research for this PhD thesis started a maximum of messages per search query, up to 20k users per hour after white-listing were inconvenient, but did not pose a major hindrance in my research.
However, Twitter Inc began to impose a dynamic but severely-limited quota in mid, while discontinuing white-listing; allowing only hundreds of user queries per hour as of time of writing. This further constraint made large-scale data collection practically infeasible.
The Streaming API, on the other hand, is capable of generating a very high number of metadata samples, without the imposition of rate-limits. Funke [pers. Furthermore, the Streaming API is more convenient than the on-demand APIs as it automatically embeds user metadata within the metadata of each message that it produces.
This eliminates the need for a separate API call to access a users information, unlike the search API which only returned a name and an id associated with a message and requires a separate call to the REST-user API in order to fetch the user metadata cf. If the connection is successful, the API then continuously streams a sample of public tweets, along with its associated mes- sage and user metadata, encapsulated in JSON format.
This process continues until the socket connection is terminated by the user or due to an error such as overloading of the Streaming API, or a network error. In a second process, consume statuses from your queue or store of choice, parse them, extract the fields rel- evant to your application, etc.
This sampling algorithm, as of time of writing, works as follows [Twitter Inc. This algorithm, in conjunction with the status id assignment algorithm, will tend to produce a random selection. Of interest is the track parameter, allowing the retrieval of tweets matching a particular search query string.
Result format Data is returned in a Data will be streamed through representational manner; the open socket, until socket connection is ended once is explicitly closed. Research implications Used widely in research Viable alternative for prior to ; main obstacle large-scale data collection is the current rate-limit. Table 4. For the sake of completeness, an in-depth technical explanation of every available metadata field returned from the Twitter API as of time of writing , as well as sample raw metadata, are provided in Appendix A.
The metadata in the next two subsections are discussed as-is. Their potential applications or uses will be covered by my research in Section 4. Name Explanation text The raw message text up to characters , the most visible attribute of a message. These items were used in my research [Cheong and Lee, a; Cheong and Ray, ]; and also the work of others as described in my earlier literature review Chapter 3.
A preliminary investigation on the connections between metadata in the two Twitter domains in the early days of my PhD research late —early briefly summarized the possible features that can potentially be inferred. From my literature reviews [Cheong and Lee, a; Cheong and Ray, ] in Chap- ter 3, I have identified several areas lacking in existing research, which is illustrated in Figure 4. The following list elaborates further on the annotations covered in Figure 4. Name Explanation id The unique identifier for each user, similar to its namesake in the message domain.
This can be free-form text naming a location, or the exact geographical coordinates as generated by GPS-enabled Twitter clients. URL User website as published in their profile. Hughes and Palen, ; Sutton et al. These concepts were ported to Twitter in future work such as Kwak et al.
The number of records range from an order of tens, up to a magnitude of thousands of results [Cheong and Lee, , c,d]. This is compounded further by the need to look up user metadata separately from the message metadata discussed in Section 4. I introduce the Gigabyte Dataset, consisting of 7,, tweets with complete message metadata , from 4,, unique users again with complete user metadata. Funke, pers.
An in-depth discussion on the Gigabyte Dataset, including its properties, data col- lection techniques, idiosyncrasies, as well as the prototype used in collecting the data is located in Section 5. Work by Joinson  and Schrammel et al.
However, as discussed earlier in the definition of online social network characteristics Table 3. Due to this limitation, Twitter has no facility to allow users to enter their gender into their profile information. Previous research [Cheong and Lee, ] has identified the fact that users on Twitter frequently publish their names as opposed to an alias or nickname as part of their user information. Social Security Administration, ]. In the US SSA dataset, the most popular names used to register births in the United States for each given year are recorded since the year Social Secu- rity Administration, ].
Using Frequency Ranking of Real Names to Determine Gender on Twitter Independent of the studies cited above, my initial study of gender detection in Twitter involves a simple ranking algorithm to determine the gender of a person based on statistics released by the United States Government [Cheong and Lee, ]. Census Bureau, ] as the ranking data.
This dataset was created by the United States Census department based on raw census data, involving 6,, total first names after data sanitiza- tion from a diverse range of ethnicities, sexes, and ages. My algorithm differs from existing ones [Warden, ; Daly and Orwant, ] in that I only perform simple string matching as opposed to a hybrid exclusivity, weight, and metaphone-based algorithm.
Instead, the simplicity and processing speed of a simple ranking algorithm is preferred, to account for a potentially high number of input name data. A hashing algorithm is first employed to pre- load all the first names on the census data for both males and females. The first name is extracted, before its rank is looked up from the hash tables of male and female name frequency.
The gender of the queried name is inferred based on this frequency information [Cheong and Lee, c]. My approach is summarized in Algorithm 4. To describe the inner workings of Algorithm 4. Similar to real life, Dorian is mainly a male name, but is also used rather infrequently as a female name. To measure the accuracy of Algorithm 4.
Experiment 4. To validate the accuracy of gender inferences using Algorithm 4. Method: Algorithm 4. For reference, this test set of 1, messages will be labeled as FirstNameTestSet For comparison, the ground truth is obtained by using a human volunteer to determine the genders. Results and Discussion: The results are depicted in Table 4. Averaging the accuracy rates over each of the ten test sets, an average accuracy rate of approximately The comparison obtained is based on the underlying assumption that human manual detection always represents the ground truth, i.
It is pertinent to note that the names used in this test [Cheong and Lee, c] were extracted in mid, when the user base was estimated to be 75 million users5. This is especially relevant, as the majority of the names tested against this algorithm consist of Western first names. Ideally first names from a diverse range of cultures and languages would need to be analyzed. Social Security Administration, ] as an alternative to the US Census name data for my simple ranking algorithm.
In my preliminary analysis of the US SSA dataset, I found that it is more thorough as it covers first names from a wide variety of cultures not merely limited to common Western first names , and more up-to-date than the Census data [U. Census Bureau, ]. To perform such adaptation, I summed the rank data for each name for a particular gender across 30 years. Total frequencies for each unique first name is then stored by using two hash tables - one for males, another for females - and indexed them by first name [Cheong et al.
The resulting ranking data, adapted from of the SSA raw data [U. To validate the accuracy of gender inferences using the SSA dataset coupled with Algorithm 4. The rest of the experiment is similar to Experiment 4. Results and Discussion: The bar chart in Figure 4. The average accuracy rate from using the SSA ranking data is Gender detection accuracy comparison on 1k-FirstNames Census dataset versus US SSA dataset 95 92 90 90 89 88 88 88 88 87 87 85 85 84 84 84 83 82 81 80 79 79 77 76 75 70 65 60 55 50 1 2 3 4 5 6 7 8 9 10 Census 81 88 89 82 90 83 85 92 88 88 US SSA 87 84 84 77 84 76 87 88 79 79 Figure 4.
However, to further validate the proposed advantages of the SSA ranking data and nullify the effect of Twitters evolving user base, I decided to rerun Experiment 4. This reevaluation is described in 4. Comparing the accuracy of gender inferences obtained from Algo- rithm 4. Method: Another test set is created, using first names from current real-world Twitter users from a more diverse range of languages and cultures. Using the Gigabyte Dataset Section 4. Algorithm 4. The average accuracy obtained using the Census data is a mere The augmented version with SSA ranking data clearly outperforms the original algorithm using US Census ranking data.
Evaluation From Experiment 4. Based on said findings, my algorithm works successfully for common names. Several limitations that affect the accuracy of human validation and by extension, algorithmic accuracy have been identified. These include: 1. Non-common names: The algorithm is based on ranking data for common first names, and as such is not exhaustive; the same issue applies to humans as not all names will be familiar to a human tester, such as names from cultures the human tester is not familiar with.
Androgynous names: names such as Tracy, Kim and Lauren are applicable for people of both genders; hence the algorithm and even human testing is not able to determine the gender accurately without other cues. Presence of names in non-human contexts: if human names are present in non- human contexts, e.
Compared to existing approaches, my proposed algorithm has the advantages of: 1. Speed: My algorithm only involves a simple hash-lookup operation which runs in constant-time slightly sacrificing memory as a trade-off for time ; this is beneficial for large-scale gender detection. When tested on an input of one million names, it outperfomed the Text-GenderFromName Adaptability: If a new ranked set of gender data is available, such as future updates to the US SSA rankings, the algorithm can be trivially adapted to incorporate the updated data set.
As I have documented in my paper resulting from this study [Cheong and Lee, c], as far as I know, this is the first time such name-based gender detection has taken place in the field of microblogging research. The repeated runs are to negate the influence of external factors such as CPU caching and background processes. I eliminated the issue of file fragmentation over the dataset by defragmenting the test data files, and ensure that Windows and background processes are not performing hard-disk intensive operations such as paging during the experiment.
As discussed in Section 4. For this thesis, I will be discounting the use of the Places feature on Twitter metadata item place as it is still in the experimental phase. In fact, several research studies [Java et al. A weakness to this approach is that the user time zone can be inaccurate. This can be as trivial as the wrong time zone being set by a user.
Another reason is the intentional change of time zone by users, as seen more recently in the Iran Election controversy where users from around the world changed their Twitter time zone to Tehran as a sign of solidarity [Cheong and Lee, b; Burns and Eltham, ]. Two Phase Geolocation Approach Based on the justifications in the previous subsection, I propose two methods [Cheong and Lee, c] used in conjunction with one another, to determine the country a particular Twitter user is currently residing in.
For tweets without accurate location data especially tweets collected in the early stages of my research : The free-form location text presented by the user in the location user metadata item is used. In the latter case, however, the location field can be populated by names of places with different levels of detail. This is a time- consuming operation to map locations to specific countries, but are nonetheless meaningful in the absence of per-message or per-user geographic coordinate data.
The rationale is that Google has a comprehensive API which is free, programmer-friendly, and has an extensive set of location names built upon their rich Google Maps service. To measure the accuracy of my proposed two-phase geolocation ap- proach, where both phases are outsourced to the third-party Google Maps Geocoder API.
Method: I ran reverse geocoding coordinate lookup and also location string lookup based on the Google Geocoder API on ten similar data sets, totaling 1, user records [Cheong and Lee, c]. I will refer to this dataset throughout this thesis as 1k-Locations. As ground truth to 1k-Locations, human volunteers manually identify the locations of the places — with the aid of the Google search engine8 , the OpenStreetMap atlas site, Windows Live Maps now Bing Maps , and Wikipedia — and categorize them according to countries.
For consistency, countries are uniquely identified by their two-character ISO- country codes to avoid conflict in naming conventions e. Results and Discussion: Table 4. Test set 1 2 3 4 5 6 7 8 9 10 Algorithm performance: percentage of correctly 96 90 87 90 92 92 88 93 80 89 determined countries Table 4. The obtained accuracy assumes that the human tester knows exactly where a particular location is in the world, and which country it exactly belongs to.
Strong cases would be for GPS coordinates, and the presence of a full address. Several cases have been identified, however, where the location matching mechanism becomes weak. This include random and nonsensical place names e. Cc , and the ambiguity of locations e.
Brighton Beach could refer to different places in both the United Kingdom or Australia. Proposed Algorithm for Scalable and Robust Offline Geolocation However, as the work on this thesis evolved, several drawbacks were noticed in the original proposal of using a third-party geolocation service. Furthermore, there are hard limits imposed on the quota of location records that can be geocoded in a window of time. Again, a high price factor is involved when a particular quota is reached.
The above drawbacks prevent this approach being scaled to handle millions of records, especially since the Streaming API has superseded the original on-demand APIs Sec- tion 4. Prior work, e. Based on my initial identification of the two phases involved in geocoding Twitter data, I propose a novel Two-phase Hybrid Geocoding method using open source and public domain data, that can easily be scalable as required to handle differing amounts of input.
My proposed approach involves the following two steps, used in tandem with one another: 1. Coordinate reverse-geolocation: If a coordinate point — expressed as a latitude- longitude pair — is found in Twitter message metadata the coordinates extended metadata object this point is reverse-geolocated to determine the country in which it belongs to.
Message-specific metadata is favored, as it is generated every time a Twitter user publishes a tweet with a supported device with geotagging activated. In its absence, the location field in user metadata is checked to see if it contains a coordinate point, which is commonly performed by older mobile software as early as [Cheong and Lee, ]. Any returned coordinates from Geodict can then be parsed using coordinate reverse- geolocation. Offline coordinate reverse-geolocation technique I use public domain data on country boundaries from Natural Earth [Natural Earth, ].
The data is provided in the ESRI Shapefile shape format, which stores a series of coordinate points outlining the border of each country, which is represented as a polygon. Shapefile geospatial metadata on every country is included as part of the Shapefile and can be accessed to return attributes for each country on the map. To improve lookup speed, I applied the Quadtree algorithm [Finkel and Bentley, ] to preload the polygon data boundary points in memory. Using a quadtree trades off memory in favor of search speed by narrowing the search space, and is commonly used in algorithms related to cartography and geographic information systems.
To determine the suitability of each of the three levels of detail, I conducted Experiment 4. To determine the speed and accuracy of reverse geo-location for each detail level of Natural Earth map data, for deciding the optimal map scale for use in Algorithm 4. Level of detail scale Computation speed average Accuracy m 0. From my evaluation, the choice of m clearly favors speed, by heavily sacrificing on accuracy.
The scale of m has the highest accuracy, but is very slow at approximately 1. This is infeasible for large data processing. Hence, I decided to use the m scale as it strikes a balance of both accuracy and speed [Cheong et al. Free-form string parsing approach For records without coordinates in user and message metadata, but with a free-form location string in user metadata, I proposed the use of the Geodict algorithm [Warden, ] to extract location strings from the free-form location text, and map it to real world locations.
Geodict, as part of the Data Science Toolkit, is an open-source algorithm by Warden . It works by extracting tokens from a given string, and attempts to match it with approximately four million records of real-world locations stored in a relational database in order to deduce a particular geographic location ranging from city level to country level.
The inner workings of Geodict are illustrated in Algorithm 4. The advantage of Geodict and the Data Science Toolkit is that it contains open, non- commercial, mapping data that can easily be updated or expanded using a relational database. I trialed my Two-phase Hybrid Geocoding algorithm — combining my coordinate reverse-geolocation in Algorithm 4.
The results, detailed later in Section 5. Web 2. Instant Messaging applications 4. My initial small-scale study [Cheong and Lee, ], documented in Experiment 4. Initial study [Cheong and Lee, ] to devise a categorizing scheme for software — or device classes —from source metadata.
Method: This initial study, conducted in [Cheong and Lee, ], involved Twitter messages from various topics. These messages were collected using the old search API as part of a clustering study, which will be discussed in Section 6. I manually extracted and categorized each source string acquired from each of the tweets. These source strings were collated, before I manually searched the Internet to find out more information about the specific software named in each source.
Results and Discussion: From this study [Cheong and Lee, ], I have arrived at a list of six device classes, as per Table 4. In this experiment Experiment 4. Follow-up study [Cheong and Lee, c]; to categorize software by device class, by evaluation of source metadata from a 14, message sample. Method: This experiment involved the categorization of 66 unique software clients in the source variable on a case study of approximately 14, messages.
Similar to Exper- iment 4. Results and Discussion: The results from this follow-up analysis [Cheong and Lee, c] extended the categorization performed in prior work [Cheong and Lee, ; Java et al. Twitter marketing tools Twitter tools which are used for marketing purposes, includ- ing bulk messaging tools. The number of category labels for classification: This is due to the rapid development of new Twitter software, and hence an increase of source strings [Cheong et al.
The lack of empirical findings into the various categories of source strings: This is again due to the magnitude of prior research which only covered samples in the order of thousands. This dataset contains raw metadata extracted from 7,, messages; more details of this dataset will be elaborated in Section 5.
To perform large-scale classification — expanding upon Experiments 4. Method: Similar to my methods in both [Cheong and Lee, ] and [Cheong and Lee, c] for each of the records in the Gigabyte Dataset, I collated all the software client strings in the source metadata fields found in the Gigabyte Dataset.
This time, I organized them in a frequency distribution beforehand; frequency bins with slight string differences caused by escape characters or artefacts from character encoding were merged e. The result- ing frequency distribution is comprised of 29, unique software client source bins.
From this, I narrowed-down the most frequently-used software clients, made up of A complete listing of software clients is included in Appendix B. By searching the Internet to deduce the type of software a source string refers to, I classify each of the source strings into suitable device classes using existing findings in Experiments 4.
Hitherto undiscovered source strings are collated into new groups based on their similarity. Results and Discussion: The device classes found through analysis of empirical data using the methodologies above [Cheong and Lee, , c] yielded a new categorization scheme of 14 device classes. A complete trend analysis of the data found in the Gigabyte Dataset will be discussed at length in the following chapter, in Section 5.
Other software client source strings found in the sample — i. From the dataset, it is observed that such software is a niche to the Japanese Twitter community, as five out of six identi- fied bot programs had Japanese websites catering to the Japanese market 2. Twitter-based Software or web services which provide a novel service such as third-party sites fancy visualization which piggybacks on Twitter as the underlying technology 0.
Each of the software source strings studied can be stored as keys in a hash table, with its corresponding device class in the bucket. Using the first few popular source strings that I obtained in Experiment 4. As per an earlier discussion on hash tables in Section 4. Device Class: Inferring Mobility and Usage Behavior Based on the categorization scheme above, I can further infer the state of user mobility, and postulate usage behavior.
In this case, message metadata is used to infer properties of their authors. With this idea in mind, I suggest the usage of the device class property, proposed above, in inferring user mobility: i. The use of desktop clients, the web interface, or Twitter interfaces only available on a non-mobile computer can suggest a fixed user mobility. The usage of mobile clients, the mobile Twitter site or the official SMS interface can suggest the user is in a mobile state.
Although not perfect as one can, say, even use a mobile Twitter interface at home , this idea does come in to play when investigating cases of Twitter in crisis and convergence events e. In my paper [Cheong and Lee, ] discussing Twitter usage in civilian response to terror events further elaborated in Section 7. I reinforce this proposal based on Dearman et al. Similar to deducing user mobility, one can use the categorization scheme of Twitter device classes to determine usage behavior.
Examples of such inferences on usage behavior are: 1. The usage of Twitter clients in the social network integration category suggest that a section of Twitter users also participate in other Web 2. The disproportionate absence of, say, mobile clients in Twitter metadata collected during such events can be used as an indicator of censorship. This can then be used in, say, targeted advertising, or for policy-making. Several profile items that can be customized — as described in Section 4. Profile Customization and the User Given such information, I postulate that the degree of profile customization exhibited by a user, when used in conjunction with clustering and pattern recognition algorithms to be discussed in Chapter 6 , can reveal several traits about the user.
This sentiment was also shared by Erickson , in his discussion on user visibility. Such exhibitions of profile customization are evidence that Twitter users who customize their profile aim to reflect online presence, and are more likely to interact and participate in Twitter activity as opposed to those who do not [Cheong and Lee, c]. Series ISSN : Edition Number : 1. Number of Pages : XX, Skip to main content. Search SpringerLink Search.
Buying options eBook EUR Softcover Book EUR Learn about institutional subscriptions. Table of contents 45 papers Search within book Search. Page 1 Navigate to page number of 3. Front Matter Pages i-xx. Campos, Luis G. Souza, Priscila T.
Saito, Pedro H.
To browse Academia.
|Top 100 country songs 2014 torrent||Written in earlythe thesis [Mischaud, ] forecasted the current trend of using Twitter for more than just publishing statuses about everyday minutiae. The advantage of Geodict and the Data Science Toolkit is that it contains open, non- commercial, mapping data that can easily be updated or expanded using a relational database. Frada Burstein, Dr. There are many more categories of such clients, which are discussed later in this thesis Section 4. Lawler and Molluzzo , who authored a study on first-year college student percep- tions of privacy on Facebook, MySpace and Twitter, revealed that on the whole, privacy on online social networks are not clearly known nor understood amongst its users. Row 5 left-to-right : map showing absence of geocodable location infor- mation; profile customization score. This domain contains data both directly produced by the user as well as data created as a consequence of user actions.|
|Alfreda street pharmacy torrent||Having investigated the five subgoals as hitherto listed, in Chapter 8, I document mis- cellaneous approaches I have contributed to the study of everyday events. Perform case studies on the manifestation of real-world events on Twitter — in the form of user and message activity — in order to understand their real-world nature. Although not perfect as one can, test matlab, even use a mobile Twitter interface at homethis idea does come in to play when investigating torrent of Twitter hartigan crisis and convergence events e. The message length of a tweet is also taken into account to study the linguistics of the user base when commenting on a brand; co-occurrence of key terms and phrases such as personal prepositions dip also identified in the messages. I will then investigate potential methods for obtaining such raw data from Twitter, and any potential issues that might arise from the process.|
|The expendables 2 free download bittorrent full||Survey the extent of current research on Twitter. Cascading retweets, akin to cascading forwards in email are also common. This discussion starts off with academic studies on inter-user activities on Twitter. Maps from left-to-right indicate: users from Qatar; users from Zimbabwe; users from Greece. It is pertinent to note here that in the early stages of research, I have proposed a similar trend categorization theme predating Kwak et al. The findings that set Twitter apart from other theoretical models illustrated that Twitter does not take on the form of a random social network; in fact, it has hubs of users with high degree of friends situated close together in the social graph [Stonedahl et al.|
|Lupin iii 1080p torrent||846|
|Hartigan dip test matlab torrent||936|
|Hartigan dip test matlab torrent||Funke, pers. The second study involves nine participants utilizing an iPhone-based photo-sharing application which allows them to take pictures 15—30 dailyand annotate them with a description of what to share and to whom. Data Mining Front Matter Pages hartigan dip test matlab torrent The gender of the queried name is inferred based on this frequency information [Cheong and Lee, c]. This process continues until the socket connection is terminated by the user or due to an error such as overloading of the Streaming API, or a network error. Based on both existing literature and my original research, I will describe how the different APIs on Twitter can be used in tandem with one another for programmers and researchers to access click here vast amount of metadata on Twitter. In as of time of writingthe Twitter user base was estimated at million user accounts [Twitter Inc.|
|Scary movie 5 nl subs utorrent movies||247|
|Mythbusters s14e09 torrent||518|
This kid icarus uprising rom 3ds torrent are
Have je ne te connaissais pas oxmo puccino torrent are
Следующая статья orange is the new black season 3 torrent