The latest tweet-ids support the newest type of tweets from the Facebook API that are avove the age of 9 days (i

Your website Footnote 2 was applied as a means to gather tweet-ids Footnote step 3 , this web site provides boffins having metadata off an effective (third-party-collected) corpus off Dutch tweets (Tjong Kim Carried out and Van den Bosch, 2013). age., this new historic maximum when requesting tweets considering a search query). Brand new Roentgen-bundle ‘rtweet’ and you can subservient ‘lookup_status’ means were utilized to gather tweets in JSON format. This new JSON file constitutes a dining table with the tweets’ recommendations, including the design date, this new tweet text, in addition to resource (we.elizabeth., type of Fb visitors).

Analysis tidy up and you will preprocessing

The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as profiles who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, N_users = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.

The newest tweet texts was basically changed into ASCII security. URLs, range holidays, tweet headers, monitor names, and you will references to help you screen names was in fact got rid of. URLs increase the reputation matter whenever discover from inside the tweet. not, URLs don’t enhance the profile matter if they are located at the conclusion an effective tweet. To prevent an effective misrepresentation of the real reputation limitation one to pages suffered with, tweets that have URLs (although not news URLs instance additional photos or movies) were omitted.

Token and you may bigram study

The Roentgen plan Footnote 5 ‘quanteda’ was utilized to tokenize the latest tweet texts on tokens (we.elizabeth., separated conditions, punctuation s. While doing so, token-frequency-matrices had been computed with: the new volume pre-CLC [f(token pre)], the fresh new relative frequency pre-CLC[P (token pre)], the fresh new regularity post-CLC [f(token article)], the brand new cousin volume article-CLC and you may T-score. The newest T-attempt is much like a fundamental T-statistic and you may exercises new analytical difference in mode (i.age., this new relative word wavelengths). Bad T-results mean a comparatively higher occurrence away from an excellent token pre-CLC, while positive T-results mean a relatively higher thickness away from a good token post-CLC. The brand new T-score Miami FL sugar daddies picture found in the analysis is displayed given that Eq. (1) and you will (2). Letter ‘s the final number of tokens for each and every dataset (we.age., pre and post-CLC). So it formula lies in the method to own linguistic computations of the Church et al. (1991; Tjong Kim Sang, 2011).

Part-of-address (POS) analysis

The brand new Roentgen plan Footnote six ‘openNLP’ was used in order to classify and matter POS kinds about tweets (i.age., adjectives, adverbs, content, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and you will miscellaneous). The brand new POS tagger works having fun with a maximum entropy (maxent) opportunities design so you’re able to assume the newest POS classification centered on contextual has (Ratnaparkhi, 1996). New Dutch maxent model useful the brand new POS class is taught to your CoNLL-X Alpino Dutch Treebank study (Buchholz and you will ). The brand new openNLP POS design has been claimed having a reliability rating of 87.3% when used for English social media data (Horsmann et al., 2015). An ostensible restrict of the most recent study is the reliability away from new POS tagger. But not, comparable analyses had been did for both pre-CLC and blog post-CLC datasets, definition the accuracy of your POS tagger are consistent more both datasets. Thus, i imagine there are no systematic confounds.

Add Comment Cancelar la respuesta

Jessie Manrty CO-MANAGER ASSOCIATED