Fast Company discusses a new report from researchers at the Mitre Corporation that pulls data about Twitter users from what they write:
The paper, “Discriminating Gender on Twitter,” which is being presented this week at the Conference on Empirical Methods in Natural Language Processing in Scotland, demonstrates that machines can often figure out a person’s gender on Twitter just by reading their tweets. […]
To conduct their research, the Mitre folks–John Burger, John Henderson, George Kim, and Guido Zarrella–first had to assemble a corpus of Twitter users whose gender they were confident of. Since Twitter doesn’t demand that users specify gender, they narrowed their focus to Twitter users who had linked to major blog sites in which they had filled out that information. In addition to collecting the tweets of these folks–many users had only tweeted once, while one of them had tweeted 4,000 times–Burger et al. collected the minimal profile data that Twitter users sometimes do include: screen name, full name, location, URL, and description.
The dataset was about 55% female, 45% male (which squares roughly with estimates of Twitter’s overall gender breakdown). Thus, by guessing “female” for every user, a computer would be right 55% of the time. Simply by examining the full name of the user, a computer was accurate about 89% of the time–a remarkable improvement, if not an especially interesting one, since first names are highly predictive of gender. The Mitre findings become intriguing, though, when the team limited its analysis to tweets alone. By scanning for patterns in all the tweets of a given user, Mitre’s program was able to guess the correct gender 75.8% of the time–a 20% improvement over the baseline. And even just by analyzing a single tweet of a user, it was right 65.9% of the time–an over 10% improvement over the baseline. […]
How is this possible? How can we give away so much with 140 characters or less? There is a whole branch of study called “sociolinguistics” that observes that different people speak differently. […]
Mitre found that given certain characters or combinations of characters, the computer could wisely bet on the gender of the tweeter. The mere fact of a tweet containing an exclamation mark or a smiley face meant that odds were a woman was tweeting, for instance.
Read the full article for more information on how gender data can be pulled from Twitter posts.