3:30 PM Saturday Room: 8338The majority of interesting information on the web is in the form of unstructured natural language data, written by humans for consumption by other humans. Natural language processing tools allow us to take data such as new articles, blog posts, tweets, and reviews and then extract meaningful structured information. For example, using named-entity recognition and sentiment analysis, your code can look at a document and identify what people, organizations, products, and places are mentioned within it and whether or not they're described in a positive or negative light. Using natural language parsers, it's possible to take a sentence and recover who's doing what to whom. Other tools can automatically construct tag sets or identify interesting characteristic phrases.
In this session, I will provide a brief introduction to natural language processing, and an overview of what tool sets, APIs and libraries are available. Code samples will be presented in Python and Java. However, the talk should be of general interest to anyone working with language data.
Topics that will be covered include:
* Sentiment analysis
* Identification of named entities (e.g., people, locations, and places)
* Natural language parsing
* Document classification and automatic extraction of tag sets.
* Summarization of documents
Toolkits that will be covered include Python's Natural Language Toolkit (NLTK) and Stanford's JavaNLP. APIs discussed will be OpenCalais and AlchemyAPI.