The automatic analysis and categorization of web content has witnessed a booming interest due to increased availability of information in a wide variety of formats (txt, ppt, pdf, pictures, audio and movies, etc), content, genre and authorship. We present two intelligent search systems:
- Read-X, a tool that searches the web and performs in real-time a) html-free text extraction, b) classification for thematic content, and c) evaluation of expected reading difficulty. Currently, we take Read-X to its next step by modeling reader characteristics. Word frequencies built from a theme-labeled corpus are used to predict vocabulary difficulty relative to the reader's prior familiarity with thematic content.
- Intelligent video content analysis focusing on recovering scene structure in movies for object tracking and action retrieval (project led by Ben Taskar). A weakly supervised algorithm uses screenplay and closed captions to parse a movie into a hierarchy of shots and scenes. Scene boundaries in the movie are aligned with screenplay scene labels.We use NLP techniques to a) retrieve descriptions of actions from the parsed text and b) resolve referential ambiguity in the screenplay. Text and movie alignment is used to label names of characters and common actions. The resulting annotations will be shown on the video of a popular TV series.