Lip Reading Word Classification

Code, Slides

Lip reading, also known as visual speech recognition, is the technique of understanding speech only through visually interpreting the movement of the lip. The goal of this project is to create a classifier model that can identify the speech content of a sequence of images of a speaker uttering a single word. The primary motivation behind this project is the existence of large volumes of videos on the internet without subtitles or captions. This project can help extract information from these videos online, benefiting hearing-impaired groups, and increasing their accessibility to video media. Our solution approach is to implement a CNN-LSTM model, which is well suited for sequence classification problems with spatial inputs, such as images and videos. Afterward, perform tuning to identify best performing hyperparameters.

Joint work with Kristen Chan.