This project, which I worked on with the Harvard Data to Actionable Knowledge Lab between March and August 2022, aims to take unorganized open data in the form of recordings and use techniques, including unsupervised machine learning methods, to increase the usefulness and enhance the user experience of accessing this data. The original data consists of over 36,000 hours of meditations and lectures that span across 5 decades made available through Dharma Seed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. Techniques to reformat this data include topic modeling and transcription with speaker diarization. Work up to this point has been completed using a corpus of the 2,200 most recent Dharma Seed recordings. Recordings were first transcribed, and that text was used to train a topic model with the corpus and increase the accessibility of the recordings themselves.
Cambridge Insight Meditation Center (CIMC) is a local partner for the project. While this work is continuing on the larger Dharma Seed corpus, a prototype using 228 recordings specific to CIMC teachers was created as a proof of concept. With this prototype, users can (1) search using the entire transcription of a recording instead of just metadata, as the inclusion of transcriptions creates the opportunity for a much more robust search, (2) load lectures and meditations at different points based on areas of interest, (3) read transcriptions and visualize them by speaker, and (4) view “recommended content” when opening a talk, a similarity measure found by training a topic model with the full corpus and displaying nearest neighbors after converting the 50-topic model into a t-SNE visualization.
t-SNE visualization of 2,200 Dharma Seed talks distributed amongst 50 topic models. CIMC teachers’ talks are shown in red. Open the interactive visualization.
A core interest of CIMC was to extract valuable teacher advice derived from unplanned conversations between teachers and students during classes. Speaker diarization was used to (1) find discussions and explicit advice given within the recordings and (2) break up talks into specific blocks that load at the start of that advice. The extraction of this advice was automated by using speaker diarization to check for new speakers, and the most relevant conversations with teachers were chosen by experts at CIMC from the extracted data. Transcriptions are timestamped, creating the possibility of a time-based search.