“(11/17/2013) Guided once more by the wise and powerful Dr. Lynn Cherny, MD (Master of Data), we first attempted to discourage cardiac fibrillation by applying Topic Modeling to Grimm’s Fairy Tales, and we then performed an exploratory Gephi Procedure on the resulting corpus. Both efforts were considered successful and the Gephi Procedure revealed that data visualization is possible with as little as 50% of the cerebellum intact. Unfortunately, the patients continue to exhibit symptoms of fatigue.” -Excerpt from Diary of a Novice Data Librarian.
Thanks again to Lynn for being a patient teacher and sharing her time with us. Our first step was converting the Text of Grimm’s Tales into a csv for use with a topic modeling tool. TF-IDF is term frequency-inverse document frequency and is a statistic used to estimate how relevant a word is in a document or collection of documents. Common words that are disregarded are known as stop words, they include articles and conjunctions. Our topic modeling tool uses TF-IDF on the converted Grimm csv, to generate lists of words that might be important to the content of the stories.
The topic modeling tool generated an index with links to topics and individual stories with HTML.
When viewing a story its percent match with each of the generated topics is listed and they are ranked. We were then able to visualize the results by using Gephi.
Gephi is useful for exploring the different options for network diagrams.
When a file is uploaded, the initially generated visualization is raw. Using this as a foundation, the formatting is customized and polished.
Color, shape, and size can all be manipulated to illustrate relationships within the data. Finally, the project is output for use on the web. The final visualization is interactive and displayable in a browser. It may need some testing and additional tuning, however a large majority of the required work is done automatically by using Gephi. Learn more and download Gephi for free at: https://gephi.org/
The time Lynn spent with us was short but valuable. We were all very glad to have her leading us towards greater data fluency and look forward to the next time our paths cross.
View the shared class notes here: DST4LNotes11-19-2013