Our guest speaker on the 9th was Jay Luker, a software engineer for the Astrophysics Data System (ADS), which has been at the core of all of our group projects this semester. Jay spoke to us about two tools he has been using in his work: MongoDB and Logstash.
Many of us in the class are familiar with relational databases, which organize data into normalized tables related to each other by unique identifiers. MongoDB databases are very different. They are NoSQL databases, which means in part that they are “document-centric” and completely de-normalized. They store data in collections of JSON-like documents rather than in tables, and unlike relational databases they do not require a schema. Documents within a single collection can be structured completely differently from each other, with different sets of fields.
To get a taste of working with a MongoDB database, we set up our own free MongoDB instances on MongoHQ, which we were able to interact with from the IPython interpreter using the Python driver PyMongo. Jay provided us with a small collection of bibliographic records from the ADS to import into our databases, and we followed along as he demonstrated how to connect to the database, create and modify collections, write some simple queries, and view the contents of documents (our steps can be found in the notes; for more on this, see the MongoDB Docs). MongoHQ provides a nice interface for doing the same things in the browser.
Jay was asked whether NoSQL will replace relational databases, and he answered, no, there will always be a place for both. There are also already some hybrids, he told us, such as Postgres, the main NoSQL competitor, which is a relational database that lets you do things in a NoSQL, document-centered way.
The remainder of class was spent on Logstash, an open source tool for managing logs and usage data. Usage logs can provide valuable information about how users are interacting with systems, but log data is typically returned as a string, making computerized access to the data a challenge. Jay recently switched to Logstash as a simpler way to extract, parse, and store log data for the ADS, and he gave us a quick overview of using Logstash and outputting the data to MongoDB.
Our collaborative notes (PDF 124 KB), like this post, cover primarily the MongoDB portion of the class, but there is additional information on both tools in Jay’s setup instructions and class outline.