7 minute read | 01-02-2020
Most people would associate two things with libraries - first, an image of shelves stacked neatly with books; second, the distinct smell of old pages which instils in you a wanting for books. There is a word which describes this sensation of enjoying the smell of pages . it's called bibliosmia. Before the “Amazon” age, the easiest way to read was to go to a library and borrow a book. The advent of the digital age has transformed the way we read and naturally changed the landscape of libraries as well. eBooks and Audibooks are becoming popular formats for reading. It is now possible to borrow digital content from libraries right from home.
In this article I wish to explore the trends in library usage, both old and new if any. In particular, I use the Seattle public library dataset as a representative viewport for library activity. The trends observed might very well not generalize across all libraries or even other libraries in the country for that matter. However, due to lack of such a unified dataset this will have to suffice.
As part of the Open Data Program, the Seattle State library provides a dataset of checkouts data containing over 35.1 million records collected since 2005. The library has over 2681971 items which are cataloged here. The data can be downloaded from the Seattle open data reserves over here.
The dataset consists of checkouts aggregated by month. It also has some meta information such as the type of Material - Ebook, paperback, audiobook etc and also the general information like title, authors and publication year. The entire dataset is about 7GB in size. There's another version of this dataset which is not aggregated and stores records by each checkout. That version can be found here.
The easiest statistics to pull out of any data is top n and provides a gateway for analysis.
This list is in no way definitive and masks a fair bit of assumption on our part. Firstly,
it assumes that a booked checked out equals a book read. We all know that isn't the case
always. Second, it assumes that the number of checkouts equates to qualitative value. Third,
the nature of data lends itself to inherent errors. The largely textual data is fraught
with complications due to string formatting and redundancy. For ex. there are 3
different versions of “Educated: a memoir” differing only by the format of the name:
Educated: A memoir,
Educated : A memoir,
Educated: A memoir / Tara Westover. I've tried
to take this into account and reformat them while processing, but in a dataset this large
errors are unavoidable. There are also several unnamed records which have high readership,
but since the title is unknown we will never know what they are.
There's a lot of overlap between the list shown here and the best-selling books of 2019 reported on other websites which provides a comforting sanity check.
All time highest number of checkouts over the last 15 years !! It is worthy to mention that, Educated: A Memoir by Tara Westover and Becoming by Michelle Obama which take the top spots, where published only in 2018 ! and have amassed so many reads in a year. Educated also features on Bill Gates's list of must reads
The most used format still seems to be the good old Paperback. However, over the last decade things are changing. eBooks and AudioBooks are slowly rising in popularity. In 2005 paperbacks had 99.8% of the share which dropped to 54% in 2019. On the other hand, eBooks and audiobooks which had a meagre 0.12% and 0.06% respectively, increased to 28.7% and 17.15% in 2019.
AudioBooks has been rising steadily in popularity - 3% each year and seems to be the future. eBooks which initially seemed to rise by leaps and bounds, has tapered off but is still gaining its way over paperbacks. The popularity of both the digital formats can easily be attributed to Amazon's foray into digital content. It's own propietary Kindle ecosystem for eBooks and Audibles for audiobooks have been highly successful.
Without any surprises Fiction seems to be the most popular genre of the lot. This section of the data analysis isn't clean. There is no clear genre list and instead each book is tagged with a subject which is basically a summary of the book listing all possible categories. This led to several overlapping sub-categories which I couldn't distil further. Thus there are several sub-classes and broad categories mixed.
We still have enough to draw some conclusions though. It is obvious that Fiction has been the most popular genre over the last several years. The last 5 years have seen a slow rise in Nonfiction categories as well. Particularly, Business and Biographies have seen a small but notable increase in readership.
There exists a strong correlation between the month of the year and the checkout activity. The months of Januaryy, March, July and August invariably have the most number of checkouts and the rest of the months exhibit much less activity in comparison. The only exceptions are June and October which rank in between the extremes. This trend is consistent across the last decade.
The correlation arises from the Universities’ schedule in Seattle. Almost all the universities in Seattle - University of Washington, Seattle University etc follow the quarter system which is shown below (Dates are approximate):
The heatmap below visualizes normalized checkout values as a bivariate function of year and month. Each box indexed by a month and year, represents the number of books borrowed for that time period. The shade indicate the magnitude - lighter to darker representing less to more. It is evident that periods of high usage aligns with the breaks in the school term or the beginning of it, when students are most likely to borrow books.
In the last few years, several best selling books have been made into movies or TV shows. It would be interesting to observe if there's an inverse correlation, wherein the release of a movie sparks an interest in the books from which the plots arose. I consider 3 popular series of books: The Hunger Games, Harry Potter and Game of Thrones
The original series consisted of 3 books: The Hunger Games, Catching Fire and Mockingjay published in 2008, 2009 and 2010 respectively. The corresponding movies were released in 2012, 2013 and the last book was released as 2 parts - 2014 and 2015. The checkout trends are visualized below as a heatmap. We can surmise that the movies definitely influenced readers as we see a sharp spike in number of checkouts in 2012 for The Hunger Games following which the successive books also show spikes. The increase in readership is across the heatmap, almost similar to a staircase.
In contrast, the readership of Harry Potter books don't display any such fluctuations. The last two books: Half-Blood Prince and Deathly Hallows show increased checkouts around their respective time of release, 2005 and 2007. There's also an unexplained spike in the checkouts of the first part in 2018. Nothing conclusive can be surmised here due to the lack of any underlying causes that can explain the variations.
GoT is probably one of the most popular shows in the history of TV shows (baring the last season) with staunch followers. The original series of books (still not completed) were released in the 1990s. The first season of Game of Thrones was aired on April 2011. This caused a very obvious craze for the books which is evident from the frenzy of colors on the graph for the first book. However, this popularity doesn't seem to entirely transfer to its successors as indicated by the lighter shades on the graph. This could either be caused by the fact the show digressed from the books at this point or simply because the readers weren't interested in continuing the books. Either ways It's safe to say that this is a clear example for movies inciting excitement for the books instead of the other way round.
The data was processed using Pandas and a combination of Seaborn and Plotnine
were used for plotting and visualization. The main challenge in processing this dataset
was the size of it. With 8GB of RAM it is impossible to load the entire dataset in one
shot. The solution was to use the chunking feature in
Pandas to process the data using
small aggregated data frames as required which takes a lot of time. The other challenging
aspect was the textual nature of the data which introduced several
challenges. The jupyter notebook which has all the code is available here.