Leveraging Public Datasets and APIs using Python - ODSC Dublin Data Science MeetUp

Posting date: 24 February 2020
The second ODSC Meet Up of 2020 took place on Wednesday evening, 20th February, with a strong crowd in attendance. This week's Meet Up was all about public datasets and how developers can query and analyse big data from publicly available datasets such as the Twitter API. Harvey Nash hosted the event and provided pizza and refreshments / beers for the crowd. 

Speaker Profile: Johannes Ahlmann, CEO Sensatus.io, Algo Data Group Ltd

Johannes Ahlmann lives in Cork and is the CEO of Sensatus.io. Johannes has over 15 years of experience in the IT industry as Head of Data Science at Algo Data Group Ltd, Head of Data Science at ScrapingHub and Principal Developer at Dell EMC. He has programmed in python for almost 20 years and loves learning about new programming languages and paradigms (haskell, rust, scheme, go, scale, etc.). 

Johannes eased the group into the presentation and explained the fundamentals of data sets and the importance they hold. Datasets represent the contents of a single database table, with each column representing a particular variable. When conducting sentiment analysis, researchers can analyse emojis attached to tweets and decide on the sentiment; far more time effective than analysing individual words, for example. The emoji portrays the sentiment. 

For machine learning, researchers need a training data set, which is the actual dataset used to train the model in performing various requests / actions. Machine learning depends on data as without the data, it is impossible for AI to “learn”. The data is the most important element which makes the training of the algorithm possible. Along with that, the dataset used to train the algorithm must be of a high quality. Poor data yields poor results / outcomes. 

Collecting voice data is a challenge for Amazon, for example. For this reason, the best way for Amazon to collect voice data is to keep selling their Amazon Echos at a discount, which in turn provides the company with revenue and a rich stream of voice data. Examples of publicly available data warehouses can be found below. 

Google BigQuery 

BigQuery is a data warehouse on RESTful web service that enables scalable, cost-effective and fast analysis of big data working in conjunction with Google Cloud Storage. It is a serverless Software as a Service and it also has built-in machine learning capabilities. Because there is no infrastructure to manage, you can focus on analyzing data to find meaningful insights using familiar SQL without the need for a database administrator.

Twitter API

When someone wants to access the Twitter API, they are required to register an application. By default, applications can only access public information on Twitter. Certain endpoints, such as those responsible for sending or receiving Direct Messages, require additional permissions from you before they can access your information. Endpoints are then divided up in 5 primary endpoints; Accounts & Users, Tweets & Replies, Direct Messages, Ads and Publisher Tools and SDK. The Twitter API is user friendly and holds vasts amount of data, with sentiment analysis a popular Twitter API analysis. 

Johannes also discussed GitHub Archives, GDELT Datasets, Stack Overflow, Common Crawl and Libaries.io and encouraged the group to experiment with these (mostly) free applications. 

The night provided some great insights into leveraging public data and the value within these datasets. Again, like so many of these events, it was 2 hours well spent! Thanks again to Johannes for taking the time to present and to Harvey Nash for their support!

Don’t miss our next Meet Up, join or view the event here.