Efficiently working with Spark partitions

It’s been quite some time since my last article, but here is the second one of the Apache Spark serie. For those of you that are new to spark, please refer to the first part of my previous article which introduces the framework and its usages. In this article, I will show how to execute specific code on different partitions of your dataset. The use cases are various as it can be used to fit multiple different ML models on different subsets of data, or generate features that are group-specific, and more.

Read more

Writing your own Gaussian Mixture Model Spark Estimator

Apache Spark is an open source framework for distributed computation. It is particularly adapted for Big Data, effectively speeding up the data analysis and data processing. Spark is particularly known for its very structured architecture allowing customization. One of the key feature of Spark is its Estimators, which is an abstraction of any learning algorithm. In order to get a strong quick jump into the Spark’s ecosystem, we will try implementing our own version of the Gaussian Mixture algorithm for 1-D data.

Read more

Scaling the A3C algorithm to multiple machines with Tensorflow.JS

As I have been working on reinforcement learning and it’s application to webcrawlers, I have came across the A3C algorithm. The original A3C approach had it’s flaws and it’s drawbacks when applied to the environment I set up for my needs. This blog post presents a different approach to the A3C algorithm, allowing us to scale it to multiple machines instead of multiple threads, while using Tensorflow.JS on NodeJS for the implementation.

Introduction

Read more

Simple crawler using Puppeteer and Chrome Headless

The code below is a simple snippet describing the use of puppeteer and chrome headless to retrieve a list of proxies and additional informations. It loops through the different pages of the website containing the proxies informations and then saves them to a csv file for further use.

Read more

Implementing SARSA(λ) in Python

This post show how to implement the SARSA algorithm, using eligibility traces in Python. It is part of a serie of articles about reinforcement learning that I will be writing.
Please note that I will go in further details as soon as I can. This is the first version of this article and I simply published the code, but I will soon explain in depth the SARSA(lambda) algorithm along with eligibility traces and their benefits.

Read more

Browser fingerprints in a nutshell

Internet privacy has been a recurrent subject over the last years, as multiple social media, such as Facebook, Twitter and others, have encountered themselves trapped in a tonload of controversies and have been under the spotlights since then.
This is basically showing a trend : the lambda user is waking up and looking at his internet privacy from a new perspective. Regular users are no longer looking at their privacy as a sacrifice they have to make in order to use their favorite social media, but rather as an aspect of the internet they have to be able to control.
One of the least known privacy breach are represented by the browser fingerprints. As most users tend to focus on securing the data they know about, many have no idea about some of the newest methods used.

Read more

Visualizing convolutional neural networks outputs

Convolutional neural networks (CNNs) are the type of neural networks the more likely to allow us to understand what is happening internally, since, as opposite to many other type of neural nets (I am thinking to GANs for example), CNNs are basically a representation of visual concepts.
Either for the purpose of debugging, or the personal satisfaction of visualizing the magic happening in your neural network, visualizing the network interpretations of your input is absolutely necessary to master when getting into deep learning. I will start by interpreting the most helpful method I discovered : the heatmap visualization of class activations in the input image.

Read more

Deploy your blog using Ghost and Docker on AWS

When I first heard about the Ghost platform for blogging, I was quite impressed by it’s simplicity but at the same time, the amount of steps I had to go through before getting a running Ghost process on my server were a bit too much. I didn’t want something that would make me lose my time but rather something simple to set up, able to display text written using the markdown or the wiki syntax : I was looking for simplicity above functionnalities. Ghost seemed perfect for the job as it was open-source, didn’t have those multiple options I was sure I would never use and that Wordpress, Joomla or any other platform made sure to include.

Read more