In my last blog post about Gap-Statistics, I explained that most of the time data has a lot of features (hundreds or even sometimes thousands). We consider a dataset with 100 features as being 100 dimensional. So each feature represents one dimension of the dataset. That’s quite a lot of dimensions for visualizing the data. We can’t just cut away all features except for three, can we? Of course not! One exception could be that 98 out of 100 features are highly correlated (find out with Heatmaps), so we omit 97 and take only a single one of them. Then we would end up with a three-dimensional dataset. …
There is a lot of code going on under the hood. That’s why I provide my Github repository at the end of this post and I show just a little code of the K-Means.
Clustering is an important technique in Pattern Analysis to identify distinct groups in data. Due to data being mostly more than three-dimensional, we perform dimensionality reduction methods like PCA or Laplacian Eigenmaps before applying a clustering technique. The data is then available in 2D or 3D and this allows us to visualize the found clusters very nicely to humans. …
Digital signatures are on the rise. Since many of us are working now from home, a lot of confidential company E-Mails need to be signed online.
Ian Goodfellows' invention of Generative Adversarial Networks (GAN’s) showed how easy it is nowadays to generate fake numbers on the MNIST dataset. It is actually just a tiny step from that, to also be able to generate imitated signatures with the handwriting style of any person. But isn’t that dangerous?
Can we distinguish with Machine Learning between an original and an artificially crafted signature? Indeed we can! We don’t necessarily even need one of those fancy neural network approaches, we can go totally classic with Hidden Markov Models (HMM). …
Last month I participated in my first Hackathon because I received a random ad E-Mail from my University which promoted a very cool sounding one. I clicked it and saw that they were searching for teams with different skills, also including Data Scientists. Awesome I thought, I’m in!
I signed up, applied and got accepted. Bingo!
A little bit about my background could be helpful. I am currently studying in my first year the Master of Computer Science in Germany and I am working part-time as a Machine Learning Engineer. …
This blog post should present, how the marketing effectiveness of Airbnb can be enhanced by the analysis of a dataset of 2016. In order to improve the marketing, the four Ps of the marketing mix should be addressed. The dataset contains listings of rented apartments and their attributes.
Product
Marketing is about fulfilling customer needs and expectations. Within product policy, the aim is to understand one’s market and be able to figure out which needs and wants the customers to have. In general, one can say that the main need for travelers is to find accommodation but nowadays it is not only about finding accommodations but even more about discovering the right accommodation. …
As I enrolled for the Udacity Data Science Nanodegree, I didn’t know where my journey is going to end. This time has now arrived after I chose the dataset provided by Arvato for this final project to graduate from the Nanodegree.
The dataset consists of four files:
This project is provided by Arvato Bertelsmann, a Supply Chain Management solution company located in Germany https://www.kaggle.com/c/udacity-arvato-identify-customers.
The project is about finding people in the German population, which is most likely responding to a targeted marketing campaign. …
About