DSE
  • Overview
  • Learn More
    • Testimonials
    • Apply
  • Corporate Training
  • For Students
  • Blog

Best European Tech Companies for data science?

5/3/2015

0 Comments

 
This list is by no means exhaustive. At Data Science Europe, we only reach out to a company if at least one student showed interest in working there. Therefore, the companies below are tech companies that we liked + attracted the interest of at least one student.
As we will add more students and reach out to more companies, we will expand the list.
This list is irrespective of whether a student got an offer or not from that company. It just represents our opinion about them after the entire process.
Unfortunately, there were quite a few tech companies that didn't impress us or even disappointed us, but we are too nice to write it here (we will definitely share the information during the course). 

Firstly and most importantly, there is a big choice that a future data scientist interested in tech has to make: joining a company with an already established team or joining a company where she would be the first data scientist. Both choices have pros and cons and should be evaluated based on long term career goals. 

For big companies with established teams, there are 2 tech companies that clearly stand out in Europe (Spotify could be added too, but it seems that lately has been focusing on building a machine learning team in NY):
  • Booking.com
  • King
Both companies have extremely high level data scientists, behaved in a professional and efficient way during the recruiting process and people seemed nice.  
Working there would give opportunities of having a great data science mentor, learn from the best as well as get a brand name on your resume.

For small start-ups where a candidate would be the first DS and in charge of building a team, we really liked companies that had all of these characteristics: (1) going through hypergrowth (2) approachable CEO/executive team despite their success (3) clearly investing in DS as a core of their business (4) high level people:
  • Catapult Sports
  • Getyourguide
  • Kitman Labs
  • TravelBird

Joining one of those 4 companies (listed in alphabetical order) is very likely to put your career on the fast track. If you choose to work for a start-up, make sure that it meets all the 4 points above. Otherwise, chances of success are slim.

As we gather more data, we will update this list as well as select the best consulting companies and large non-tech companies in Europe.

ps. Clicking on the company names will take you to their career page. In any case, don't get discouraged if in future you don't see any DS opening there. These are the best companies in Europe. If they think you are smart, they'll create a role for you, cause that's what great companies do.
0 Comments

Data science in the u.s. vs Europe

5/3/2015

0 Comments

 
Probably the question we get asked the most is: what's the difference between getting a job as a data scientist in the US vs Europe? (we are just talking about real data science jobs here. We don't deal with the fake ones).

After the first batch, we had students interviewing in many US as well as European companies and that, plus our own prior experience, gave us a great opportunity to notice the main differences.
Europe PROS

1) In Europe, in many cases you will be the first data scientist, building the team. Not just in start-ups, but also in large companies. Your career can grow extremely fast (-> head of the dept in months).

2)  If a European company is investing now in data science (a bit ahead of the curve), it shows that they have a cutting edge mentality. Right now is less likely to get a crappy DS job in Europe than in the US.


3) The field is still small in Europe. You will soon get to know everyone. Being one of the first data scientists and getting to know the other first data scientists will be huge for your career.

4) Quality of life (I know it is highly subjective, but still...)





Europe CONS

1) Salaries are lower. 25% higher in the  US compared to all Europe, except for Switzerland where salaries are just like US.



2) Job candidates typically have more power in the US. For instance, if you get an offer,US companies will wait until you finished interviewing everywhere else before asking you for a final decision. In Europe, they send you an offer and tell you to sign it in 1 week no matter what.

3) Europeans are pretty cheap with stocks, while in the US stocks are a relevant part of the compensation package.


4) If at some point you plan to start a company, US is better.
0 Comments

The confession of a FAKE data scientist

5/3/2015

0 Comments

 
There is a well kept secret in the data science world: out of 10 professionals with a data scientist title, probably 3 do real data science work (modeling, machine learning, recommendation systems, identifying business opportunities in the data, etc.) and 7 are just rebranded analysts who are definitely not doing the 'sexiest job of the century' (they write SQL queries for a product manager and put up dashboards). 

Many extremely bright PhDs get hired as data scientists, think they are going to solve hard challenges and end up doing a job which is very different from what they were expecting.

I was reading this post on reddit: no one knows am an alcoholic and made me think about how a similar post for "data science" would be. Probably something like:

NO ONE KNOWS AM A FAKE DATA SCIENTIST

I spend nearly every day at the office alone, writing SQL queries and surfing the internet. I wake up nearly every morning bored and ashamed of myself. I swear I'm going to stop, and then the next day I'm writing SQL again. I don't black out, miss work or alienate friends with complaints about SQL (I try never to talk to or communicate with anyone while writing SQL. No one outside work knows I do this and I'm terrified of anyone finding out) so I'm probably 'functional' at this point, but I know that if I don't stop and start doing intellectually challenging work, I'm going to end up ruining my life. But I have no willpower. I don't think I can.

The problem here is that no one has really any incentive in uncovering this situation: writing SQL queries is simple and boring, but it is a delicate task and employers love having smart people doing that (and that's why they are willing to give a shiny title and a fat salary to PhDs for doing that). 
On the other hand, "fake data scientists" have no reason to let people know about how bad is their job. Partially because of pride (all my friends think I have a great job, that's a nice feeling!) and partially because when they change job it is in their best interest to pretend that in their previous job they solved hard challenges[1].
This leads to a situation which makes extremely hard for someone from academics to figure out what kind of job they are really getting: a real data scientist or a rebranded analyst?

Part of the goal of DSE is to make sure you'll get a real data scientist job. Let me say this again cause it is very important: we'll make sure you are not getting a rebranded analyst job. We'll tell you which questions to ask to figure that out (mentor advices here were invaluable). We already have alumni and mentors working at the best European and US companies telling us what they actually do. 

Btw, if in a job ad you see at least one of these keywords, it means it is obviously a rebranded analyst:
  • Excel
  • Tableau 
  • Dashboards
  • Reporting
  • Ad-hoc analyses
  • SQL wizard/guru/ninja/etc.


Note:
[1] Can a start-up please solve this problem and come up with a rigorous way to actually verify what an employee did in her previous job?? 
0 Comments

Who is a data scientist?

6/30/2014

0 Comments

 
Data scientist has been defined as the sexiest job of this century. But who is a data scientist? Why is it so much in demand? How much is it hype and how much do companies really need this new professional figure? 

The classical definition of data science is a field in between statistics and computer science: a statistician with coding skills or a software engineer who understands statistics. 
However, I feel that the best way to define a data scientist is through the goal she is supposed to achieve, rather than through her skills such as coding or knowing what's a Poisson distribution.

A data scientist is someone who is able to use the huge amounts of data companies have nowadays to positively affect the company. It is really that simple. And this is also why companies are running after data scientists, sometimes without even knowing exactly what they are looking for. 
They know they have tons of user generated data. They know that could be a goldmine (which users like us? Why some users convert and some don't?  How can we make users who currently don't convert change their behavior?) and they want someone able to answer those questions and build products accordingly. 

Some data scientists are more statisticians focusing on statistical tests and regression models, some are more computer scientists putting machine learning models into production, some focus primarily on data visualizations helping executives understand the business, but all of them are using data to improve the product their company sells. That's the lowest common denominator. 

Currently, there is a tremendous mismatch between demand and supply of data scientists. 
The general belief is that very few people are coming out of school with the skills required to be a data scientist and, at the same time, companies are trying to build entire data science teams to take advantage of their data. That's why data scientists are called unicorns, salaries skyrocketed (~125K $ - entry level, ~200K $ - mid-career in the Silicon Valley), and the "sexiest job of the century" definition was born.

However, in general, I disagree with the statement that there is shortage of people with data science skills.  Think about astrophysicists or particle physicists. They deal all the time with huge data sets building statistical models to discover the event of interest; or biostatisticians, constantly trying to identify which variables are related to a given disease. These problems are very similar to typical data scientist problems such as who clicks on an ads or why we aren't growing in Europe. In fact, pretty much any grad student in a quantitative field is a data scientist. 

In my opinion, the data scientist demand-supply mismatch is not about technical skills, but is about:
  • specific tools used in the industry (SQL, Hive, Pig, R, Github)
  • ability to work with unstructured data and little project direction 
  • ability to do well in a fast-paced environment
  • ability to explain advanced models to a director of marketing instead of an MIT faculty
  • matching problem: students not knowing how to get in touch with companies and companies not knowing where to find potential data scientists in the university world


Btw, at Data Science Europe, we are trying to solve exactly those points. We will teach you the tools used in the industry, you will tackle real world data problems under time pressure, and we will teach you how to present to non-technical audience. 
Finally, we'll take you to the companies you are interested in and help you prepare for the interview.
0 Comments

Data scientist Job Interview questions

6/30/2014

0 Comments

 
These are real questions that have been asked during job interviews at tech companies in SF. The goal of this post is to use their questions to give an idea of the projects data scientists tackle in tech companies. 

Q1: A typical problem at our company is retention. People set up their account, but then they don't use it. How would you try to improve retention at our company?

This is REALLY a classical project you will face working as a data scientist. A weak area of the business will be identified (here is retention, but it can be conversion, growth, or really anything) and your job as a data scientist is to suggest how to solve the problem. 

Firstly, try and define the outcome variable. To keep things easy, always start by making it binary. In this case 0/1 -> non_retained/retained. Look at the distribution of time delta between when users create the account and when they take their first main action (post, upload a pic, etc.). Based on the plot, find a cut-off point and define as non_retained users who don't do it within x days from their account creation. 

Select the variables: typically user based (demographics, region, language, etc.) and usage based (how did they get to website, where did they click before creating the account, where after, when did they create the account, device etc.).

Plot dependence plots of each variable against the outcome. Lots of spurious relationships here. It is just to get a feeling of the data, to make sure there is information somewhere and potentially to do some transformations like log of some variables if the distribution requires it.

Use a binary model to predict retention (can be any models, from a logistic regression to decision trees, boosted decision trees, etc.). The key part is for the model to be understandable, i.e. which variables affect the outcome and how. 

After this, look at the model, how it works, and explain it. Always always look for actionable insights. Ex: do people who visited few pages perform worse? if so, we need to build a data product that better recommends people what to do after creating the account. 



Q2:  Imagine you have a table with 3 columns: user who sends the message, user who receives the message and timestamp. What's the average length of a conversation?
Table:
Sender Receiver ts
A             B             10
A             C             13
B             A             15
B             C              21
C             A              35
A             B              43 

Again, look at the distribution and choose after how long you define the end of a conversation (say 3 days from the last message). Ask the interviewer questions about corner cases, such as if A sends two consecutive messages to B. Let's say the interviewer tells you to not count the second message in the average length. 
Then write the code in whichever language you prefer, typically R or Python (I'll let you do it).



Q3: We have a table which logs each user, the product they buy and the timestamp (see below). Every time a user buys 10 different products in her lifetime, becomes a VIP user and we send them a coupon. Write a script that every day identifies the new power users. Also, based on the data set you have, suggest new business and/or marketing strategies.
Table:
User Product timestamp
A       xxx          12
A       xzx          15
A       xxx          24
B       zzz           35   

The code here is pretty simple. Order users by ts and count distinct products, whenever that count reaches 10, save that user in a VIP user table. Each day, only selects users who become VIP for the first time. 
The second part of the question is about collaborative filtering. We have what and when users buy, the goal is to build better recommendations on the website (think Amazon or Youtube).  
The interviewer wants you to identify relationships between products. For instance, if user buys product xxx -> also buys xzx? Is there any product that maximizes probability of users buying again after it? Here, you should be able to briefly talk about the technique you would use. Doesn't need to be fancy, correlation or cosine similarity are perfectly fine; but you need to show a good understanding of how they work. 



Q4: Let's say we have a continuous variable and you wanna bin it. How would you do it?

This is also super popular. It is very common to have to bin continuous variables for many reasons (even in question 1 we had to do it, where the continuous variable time_to_first_action became the categorical binary variable retained/non_retained).
In general, binning reduces the amount of noise but you obviously lose information. The goal is to minimize the variance within the bin and maximize the variance between bins. That is, all points that end up in a given bin should be similar to each other and very different to the points that belong to the other bins.
There are two possible answers to this question: plot the histogram and bin accordingly. You will be then asked to talk a bit about the hist and cut functions in R. 
Percentile binning: 5th, 10th, 15th, etc. and you will asked to write a simple function to compute the percentiles.
Finally, you should provide a comparison of the two techniques: main differences and pros and cons of each.

0 Comments

    Archives

    May 2015
    June 2014

    Categories

    All

    RSS Feed

Data science europe

Overview
Learn More
Apply
Blog




© 2014 | Data Science Europe
info@datascienceeurope.com
San Francisco, CA
Contact Us!
Photo used under Creative Commons from Alan Light