Data science is a team sport.
D. J. Patil
Team Sport
The team includes data scientists, managers, data engineers who develop and perform the infrastructure, people have contact with outside, etc. The positions include:
Data engineer | Build and administrate database Build production level algorithms, and implement them on servers Have skills in infrastructure development |
Data scientist | Do actually analysis of the day-to-day data Pull, analyze, perform experiments and visualize the data, communicate the results Hand off algorithms they developed to data engineer to implement at scale |
Data science manager | Make sure every body interact with each other, keep things moving Recruit and build data science team Interface with upper management and same level collaborators |
The team works as a unit, each of these people are working on individual project / sub-problem, and come together to have joint meetings / presentations. They also interact with external forks.
- We you are just a startup at early stage, the first order of business is just making sure your data house is in order, focusing on infrastructure storing the data.
- When you are a mid size organization, hopefully you have got the basic infrastructure in place, then you can think about building your real data science team. You can bring on board people that are actually data scientists.
- Once you have machine learning algorithms built by data scientists, you might need to implement them back on to your system and scale them up. That would be turning it back over to data engineers.
- If you are running a big organization, you need manage the team and keep everybody on task and coordinated.
Data Engineers
Background for data engineers is usually computer science, computer engineering, quantitative, information technology. They might do things like:
- Build infrastructure, hardware, software, databases, storage and computing system, etc.
- Manage and monitor the use and security.
- Implement production tools
The primary thing you are looking for is “can they execute these jobs that your organization needs to them to execute?” Data engineers are the people who maintain the data stack for you. A few key characteristics:
- Be able to find answers on their own.
- Know a little about data science.
- Be able to work well under pressure.
- Personal communication are very important.
Data Scientists
Background of data scientists includes statistics or bio-statistics, quantitative like physics or engineering. Data scientists usually have the skills like R or Python programming, interactive visualization like D3.js, and experience with at least one database, so they are be able to do:
- Statistics, knowing inference (to come up with a new hypothesis); or prediction and machine learning (to build predictive tools).
- Data analysis like running experiments like pull, clean and analyze the data,
- Communicate the results by creating nice visualization, carefully expressing what is going on and how uncertain they are.
A key component of being a good data scientist is not being intimidated by new kinds of idea or software. Are you a go-getter? Are you able to learn new things? A few key characteristics include:
- Whether they are willing to find answers on their own
- Be un-intimidated by new data.
- Willing to say “I don’t know”, because often data come to no conclusion.
Data science requires both a lot of dedication to get all the details right, and people skills being able to work with other people.
Data Science Manager
Data science manager need to have some background in data science or data engineering plus management background, need to know what can and can not be archived. Data science manager is responsible for:
- Build the team (identifying and recruiting data engineers and data scientists)
- Setting goals and priorities, identifying the problem need to solve
- Make sure people interact with each other or outside the team
- Report to higher management, collaborate with people with the same level
Interviewing
The structure of data science interviewing is usually follows:
- individual or group meeting
- demonstration and presentation of skills
- test or assessment of technical skills
Management Strategies
It is important to have a system in place so that new member can quickly get into the work flow. The onboarding process usually starts with an initial meeting with managers. The meeting should go over:
- overview of positions
- what the expectations are
- what projects to complete
- who to interact and at what time scale
- establish any sort of policies
It’s really a good idea to have a regular meeting with individual team member where they can present anything they are working on. Team meetings is also important, it is a good time to communicate with others, you need to setup it in such a way that people are empowered or be able to make criticism if they need to, without being rule or mean.
Managers are also in charge of monitoring interactions. Data science and data engineering require large amount of uninterrupted, concentrated effort. Introducing too many meetings will slow down their work. It is also important to manage the growth of the organization, and make sure there is opportunities for individuals to learn new tools. Identifying opportunities for advancement for your people is also critical component for being a manager.
Evaluating Success
On one level, you need to talk about group success, the metrics for success can be very specific or very vague, for example:
organizational problems | vague, hard to hit, data do not always work |
internal problems | concrete, definable, easy to hit |
Another thing is individual success, but it is a little bit harder to identify individual success by completion of specific projects. It is often easier to monitor day-to-day activities, learning new techniques.
Examine failure is also important, it is up to manager to take responsibility and to remove some of the heat from the people that are doing the experiments so they can feel sort of empowered to do it the right way.
The next thing is to identify problems, say lacking of communication, and then proposing concrete steps and people responsible for taking those steps is a key component of examining failure.
Celebrating success is important and will keep people motivated especially when it gets frustrating.
Embedded or Dedicated teams?
Should you build a stand-alone data science team all on its own? Or embed data scientist into teams?
Embedding a data scientist is to have a data scientist sit with other teams (marketing, business intelligence, etc). This is a really good way to promote collaboration, but meanwhile a little difficult to communicate with other data scientists.
You could also imagine setting up an independent data science group (when organization is getting larger). The keys when making this comparison are:
Communication | Data scientists need to work on concrete problems, which often come from other teams (marketing, BI, etc…) |
Support | Embedded data scientists probably won’t find support they need. |
Empowerment | Data often don’t tell you what you wanna hear. Communicating these sorts of things to the people in the external units can be very difficult. We need to empower data scientist to be able to be confident in reporting the results that were there. |
It is a good idea to combine both “dedicated” and “embedded” approaches.
Interaction with Other Groups
Consulting is an opportunity for your team to collaborate with other people where they come to you. The bring problems from outside, they want to have solved. Collaboration is data science team work closely with one of other teams to build out a whole project for longer period of time. By teaching, data science teams can play a useful role by educating other people about what data science is, what data science can be used for. Finally you can have data scientists be creative and propose new ideas.
Through data science training, you could enable people outside data science team to use data and interact with data. This can be done via many ways: online courses, internal talks, building documents or tools which can be shared with others.
There are always some difficulties when it comes to interaction, usually because of the lack of interaction, lack of empowerment, or lack of understanding. There will always be some potential internal difficulties, some are related to personalities or interactions between people, some are related to the way data scientists and data engineers tend to work. Interpersonal conflict happens in any organization, in this case the Code of Conduct will help a lot. It is better to say it upfront, rather than to invent policies on the fly when dealing with diverse people who have different expertise and maybe different expectations about what’s going on.
My Certificate
For more on Data Science Team, please refer to the wonderful course here https://www.coursera.org/learn/build-data-science-team
I am Kesler Zhu, thank you for visiting my website. Checkout more course reviews at https://KZHU.ai