How to Use GitHub For Data Science Research
Using GitHub for data science research is not quite mainstream just yet for the data science industry, but it is growing in popularity in the field. From our point of view, there is a good reason for that growth.
Data science has been a siloed field of research, but as technology continues to develop so does data science. It is now more common to see teams of data scientists all working towards similar goals rather than a single data scientist sifting through data on their own and pushing it forward to a team of developers or engineers.
In this guide, we will explain how data science and GitHub are a perfect match, but also the “why” behind using GitHub at all for data science. After we answer those fundamental questions, we can discuss how to get started using GitHub for building a data science portfolio.
Do Data Scientists Use GitHub?
GitHub for data science is a more recent development because, for the longest time, GitHub was seen as a tool almost exclusively for software developers. The primary reason for this assumption about GitHub users was less about the tool being used and more about the people using them.
GitHub is geared towards collaboration, a vital pillar of the software development community. The task of creating code that works towards a certain goal or result while also minimizing bugs is daunting, at best. Working together is the only way to keep pace and succeed in the long run.
Now, data scientists sit on the other end of the spectrum. Their work was primarily seen as a single person crunching numbers and sifting through raw data for weeks and months on end in order to recognize structure in the chaos.
However, that is not the case. The world of data science has changed dramatically, so in the next two sections we will break down what is data science and what is GitHub. Ultimately, this will give us a better idea how, and why, data scientists are using GitHub more and more.
What Is Data Science?
The role and purpose of data science is to extract value from data. Data science incorporates a variety of disciplines and fields like statistics, scientific methods, artificial intelligence (AI), and data analysis. Data science is one of today's most fascinating subjects and the field continues to grow as demand for data management, evaluation, exploration, visualization and more keeps pace with the growth of AI and other technological innovation. More importantly, though, it is needed and critical for many industries as it is the backbone to understanding customers and creating new products and services.
Data has been at the core of research since before computers came on the scene. Image Source
Businesses and entire industries are sitting on a gold mine of data. Data volumes have expanded as contemporary technology has facilitated the generation and storage of ever-increasing quantities of data.
However, most of this data is languishing undisturbed in databases and data lakes. Data science makes this vault of data a source of innumerable applications for businesses.
What Is GitHub?
GitHub is a version control tool widely known and used in data science and many other industries. GitHub uses the Git program to apply version control to your code. Files for a project are saved in what GitHub calls a repository, which is a central remote location.
GitHub Octocat Logo, Image Source
All your updates done locally on your system can be updated on the remote location when you push the updates to GitHub, keeping everyone in the loop and making working remotely a breeze. If you wish to go back to an earlier version of your project before committing, you may do so using this saved record within your repository.
Since project files are stored remotely, anybody with access to the repository may download it and make modifications. GitHub repositories allow you to create branching codes, meaning building out temporary files to make sure you aren’t breaking anything before committing the code to GitHub.
This means that multiple data scientists can work together on the same data to interpret and utilize data at a much, much quicker pace. For data science, GitHub has some essential functions that will come in handy time and time again.
Why Use GitHub for Data Science Research
To summarize all these moving pieces, you should be using GitHub for your data science research because it allows you to manipulate the data in real time, with multiple collaborators, without risking total cascade failure in the production process.
Traditionally, GitHub wasn’t used for data science, since the process of putting models into production (where version control becomes of critical importance), was given off to software development or data engineering teams.
GitHub for data science repository, Image Source
However, there is an upward trend in systems that make it much easier for data scientists to create their own code and deploy models into production rather than passing off data to another team entirely. This means businesses across various industries are able to reduce the number of teams they need without risking the quality of the production of their data.
Aside from GitHub using all the features you might need to keep data organized and sift through data in any way you may want, it also makes the process smoother and more predictable. It makes the production process reproducible, a difficult thing in any form of data science.
Now that you understand why you need to be using GitHub for all of your data science research, we will discuss how you can begin to actually put GitHub into practice by building out a portfolio within a GitHub repository, creating a master branch, and how to create branches for manipulating data without critical loss of any part of the production process.
How Do I Make a GitHub Data Science Portfolio?
Before you begin using GitHub for your data science research, you will need to make sure you have all of the necessary tools. First and foremost, you will need to set up a GitHub account, if you don’t have one already. Second, you will need to install Git.
To create your first repository, simply click the “New” button and edit the necessary information to name and create your repository. Be sure to check off the box which reads, “initialize with a read.me file.” Then, hit the “Create” button.
Now, in order to work on the code locally you will need to select “clone or download” to have a version you can work on separated from the master branch you have just created.
Then, when you are ready later on to commit your code to the master branch you can do so. Any mistakes along the way will not affect the master branch until you are sure the code is all squared away.
To start working on a new branch, simply type in “git branch my-branch.” Before doing so, it is usually good practice to use the “git pull” command to ensure your branch is up-to-date with the master branch.
Then, when you are comfortable and confident with your code, then you can merge the branches back to the master branch.
Once you commit a branch, the last step you should take, after making sure everything is still working properly, is to use the “Delete Branch” button to remove your temporary branch to avoid confusion or duplicating code or processes.
While GitHub has a lot more that can be done with it and in it, this is a great way to start using GitHub for data science research. This information is just enough to get you started and help you put your best foot forward with your data science research.
Looking For More Information On Data Science and GitHub?
Data science is a quickly changing and growing field. As technology advances further, new trends are always popping up. Do you think GitHub has a long-lasting place in the world of data science?
If you are wanting to start using GitHub for data science, then we would love to help you get started!
You can contact us for any questions or read other articles related to data science, GitHub, machine learning and more elsewhere on our blog.