Salesforce CodeT5 vs Github Copilot: A Comparative Guide to Automatic Code Generators


Sometimes creating large code for software can be a long and tedious task. Developers today are looking for methods and tools that can aid coding and improve lead time and accuracy for software development productivity. As a result, automatic code generation capabilities are discovered and may evolve in programming languages ​​and IDEs that work at compile time. Automatic code generation can be an amazing tool with potential use cases for business settings. This article will cover two of the most recently developed tools for automatic code generation, Salesforce CodeT5 and Github Copilot.

Salesforce Code T5

The CodeT5 by Salesforce is an open source machine learning tool that can easily understand and generate code in real time. It is a unified, pre-trained, identifier-aware encoder-encoder tool that enables a wide range of code intelligence applications. The tool aims to reduce the time spent writing software as well as lower computational and operating costs. It consists of software code pre-training methods that stimulate a range of downstream applications in the software development lifecycle. CodeT5 has an uninformed model for natural language processing tasks, which crops text to text with input and output data always being text strings.

Register for our upcoming Masterclass>>

The existing methods of pre-training to code had two major limitations which CodeT5 addressed. First, they often rely on an encoder-only model similar to BERT or a decoder-only model like GPT, which is suboptimal for generation and comprehension tasks. Second, current methods can only adopt conventional NLP pre-learning techniques on source code by viewing it as a sequence of tokens like natural language, which largely ignores the rich structural information present in the programming language. , information that is vital to fully understanding the code. semantics.

Architecture and operation of CodeT5

Salesforce’s CodeT5 is built on an architectural schema similar to that of Google’s T5 framework, but it incorporates a better specific knowledge of code, which gives the model a better understanding of code. It takes the code to work on and the accompanying comments as a sequence to build and generate.

Some of CodeT5’s pre-training tasks include:

  • Masked Extent Prediction: Randomly masks the extent with lengths and the decoder retrieves the original input. Captures syntax information from NL-PL input and learns robust multilingual representations.
  • Identifier marking: The encoder distinguishes whether each code is an identifier or not.
  • Masked Identifier Prediction: Uses the same mask placeholder for all occurrences of a unique identifier. Understands the semantics of the code as a function of the obscured code.
  • Bimodal Dual Generation: jointly optimizes conversions from code to comments and vice versa. This encourages better alignment between the NL and PL peers.

Image Source: Salesforce Code T5

Features of the T5 code

Some features of CodeT5 include:

  • Text-to-Code Generation: Can generate code based on the natural language description.
  • Automatic code completion: can complete the entire code function, given the name of the target function.
  • Code summary: It can generate the summary of a function in a natural language description.

Risks with CodeT5

While CodeT5 can be a potential tool for automatic code generation, there are still ethical risks that should be considered first. The CodeT5 team says they are still working on improving the following risks:

  • Automation Bias: Sometimes the system can produce functions that seem superficially correct, but which may not be what the developer intended. If developers adopt these incorrect code suggestions, it can corrupt the schema and lead to much longer debugging time with significant security issues.
  • Safety Implications: Pre-trained models may encode some sensitive information from the training data. The tool may not be able to completely remove some sensitive information and produce code that adversely affects the software.

Github co-pilot

Github co-pilot is a service tool created by GitHub and OpenAI and is described as an AI pair programmer. This is a plugin for Visual Studio Code and automatically generates code based on the contents of the current file and the current cursor location. Copilot can generate entire multiline functions and can even create documentation and tests based on the context of a code file.

It’s powered by a deep neural network language model called Codex, trained on several public code repositories on Github. It can help refine and achieve cutting edge results on a wide range of NLP issues.

See also

How it works?

Visual Studio Code sends comments and code typed by the developer to the Copilot service, which synthesizes and suggests implementation. Github says the Copilot tool acts like a pen to generate code. The former claims the co-pilot understands more context than most code assistants currently available. It uses the context provided and synthesizes a corresponding code. Copilot can work with a wide range of frameworks and languages ​​such as Python, Javascript, TypeScript, Ruby, and Go. Alternative suggestions can be browsed and suggestions can be accepted or rejected with an option to also manually edit the suggested code.

Image source: Github co-pilot

Co-pilot characteristics

Some features of Github Copilot include:

  • Convert comments to code: Write a comment that describes the logic and Copilot assembles the code.
  • Easy autofill: Copilot can help produce repeating code patterns quickly. Fueled by a few examples, the co-pilot learns and does the rest.
  • Test Aids: Copilot automatically suggests tests that match the code implementation.

Risks with Copilot

Github Copilot may come with unknown issues during implementation, which can be a potential risk factor, some of which include:

  • Bugs During Implementation: A few developers who got their hands on the co-pilot complained that it generated a number of bugs at runtime when being trained on a large size of Github projects.
  • Unwanted Results: From time to time Github Copilot may produce unwanted results which may include biased, discriminatory, abusive, or offensive results.

Abstract

While automatic code generators are tools that aim to automate tedious and time-consuming coding work for developers, they come with their own set of limitations and risk factors. These questions still seem to be at work and require sustained attention. In the near future, this technology will enable existing engineers to be more productive, reducing manual tasks and helping them focus on other interesting aspects of the job.


Join our Discord server. Be part of an engaging online community. Join here.


Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.

Victor Dey

Victor Dey

Victor is an aspiring Data Scientist and holds a Master of Science in Data Science & Big Data Analytics. He is a researcher, data science influencer and also a former college football player. A great connoisseur of new developments in data science and artificial intelligence, he is committed to developing the data science community.


Margie D. Carlisle