Article reprint source: AIcore
Original source: CSDN
Image source: Generated by Unbounded AI
GitHub Copilot, a code automatic generation tool based on the big model language, has been welcomed by countless programmers since its launch. They have said that they finally have a code artifact that does not require overtime!
According to the data collected in the mid-term of the "2023 AI Developer Ecosystem Survey Questionnaire" recently launched by CSDN, 90% of the respondents said that they have used code generation tools in different scenarios such as production, testing, and entertainment, and 35% of the respondents said that they use them every day.
However, it is important to note that while these code tools improve work efficiency, have you ever thought about whether they bring blessings or disasters? Is the generated code really safe on the basis of "ready-to-use"?
Recently, in order to study the security of Copilot generated code, six university researchers from Wuhan University, Central China Normal University, Massey University in New Zealand, and Royal Melbourne Institute of Technology University conducted an empirical study on the security vulnerabilities of Copilot generated code on GitHub, and published an academic paper titled "Security Weaknesses of Copilot Generated Code in GitHub", which revealed the security of the future AI programming artifact that can "complete coding with just your mouth".
Paper address: https://browse.arxiv.org/pdf/2310.02059.pdf
Sample: 435 codes in production environment, covering 6 mainstream programming languages
During the experiment, the researchers selected 435 code snippets generated by GitHub Copilot from GitHub's public projects, covering multiple mainstream programming languages such as Python, JavaScript, Java, C++, Go and C#.
At the same time, it uses CodeQL, an open source static analysis tool that supports multiple languages (including Java, JavaScript, C++, C# and Python), to perform security scanning and analysis on code snippets, and use the Common Vulnerability Enumeration (CWE) to classify security vulnerabilities in code snippets.
Furthermore, based on the following research process, the researchers proposed questions that need to be studied and verified from three dimensions.
RQ1: Is the Copilot-generated code in GitHub projects safe?
Rationale for this question: Copilot may generate code suggestions that contain security vulnerabilities, and developers may accept these suggestions, potentially making their programs vulnerable to attacks. The answer to RQ1 helps understand how often developers encounter security vulnerabilities when using Copilot in production.
RQ2: What security vulnerabilities exist in the code snippets generated by Copilot?
Reason for asking this question: The code generated by Copilot may contain security vulnerabilities, and developers should conduct a rigorous security review before accepting the code generated by Copilot. As the GitHub Copilot documentation clearly states, "Copilot users are responsible for ensuring the security and quality of their code." The answer to RQ2 can help developers better understand possible security vulnerabilities in Copilot-generated code, so that they can prevent and fix these vulnerabilities more effectively.
RQ3: How many security vulnerabilities belong to the MITRE CWE Top-25?
Rationale for this question: The list contains the 25 most dangerous security vulnerabilities. The answer to RQ3 can help developers understand whether the code generated by Copilot contains widely recognized security vulnerability types and how well Copilot handles these most common vulnerabilities.
Step 1: Identify “real” AI-generated code on GitHub
The reason why we want to use GitHub as the main data source to answer research questions is that, in the researchers' view, GitHub contains millions of public code repositories and can access a large number of code resources, allowing it to cover a variety of programming languages and project types in the research.
However, it is not easy to directly obtain the code generated by Copilot in GitHub, because even with the help of many tools, it is difficult to tell whether the code is from AI or human engineers.
Faced with this dilemma, the six researchers chose to identify many code snippets by searching the repository descriptions and comments provided in the code, such as searching with keywords such as "by GitHub Copilot", "use GitHub Copilot" and "with GitHub Copilot", and finally obtained the following results:
The number of search results in different languages obtained from the code tag:
Next, we enter the filtering phase. Here, the researchers point out in the paper that they mainly follow three rules:
1. For the search results under the repository label, the researchers determined which projects were fully generated by Copilot based on the statements in the project description or related readme files. In turn, they retained code files in major languages supported by Copilot, such as Python, JavaScript, Java, C++, C#, and Go.
2. For search results under the Code tag, the file comments showing the code are generated by Copilot.
3. The research object this time is mainly the code snippets used in actual projects. For this reason, code files used to solve simple algorithm problems on the LeetCode platform will be excluded.
After completing the pilot data annotation, the first author of the paper checked the rest of the search results and obtained a total of 465 code snippets. After removing duplicate results, 435 different code snippets were finally obtained. 249 of them came from repository tags and 186 came from code tags, as detailed below:
Step 2: Data Analysis
During the testing phase, the researchers used two static analysis tools to perform security checks on each code segment (CodeQL plus specialized tools for specific languages) in order to improve the coverage and accuracy of the results.
In this study, the researchers first used CodeQL to analyze the code in the dataset. The default query suite in the CodeQL standard query package is codeql-suites/-code-scanning.qls. Several useful query suites are included in the codeql-suite directory of each package.
At the same time, it uses the -security-and-quality.qls test suites related to security weaknesses to scan code snippets. These test suites can check multiple security properties and cover many CWEs. For example: the Python test suite provides 168 security checks; the JavaScript test suite provides 203 security checks; and the C++ test suite provides 163 security checks.
In addition, the researchers selected other popular static security analysis tools for each programming language to scan files. For example, Python uses Bandit, JavaScript uses ESLint, C++ uses Cppcheck, Java uses Findbugs, C# uses Roslyn, and Go uses Gosec. If the CWE ID related to the security issue cannot be obtained directly from the scan results, the researchers will manually map the security attributes provided by the scan results to the corresponding CWE.
35.8% of code segments have security vulnerabilities, C++ code has the most vulnerabilities, covering 42 CWE types
After analysis, the researchers came to the final conclusion based on the three questions raised above.
RQ1: Is the Copilot-generated code in GitHub projects safe?
Of the 435 code snippets generated by Copilot, 35.8% contained security vulnerabilities, which created security issues regardless of the programming language used.
The proportion of security vulnerabilities in Python and JavaScript code is slightly higher, and these are the languages most commonly used by developers using Copilot. Of the 251 Python code snippets collected, 39.4% have security risks. Of the 79 JavaScript code snippets, 27.8% have security risks. Among all programming languages, C++ code snippets have the highest proportion of security vulnerabilities, reaching 46.1%. Go also has a relatively high proportion of security risks, at 45.0%. In contrast, the proportion of files with security issues in C# and Java code is lower, at 25% and 23.2%, respectively.
RQ2: What security vulnerabilities exist in the code snippets generated by Copilot?
To answer RQ2, the researchers processed the scan results from RQ1 and eliminated duplicate security issues detected in the same code segment. In total, 600 security vulnerabilities were identified in 435 code segments.
The detected security vulnerabilities are of various types and are related to 42 different CWEs, with CWE-78 (OS Command Injection), CWE-330 (Use of Insecure Random Value Vulnerability), and CWE-703 (Improper Checking or Handling of Exception Conditions) appearing the most frequently.
RQ3: How many security vulnerabilities belong to the MITRE CWE Top-25?
Of the 42 CWEs identified, 11 are among the currently recognized CWE Top-25 vulnerabilities for 2022.
Final Thoughts
In response to this, some netizens joked that their own bug-solving skills might be better than GitHub Copilot.
Of course, this study is not intended to persuade developers not to use AI-assisted coding tools in their daily work, but to show that using Copilot to generate code in actual development can improve development efficiency, while also reminding everyone to conduct their own security assessments.
At the same time, run appropriate safety checks when accepting Copilot's code suggestions to effectively avoid some potential risks and reduce losses.
For more details, see the paper: https://browse.arxiv.org/pdf/2310.02059.pdf
