For many companies, the question is no longer whether to use open-source tools, but rather how to implement them with the appropriate governance and controls. Have security concerns been accounted for? How does one effectively institute controls over bad code? Are there legal implications for using open-source software?
Open Source Security Risks
Open-source software is not inherently more or less prone to malicious code injections than proprietary software. It is true that anyone can push a code enhancement for a new version, and it may be possible for the senior contributors to miss intentional malware. However, in these circumstances, open source has an advantage over proprietary, coined in 1999 by Eric S. Raymond as Linus’s Law: “Given a large enough beta-tester and co-developer base, almost every problem will be characterized quickly and the fix obvious to someone.” It is unlikely that a deliberate security error goes unnoticed by the many pairs of eyes on each release. However, security issues persist.
Open-Source Security – An Example
Debian, a Unix-like computer operating system, was one of the first to be based on the Linux kernel. Like many systems, it utilizes OpenSSL, a software library that provides an open-source implementation of the Secure Sockets Layer (SSL) protocol, commonly used by applications that require secure communications over a network.
In 2006, a snippet of code was removed from Debian’s OpenSSL package after one of the contributors found that it caused runtime warnings generated by other packages. After the removal, the pseudorandom number generator (PRNG) generated SSL keys using only the process ID (in Linux, a number up to 32,768) to the exclusion of all other random data. Since a relatively small number of values was used, the keys created over a period of almost two years were too predictable to be used securely. Users became aware of the issue 20 months after the bug was introduced, leading to costly security resolutions for companies and individuals who relied on Debian’s OpenSSL implementation.[1]
OpenSSL was again the subject of negative attention when a bug dubbed ‘Heartbleed’ was introduced to the code in 2012 and disclosed to the public in 2014. A fixed version of OpenSSL was released on the same day the issue was announced. More than a month after the release, however, 1.5% of the 800,000 most popular affected websites were still vulnerable to the security bug. [2]
The good news is that such vulnerabilities are documented in the Common Vulnerabilities and Exposures (CVE) system, and they are not so common. For Python 2.7, the popular version released in 2010, 15 vulnerabilities were recorded from 2010 to 2016, only one of which is considered ‘High’ severity, with a CVSS score of 7.5. jQuery, a JavaScript library that simplifies some components of web application development and the most common open-source component identified in the latest Open Source Security and Risk Analysis (OSSRA) report, only has four known vulnerabilities from 2007 to 2017, none of which rank higher than a ‘Medium’. The CVE is just one tool available for improving the security profile of software applications, but technologists must remain vigilant and abreast of known issues. Corporate IT governance frameworks should be continuously updated to keep up with the changing structure of the underlying technology itself.
Bad Code
Serious security vulnerabilities may not be a daily occurrence, but bad code can affect software at any time. pandas, a popular open-source software library used in Python implementations for data manipulation and analysis, was first released in 2009. Since then, its contributors have identified over 10,000 issues, 1,933 of which are currently considered unresolved.[3] A company that relies on accurate output from a codebase that uses pandas needs to be vigilant not only in testing the code written by its in-house developers, but also in verifying that all outstanding known pandas issues are covered by workarounds and the rest of the functionality is sound. Developers and testers who are not intimately familiar with the pandas source code must devise creative testing tools to ensure complete integrity of applications that rely on it.
Bad Code – An Example
The Comma-Separated Values (CSV) file is one of many data formats that can be loaded for data manipulation and analysis by pandas, in this case using the built-in read_csv function. read_csv has a number of associated helper attributes intended to simplify the data import, one of which is parse_dates, which, as the name implies, tells pandas to automatically parse dates in the data using a recognition algorithm to determine the format in each date-populated column.
However, if a row of data contains a blank value where a date is expected, pandas may populate that field with today’s date — a bug first formalized in version 0.9 in 2012 [4] (closed three days after it was opened) and again in 2014.[5] The issue was not closed until the end of 2016, when one of the contributors noted that the tests passed for version 0.19, stating that he was “not sure when this was fixed, but it doesn’t seem like it occurred recently. [6]
In the meantime, pandas versions prior to 0.19 may have resulted in incorrect date-related parameters if blank fields were fed to the system. For example, a mortgage-backed security may have had an incorrect calculated weighted average loan age if some of its loans had blank first payment dates, causing these rows to have a loan age of zero.
In addition to implementing security testing, IT controls must include a clear framework for testing both in-house and open-source components of all applications, especially high-impact programs.
Open-Source Licensing
Finally, it is important to be aware of open-source licensing constraints and to maintain active licensing governance activities to avoid legal issues in the future. Similar to the copyright concept, some open source creators have adopted the concept of ‘copyleft’ to ensure that “anyone who redistributes the software, with or without changes, must pass along the freedom to further copy and change it. [7] This means that, legally, for any software that contains a copylefted open source component, whether it comprises 99% or 0.1% of the application code, the entire source code must be distributed with the software or be made available upon request. This is not an issue when the software is distributed internally among corporate users, but it can become more problematic when the company intends to sell or otherwise provide the software without revealing the internally developed codebase. Not all open-source software is copylefted – in fact, many popular licenses are highly permissive with very few restrictions. Below is a summary of the four most popular open-source licenses. [8]
Of the four, only the GNU General Public License (GPL, all versions) requires the creators to disclose the source code. Between 20% and 25% of all open-source software is covered by the GNU GPL.
OSSRA found that 75% of applications contained at least some components under the GPL family of licenses, and that only 45% of those applications complied with the GPL copyleft obligations. Overall, the Financial Services and FinTech industries maintained 89% of all applications with at least one licensing conflict.
Most open-source software, even that which is licensed under the GNU GPL, can be used commercially. For example, a company can use and internally distribute a financial model written in R, an open-source programming language licensed under the GNU GPL 2.0. However, important legal consequences must be considered if the developed code will be later distributed outside of the company as a proprietary application. If the organization were to sell the R-based model, the entire source code would have to be made available to the paying user, who would also be free to distribute the code, for free or at a price. Alternatively, a model implemented in Python, which is licensed under a Berkeley Software Distribution (BSD)-like agreement, could be distributed without exposing the source code.
Open-Source Governance and Controls
Governance risks are specific to how open-source tools are integrated into existing operations. These risks can stem from a lack of formal training, lack of service and support, violations of third-party intellectual property rights, or instability and incompatibility with existing operating environments. Successful users of open-source code and tools devise effective means of identifying and measuring these risks. They ensure that these risks are included in process risk assessments to facilitate identification and mitigation of potential control weaknesses. Security vulnerabilities, code issues, and software licensing should not deter developers from using the plethora of useful open-source tools. Open-source issues and bugs are viewed and tested by thousands of capable developers, increasing the likelihood of a speedy resolution. In addition, a company’s own development team has full access to the source code, making it possible to fix issues without relying on anyone else. As with any application, effective governance and controls are essential to a successful open-source application. These ensure that software is used securely and appropriately and that a comprehensive testing framework is applied to minimize inaccuracies. The world of open source is changing constantly –we all just need to keep up.
[1] https://www.theregister.co.uk/2008/05/21/massive_debian_openssl_hangover/
[2] http://www.theregister.co.uk/2014/05/20/heartbleed_still_prevalent/
[3] https://github.com/pandas-dev/pandas/issues/
[4] https://github.com/pandas-dev/pandas/issues/2263
[5] https://github.com/pandas-dev/pandas/issues/6428
[6] https://github.com/pandas-dev/pandas/pull/14862