Introducing SCANOSS Integration in Theia: Transparent License Compliance for AI-Generated Code

March 20, 2025 | 5 min Read

We are excited to introduce a powerful feature available in the AI-powered Theia IDE and the underlying Theia AI framework: SCANOSS integration. SCANOSS scans AI-generated (and of course any other) code against a database of known code. It identifies any matches, and provides developers with detailed information about matched sources and associated licenses, enabling informed decisions on code usage and compliance. And the best news: you can use this feature for free!

The SCANOSS integration in the AI-powered Theia IDE shows a match in code generated by the AI agent Theia Coder.

If you’re unfamiliar with Theia AI or the AI-powered Theia IDE, check out the Theia AI introduction and the AI Theia IDE overview, download the AI-powered Theia IDE and check out the Theia Coder documentation.

Why to scan AI-generated Code for matches with SCANOSS?

With AI-assisted coding becoming increasingly prevalent, developers often incorporate AI-generated snippets without clear visibility into their origins or licensing implications. Licensing compliance is critical, as violations can lead to significant legal and operational risks—especially considering that most leading large language models (LLMs) have unknown training datasets.

A key challenge arises from the nature of AI code generation: if an LLM was trained on publicly available code, it could generate output that is identical or nearly identical to the original code. Depending on the license of the original code and the license of the project in which the generated code is used, this could result in a licensing violation. For instance, an LLM might generate a snippet that is an exact match to GPL-licensed code. If a developer unknowingly integrates this snippet into a proprietary project without adhering to GPL terms, this could trigger compliance issues. Even models with known training data, such as StarCoder, require caution regarding license compliance.

SCANOSS mitigates this challenge by analyzing AI-generated code snippets against a comprehensive and continuously updated database of open-source software. It identifies matches and clearly provides:

  • The source of the matched code
  • The licensing terms associated with the original source
  • A match percentage indicating the similarity between the generated snippet and existing open-source code

In the example video below, we force such a match by asking the advanced coding assistant “Theia Coder” to add some code to a specific file. The code we provide is known and already published under EPL. As you can see, the SCANOSS integration seamlessly scans the generated code, shows a match and provides the details. Please note that for the example, we really forced the underlying LLM to provide us with a duplicate of existing code.

However, the similarity matching of SCANOSS is built to detect generated code that is very similar to existing code as well. In the following video, we show a more realistic second example in which we ask Theia Coder to generate a calculator as a node-based application. SCANOSS identifies a 5% match with some existing code with an unknown license, obviously from a lecture. As the original repository is provided, a developer can now investigate the match more in detail, check if there is any license specified or whether the generated code actually is a concerning duplicate.

👉 Try it yourself: Download the AI-powered Theia IDE and check the SCANOSS documentation.

Why Transparency matters

Scanning generated code is not a unique feature, other proprietary AI-powered code assistants sometimes provide a similar capability. However, proprietary solutions usually silently withhold code snippets upon detecting license issues if you activate the option to scan generated code. This is particularly absurd in scenarios like open-source projects, where newly generated code might naturally resemble existing project code.

In the AI-powered Theia IDE, our open and transparent approach provides complete visibility and control to users. If a match is detected, we openly display this to the developer, providing detailed information about the matched code and its associated license. This comprehensive transparency enables developers to make informed decisions about how to handle potential licensing issues—whether by acknowledging attribution, adjusting the usage, or reconsidering the integration altogether. We believe transparency should be the default, not the exception, and this is precisely the openness that defines Theia’s approach to AI integration putting the control back into the developers hands.

Use it - for free!

For end-users of the AI-powered Theia IDE, this feature helps ensure that all generated code can be quickly evaluated and safely integrated, eliminating uncertainties around license compliance. Of course, neither Theia nor SCANOSS can guarantee that no licensing issues exist, even if no matches are detected, this tool represents a significant step forward in minimizing compliance risks.

Hosted by the Software Transparency Foundation, the SCANOSS service integrated in Theia is open source and free to use, with rate limits applying only for unusually high usage. Learn more about configuring and using this feature in the documentation.

Tool builders leveraging the Theia AI framework to build their own tailored AI-native tools or IDEs can just use the same SCANOSS integration out-of-the-box. Adopting SCANOSS enables their tools to offer proactive compliance checks seamlessly, enhancing user confidence and reducing legal risks. If you want to deploy SCANOSS in a professional environment, be aware that there are services and SLA level subscriptions are available for SCANOSS that also get rid of the (very high) rate limits of the free version see the product page. SCANOSS also provides software intelligence about security vulnerabilities, encryption and geographical provenance.

Explore the Theia AI SCANOSS documentation for more details and start using the SCANOSS integration in the AI-powered Theia IDE or your own tool today!

If you are interested in building custom AI-powered Tools, EclipseSource provides consulting and implementation services backed by our extensive experience with successful AI tool projects. We also specialize in web- and cloud-based tools and support for popular platforms like Eclipse Theia and VS Code.

👉 Get in touch with us to learn more about how we can help you build custom AI-powered tools.

Jonas, Maximilian & Philip

Jonas Helming, Maximilian Koegel and Philip Langer co-lead EclipseSource, specializing in consulting and engineering innovative, customized tools and IDEs, with a strong …