AMÁLIA and the future of European Portuguese LLMs

TL;DR

Portugal announced the development of AMÁLIA, a large language model focused on European Portuguese, backed by a €5.5 million government investment. The project aims to create a high-quality, open-source NLP resource for Portugal. Key details about data, benchmarks, and open access are still emerging.

The Portuguese government announced in December 2024 a €5.5 million investment in AMÁLIA, a large language model (LLM) designed specifically for European Portuguese, aiming to bolster NLP tools for the language and promote open-source development.

AMÁLIA is a collaborative effort involving top Portuguese universities and research labs, including NOVA, IST, IT, and FCT. It is based on a continuation of the EuroLLM pre-training, with modifications to architecture and training data focus on European Portuguese. The model’s training involved 107 billion tokens, with approximately 5.8 billion tokens from Arquivo.pt, representing about 5.5% of the total, and a higher percentage during supervised fine-tuning.

While the project emphasizes openness—sharing code, data, and training logs—it currently has not publicly released model weights or the full datasets, which has raised questions about the extent of its openness. The team created four benchmarks specific to European Portuguese, including ALBA, to evaluate the model’s performance, and results show AMÁLIA surpasses some state-of-the-art models like Qwen 3-8B on most benchmarks but still lags on ALBA, indicating room for improvement.

Why It Matters

This development is significant because it represents Portugal’s first large-scale effort to develop a dedicated NLP model for European Portuguese, a language with limited NLP resources compared to global languages like English or Spanish. The investment highlights a national priority to improve language-specific AI tools, which could impact education, government, and industry sectors, and set a precedent for smaller language communities to develop tailored NLP solutions.

Portuguese Flash Cards - Learn Portuguese Language Vocabulary Words and Phrases - Basic Language for Beginners - Gift for Travelers, Kids, and Adults by Travelflips

Portuguese Flash Cards – Learn Portuguese Language Vocabulary Words and Phrases – Basic Language for Beginners – Gift for Travelers, Kids, and Adults by Travelflips

PORTUGUESE FLASH CARDS – Basic Portuguese words and phrases for beginners and travelers

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Prior to AMÁLIA, most NLP models for Portuguese were either trained on mixed or Brazilian Portuguese data, with limited focus on European Portuguese. The project follows recent efforts by other countries, such as Italy’s Minerva, to develop language-specific models. The initiative comes amid increasing global interest in multilingual and language-specific LLMs, but Portuguese remains underrepresented in large AI models, partly due to data scarcity. The project’s focus on open-source principles aligns with broader movements to democratize AI access, though actual openness remains limited at this stage.

“AMÁLIA aims to treat European Portuguese as a first-class citizen in NLP, with dedicated data and benchmarks.”

— Research team member

“This investment underscores Portugal’s commitment to advancing AI and digital sovereignty for our language.”

— Portuguese government spokesperson

Amazon

Portuguese NLP tools for developers

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear when the full model weights and datasets will be publicly released, and how much European Portuguese data is effectively incorporated into the training. The actual performance of AMÁLIA on real-world tasks beyond benchmarks is still to be demonstrated, and the impact of limited data on its capabilities is uncertain.

occiam AI Translation Earbuds Real Time, 164 Language Translator Device with No Subscription, Simultaneous Interpretation for Face-to-Face, Photo/Audio/Video Translating Headphone Matte Black

occiam AI Translation Earbuds Real Time, 164 Language Translator Device with No Subscription, Simultaneous Interpretation for Face-to-Face, Photo/Audio/Video Translating Headphone Matte Black

AI-Powered Translation Headphones: The earbuds feature a multilingual AI assistant, supporting diverse cross-language modes: (1) Dual-Person Free Talk…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The next steps include potential release of model weights and datasets, further benchmarking, and integration into Portuguese NLP applications. Monitoring the project’s progress and community feedback will be essential to assess its real-world impact and openness.

Amazon

open-source NLP models for European Portuguese

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Will the model weights for AMÁLIA be released publicly?

It is not yet confirmed when the weights will be available, but the team has emphasized open principles, so a release is possible in the future.

How much European Portuguese data was used in training AMÁLIA?

Approximately 5.8 billion tokens from Arquivo.pt were used, constituting about 5.5% of the total training tokens. The exact amount of European Portuguese data overall remains unclear.

How does AMÁLIA compare to other Portuguese NLP models?

AMÁLIA outperforms some models like Qwen 3-8B on most benchmarks but still lags on specific tests like ALBA, indicating potential for further improvement with more data or training.

What are the main challenges faced in developing AMÁLIA?

Data scarcity for European Portuguese and limited open access to training resources are key challenges, along with ensuring the model accurately captures Portugal-specific knowledge.

You May Also Like

Project Gutenberg – keeps getting better

Project Gutenberg, the free eBook library, has announced ongoing updates and enhancements to its digital collection, making it easier to access classic literature.

Higher Education’s Identity Crisis

American colleges confront declining enrollments, funding cuts, and AI disruptions, prompting a reevaluation of their core purpose and future models.

America Has Always Had a Gerrymandering Problem. This Is New.

Recent Supreme Court decisions have significantly altered the landscape of gerrymandering and voting rights, impacting future elections and representation.

Live updates: Iran war news; Trump says agreement to be signed Sunday, Tehran pushes back on timing

President Trump claims an agreement with Iran is scheduled for Sunday to reopen the Strait of Hormuz, but Iran’s IRGC denies plans for a signing on that day.