AMÁLIA and the future of European Portuguese LLMs

TL;DR

Portugal announced the development of AMÁLIA, a large language model focused on European Portuguese, backed by a €5.5 million government investment. The project aims to create a high-quality, open-source NLP resource for Portugal. Key details about data, benchmarks, and open access are still emerging.

The Portuguese government announced in December 2024 a €5.5 million investment in AMÁLIA, a large language model (LLM) designed specifically for European Portuguese, aiming to bolster NLP tools for the language and promote open-source development.

AMÁLIA is a collaborative effort involving top Portuguese universities and research labs, including NOVA, IST, IT, and FCT. It is based on a continuation of the EuroLLM pre-training, with modifications to architecture and training data focus on European Portuguese. The model’s training involved 107 billion tokens, with approximately 5.8 billion tokens from Arquivo.pt, representing about 5.5% of the total, and a higher percentage during supervised fine-tuning.

While the project emphasizes openness—sharing code, data, and training logs—it currently has not publicly released model weights or the full datasets, which has raised questions about the extent of its openness. The team created four benchmarks specific to European Portuguese, including ALBA, to evaluate the model’s performance, and results show AMÁLIA surpasses some state-of-the-art models like Qwen 3-8B on most benchmarks but still lags on ALBA, indicating room for improvement.

Why It Matters

This development is significant because it represents Portugal’s first large-scale effort to develop a dedicated NLP model for European Portuguese, a language with limited NLP resources compared to global languages like English or Spanish. The investment highlights a national priority to improve language-specific AI tools, which could impact education, government, and industry sectors, and set a precedent for smaller language communities to develop tailored NLP solutions.

Portuguese Flash Cards - Learn Portuguese Language Vocabulary Words and Phrases - Basic Language for Beginners - Gift for Travelers, Kids, and Adults by Travelflips

Portuguese Flash Cards – Learn Portuguese Language Vocabulary Words and Phrases – Basic Language for Beginners – Gift for Travelers, Kids, and Adults by Travelflips

PORTUGUESE FLASH CARDS – Basic Portuguese words and phrases for beginners and travelers

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Prior to AMÁLIA, most NLP models for Portuguese were either trained on mixed or Brazilian Portuguese data, with limited focus on European Portuguese. The project follows recent efforts by other countries, such as Italy’s Minerva, to develop language-specific models. The initiative comes amid increasing global interest in multilingual and language-specific LLMs, but Portuguese remains underrepresented in large AI models, partly due to data scarcity. The project’s focus on open-source principles aligns with broader movements to democratize AI access, though actual openness remains limited at this stage.

“AMÁLIA aims to treat European Portuguese as a first-class citizen in NLP, with dedicated data and benchmarks.”

— Research team member

“This investment underscores Portugal’s commitment to advancing AI and digital sovereignty for our language.”

— Portuguese government spokesperson

Amazon

Portuguese NLP tools for developers

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear when the full model weights and datasets will be publicly released, and how much European Portuguese data is effectively incorporated into the training. The actual performance of AMÁLIA on real-world tasks beyond benchmarks is still to be demonstrated, and the impact of limited data on its capabilities is uncertain.

Language Translator Device No WiFi Needed, 2026 Upgraded AI Translator, Support 150 Languages Voice Instant Two-Way Translation, Offline/Photo Translator for Business Travel

Language Translator Device No WiFi Needed, 2026 Upgraded AI Translator, Support 150 Languages Voice Instant Two-Way Translation, Offline/Photo Translator for Business Travel

【AI Translator Supporting 150 Languages】A20 AI translator adopts the latest technology, ultra-fast and accurate translation, the response time…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The next steps include potential release of model weights and datasets, further benchmarking, and integration into Portuguese NLP applications. Monitoring the project’s progress and community feedback will be essential to assess its real-world impact and openness.

Amazon

open-source NLP models for European Portuguese

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Will the model weights for AMÁLIA be released publicly?

It is not yet confirmed when the weights will be available, but the team has emphasized open principles, so a release is possible in the future.

How much European Portuguese data was used in training AMÁLIA?

Approximately 5.8 billion tokens from Arquivo.pt were used, constituting about 5.5% of the total training tokens. The exact amount of European Portuguese data overall remains unclear.

How does AMÁLIA compare to other Portuguese NLP models?

AMÁLIA outperforms some models like Qwen 3-8B on most benchmarks but still lags on specific tests like ALBA, indicating potential for further improvement with more data or training.

What are the main challenges faced in developing AMÁLIA?

Data scarcity for European Portuguese and limited open access to training resources are key challenges, along with ensuring the model accurately captures Portugal-specific knowledge.

You May Also Like

Bose Promo Code: 40% Off for May 2026

Bose is offering a 40% discount on select headphones, earbuds, and speakers for May 2026. The deal is available through a promo code on Bose.com.

When a Content Network Starts Publishing to Itself

Discover what happens when a content network begins publishing to its own sites. Learn the risks, benefits, and how to manage this internal publishing loop effectively.

South Korea's Kospi hits fresh record as Asia markets trade mixed amid oil surge, Iran risks

South Korea’s Kospi reaches a fresh high as Asian markets trade mixed amid rising oil prices and Middle East tensions, impacting global energy and equities.

Show HN: Rocksky – Music scrobbling and discovery on the AT Protocol

Rocksky introduces a decentralized platform for music tracking and discovery built on AT Protocol, enabling privacy-focused, peer-to-peer music interactions.