AMÁLIA and the future of European Portuguese LLMs

TL;DR

Portugal announced the development of AMÁLIA, a large language model focused on European Portuguese, backed by a €5.5 million government investment. The project aims to create a high-quality, open-source NLP resource for Portugal. Key details about data, benchmarks, and open access are still emerging.

The Portuguese government announced in December 2024 a €5.5 million investment in AMÁLIA, a large language model (LLM) designed specifically for European Portuguese, aiming to bolster NLP tools for the language and promote open-source development.

AMÁLIA is a collaborative effort involving top Portuguese universities and research labs, including NOVA, IST, IT, and FCT. It is based on a continuation of the EuroLLM pre-training, with modifications to architecture and training data focus on European Portuguese. The model’s training involved 107 billion tokens, with approximately 5.8 billion tokens from Arquivo.pt, representing about 5.5% of the total, and a higher percentage during supervised fine-tuning.

While the project emphasizes openness—sharing code, data, and training logs—it currently has not publicly released model weights or the full datasets, which has raised questions about the extent of its openness. The team created four benchmarks specific to European Portuguese, including ALBA, to evaluate the model’s performance, and results show AMÁLIA surpasses some state-of-the-art models like Qwen 3-8B on most benchmarks but still lags on ALBA, indicating room for improvement.

Why It Matters

This development is significant because it represents Portugal’s first large-scale effort to develop a dedicated NLP model for European Portuguese, a language with limited NLP resources compared to global languages like English or Spanish. The investment highlights a national priority to improve language-specific AI tools, which could impact education, government, and industry sectors, and set a precedent for smaller language communities to develop tailored NLP solutions.

Portuguese Flash Cards - Learn Portuguese Language Vocabulary Words and Phrases - Basic Language for Beginners - Gift for Travelers, Kids, and Adults by Travelflips

Portuguese Flash Cards – Learn Portuguese Language Vocabulary Words and Phrases – Basic Language for Beginners – Gift for Travelers, Kids, and Adults by Travelflips

PORTUGUESE FLASH CARDS – Basic Portuguese words and phrases for beginners and travelers

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Prior to AMÁLIA, most NLP models for Portuguese were either trained on mixed or Brazilian Portuguese data, with limited focus on European Portuguese. The project follows recent efforts by other countries, such as Italy’s Minerva, to develop language-specific models. The initiative comes amid increasing global interest in multilingual and language-specific LLMs, but Portuguese remains underrepresented in large AI models, partly due to data scarcity. The project’s focus on open-source principles aligns with broader movements to democratize AI access, though actual openness remains limited at this stage.

“AMÁLIA aims to treat European Portuguese as a first-class citizen in NLP, with dedicated data and benchmarks.”

— Research team member

“This investment underscores Portugal’s commitment to advancing AI and digital sovereignty for our language.”

— Portuguese government spokesperson

Amazon

Portuguese NLP tools for developers

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It remains unclear when the full model weights and datasets will be publicly released, and how much European Portuguese data is effectively incorporated into the training. The actual performance of AMÁLIA on real-world tasks beyond benchmarks is still to be demonstrated, and the impact of limited data on its capabilities is uncertain.

Language Translator Device No WiFi Needed, 2026 Upgraded AI Translator, Support 150 Languages Voice Instant Two-Way Translation, Offline/Photo Translator for Business Travel

Language Translator Device No WiFi Needed, 2026 Upgraded AI Translator, Support 150 Languages Voice Instant Two-Way Translation, Offline/Photo Translator for Business Travel

【AI Translator Supporting 150 Languages】A20 AI translator adopts the latest technology, ultra-fast and accurate translation, the response time…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The next steps include potential release of model weights and datasets, further benchmarking, and integration into Portuguese NLP applications. Monitoring the project’s progress and community feedback will be essential to assess its real-world impact and openness.

Amazon

open-source NLP models for European Portuguese

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Will the model weights for AMÁLIA be released publicly?

It is not yet confirmed when the weights will be available, but the team has emphasized open principles, so a release is possible in the future.

How much European Portuguese data was used in training AMÁLIA?

Approximately 5.8 billion tokens from Arquivo.pt were used, constituting about 5.5% of the total training tokens. The exact amount of European Portuguese data overall remains unclear.

How does AMÁLIA compare to other Portuguese NLP models?

AMÁLIA outperforms some models like Qwen 3-8B on most benchmarks but still lags on specific tests like ALBA, indicating potential for further improvement with more data or training.

What are the main challenges faced in developing AMÁLIA?

Data scarcity for European Portuguese and limited open access to training resources are key challenges, along with ensuring the model accurately captures Portugal-specific knowledge.

You May Also Like

At Least We Know the Washington Post Isn’t Buying Views

The Washington Post has spent $80,000 on new video and audio equipment for its opinion podcast, raising questions about its viewership and strategic focus.

Meta deletes popular 1M follower account after Kuwaiti request

Meta removed a popular account with 1 million followers after a request from Kuwaiti authorities, raising questions about content moderation and censorship.

TIL while attempting to land a role in The Wire, Idris Elba hid his English accent from series creator David Simon to prove he was “American enough” for the part. In his 4th audition, Simon found out. However, by that time Elba had already impressed Simon enough to convince him to give Elba the role

Idris Elba revealed he concealed his British accent during auditions for ‘The Wire’ to appear more American and secure the role of Stringer Bell.

Best Buy Discount Codes: Up to 60% Off

Explore current Best Buy discount codes and deals, with savings up to 60% on electronics, plus membership perks and price match guarantees.