Collaborative Development of Regional LLMs as Google Enters the Fray - Insights From ZDNet
Collaborative Development of Regional LLMs as Google Enters the Fray - Insights From ZDNet
EDUARD MUZHEVSKYI / SCIENCE PHOTO LIBRARY/Getty Images
Google is joining collaborative efforts to build large language models (LLMs) that better cater to Southeast Asia’s population and cultural mix.
Its research arm will work with AI Singapore to enhance datasets used to train, finetune, and assess AI models in languages specific to the region. Called Project Southeast Asian Languages in One Network Data (SEALD), the initiative aims to “improve cultural context awareness” in LLMs built for the region, said AI Singapore in a statement Monday.
Also: Five ways to use AI responsibly
The government agency added that the collaboration will focus first on Indonesian, Thai, Tamil, Filipino, and Burmese, with the two partners developing translocalization and translation models jointly. They also will develop tools to help scale translocalization capabilities and best practices for tuning datasets . Pre-training guides will be published for Southeast Asian languages.
Newsletters
ZDNET Tech Today
ZDNET’s Tech Today newsletter is a daily briefing of the newest, most talked about stories, five days a week.
Subscribe
All datasets and output from Project SEALD will be released in open source, AI Singapore added.
The initiative will further support training efforts for models under SEA-LION (Southeast Asian Languages in One Network), which the Singapore government agency launched last year.
Also: The best AI chatbots: ChatGPT and other noteworthy alternatives
Consisting of open-source LLMs pre-trained for the region’s societal nuances, the current iteration of SEA-LION runs on two base models: a three-billion parameter model and a seven-billion parameter model. Its training data comprises 981 billion language tokens. AI Singapore defines these tokens as fragments of words created from breaking down text during tokenization. These fragments include 623 billion English tokens, 128 billion Southeast Asia tokens, and 91 billion Chinese tokens.
Project SEALD is currently working on a use case to improve communications with migrant workers in Singapore, who may converse more fluently in various regional languages than in English. Data collection efforts will reflect unique linguistic traits within this community and provide the foundation to improve engagement between the Singapore government and employers.
Datasets and output from Project SEALD will be integrated with generative AI applications developed by Google Cloud and the Singapore government , under the latter’s AI Trailblazers scheme, to support community outreach.
The Project SEALD partners will also work with the industry, including academia and the public sector, across functions, such as data collection and quality checks. These efforts will include collaboration with academia in different Southeast Asian countries to establish methodologies for evaluating and benchmarking generative AI applications across the region.
Also: Want to work in AI? How to pivot your career in 5 steps
AI Singapore also plans to make SEA-LION LLMs available on Google Cloud’s Model Garden on Vertex AI , providing access to pre-verified AI models. The regional LLMs will be added to Hugging Face , an open-source repository for AI tools and pre-trained models focused mostly on natural language processing capabilities.
AI Singapore on Monday also announced it inked Memorandums of Understanding and Letters of Intent with various organizations in Indonesia, Malaysia, and Vietnam to develop datasets and applications for regional LLMs.
In addition, the Singapore agency said it is working with partners in Indonesia, Thailand, and the Philippines to build resources on regional language syntax and semantics. These include Thailand’s Vidyasirimedhi Institute of Science and Technology and the Philippines’ Ateneo Social Computing Science Laboratory.
In 2022, Google Research unveiled a partnership with the Indian Institute of Science to work on Project Vaani , which aims to gather anonymized speech data across 773 districts and build an LLM representing the country’s diverse population.
Also: Is prompt engineer displacing data scientist as the ‘sexiest job of the 21st century’?
Last week, AI Singapore’s director of AI innovation Laurence Liew called for generative AI players to incorporate regional and local data models to ensure their products better reflect a diverse global population. Integrating SEA-LION, for instance, will help generative AI tools generate more accurate responses, Liew said, noting that the regional LLM generated a more accurate prediction compared to a global public platform when asked about a recent Asian election.
He added that most public generative AI tools today are non-Asian focused and might haveinherent data bias . LLMs such as SEA-LION are more “culturally sensitive”, which he said will ensure generative AI-generated responses better reflect the region’s societal mix.
Artificial Intelligence
Photoshop vs. Midjourney vs. DALL-E 3: Only one AI image generator passed my 5 tests
AI-powered ‘narrative attacks’ a growing threat: 3 defense strategies for business leaders
Copilot Pro vs. ChatGPT Plus: Which AI chatbot is worth your $20 a month?
How my 4 favorite AI tools help me get more done at work
- Photoshop vs. Midjourney vs. DALL-E 3: Only one AI image generator passed my 5 tests
- AI-powered ‘narrative attacks’ a growing threat: 3 defense strategies for business leaders
- Copilot Pro vs. ChatGPT Plus: Which AI chatbot is worth your $20 a month?
- How my 4 favorite AI tools help me get more done at work
Also read:
- [New] In 2024, How to Download YouTube Videos without Any Software
- [New] YouTube Earning Masterclass Taking Your Streaming Business to New Heights
- [Updated] In 2024, Elevating Movie Visuals Applying CG Central's Luts Techniques
- [Updated] The Elite List of YouTube Player Applications for 2024
- /Amo
- 2024 Approved Spectral Showcase Curating the Best 4K Displays on Screen
- 5 Techniques to Transfer Data from Honor X8b to iPhone 15/14/13/12 | Dr.fone
- A Safe Guide to Speeding Up Audio on Spotify
- Convertir JPEG a PNG De Forma Gratuita en Línea Con Herramientas De Conversión Rápidas
- Convierte Tus Archivos De Película MP4 Al Formato Ogg Sin Gastar: Soluciones De Conversión en Línea
- Convierte Tus Documentos CAF Sin Coste: Uso Libre Y Fácil Con Movavi Solutions
- Get the Latest Version of OBS Studio – Supporting Windows, macOS, and Ubuntu Systems
- In 2024, Troubleshooting Tips for the Mystery of Hidden Shorts Thumbnails
- Make a Movie Online for Free Top 9 Options for 2024
- Meilleur Logiciel Pour L'Editing 4K: Facilité Et Précision Dans Vos Travaux Vidéo Professionnels
- Toutes Les Étapes Pour Transformer Un Fichier WMV Au Format FLV Gratuitement - Guilde Du Multimédia
- Troubleshoot Immediately: Why Isn't Your Audacity Session Picking Up Sound?
- Title: Collaborative Development of Regional LLMs as Google Enters the Fray - Insights From ZDNet
- Author: Donald
- Created at : 2024-12-10 17:11:21
- Updated at : 2024-12-12 19:00:16
- Link: https://some-tips.techidaily.com/collaborative-development-of-regional-llms-as-google-enters-the-fray-insights-from-zdnet/
- License: This work is licensed under CC BY-NC-SA 4.0.