Amazon aims to enhance the evaluation of AI models and increase human involvement in the process.
During the AWS re: Invent conference, AWS vice president of database, analytics, and machine learning Swami Sivasubramanian announced Model Evaluation on Bedrock – now available on preview – for models found in its repository Amazon Bedrock. Without a way to transparently test models, developers may end up using ones that are not accurate enough for a question-and-answer project or one that is too large for their use case.
“Model selection and evaluation is not just done at the beginning, but is something that’s repeated periodically,” Sivasubramanian said. “We think having a human in the loop is important, so we are offering a way to manage human evaluation workflows and metrics of model performance easily.”
Sivasubramanian mentioned to The Verge in a separate interview that developers often struggle to determine the most suitable model for their projects, leading to issues such as using overly powerful or oversized models.
Model Evaluation comprises automated and human evaluation components. In the automated version, developers can assess the performance of models on metrics such as robustness, accuracy, or toxicity for various tasks within the Bedrock console. Bedrock includes popular third-party AI models like Meta’s Llama 2, Anthropic’s Claude 2, and Stability AI’s Stable Diffusion.
While AWS provides test datasets, customers can also bring their own data into the benchmarking platform for a better understanding of the models’ behavior, followed by the generation of a report.
Involving humans, users can opt to work with an AWS human evaluation team or their own, specifying the task type, evaluation metrics, and the dataset they want to use. AWS will provide customized pricing and timelines for those working with its assessment team.
AWS vice president for generative AI Vasi Philomin mentioned to The Verge that gaining a better understanding of how models perform guides development and allows companies to ensure compliance with responsible AI standards before utilizing the model.
Sivasubramanian also emphasized that human evaluation of AI models can uncover metrics that automated systems may overlook, such as empathy or friendliness.
Philomin clarified that AWS will not mandate all customers to benchmark models, offering flexibility based on individual needs. While the benchmarking service is in preview, AWS will only charge for the model inference used during the evaluation.
As there is no specific standard for benchmarking AI models, Philomin highlighted the goal of Bedrock benchmarking to offer companies a means to measure the impact of a model on their projects rather than a broad evaluation of models.