In this work, we introduce a novel framework for Graph-Enhanced Question Answering for the AI Act. Designed to improve transparency and accuracy when dealing with complex legal texts, this approach addresses the complexity and interpretability challenges of legal texts.
1 Introduction
1.1 Challenges and opportunities of a Graph-enhanced LLM-based Question Answering for the AI Act
Legal texts such as the AI Act present significant challenges for automated question-answering (QA) due to their complexity, formal language, and need for interpretability. Existing legal QA systems have focused on various national and international regulations, including statutory laws and privacy policies. Nevertheless, they have not yet directly addressed the recent European Regulation on AI Act [3].
Moreover, they typically fall into three categories: (i) Information Retrieval (IR)-based approaches, which lack semantic depth and contextual awareness; (ii) ontology-based methods, which are resource-intensive and constrained; and (iii) neural models, which offer fluent but opaque responses with limited traceability. These limitations hinder their effectiveness in regulatory contexts, thus constraining their practical use by legal professionals.
With this purpose in mind, in this work we address these challenges and propose a graph-enhanced, LLM-based question-answering (QA) system for retrieving information from the AI Act.
It combines: (i) a graph-based representation of the Act to improve document’s organization and exploration, and (ii) a language model agent interacting with this graph to provide traceable and context-aware responses.
Inspired by GraphReader and adapted for legal contexts, the model organizes the Act into Chunks, AtomicFacts, and KeyElements. Compared to traditional knowledge graphs or ontology-based legal QA systems, this is flexible and scalable, as the graph representation can be extended to incorporate additional legal documents and adapted to reflect evolving regulatory frameworks. Moreover, our evaluation, combining standard automatic metrics with human judgment, suggests that the proposed framework enhances transparency, accuracy, and semantic understanding in retrieving information from the AI Act.
2 Background
2.1 Knowledge Graphs
Due to the unstructured nature of textual documents and, in this specific field, the intrinsic complexity of legal regulation, providing a structured and machine-readable format becomes crucial. For this purpose, Knowledge Graphs (KGs) represent a way to organize legal documents, to link concepts and to improve access and analysis of legal knowledge.
A Knowledge Graph is a network of interconnected entities and relationships that allows organizing the information in a way that enables reasoning, retrieval and knowledge discovery. Unlike unstructured texts, KGs store entities using nodes and relationships using edges, allowing AI systems to query and navigate information more efficiently. Later in the section, we discuss the application of KGs in structuring statutes, case law, and regulations, making them useful tools in legal research.
However, building a high-quality legal KG is a challenging task due to the ambiguities contained in legal texts, the domain-specific terminology, and the need for precise entity recognition and relation extraction.
2.2 Legal QA Systems
Existing approaches for Legal QA include:
- IR-based systems
- Ontology-based systems
- Neural systems
Traditional methods of AI Act QA systems often struggle with context fragmentation and lack of explainability.
3 Methodology
Unlike Knowledge Graph (KG)-based QA systems, which require large amounts of training data, or ontology-based systems, which depend on expert input, this approach focuses on structuring the Regulation into a graph, while relying on LLM agents for information retrieval and answer generation.
Concretely, it involves:
- The creation of a graph-based representation of the AI Act, leveraging the AI Act Explorer.
- The implementation of the GraphReader agent to navigate this graph and derive the answers for user queries.
While this work focuses on the AI Act, we could generalize the proposed framework for other legal and regulatory documents. In fact, it offers a scalable methodology for structuring and retrieving legal knowledge across different domains.
3.1 Graph-Based Representation of the AI Act
To enable efficient and interpretable question answering over legal content, we represent the AI Act as a structured graph stored in a Neo4j database. Indeed, this graph captures the hierarchical and semantic organization of the Regulation through a variety of node and relationship types. The core entities in the graph include Article, Annex, Recital, Chapter, and Section nodes. These nodes are enriched with properties such as IDs, titles, content, and dates, depending on their type. Relationships like HAS_ARTICLE and HAS_SECTION reveal the hierarchy of the Regulation by linking Chapters or Annexes to their respective Sections and Articles.
Each Article, Annex, and Recital is further segmented into smaller units called Chunk nodes, corresponding to their paragraphs (Article, Annex) or sentences (Recital). These Chunks are then linked to their parent nodes via HAS_CHUNK relationships and sequentially ordered using NEXT relationships. Cross-references within the Regulation are modeled using the HAS_REFERENCE relationship. Finally, HAS_RELATED_RECITAL relationship captures associations between Chunks or Articles and relevant Recitals.
Moreover, to support fine-grained semantic querying, the graph includes AtomicFact nodes, representing concise, standalone factual statements extracted from Article Chunks: HAS_ATOMIC_FACT relationships connect them. Each AtomicFact is further decomposed into KeyElement nodes — important nouns, verbs, or adjectives representing core legal concepts — linked through HAS_KEY_ELEMENT. To reduce redundancy, KeyElements are filtered for duplicates during pre-processing by checking for semantic similarity before insertion.


3.2 GraphReader implementation for the Graph-Enhanced Question Answering for the AI Act
The GraphReader is a graph-based agent system that authors S. Li, Y. He, H. Guo, X. Du, B. Bai, J. Liu, J. Liu, X. Qu, Y. Li, W. Ou-yang [6] developed for Question Answering (QA) on long documents. In this approach, they structured a document as a graph, and an agent navigates this graph to retrieve and derive the answer to a given input question. In this work, they developed two customized implementations of GraphReader and applied them to question answering on the AI Act:
- GraphReaderbase , which follows the original GraphReader design principles with modifications to suit the AI Act,
- GraphReaderaf-only , a variant that removes the Initial Node Selection phase and instead directly selects AtomicFact nodes based on similarity to the question.
As described in the original paper, both models follow three core phases:
1) Graph Construction
This phase begins by extracting AtomicFact and KeyElement nodes. In contrast to the original implementation, where each node vi contains a KeyElement ki and a set of AtomicFact Ai, this work adopts a hierarchical design. Each Chunk node is linked to one or more AtomicFact nodes, which are in turn connected to one or more KeyElement nodes. This revised structure omits the concept of neighboring nodes as in the original paper but aligns more naturally with the coarse-to-fine reasoning strategy employed by the agent. After extracting AtomicFact and KeyElement nodes, embeddings are computed. Embeddings are generated for the content properties of both AtomicFact and KeyElement nodes. These vector representations are crucial for similarity-based retrieval steps in later stages.
2) Graph Exploration
This phase involves a structured traversal of the graph, guided by a rational plan and focused on extracting relevant information from various node types.
a. Rational Plan Creation:
This phase prompts the LLM to generate a step-by-step strategy for answering the input question, guiding the agent’s navigation through the graph.
b. Initial Node Selection:
In this step, the agent selects KeyElement nodes to initiate graph exploration. Unlike the original approach, which involves prompting the LLM with all KeyElements, this implementation prompts the model only with the input question. The LLM then extracts relevant KeyElements directly from the question. Each selected element is subsequently matched to the most similar node in the graph based on cosine similarity (threshold: 0.5) between embeddings. This approach narrows the initial focus to question-relevant topics, thus improving efficiency.
c. AtomicFact Exploration:
In this phase the agent reads the AtomicFact nodes linked to the previously selected KeyElement nodes. While Li et al. use all the linked AtomicFacts, this implementation filters them by cosine similarity (threshold: 0.6) between each AtomicFact’s embedding and the question embedding. The threshold is lowered incrementally if no matches are found, eventually including all nodes if necessary. The filtered AtomicFacts are then passed to the LLM along with the question, rational plan, and notebook, and the LLM selects one of two actions:
i. read_chunk(List[ID]) – Retrieve and explore Chunk nodes linked to relevant AtomicFacts for deeper insight.
ii. stop_and_read_neighbor – Skip to the Neighbor Exploration phase.
d. Chunk Exploration:
This step of the GraphExploration phase is performed only if the chosen_action of the previous AtomicFact Exploration is read_chunk(List[ID]). At this point, the agent reads one Chunk node at a time, updating the notebook with new insights. For each unvisited Chunk, the agent performs two operations before reading its content:
i. Annex detection. If the Chunk references annexes (via HAS_REFERENCE), their IDs are added to check_chunks_queue. This allows the agent to extract relevant details contained only in annexes.
ii. Related Recital retrieval. Using the HAS_RELATED_RECITAL relationship (or fallback via the containing article), the agent adds Chunks from linked Recitals to the queue.
In this work, we intentionally modified the agent’s behavior by requiring it to read all the Chunk nodes in the check_chunks_queue before terminating. In contrast, the original implementation allows the agent to decide whether to stop its exploration after each Chunk Exploration step. We made this design choice to prevent the loss of potentially valuable information during graph traversal, resulting in two key implications. On one hand, we expect the generated answers to be more exhaustive, capturing all relevant insights from the explored nodes. On the other hand, they may include more information than is strictly necessary to address the input question. Finally, the process is bounded by the max_chunks parameter to prevent GraphRecursionError, as defined by LangGraph’s recursion limits.1
e. Neighbor Exploration
This step of the GraphExploration phase is performed if the previous chosen_action was either stop_and_red_neighbor (from AtomicFact Exploration) or search_neighbor (from Chunk Exploration). At this point, the agent identifies additional KeyElement nodes to explore. More specifically, the neighbor_check_queue list of KeyElement nodes, together with the input question, the rational plan, and the notebook is prompted to the LLM, which reads the KeyElement nodes and chooses from the following functions:
i. read_neighbor_node(List[KeyElement]). If relevant KeyElements are identified, the LLM returns their contents in a list for further AtomicFact and Chunk exploration.
ii. termination. If none of the KeyElement nodes are relevant to the question and plan, the agent ends the graph traversal.
3) Answer Generation
To generate the final answer the LLM takes as input the notebook of the agent, which contains all insights and information gathered during the Graph Exploration phase, and produces the final response to the input question.

3.2.1. Second Implementation: GraphReaderaf-only
As previously introduced, this version modifies the previous one by omitting the Initial Node Selection phase. The key difference from the previous model is the AtomicFact Exploration step of the Graph Exploration phase.
While the GraphReaderbase model selects AtomicFact nodes based on the associated KeyElement nodes extracted from the input question, in this case, the selection is made directly based on the AtomicFact nodes of the graph. Specifically, the agent exploits the Neo4j Vector Index to retrieve k_atomic_facts AtomicFact nodes, considering the cosine similarity between the embeddings of their content and the embedding of the input question. Then, similar to the process in GraphReaderbase model, the selected AtomicFact nodes, together with the input question, the rational plan, and the notebook, is prompted to the LLM, which reads all the AtomicFact contents and chooses between read_chunk(List[ID]) and stop_and_read_neighbor. The description of these actions, the output of the LLM, and the side effects of this step on the OverallState are the same as those described in the AtomicFact Exploration stage of the GraphReaderbase model. A visual representation of the implemented models follows.


3.3 Implementation details
The storage and retrieval of graph data is managed through Neo4j, while the agent logic and workflow orchestration are built using LangChain , in combination with LangGraph. This setup provides several key advantages aligned with the requirements of this project:
- Message Clarity. LangChain ensures a clear distinction between system prompts and user inputs, thus facilitating the communication with the LLM model.
- Structured Outputs. By integrating LangChain with the Pydantic package2, it is possible to specify a specific structured output for an LLM. This integration, as a result, eliminates the need for complex and cumbersome parsing functions, ensuring more efficient data handling.
- Pipeline Creation. LangGraph enables the definition of multistep workflows, allowing each phase of the GraphReader to be composed into a coherent execution pipeline. It also supports the definition of agent state transitions, which ensures consistency across the process.
- Automated Evaluation. Once we define the pipeline, the final evaluation of the GraphReader becomes straightforward, as the framework autonomously manages the entire process.
The GraphConstruction phase, comprising the extraction of both AtomicFact and KeyElement nodes, was conducted using OpenAI’s gpt-4o-mini LLM model3. We also used the same model in each step of the GraphExploration phase. Each operation involving embeddings was conducted leveraging the text-embedding-3-small embedding model developed by OpenAI4, with an embedding dimension of 512.
4 Evaluation and results
To provide a broad assessment, we consider both automatic and human-based evaluation methods, as detailed in the following sections. Automatic evaluation is based on standard and scalable measurements, using metrics such as similarity between embeddings. However, these methods might not be effective for complex responses, so we also included human evaluation to provide a more nuanced assessment of response quality.
4.1 Data
To the best of our knowledge, no dataset currently exists for evaluating QA systems on the AI Act. Consequently, to address this gap, we created a small dataset consisting of two distinct groups of 10 questions each. In the first group, we paired each question with an expected answer. After that, a panel of lawyers with expertise in the AI Act reviewed it. This subset was finally used for automatic evaluation. The second group contains questions without predefined answers and was assessed manually by independent legal experts for human evaluation. The complete set of questions and answers is available in the experiment section of the supplementary material. While the number of questions is limited, the dataset provides a preliminary step toward evaluating the performance of our system. We further discuss this point in the limitations section.
4.2 Evaluation Metrics
4.2.1 Automatic Evaluation
As described by J. Martínez-Cruz [7], automatic evaluation relies on standard metrics and tools to evaluate the performance of a given model. One of the main advantages of this approach is that it does not require intensive human participation, which not only saves time, but also minimizes subjective influences on the evaluation. Overall, this makes the evaluation process more standardized and less biased. In the next section, we describe the automatic metrics we used in this project.
Accuracy:
To measure the accuracy of the model-generated answer, we used BERTScore, introduced by Z. Zhang, V. Khattar, F. Wu, K. Q. Weinberger, and Y. Artzi [12]. Many studies have employed this metric for text generation tasks. For example, building on BERTScore, S. Zhou, U. Alon, S. Agarwal, and G. Neubig [13] developed an evaluation metric for code generation, while J. Tobin, D. Li, S. Venugopalan, K. Seaver, R. Cave, and K. Tomanek [11] leveraged BERTScore to assess Automatic Speech Recognition (ASR) model quality.
BERTScore computes similarity between a model’s response and a reference answer using contextual embeddings from a pre-trained transformer model, returning the precision, recall and F1-score between the tokens in the model’s response and the ones of the ref-erence answer. Moreover, unlike traditional n-gram-based metrics like BLEU or ROUGE, BERTScore works on contextual embeddings, making it robust to paraphrasing and word reordering. These properties make BERTScore a strong candidate for evaluating AI-generated answers.
Relevance:
To measure the relevance of the model’s answer to the input question, we used SentenceBERT as presented by N. Reimers ([9]). This model generates sentence embeddings which are then used for semantic similarity comparisons. The evaluation process consists of the following steps:
- For each sentence in the model’s response, cosine similarity is computed in relation to the input question using the SentenceBERT model;
- The final relevance score is obtained by averaging the cosine similarity scores across all sentences in the response.
Given SentenceBERT’s ability to capture the semantic meaning of sentences, this method ensures an evaluation of the the responses based on their actual relevance to the question rather than mere surface-level similarities.
4.2.2 Human Evaluation
Despite being easier to implement, automatic evaluation is not always sufficient, especially when assessing domain-specific responses generated by LLMs. In such cases, human evaluation is more reliable, providing a more accurate assessment.
As described by Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang [2], several factors influence the reliability of human-based evaluations, including the number of evaluators, the chosen evaluation criteria and the expertise level of the reviewers. Inspired by many authors [6, 26, 33], we adopted the following evaluation criteria:
- Accuracy. This criterion evaluates the ability of the language model to produce information that aligns with the response a domain expert would provide, avoiding errors and inaccuracies;
- Relevance. This criterion examines how well the response addresses the given question, ensuring that the information provided is pertinent and directly applicable;
- Transparency. This indicator assesses whether the response appropriately relies on the content and context of the European Artificial Intelligence Act, ensuring that no extraneous information is included.
Three independent legal experts evaluated each model and applied the above criteria to assess performance. We selected the experts based on prior collaborations and professional connections within our scientific network. To evaluate the models’ responses, the experts used a five-point Likert scale for each criterion. Finally, we collected human evaluations anonymously via a Google Form, in compliance with the General Data Protection Regulation (GDPR).
4.3 Baseline and Competitors
In our evaluation, we compare our methodology with a baseline and two competing models, as described below. The baseline model represents a reference point, implementing the simplest approach to QA on the AI Act. In contrast, the competitor models are more advanced, providing a stronger reference for evaluating the effectiveness of our implemented models. The following subsections describe each of these models in detail.
4.3.1 Baseline Model
Based on the survey of J. Martínez-Cruz [7], we decided to implement the baseline model by leveraging the BM25 retrieval method. Introduced by S. Robertson and H. Zaragoza [10], BM25 is a classical probabilistic-based information retrieval (IR) method widely used in tasks involving document ranking and search. Significantly, this method does not rely on deep learning, as it retrieves documents based on term frequency and statistical weighting, making it an efficient and interpretable approach. Moreover, several studies have exploited BM25-based models. For example, S. Khazaeli, J. Pumkao, C. Morris, S. Sharma, B. Staub, M. Cole, C. Sheu-Webster, and D. Slakoff [4] implemented BM25 to rank legal documents based on relevance to user queries, while A. Askari, Z. Yang, Z. Ren, and S. Verberne [1] utilized BM25 for initial document retrieval in legal contexts.
In our setup, BM25 was applied using the Whoosh library to index Chunk nodes (from both articles and recitals of the AI Act) stored in Neo4j5. User queries were processed with KeyBERT to extract key topics, which were then used to formulate search queries6. Eventually, the system retrieves and ranks relevant Chunks based on BM25 scores, with the top results forming the model’s response. A detailed description of the baseline model is provided in the supplementary material.
In general, while BM25 does not capture the semantic meaning of a Chunk’s content as effectively as modern neural or deep learning based models, it still provides a strong and widely used baseline model for evaluating retrieval quality of QA systems.
4.3.2 Competitor Models
In addition to the baseline model, we compared our proposed approach against two more advanced competitor models.
Clairk
The first competitor model is Clairk, developed by the St. Gallen Endowment for Prosperity and Trade7 under the Digital Policy Alert initiative. This tool provides an interface for retrieving information from the AI Act and other regulations. Users can input a question, and the system generates an answer while also listing the sources from which the information was retrieved. Despite being well designed and user friendly, the platform does not provide any details about how the underlying retrieval model works or how it is implemented, only allowing the user to select the document from which to retrieve the requested information
SBERTGPT
The second competitor model, named SBERTGPT ,combines SentenceBERT for semantic retrieval with a large language model (LLM) for answer generation.
This approach first identifies the most relevant Chunk nodes in the AI Act using SentenceBERT’s embeddings and cosine similarity. Then, an LLM refines and rephrases the extracted content to produce a final response. All things considered, this method represents a good alternative to the proposed models, as it balances the ability of the SentenceBERT model of capturing semantic similarity between sentences with the readability of LLMs, which re-arrange the retrieved passages.
In general, the effectiveness of SentenceBERT in generating meaningful embeddings has led several studies adopting such a model for their QA systems. For example, T. N.-T. Nguyen, P.-P. H. La, T. T. Nguyen, K. Van Nguyen, and N. L.-T. Nguyen [8] combined SentenceBERT with BM25 in a two-stage QA system for medical texts, while T. Kleinlein and M. Abedin [5] highlighted that combining GPT-2 with BERT-based models improved both question generation and answering tasks by leveraging their complementary strengths.
4.4 Evaluation
Table 2 summarizes the results obtained for the automatic evaluation.
For BERTScore, we report the average value of F1-score across all the answers, while for SentenceBERT we report the average value of the cosine similarity across all the answers, with the total score computed as the simple average.

With regard to human evaluation, Table 3 summarizes the results obtained. For each criterion, we report the average score across all 10 answers evaluated by each legal expert, resulting in a final average computed from 30 scores. Then, this average was weighted according to the level of confidence each expert reported regarding their familiarity with the AI Act. Specifically, each expert provided a self-assessed confidence score on a scale from 1 to 10, which was used to weight their individual evaluations.

Lastly, the final score is computed as a weighted average of the automatic score and the human score, with a higher weight assigned to human evaluation to reflect its greater importance in legal contexts.
In this sense, human scores are given 70% of the total weight, while automatic metrics contribute 30% to the final score.
The motivation behind this choice lies in the limitations of automatic evaluation metrics, which, while useful for measuring linguistic and semantic similarity, fail to assess legal correctness, reasoning, and compliance with regulations. For instance, an answer may appear semantically similar to a reference answer but still be incorrect in legal interpretation, leading automatic metrics to assign it an unjustifiably high score. In contrast, human evaluators are able to identify legal inaccuracies and consider contextual reasoning, making their assessment more reliable.
This preference for human evaluation is further supported by research on large language model (LLM) evaluation, where studies indicate that human judgment provides a more comprehensive and accurate assessment, particularly in domain-specific tasks such as legal QA [2]. For these reasons, the final evaluation follows the formula:
Final Score = (0.3 * Automatic Score) + (0.7 * Normalized Human Score)
Where Automatic Score represents the Total Score of Table 2, and Normalized Human Score represents the normalized Total Score of Table 3. Note that each Total Score is normalized to [0, 1], to ensure consistency with the Automatic Score. Table 4 shows the final scores for each model, along with their respective automatic and human scores.
When considering the combined final score (Table 4) the GraphReaderbase achieves the highest performance. These results therefore confirm the benefits of a structured graph traversal strategy and the importance of the Initial Node Selection phase, which contributes to more targeted and contextually relevant responses.
The GraphReaderaf-only variant performs comparably in automatic evaluation, achieving the highest score among all models, but its human evaluation scores are lower. This discrepancy suggests that, despite the effective AtomicFact embedding-based retrieval, bypassing the initial use of KeyElement nodes reduces the perceived relevance,accuracy and transparency of the answers, underlining the limits of relying solely on vector similarity among entire sentences in Legal QA.
Clairk ranks second overall, with strong performance in both evaluation types. It demonstrates that dense retrieval methods can be competitive, although its black-box architecture limits interpretability and refinement potential. Meanwhile, BM25, as a baseline, performs the worst across all measures, reaffirming that traditional term based retrieval is insufficient for the nuanced demands of legal texts.
Overall, the models developed in this work, particularly GraphReaderbase , successfully combine accurate retrieval with transparent reasoning. This balance is critical in Legal QA, where the ability to justify an answer through traceable sources is as important as the answer itself.
4.4 Limitations and Future Works
Although the proposed models show promising performance in question answering on the AI Act, several limitations remain and might be subject of future work.
Firstly, the evaluation was conducted on a small dataset with a limited number of evaluators. Expanding both the dataset size and the number of human evaluators would improve the robustness of the results and increase their generalizability to real-world applications.
Another limitation is the lack of comparison with other open-source LLMs and embedding models, such as SentenceBERT. Indeed, evaluating alternative retrieval and reasoning architectures would provide deeper insights into how different models handle legal text interpretation and answer generation. Similarly, while this work compared GraphReader-based models with a few baseline and competitor models, further benchmarking against additional competitors would strengthen the comparative analysis.
Regarding the evaluation, we could explore alternative automatic metrics, such as LLM-as-a-Judge approaches. However, using an LLM to evaluate responses generated by another LLM raises concerns regarding bias and reliability. Ideally, a Knowledge Graph (KG)-based evaluation would have provided a structured and interpretable assessment, but this would have required extensive fine-tuning and additional computational resources.
5 Conclusion: a Graph-Enhanced Question Answering for the AI Act
In this work, we introduced a novel framework that integrates the interpretability of graph-based representations with the semantic reasoning capabilities of Large Language Models (LLMs) to support question answering on the AI Act. To our knowledge, this represents one of the first research efforts aimed at developing a Graph-Enhanced Question Answering specifically for the AI Act.
As shown above, a key strength of our approach lies in its capacity to overcome several limitations of traditional legal QA systems. Specifically, it offers the following advantages: (i) No need for manually crafted ontologies or any other human intervention, as it automatically structures the AI Act using LLM-based information extraction; (ii) No reliance on training data or fine-tuning existing models. Instead, the final graph and its content represent the only reference for the answers provided by the system; (iii) LLMs allow the generation of both a structured and semantic-aware representation of the content of the AI Act; (iv) Answer generation is traceable and verifiable, as responses result from explicit graph traversal of the AI Act’s content.
Special thanks to our contributor:
Nicola Aggio
NOTES
- https://langchain-ai.github.io/langgraph/how-tos/recursion-limit/ ↩︎
- https://docs.pydantic.dev/latest/ ↩︎
- https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ ↩︎
- https://openai.com/index/new-embedding-models-and-api-updates/ ↩︎
- https://whoosh.readthedocs.io/en/latest/ ↩︎
- https://maartengr.github.io/KeyBERT/api/keybert.html ↩︎
- https://www.stgallen-endowment.org/ ↩︎
REFERENCES
[1] A. Askari, Z. Yang, Z. Ren, and S. Verberne. Answer retrieval in legal community question answering. In European Conference on Information Retrieval, pages 477–485. Springer, 2024.
[2] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
[3] European Parliament. Regulation (eu) 2024/1699 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence and amending regulations (ec) no 300/2008, (eu) no 167/2013, and (eu) 168/2013, (eu) 2018/858, (eu) 2018/1139 and (eu) 2019/2144 and directives 2014/90/eu, (eu) 2016/797 and (eu) 2020/1828 (artificial intelligence act), 2024. URL https://eur-lex.europa.eu/eli/reg/2024/1699/oj/eng.
[4] S. Khazaeli, J. Pumkao, C. Morris, S. Sharma, B. Staub, M. Cole, C. Sheu-Webster, and D. Slakoff. A free format legal question answering system. In Proceedings of the Neural Legal Language Processing Workshop 2023, pages 107–113, 2021.
[5] T. Kleinlein and M. Abedin. Answers to lawyer learning: 36 questions at the best of their world. arXiv preprint arXiv:1911.02365, 2004.
[6] S. Li, Y. He, H. Guo, X. Du, B. Bai, J. Liu, J. Liu, X. Qu, Y. Li, W. Ou-yang, et al. Graphreader: Building graph-based agent to enhance long-context abilities of large language models. arXiv preprint arXiv:2406.14055, 2024.
[7] J. Martínez-Cruz. A survey on legal question-answering systems. Computer Science Review, 48:100552, 2023.
[8] T. N.-T. Nguyen, P.-P. H. La, T. T. Nguyen, K. Van Nguyen, and N. L.-T. Nguyen. Spberta: A two-stage question answering system based on sentence transformers for medical texts. In International Conference on Knowledge Science, Engineering and Management, pages 371–382. Springer, 2022.
[9] N. Reimers. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
[10] S. Robertson, H. Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
[11] J. Tobin, D. Li, S. Venugopalan, K. Seaver, R. Cave, and K. Tomanek. Assessing asr model quality on disordered speech using bertscore. arXiv preprint arXiv:2209.10951, 2022.
[12] Z. Zhang, V. Khattar, F. Wu, K. Q. Weinberger, and Y. Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
[13] S. Zhou, U. Alon, S. Agarwal, and G. Neubig. Codebertscore: Evaluating code generation with pretrained models of code. arXiv preprint arXiv:2302.05527, 2023.
[14] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022.
[15] M. Zhong, Y. Liu, D. Yin, Y. Mao, Y. Jiao, P. Liu, C. Zhu, H. Ji, and J. Han. Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197, 2022.
