[go: up one dir, main page]

Skip to content
Huan He edited this page Feb 18, 2022 · 21 revisions

Foreword

MedTator is a serverless web tool that focuses on the core steps related to corpus annotation.

What does "serverless" mean?

"Serverless" does not mean that you cannot put MedTator on a server (you certainly can access MedTator by any web server), nor does it mean that MedTator lacks any annotation functionality. The "serverless" means that MedTator can process data 100% within your web browser and no need for a server support.

Therefore, MedTator won't send any information to any place, it won't keep any data or annotation out of your local environment, and it won't save any user operation (e.g., mouse click, key press, etc.). You can use MedTator via internet as a web page or locally like a standalone program.

Background

Natural language processing (NLP) and machine learning techniques have been widely applied in practice and research, which usually need to rely on high-quality annotated datasets. Therefore, manual annotation is required to collect additional information from document, and a suitable tool is needed to reduce the intensive labor work. To address this need, many text annotation tools have been developed for a variety of tasks, such as text classification, named-entity recognition, and sequence prediction.

However, while existing tools provide many powerful features to cover various needs in text annotation, it is still challenging for non-expert users or annotators to leverage these tools in their own research task. Therefore, based on the feedbacks from our domain experts and experienced annotators, we propose and implement MedTator to address the challenges.

System Architecture

MedTator is implemented in pure frontend JavaScript with the annotation schema and files processed in client’s web browser, which enables installation-free and cross-platform access for both administrators and annotators. Although MedTator is a pure frontend web application that doesn’t require any server components, its architecture design still follows the concept of the Model-View-Controller (MVC) pattern and a refinement of MVC, the Model-View-ViewModel (MVVM) pattern. The MVVM pattern helps to design a blueprint for developers to build frontend / client applications with more responsive user interaction and feedback, while avoiding costly duplication of code (e.g., DOM manipulation and CSS update) and effort across the overall architecture.

Due to the complexity of the annotation tasks, we designed four tabs and each tab focuses on a certain task to avoid users’ recognition overload. Although the task for each tab is different, the functions and data structure used by each tab can be shared. Therefore, we leverage the features provided by the Vue.js and other packages to implement MedTator’s architecture and the core functions needed for annotation tasks.

As shown in the above figure, the architecture of MedTator includes four layers, namely user interface layer, core modules layer, data persistence layer, and open-source packages layer.

The user interface layer contains the four tabs for the core annotation tasks, which are built based on Metro UI. It provides the similar experience of other well-known desktop applications. In the core module layer, we implement a Vue.js based core app controller to route the requests from users to the core functions, such as importing schema and annotation files and IAA calculation. As the intensive requirements of rendering tags and other visual effects, we implement some modules related to visualization. For example, when showing the relation tags, a polyline will be drawn on the editor in SVG (Scalable Vector Graphics) format to indicate the entities to be linked. To get the correct coordinates of the polyline in different display modes (i.e., document mode, and sentence mode), we developed modules to get the relative tag coordinates in the editor and map the coordinates to a SVG path in different coordinate system. The data persistence layer can handle the requests of reading and writing files in various formats.

Open-source Packages

The functions and features of MedTator are based on many open-source packages, which are served from public free content delivery network (CDN) services. So that users won’t need to install any runtime environment on server or client to use it (i.e., no need to install Java, Python, R, or any other runtime). A list of used open-source packages and their details are shown as follows.

Package Name Version Description
Metro UI 4 4.3.2 Metro 4 is an open-source toolkit for developing with HTML, CSS, and JS for quick prototyping responsive web pages.
jQuery 3.4.1 jQuery is a fast, small, and feature-rich JavaScript library for HTML document traversal and manipulation, event handling, Ajax, etc.
jQuery UI 1.12.0 jQuery UI is a curated set of user interface interactions, effects, widgets, and themes built on top of the jQuery JavaScript Library.
Vue.js 2.6.11 Vue.js is an open-source Model–View–ViewModel frontend JavaScript framework for building user interfaces.
jszip 3.2.0 JSZip is an efficient JavaScript library for creating, reading and editing .zip files with simple API set.
dayjs 1.8.36 Day.js is a minimalist JavaScript library that parses, validates, manipulates, and displays dates and times.
CodeMirror 5.62.0 CodeMirror is a versatile text editor implemented in JavaScript for editing code in web browser.
PapaParse 5.3.1 Papa Parse is a fast in-browser CSV (or delimited text) parser for JavaScript, which is reliable according to EFC 4180.
Shepherd 8.3.1 Shepherd is a JavaScript library for guiding users through the main features of a web application.
winkNLP 1.8.0 winkNLP is a JavaScript NLP library that supports stemmer, lexicon, tokenizer, lemmatizer, etc.
Compromise 13.11.4 Compromise is a JavaScript NLP library that supports sentence split, token normalization, named-entity recognition, etc.
xml-formatter 2.4.0 xml-formatter is a JavaScript library for converting XML into human readable format while respecting the xml:space attribute.

Run Your Own Copy

As MedTator is a serverless pure frontend web application, there are two ways to run your own copy:

  1. Standalone version: MedTator itself is just a single HTML file which contains everything needed. So, you can just open the HTML file directly to use it offline. Moreover, we cached all libraries used in the static folder, so you can use it even without internet access.
  2. Online version: You could fork your own copy on GitHub and run it with your own domain name which is provided by GitHub.

Download Standalone Version

You could find the release link on the repo homepage:

Then, you could find the release zip file that only contains the standalone version:

In addition to the release version, you could also download the latest development version by downloading the whole repo:

Unzip the downloaded zip file, and double click the docs/standalone.html to open the latest development version of the standalone MedTator.

Fork Online Version

  • First, go to the homepage of the MedTator repository https://github.com/OHNLP/MedTator
  • Secondly, you could find a “Fork” button in the top right, next to the star button. Click this “Fork” button and follow the instruction to fork MedTator repository to your own GitHub account. set GitHub Pages
  • Thirdly, go to the settings of your forked repo and switch the “Pages” section. Set the source to branch “main” and folder “docs”, then save.

Then, GitHub will assign a customized domain name for this forked MedTator. After a few minutes, you could access your own MedTator copy with that customized domain name. For example, if your GitHub account name is username123, you could find your forked MedTator in https://username123.github.io/MedTator by default.

In addition to the above default configurations, you could also specify different branch or folder to server as MedTator homepage according to your own situation. More details about forking a repo on GitHub could be found at https://docs.github.com/en/get-started/quickstart/fork-a-repo and more details about the GitHub pages could be found at https://docs.github.com/articles/configuring-a-publishing-source-for-github-pages/ .

Comparison with Other Tools

Many tools have been developed for facilitating the annotation process. Although existing tools may provide powerful features to cover various needs of text annotation, they usually require users to install a runtime environment before annotators start annotation. For example, most web-based tools provide project and user management for better authentication and multi-project support, which may be helpful for large annotation teams to work collaboratively. As a result, a central database, such as MySQL and MariaDB, needs to be installed to save the information related to permissions and project settings. Other features usually also need some packages to be installed. As a result, users must solve the installation issues before the annotators could run any tool for an annotation task. And usually, the installation and configuration issues are not easy to solve for non-technical users.

Existing web-based annotation tools usually adopt a browser / server architecture, in which the server-side program provides various features for corpus annotation and the browser-side web pages provide user interface to collect user input. To support those features, the server-side program is usually built upon backend web techniques to handle data processing and storage. Therefore, it requires 1) a stable server. A dedicated machine with stable network connection is usually needed to host the server-side program. 2) the technical expertise. The installation, configuration, and maintenance of the server environment often requires a series of technical expertise, such as the management of Linux server, network and firewall, virtual environment of programming language runtime, and databases administration. During the installation, any environmental issues may cause the installation to fail (e.g., missing libraries, mismatch versions, and wrong environmental variables, etc.), and sometimes the environment updates may also cause the installed tools to be unworkable (e.g., operation system patches and dependent libraries updates, etc.). Resolving these issues related to installing and maintaining servers is also a challenge for administrators. 3) permissions. Unlike the annotating the public corpora (e.g., twitter, Wikipedia, and PubMed metadata), the data privacy and security of clinical documents are critical concerned in clinical settings. Researchers and annotators may need to communicate with the IT department regarding server permissions, such as software installation, network privileges, and data access.

This installation issue seems to be due to the needs for various features, but in fact the one of the root causes could be the lack of basic computing infrastructure and fundamental functions in web techniques in the past. For example, the project and data management for multi-user annotation usually requires centralized storage and authentication. In the past, these services are not available or not easy to setup for individuals or small teams. Nevertheless, as a benefit of the popularity of the cloud computing, this kind of service could be easily obtained and integrated into local machine from public cloud computing platforms, or own private cloud. Then, the tool itself could focus on its unique functions, and users could use their own local tools to manage data and project. Moreover, as the public cloud services become more popular, it is possible to develop, distribute, and evaluate a web-based application through public services to enable community engagement.

Another benefit is the evolving of HTML5 and modern web browsers. As the development of HTML5 techniques, the functions of modern web browser increase a lot. Especially for those features (e.g., local storage, complex visualization, NLP, machine learning algorithm, and in-memory database) which were only available in web plugins such as Adobe Flash, Microsoft SliverLight, and Java Applet, are embedded in modern web browsers as default abilities or available through public content delivery networks. These improvements greatly empowered the development of the comprehensive web-based application. As a result, it is possible to build better tools based on these improvements to save time for users.

Therefore, we designed and implemented MedTator as a serverless frontend web application, which could run on public cloud services such as GitHub Pages. All the package needed could be loaded from public CDN services. Moreover, users’ own server installation could be as simple as just a few clicks on web pages, and it is optional.