Automating Manual Data Extraction

At Rearc Data, we build a wide variety of data pipelines from hard-to-process sources. These include PDF reports, web dashboards, and free text. Often, there is some pattern or visual guide that enables us to extract data cleanly. Other times, the data we want is entirely unstructured in that context, and there's no "right" way to extract it.

To illustrate this problem, let's use a real-world use case. One of our Health and Life Sciences (HLS) datasets pulls weekly influenza data from a PDF report published by a Spanish agency. Some of the data we want from this report is semi-structured as a table, but some is just plain text that gets re-written each week. Extracting the table is hard but doable.

Obtaining structured data from the free text is not as simple. The format of this subtype information changes from week to week, as do the mannerisms used to describe them. For example, text like "..., all of which are type A," and "..., of which 7 are type A and the rest are type B" is in frequent use.

In order to produce reliable data, these inconsistencies require a solution that is explainable, fixable, and efficient. It must also be correct, correctable, and timely. This article will explain how we achieved just that.

First attempt: Traditional NLP

The best solution is often the simplest. There are at least three typical ways to structure information in free text.

The first approach is regular expressions. For example, we can define an expression like:

regex = re.compile(r'of which (\d+) were type A')

This expression allows us to find phrases that match that pattern, where (\d+) means "capture a sequence of digits". We can then retrieve the exact number that was mentioned and use that as the number of Subtype A flu cases.

While the above example is trivial and our real expression was much more sophisticated, regular expressions are generally very rigid and get complicated when adapting to wildly varied texts. Some of our other data pipelines use regular expressions for text that doesn't vary much from week to week; in this case, we needed something more flexible.

Without diving too far off the deep end, we tried both structured grammars as well as more traditional NLP tools like SpaCy and NLTK. While we were able to build solutions that worked for all historical data, these approaches continued to fail as new reports were published and the relevant text got structured in ever more creative ways. Ultimately, we needed something more intelligent and flexible.

Second attempt: LLMs and modern NLP

The rise of large language models (LLMs) offered a new approach. LLMs take loosely structured text as input and can be instructed to generally produce machine-readable outputs. This has proven to be incredibly flexible when the LLM is trained on a wide array of tasks, adapting easily to new kinds of tasks when provided instructions or examples of what it should do.

For example, we could pass OpenAI's GPT model some general task instructions, a few hand-annotated examples with solutions, and then request that it perform the same annotation on a new input and format the result as JSON. Done carefully, this works a surprising amount of the time and is easily tuned by simply improving the instructions, simplifying the output model, or providing more or better examples.

We built our first prototype using the excellent Marvin library, using GPT3.5-turbo as the backing LLM. The resulting code looks something like this:

from marvin import ai_model
from pydantic import BaseModel

class FluSubtype(BaseModel):
    name: str
    count: int

@ai_model
class FluSubtypeInfo(BaseModel):
    """
    Given a sentence describing what flu subtypes were detected,
    parse out the name and count of each subtype.
    """

    subtypes: list[FluSubtype]


FluSubtypeInfo("In week 20 of 2023, we detected 50 cases of influenza, "
               "all of which were type A [30 A(H1N1), 15 A(H2N2)]")
# Output: FluSubtypeInfo(subtypes=[
#     FluSubtype(name='A(H1N1)', count=30),
#     FluSubtype(name='A(H2N2)', count=15)
# ])

Clearly, for the amount of code written, this result is mind-bogglingly good. It's trivial to adjust the code to look for total positive tests, limit the flu subtype names to known possible values, and more. We've written effectively no code, and we're relying on a LLM with famously good performance. It takes only a few seconds to run and uses an external API, so we don't even need any heavyweight dependencies or local compute power; under the hood, we're just sending and receiving JSON over a regular web connection. This approach is cost-effective and still runs in only a few seconds.

So what's the catch? Well, as with any ML situation, we have no way of knowing when we fail. Sure, we could use another LLM-based function to validate the output of the first LLM function. This is not a bad idea in general, but there's still a good possibility that both functions will let the same mistake through. Additionally, the same LLM might make different mistakes each time you run it. Also, if you're running against a LLM that's being continually trained and updated, then you might still get different responses on different days.

In our tests, depending on how we phrased the problem instructions and validated the responses, we could get >90% accuracy with GPT3.5 and >98% accuracy with GPT4. Unfortunately, that's not good enough: our customers expect 100% accuracy.

So the LLM-based function is fast, cheap, easy to develop, works most of the time, but isn't quite accurate enough. If even the most advanced AI of the day isn't reliable enough, then what is?

Third attempt: Humans

People are very good at solving unstructured problems, but relative to code or LLM's, they're expensive, difficult to involve, and slow. We needed to find a way to involve people in our data pipelines without reinventing the wheel or taking forever. We chose to use AWS's Sagemaker Augmented AI (A2I) service -- which doesn't technically require an AI to be in the loop -- as an on-demand way to submit work to human annotators. After trialing many options, we chose to use a dedicated third-party vendor to provide the annotation services. As a backup option, A2I also lets us use our own engineers to provide annotations via the same interface, either for testing or for high-priority dataflows.

Like AI, people are inconsistent. Any single human is likely to surpass the accuracy we were getting from LLMs, but people still aren't perfect. Since different people are likely to make different mistakes from each other, a common solution is to trust and verify: use multiple annotators and then reconcile their answers (using some sort of "consensus" algorithm). This increases the cost substantially but helps us achieve the 100% accuracy we're aiming for.

Involving humans in the loop, though, introduces timing difficulties. We don't produce enough tasks for human annotators to justify having a 24/7, always-available annotation staff. The annotators available through our vendor work normal workday hours in a few different timezones, and we aren't their only or even one of their bigger customers. We can't count on getting annotations quickly, so 1-2 business days is a reasonable turnaround. We needed to design our dataflows to be highly asynchronous, to be able to submit any new work, check for any potentially finished results, and not touch any outstanding tasks. Each time we run our data pipeline, it doesn't necessarily produce a complete dataset; it just reconciles our final dataset with whatever inputs and outputs are available to put into the human annotators' queue.

Ok, so now we have a solution that's expensive and slow, but almost certainly correct. How do we make this faster, cheaper, and even more likely to be correct?

Fourth attempt: Humans and AI

LLMs make mistakes because they're incompetent. Humans make mistakes because, relative to computers, we're lazy, biased, and slow. (Some annotation marketplaces suffer from fraud as well, but using a trusted vendor largely eliminates this risk.) Whichever approach you take, you need checks and balances.

One easy solution is to use these two approaches to check and balance each other. If you want a quorum of three, you could use three humans with a LLM checking their answers, or you could use two humans and a LLM all attempting the same task, or any combination of those options. In the former case, you can keep or ignore human responses before they reach your consensus algorithm. In the latter, you treat the LLM as a peer of the humans, and all three results go into the consensus algorithm. LLM's are very likely to make different mistakes than humans, so the odds are good that where the LLM fails, the humans will succeed, and vice versa.

If the LLM is solving the same task anyway, yet another option is to use its results to prompt humans. If the LLM performance is generally good and you can trust your annotators to really review the LLM's guesses and not just take them for granted, then this can be an excellent way to save people's time. Run the LLM first, take its results, provide them to the annotators as a default response, have the annotators fix any mistakes the LLM made, and then perform your final consensus. This does bias your annotators, and it relies on them double-checking the AI's work consistently, but it can also save them a huge amount of time if the LLM generally does well. Trade-offs all around.

Implementation

The example use case we've been working with has been nothing more than an example. In reality, we want to use this human + AI workflow for a wide variety of tasks. That means the tooling needs to be reusable, and it needs to be easy to add more kinds of tasks, preferably without needing a centralized list of all such tasks. That need for generality will guide our implementation. If you have a smaller number of larger use cases, a different approach may be better.

First, we needed to ensure we had all our services and credentials lined up. That included working with our third-party annotator to ensure we were on the same page about volume, task complexity, etc. We also needed to get access to the LLM's we wanted to use: in our case, we used OpenAI's API's, but other use cases may require self-hosted models or alternative services.

Second, we needed interfaces: the developer interface, the annotator interface, and the AI interface.

Interfaces

Development Interface

For the developer interface, we took inspiration from Prefect Marvin, which we showed earlier. In Marvin, the AI takes on the role of a data extractor or some other unknown function: it takes instructions and a particular input, and generates a structured output. The interface is incredibly simple:

from marvin import ai_fn

@ai_fn
def convert_text_to_number(textual_number: str) -> int:
    """Given a number as text (e.g., "five"), convert it to an integer (e.g., 5)."""

convert_text_to_number("two hundred thirty one") # Output: 231

We borrowed this same interface for our new use case. Because the execution is not as straightforward, there's a little more configuration. Our result looks like this:

@human_ai_fn(
    workflow=HumansAlongsideAI(num_workers=2)
)
def convert_text_to_number(textual_number: str) -> int:
    """Given a number as text (e.g., "five"), convert it to an integer (e.g., 5)."""

The output type is generally not a primitive; instead, we use pydantic to define the data model for the response. This is quite useful in a variety of ways:

It lets us apply validations, transforms, and much more to the response data.
It provides a structured way for describing the response schema, from which we can generate an HTML response form, including javascript-powered dynamic elements.
It lets us easily serialize and deserialize responses, e.g., using JSON, so responses can be stored and loaded from files.
It lets us upgrade our response model in minor ways without losing the ability to load pre-existing responses.

Annotation Interface

The human annotators need an interface to use too. We used AWS Sagemaker A2I to coordinate delivering work to our annotators, but A2I still requires that we provide the HTML form the annotators will use. Since we're building this system to work for a variety of tasks, we ideally want HTML that is either dynamically generated up-front or dynamically generated client-side (the latter has the advantage of supporting dynamic fields, e.g., submitting an unknown number of flu subtype responses). This dynamic generation can be driven by our pydantic model, e.g., by exporting our model in JSON Schema format. Given that JSON Schema, there are various tools that can render that out as a dynamic web form.

LLM Interface

Finally, the AI needs an interface. Interacting with LLM's isn't just about networking, authentication, and SDK's. If you want to get structured data back, you need to provide clear instructions to the model about how it should respond, and then you need good parsing and validation of that response. LLM's are, at their core, text completion systems. Getting a glorified autocompleter to solve a problem in a machine-readable way is its own challenge, including prompt engineering and much more. Fortunately, Marvin does most of this for us. We reuse Marvin and its mechanisms, and treat this problem as solved.

The Final Result

When all's said and done, what do we have?

A developer finds an extraction task within a larger data pipeline that they can't practically solve with regular code but that they are expected to solve accurately. So they describe that step of the process, define the inputs and outputs of that step, and wrap that as a @human_ai_fn. They then use this function like they would any other function within their larger data pipeline. The function either returns the consensus result or a flag if there is not yet a consensus for the given input.

Getting to consensus is the variable and complex process we described above. A typical flow would look like:

Submit the general instructions and specific input to the LLM.
Take its responses and send them to 1-2 human annotators for validation.
When all responses are available, consolidate them and compute the consensus.

It's critical that all intermediate data gets stored and that later artifacts can be re-generated from earlier ones. There are several reasons for this:

The AI and people cost money to involve, and shouldn't be tasked with the same work more than once.
The AI and people take time to respond, and may provide different responses each time. It's important to remember what responses were received to avoid unnecessary changes downstream.
The final results need to be auditable and correctable. It is convenient to implement this by allowing the injection of corrections into the Human + AI pipeline.
The data model for the output should be able to be upgraded without discarding pre-existing responses, for all the reasons listed above.

The implementation of this should manage its own infrastructure with AWS, be able to pick up wherever previous executions left off, and avoid restarting any work that may be in-progress. Additionally, the consensus algorithm should be tailored to each problem, but for the kind of extraction tasks we show in the example above, it's convenient to have a general-purpose algorithm that can resolve little differences without throwing out entire responses.

The result is a nearly zero-effort process for integrating humans, AIs, or both into an otherwise automated data pipeline to perform tasks that couldn't be automated any other way. The developer can decide on a task-by-task basis what their performance requirements are (e.g., using AI only if 100% accuracy is not a requirement) and tune the instructions and data model to be as simple or as complex as the AI and humans can handle.

In some cases, if latency is not a concern or the problem's scale is small, it may be faster, cheaper, and easier to automate simple tasks using humans and AI's instead of attempting to solve them manually. There may also be other kinds of AI better equipped to solve a particular problem (e.g., translation or named entity extraction) that can be integrated into this process to improve performance or handle other data modalities (images, PDF files, web links, etc.).

There are innumerable tasks that, at small enough scales, are more efficiently solved by AI's and manually annotation than by complex, robust data engineering. Making it easy to obtain those manual annotations opens another door for our data engineers to pick the best tool for each job and offer simple solutions to simple problems.

Going back to our example use case, once this tooling was all in place, we were able to send dozens or hundreds of Spanish sentences (along with auto-translated English equivalents) to GPT4 and human annotators and transform the responses into a high-quality, high-reliability, 100% accurate step in our larger data pipeline. This allowed us to replace our several failed automation efforts, which consumed months of expensive developer time, with just a few minutes of development time and a couple hours of human annotator effort, while retaining the ability to scale to dramatically larger scales with ease. Moreover, the ease of our tooling meant that we could achieve this simplicity not just with one problem but with any number of future problems as well.

Conclusion

In this post, we presented an example problem that was easy to solve manually but hard to automate and showed the various steps we took to turn that entire class of problems into an easily integrated capability within a larger, otherwise-automated data pipeline. We showed how this could be simple for developers to use while conforming to various constraints and choosing from a menu of trade-offs.

At Rearc Data, we strive to provide high-quality data products. In some cases, that involves working with sources that were never intended to be machine-usable. While manual annotation may seem like an expensive solution or a waste of people's valuable time, in many cases and at many scales, it's actually far more efficient than automation, freeing up not just funds but human effort for problems that are more worth automating. Having the ability to collaborate with human annotators within a larger automated data pipeline lets us achieve both high quality and cost-efficiency.