• No results found

Automated analysis of battery articles

N/A
N/A
Protected

Academic year: 2022

Share "Automated analysis of battery articles"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

UPTEC K 20001

Examensarbete 30 hp Januari 2020

Automated analysis of battery articles

Robin Haglund

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Automated analysis of battery articles

Robin Haglund

Journal articles are the formal medium for the communication of results among scientists, and often contain valuable data. However, manually collecting article data from a large field like lithium-ion battery chemistry is tedious and time consuming, which is an obstacle when searching for statistical trends and correlations to inform research decisions. To address this a platform for the automatic retrieval and analysis of large numbers of articles is created and applied to the field of lithium-ion battery chemistry. Example data produced by the platform is presented and evaluated and sources of error limiting this type of platform are identified, with problems related to text extraction and pattern matching being especially significant. Some solutions to these problems are presented and potential future improvements are proposed.

ISSN: 1650-8297, UPTEC K 20001 Examinator: Peter Broqvist Ämnesgranskare: Pavlin Mitev Handledare: Jonas Ångström

(3)

Populärvetenskaplig sammanfattning

Energilagring spelar en central roll i kampen mot klimatförändring. Uppladdningsbara batterier används som komponenter i allt från mobiltelefoner och elektriska fordon till hela elnät, och övergången till förnybara energikällor är beroende av utvecklingen av miljövänligare, effektivare och billigare batterier.

Batteriforskning syftar till att förbättra befintliga och utveckla nya batterier, ofta genom att modifiera eller ersätta materialen som används. Ett stort problem är dock att batterier är komplicerade och oförutsägbara system, och en liten ändring i någon komponent kan ha stor positiv eller negativ inverkan på batteriets egenskaper.

Då batterifältet är stort och aktivt finns många resultat publicerade i vetenskapliga artiklar, och dessa kan i princip användas för att hitta lovande material och tekniker. I praktiken är dessa hundratusentals artiklar dock på tok för många för realistiskt kunna läsas av en människa.

I detta arbete utvecklades därför ett system för att på att automatiserat sätt hämta och analysera stora mängder av artiklar. Artiklar laddas ner baserat på söktermer och intressanta artiklar identifieras med hjälp av maskininlärningsbaserad textanalys. I dessa artiklar kategoriseras sedan meningar baserat på deras innehåll, och data såsom batterikapacitet och elektrodmaterial extraheras tillsammans med metadata som publikationsår och annan identifierande information för artikeln.

Då denna information lagras på ett standardiserat format kan den enkelt användas av olika program för statistisk analys. Data i olika kategorier kan exempelvis ställas mot varandra för att hitta olika sorters korrelationer och trender, vilket kan användas för att optimera olika parametrar eller hitta statistiska outliers bland publicerade artiklar. Genom att analysera data från många artiklar samtidigt är förhoppningen att kunna se mönster som inte är synliga baserat på de relativt få artiklar enskilda forskare läser.

Arbetet i denna uppsats syftade huvudsakligen till att undersöka den praktiska hållbarheten för denna typ av system samt att hitta styrkor och svagheter i teknikerna som används. Konceptet demonstrerades fungera praktiskt och verkar lovande, men begränsas samtidigt av bl.a. problem med extraktion av text från PDF-filer och matchning av vissa typer av data i text. Dessa problem väntas åtminstone delvis kunna avhjälpas genom användning av mer avancerade lösningar, t.ex.

system baserade på artificiella neurala nätverk.

Utöver detta pekar problemen i projektet även på värdet av att publicera data på ett standardiserat och lättläst format. Detta görs ofta redan nu i olika fält, men lättare tillgänglighet till resultat, experimentella parametrar, material, instrumentinställningar och liknande information skulle både underlätta denna typ av analys och även göra det lättare för läsare att reproducera publicerade experiment.

(4)

Table of Contents

Background ... 1

1. Lithium-ion batteries ... 1

2. LFP ... 1

3. Battery research ... 2

4. Article analysis ... 2

5. Parameter selection ... 3

Purpose ... 4

Description ... 5

1. Fundamentals ... 5

Programming language ... 5

Journal article access ... 5

Libraries ... 5

2. Collecting articles ... 6

Searching ... 6

Downloading ... 6

3. Parsing ... 7

File extensions ... 7

PDF ... 7

XML ... 8

Further processing ... 8

4. Classification ... 8

Machine learning ... 8

Article classification ... 9

Sentence classification ... 10

5. Data extraction ... 10

Regular expressions ... 10

Numeric values ... 11

Exporting data ... 11

Metadata ... 11

6. Example run ... 12

Results ... 14

1. LFP data ... 14

2. General Li-ion battery data ... 14

3. Reliability... 15

Discussion ... 16

1. Platform usefulness ... 16

2. Machine-readable appendices ... 17

Conclusion ... 17

Acknowledgements ... 17

References ... 18

Appendix: Practical notes ... 20

Obtaining training data ... 20

Rate limits ... 20

Limits of pattern matching ... 21

Data structures ... 21

Context of extracted values... 21

Probabilistic classification ... 21

Degenerate categories ... 22

Parallel processing ... 22

Memory ... 23

DOI collision... 23

Full text vs. abstract ... 24

One-class SVMs ... 24

(5)

1

Background

1. Lithium-ion batteries

Rechargeable batteries are an important part of modern society, powering everything from handheld electronics to entire power grids. The search for cheaper, greener and more efficient energy storage solutions is a critical part of the ongoing transition away from fossil fuels. Large-scale deployment of renewable energy sources is reliant on new energy storage solutions to handle the natural fluctuations in the power output of energy sources such as wind and solar as well, as well as to store energy used in electric vehicles.

Figure 1. LiFePO4 battery, schematic representation. [1]

Lithium-ion batteries represent the majority of rechargeable batteries sold today. As shown in the example in Figure 1, these consist of a negative electrode (anode) and a positive electrode (cathode) immersed in an electrolyte. Electrical energy is stored by applying a voltage to insert lithium ions into the anode and released by allowing them to flow to the cathode.

The cathode of a Li-ion battery generally consists of a lithium-metal compound, like LiCoO2 (LCO) or LiFePO4 (LFP), while the most common anode material is graphite, although other materials like silicon and Li2TiO3 (LTO) are also used. The electrolyte consists of some lithium salt, often LiPF6, dissolved in one or several organic solvents like ethylene carbonate (EC) or vinylene carbonate (VC).

In addition to these basic components, the battery also contains a semipermeable polymer film dividing the two electrodes and two current collectors acting as electric contacts for the electrodes, although these are not normally electrochemically active.

It is often helpful to use figures of merit to quantify important parameters. Some important ones for batteries include specific energy capacity (Wh/g), specific power (W/kg) and cycle durability (cycles until 80% capacity remains) – these are usually listed prominently on battery product pages.

Similar figures are also used for their constituent materials, especially in research, where important examples include the specific capacity (mAh/g) and current density (mA/g) of electrode materials.

2. LFP

LFP – lithium iron phosphate, LiFePO4 – is a lithium-ion battery cathode material introduced in 1996 [2], and is the main material studied in this thesis. LFP batteries have fairly average specific energy and power compared to those other cathode materials, but possess a number of competitive advantages:

LFP has a longer cycle life as well as calendar life compared to other Li-ion batteries, and tolerates storage at full charge well while also displaying slow self- discharge [3]. These properties make it especially suitable for applications like grid energy storage and backup power, and its longer life reduces the total cost of operation in all applications by requiring less frequent replacement.

Battery voltage remains largely flat during discharge [4], allowing it to deliver about the same power for most of its discharge. A flat voltage can also help simplify and economize the design of devices powered by LFP batteries by eliminating the need for voltage regulation circuitry.

(6)

2 The cheap and non-toxic elements used

help reduce cost and environmental impact compared to other popular materials like LCO. The fact that it does not rely on scarce elements, like LCO’s cobalt, is also an advantage from a geopolitical perspective as the supply of such elements is often controlled by one or a few countries.

Finally, the 3.2 V nominal voltage of a typical LFP cell means four of them can be connected in series to produce a 12.8 V battery, which is close enough to automotive lead-acid batteries to allow vehicles to benefit from LFP’s long life and other advantages [5].

The main disadvantage of LFP is its low conductivity, which severely limits current density [6]. However, this can be remedied through carbon-coating, particle size reduction, doping, or some combination thereof [7]. This has allowed LFP to become commercially viable, and today LFP batteries represent about 10% of the rechargeable battery market [8].

3. Battery research

The solving of LiFePO4’s conductivity problem is a prime example of applied battery research. This field aims to understand and refine the behavior of batteries to better meet the needs of society.

The challenges addressed can often be seen as optimization problems where one or several figures of merit – current density in the above case – is optimized while trying to avoid adversely affecting others.

Another example of battery research is modeling and investigation the solid- electrolyte interphase (SEI) [9] to better control of its growth and structure and reduce battery capacity loss over time. This is not an optimization problem in itself, but the success of different solutions can be evaluated by measuring cycle durability.

As batteries are quite complex systems it is difficult to predict what effects a given change will have on their parameters, meaning it is often hard to find the materials, synthesis techniques etc. needed to achieve a specific change. There is a wealth of published scientific articles which could theoretically be used to help inform these decisions by indicating promising research directions even delivering complete solutions. The problem is that the published battery articles are far too numerous to be read by any one researcher, which this thesis project looks to get around by creating a system to automatically read and analyze large numbers of battery articles for use in research.

4. Article analysis

The idea of collecting data from many journal articles is not new. Meta-analyses and review articles have long allowed readers to quickly get an overview of the state of a field [10], but producing these takes large amounts of time and effort, and it may not be possible to conduct full-scale a literature study each time some aspect of a field is to be investigated.

In the last few decades the availability and accessibility of scientific articles has been greatly improved thanks to internet-based publishing, and powerful machine learning- based text processing algorithms combined with increased computing power and user- friendly programming libraries have opened up new methods for mass text analysis.

These tools allow for the relatively easy creation of systems to automatically find, download, parse and analyze articles, extracting and compiling both numeric and textual data. This can then either be directly presented to the user or subjected to different types of statistical analysis to find trends and correlations as well as statistical outliers.

(7)

3 As this problem involves parsing article text

to extract information it needs some form of natural language processing (NLP) system.

NLP is a field that is central to many modern computer systems such as search engines, digital assistants, autocomplete algorithms and many other applications dealing with human-language text, but is also notoriously hard.

Ideally an NLP system for article analysis would be able to comprehend and summarize text as reliably as a trained human, except much faster. There are advanced NLP systems, for example AllenAI’s Aristo [11], which use trained artificial neural networks (ANNs) to parse scientific text and answer specific types of questions, such as standardized exams.

Such a system could theoretically be applied to battery articles to allow data to be extracted simply by posing queries like

“What is the anode material of the battery?”

instead using more explicit methods.

However, implementing and training such a system requires much more time, expertise and training data than could reasonably be expected of a thesis project. For this reason, a simpler solution based on support-vector machine (SVM) classification is used.

SVMs are computer models that divide data into classes based on training data. A pre- trained SVM can sort articles and sentences into categories. Based on these categories, different forms of regular expression-based text matching can be applied to the categorized text to obtain different types of data.

A weakness of this type of system is that it does not actually understand the text it is processing in any meaningful sense - it is just matching patterns, not comprehending textual meaning. It does not understand whether its own output is reasonable, so the system’s output needs sanity checking and cleanup.

Patterns to be matched also need to be explicitly programmed rather than trained based on examples, which takes significant time and effort as well as some knowledge of battery chemistry.

5. Parameter selection

As mentioned, there are many different figures of merit and other parameters that can be extracted from battery articles, each containing information about some aspect of the battery. As the extraction of each requires work to implement and maintain it is necessary to select ones that provide as much useful information as possible relative to the work needed. Based on this, the parameters chosen for extraction in this project can be seen in Table 1. These selections were largely based on informally questioning of local battery researchers regarding which parameters they considered the most important in a battery.

Table 1. The 15 selected parameters.

Category name Example data Anode material graphite Specific capacity 90 mAhg-1 Current density 30 mAg-1 Capacity retention 73%

Number of cycles 250

Electrolyte DEC

Space group Pmm2

Particle size 150 nm

Heat treatment 250 °C

Cell type CR2032

Lattice parameters a = 2.461 Å

Binder PVDF

wt% active material 69%

Composition LiFePO4

Synthesis procedure (not extracted) The most significant of these parameters is probably specific capacity, representing the amount of charge that can be stored per unit mass, or sometimes area, of electrode material. As this is often the primary figure of merit for an electrode material, effort needs be spent to ensure that it is extracted as accurately as possible.

(8)

4 Attention must be paid to units, to

distinguish between mass-specific (mAh/g) and area-specific (mAh/cm2) capacity values. Different SI prefixes also appear, most frequently m (milli-), and must be handled to normalize values so that they can be accurately compared. Finally, it is also important to remember that specific capacity cannot be used to directly compare different materials as specific energy, which usually is what one actually wants to compare, also depends upon the operating voltage of the battery.

Crystal space groups, on the other hand, are very easy to extract as text can simply be matched against a list of the 230 possible space groups [12]. Space group information may be useful as supplemental information in certain analyses, for example to confirm which material is being discussed, and was included as the performance and work overhead of doing so is negligible.

Synthesis procedure, while important, is by far the hardest to extract out of the listed parameters. This hypothetical parameter represents the steps used to produce the materials used and assemble them into a battery. This information would be useful both in identifying important experimental parameters and to improve reproducibility, but is far too complicated to extract in any meaningful way with the primitive pattern- matching techniques used in this platform, and as such was not included in the final version of the projects. In the future much more powerful systems, probably ANNs, may be able to understand article text well enough to produce synthesis information structured like a flowchart or similar, listing steps with associated information as well as how they relate to other steps – assuming the article actually the information needed for reproducibility.

Purpose

The goal of this thesis project is to develop, test and evaluate an AI platform for the mass collection of chemically relevant data from journal articles in the field of battery chemistry, to be used to inform research decisions and enable large-scale statistical analysis. The platform will be tested by applying it to both LiFePO4 and general lithium-ion battery articles.

Extracted data will be presented and discussed, and the platform as a whole will be evaluated in terms of both current usefulness and future potential. The extracted data will be analyzed with the purpose of evaluating whether the output of the platform is reasonable.

Major problems and sources of error limiting the accuracy and performance of this type of platform will be identified and discussed, and some potential solutions will be proposed.

(9)

5

Description

1. Fundamentals

Any large project has to make some important choices that will affect both its results and the development process. These choices are difficult and should be made carefully since they will directly impact the capability and results of the designed implementation.

Programming language

Python was used because of its user- friendly syntax, large number of libraries, comfortable package management via Anaconda [13], plentiful documentation, and widespread use. JupyterLab was used as the IDE1 due to its convenient features and Anaconda integration.

The R language was used in the downloading module out of necessity, but as it is not particularly well-suited as a general programming language it was not used elsewhere.

Computational performance was not deemed critically important for this project, so a more high-performance language like C++ was not necessary.

Journal article access

Manually downloading thousands of articles is very time consuming. The most efficient method of fetching articles is direct downloading through provider APIs2. Ideally articles would be fetched from all available providers at once, but as providers use different APIs and procedures this was not feasible within the scope of this project.

Thus, one single provider had to be chosen to work with. This selection was influenced by a number of factors:

1 Integrated Development Environment: An application used for software development.

❖ Number of articles available in the field of battery chemistry

❖ Availability of full-text articles

❖ Available libraries and documentation

❖ Ease of API interaction

❖ Download rate limits

Based on these factors, Crossref was chosen. It has a generous rate limit of 50 requests per second with no weekly or monthly limit, many relevant articles a number of libraries available to facilitate API interaction. It also has excellent documentation [14].

Libraries

Libraries, roughly speaking, contain code that can be used for some specific purpose, in this case interaction with Crossref’s API.

Initially the library rcrossref [15] was selected based on its features and performed well, although it was eventually changed for the parent library fulltext [16], which attempts to collect means of interacting with various providers in one package. By building the platform around fulltext it was hoped that expanding to support more providers would be easier.

fulltext is written in R, so to avoid having to use R for the rest of the platform it was run through the R-to-Python library rpy2 [17]. Getting fulltext to work through rpy2 was complicated by the fact that it did not use the same R kernel as JupyterLab, which made installing libraries somewhat difficult.

2 Application Programming Interface: A protocol for simplifying communication between a server and its client.

(10)

6 2. Collecting articles

The first task of the platform is to retrieve the articles that are to be analyzed. This process consists of searching for articles, saving the DOIs and metadata of the hits, retrieving download links based on those DOIs and then downloading articles using those links.

Searching

fulltext has its own search module, but it proved easier to do the searching manually than to handle fulltext’s search results. The downside of this is that the search module will require additional work to be adapted for use with other providers, as each provider has their own API.

Searches are performed through HTTP queries to Crossref’s REST1 API. A query is sent containing one or more search terms and a JSON2-format list of up to 1000 search results is returned which contains various metadata as well as a cursor pointing to the next page of results. This cursor can then be attached to next search query to return the next page of results, analogous to clicking “next” when in a browser. The cursors are followed all the way to the end of the search results, saving each list of results along the way. Finally, the DOIs from each page of results are extracted and combined into one single list of DOIs which is used to download articles.

The metadata, for example publication dates and titles, is left with the original search results for later use.

Since Crossref’s search engine doesn’t support the AND/OR/NOT operators found in many search engines they are applied client-side as part of the platform. This is

1 Representational State Transfer, a commonly used style of web service API using HTTP requests.

2 JavaScript Object Notation, a file format which stores data as attribute-value pairs, similar to Python’s dict data type.

done by applying set operations to the sets of DOIs formed above. For example, performing “Prussian blue”3 AND “battery”

is done by taking the set-theoretical intersection of those two sets of search results, i.e. finding all DOIs that appear in the results for both search terms. This can be used to both reduce the number of articles to download and eliminate many false hits – searching for “Prussian blue”

alone returns many articles on the military history of Prussia, for example.

Downloading

The downloading module passes article DOIs to fulltext which then fetches download links for the articles and retrieves them using those links. This worked well, but the download rate was unacceptably slow, taking up to a week to retrieve 10,000 articles.

To improve this situation parallel downloading was implemented using R’s parallelsugar [18] package. This package, normally used for parallel processing, allows for several similar processes to be run in parallel, and is a convenient way to run multiple downloads at a time. The downloading module uses 24 parallel threads and can thus run up to 24 simultaneous download requests at a time.

Parallel downloading greatly increased the download speed, but also bypassed fulltext’s internal rate limiting as an unforeseen side effect. Normally fulltext would limit download rates to prevent downloads from exceeding the server-side maximum request rate, but the parallel download processes could not communicate with each other to detect that the sum of all download requests was in fact exceeding the rate limits.

3 A promising Na-ion battery cathode material.

(11)

7 Whenever Crossref’s rate limits are

exceeded all connections are dropped. This means that downloads will frequently and mysteriously fail, and consistently exceeding the rate limits will eventually lead to being blocked from further downloading until the matter is resolved with their support staff.

To address this more rigorous client-side controls were put into place in order to prevent the rate limits from being exceeded.

Requests were bundled into batches smaller than the maximum number of requests per second and separated by a pause of one second. Although this throttling nominally decreased the download rate the net effect was actually a significant acceleration of the download process because connections were no longer being dropped, increasing the actual download rate by as much as 200%.

fulltext uses a cache to avoid re- downloading files. The platform uses this cache by leaving downloaded articles unmodified in the cache folder and instead copying them from the cache to the working directory as needed. It is important not to move or modify cached articles as this will cause them to be re-downloaded due to no longer being recognized.

All of the mentioned improvements greatly speed up the download module, but it remains the slowest part of the platform.

Formulating a search query that is as narrow as possible is key to reducing time taken, and fetching more than a few thousand articles is an overnight job.

3. Parsing

The methods applied later in the platform are meant for plaintext data, but articles are downloaded as PDF or XML files which need to be processed before being used in later steps.

File extensions

Some PDF files are downloaded with the .xml file extension for unknown reasons.

This is corrected using the fact that PDFs always start with the characters “%PDF”

when read as plaintext. The first four characters in each .xml file are read and the file is renamed if it is identified as a PDF file, deleting any duplicates.

Another subset of .xml articles are actually HTML files. These are ignored since XML and HTML cannot be read by the same parser. A separate HTML parser could be implemented, but the number of HTML articles was judged to be too small to be worth the effort.

PDF

PDF (Portable Document Format) is something most people have come into contact with. It is a file format that focuses on visual presentation and tries to ensure that documents look the same on every system. This makes it excellent for human- readability, but the actual text content of these documents is difficult to access in an automated manner. The internal structure of a PDF consists of many individual elements positioned on the page to construct an image that makes sense to humans. Such a document is easy to parse visually but cannot be easily read as plaintext, as illustrated in Figure 2.

Fortunately, there are tools to deal with this.

pdfminer.six [19], which was used in this project, can extract plaintext fairly reliably. The main issue is related to margins. Because text is parsed based on the

Figure 2. Part of a PDF file opened as plaintext – PDF is a binary format.

(12)

8 distance between elements on the screen the

parser needs to know the difference between margins separating individual letters in a word and the margins created by whitespace between words. This behavior is controlled by margin settings which define the limits between different types of margins, and incorrect margin settings may cause some sentences to end up with whitespace between e v e r y l e t t e r, or nowhitespace.

One way to prevent this would be to detect these issues in pdfminer.six output, for example by counting the proportion of single characters or very long words, and automatically adjust the margin settings used for that specific article.

It is also worth noting that PDF parsing is a computationally intensive task and typically takes about 2-4 seconds per document, which can add up to several hours of processing time when parsing many articles.

XML

XML (eXtensible Markup Language) is a file format that consists of a tree of elements with names and attributes that enclose text content, similar to HTML. It makes for highly structured documents that are significantly easier to parse than PDF. Part of a typical XML article can be seen in Figure 3.

However, the downloaded XML articles have many different structures, and identifying and implementing a solution to cleanly extract article text from each of these was not deemed feasible. As such the XML parser simply extracts any non- element text string longer than 20 characters, separating strings by a single whitespace character, and saves the result as plaintext. This produces reasonable extraction of text, but does occasionally drop parts of sentences. This could be prevented by also capturing smaller strings, but this would also mean extracting more unwanted text.

Further processing

PDF and XML files often contain problematic characters such as dashes and hyphens of different lengths, different whitespace characters, ligature characters used to improve rendering of groups of multiple letters and so forth. These cause issues when applying classifiers and regular expressions, and are converted to their regular counterparts whenever possible. For example, the ligature fi would be converted to the two letters fi – a subtle visual difference, but parsed as completely different characters.

4. Classification

Only some articles are relevant to the field being studied and only some sentences in those articles are interesting for data extraction. Classifiers are used to separate these out in order to improve the quality of the platform’s final output.

Machine learning

Machine learning is a general term for models meant to perform some specific task without explicit instructions, for example filtering e-mail or translating text.

Supervised models, the most common type, use training data to create a model to inform future decisions on data of the same type.

Figure 3. Part of an XML document. Text that would be extracted by the parser is highlighted.

(13)

9 The classifiers used in this platform are

support-vector machines (SVMs), which in simplified terms draw a hyperplane to separate two categories of training data as well as possible, i.e. with the largest possible margin between nearby points and the hyperplane, as seen in Figure 4. This hyperplane can then be applied to data of the same type to classify points into one category or the other.

In addition to training data these models also rely on so-called hyperparameters – parameters set beforehand instead of being derived from training. These significantly affect the output of the classifier, and should be optimized for the type of data the classifier is expected to encounter. One example of such a parameter in SVMs is tolerance, which determines when the classifier will stop trying to optimize the model, or in other words what counts as

“close enough”.

The hyperplane used by the classifier normally has to be flat, but it is possible to get around this by curving the space containing the points. This allows the flat plane to effectively follow a curved path, similar to how a straight scissor cut can make a complicated hole if the paper is folded [20]. An efficient way to do this is known as the kernel trick1 and is frequently used to allow SVM and other linear classifiers to separate more complicated data.

1 The origin of this term is shrouded in mystery.

There is much more to machine learning than can be described here, and the scikit- learn [21] project is recommended as a starting point for further reading.

Article classification

The article classifier uses a support-vector machine to determine which articles are relevant to the subject being studied.

First, training data must be obtained, representing the two classes of articles. The relevant (positive) examples should be as representative of the type of article being studied as possible. A good source of these is the set of articles one tends to collect while studying a subject, and 10-100 well- chosen articles are usually enough to produce acceptable performance.

Non-relevant (negative) examples represent all other articles. This set is much harder to represent well, as it is not possible to predict every type of article that might appear.

Ideally negative training data includes both clearly irrelevant articles as well as edge cases, i.e. irrelevant articles that almost look like relevant articles. This helps better define the border between the two categories.

Once collected the positive and negative examples are converted to plaintext using the PDF and XML parsers as needed and placed in the positive and negative example folders to be fed to the article classifier as training data.

The trained classifier can then be applied to a folder of plaintext articles to classify them as relevant or not. Relevant articles are copied to another folder for further processing.

Figure 4. A one-dimensional hyperplane (i.e. a line) separating two groups of points in two dimensions.

(14)

10 Sentence classification

The sentence classifier determines which categories a sentence belongs to, if any. The principle is the same as with the article classifier except that in this case there are fifteen different classes representing the different parameters that are to be extracted, which were shown in Table 1.

The training data for the sentence classifier consists of sentences accompanied by 15 entries of 1 or 0 indicating whether the sentence belongs to that class. These are fed to the classified which essentially creates 15 classifiers in one.

Unlike the article classifier the sentence classifier essentially gets negative examples

“for free” as sentences tend to be negative for most categories. When a sentence is fed as a positive example in one or several categories it is also applied negative example in all other categories.

When the trained classifier is applied to a sentence it will return a list of the categories the sentence belongs to. For example, a sentence discussing the cycling of a battery might be classed as belonging to both the specific capacity and current density categories.

5. Data extraction

The data extraction module is responsible for identifying and extracting relevant data in classified sentences. Data may be numeric, for example a specific capacity value, or it may be non-numeric, for example the name of one of the electrolytes used in a battery.

Extraction is arguably the most complicated part of the platform as it involves parsing natural language sentences to extract their meaning. There are some machine learning-

based methods which can do this reasonably well, for example the Aristo Project [11] by AllenAI which specializes in parsing certain types of scientific sentences.

Implementing such a system was far outside the scope of this project however, and a simpler solution was sought.

Regular expressions

Regular expressions are a powerful way to match patterns within text. They are relatively easy to create once the cryptic syntax is mastered and perform well when matching information that is on a known form. This includes value-unit pairs or variables that have a limited number of possibilities, for example space groups.

An example regular expression can be seen in Figure 5. This pattern matches specific capacity values fairly well but relies on certain assumptions: all values are given on specific units, the text has been properly processed by the PDF or XML parser, all whitespace consists of normal spaces, commas are thousands separators and not decimal separators etc. When these assumptions occasionally fail the result is typically either no match or an incorrect value.

Context-sensitive data like as electrolyte composition poses special problems. It is easy to match the names of electrolytes in the text to determine their presence, but it is very difficult to actually determine the composition of the electrolyte mixture used in a battery experiment because there are numerous different ways to formulate sentences stating the ratios of the constituent components. These require deeper comprehension than simple pattern matching and cannot be handled well by regular expressions alone.

[\d,]+[\d\.]+? *m? *A *h(( *g *-? *1)|( *\/ *g))

One or more digits, with comma separators

Decimals, optionally

0 or more spaces

“mAh”

or “Ah” “g -1”,

matching “g-1

OR operator “/g”

Figure 5. A regular expression matching specific capacity values.

(15)

11

Figure 6. The data extraction module's graphical output mode.

Writing the regular expressions used involves a combination of chemical reasoning and reading through sentences tagged by the sentence classifier. To facilitate this process a graphical output mode, as shown in Figure 6, was created.

This prints classified sentences in each article along with the results of pattern matching, and includes an optional troubleshooting mode to print much more information.

Numeric values

For categories with numeric data value extraction is performed. This attempts to extract a numeric value from the matched text string. In the case of specific capacity and current density these values are normalized to the units mAhg-1 and mAg-1 respectively, so “530mAh g 1” would be interpreted as a numeric value of 530 while

“1.5 Ah /g” would be normalized to 1500.

Exporting data

From each match the matched string, numerical value (if found), category identifier, full sentence text and reduced DOI are compiled and added to a dictionary of hits. This represents the first five entries seen in Figure 7.

Metadata

Metadata is data about data, for example the title and publisher of an article. This information is not kept attached to each article through the whole program but is rather left with the search results and reattached once the extraction process is finished. This means the only information associated with each article through most of the process is its reduced DOI, which is stored in the file name, and its actual content. This simplifies the data structures needed greatly.

Figure 7 An entry of extracted data and metadata.

The metadata module iterates over the list of search results, each of which contains the metadata associated with an article. The search result’s DOI is converted to its reduced form by replacing non- alphanumeric characters with underscores and is compared to each entry’s Name.

Where they are identical the metadata is added to the match entry, producing an entry such as the one seen in Figure 7.

The output of the metadata module is a dictionary of such entries and is saved as a JSON file, forming the final output of the platform.

Sentence: “The discharge capacity of LiFePO4 was enhanced up to 140 mAh/g using a simple carbon coating method.”

Match: “140 mAh/g”

Category: 1

Name: “10_1007_s11164_011_0273_3”

Value: 140

Title: “Continuous supercritical hydrothermal synthesis:

lithium secondary ion battery applications”

DOI: "10.1007/s11164-011-0273-3”

Publisher: “Springer Science and Business Media LLC”

Date: 2011

(16)

12 6. Example run

The platform is divided into eleven modules, shown in Figure 8. In this section the basic steps to go from search terms to extracted values are described.

Module 1: Searching

A list of all search terms to be used in the final query is entered. For example if the query “(lithium AND battery) NOT medicine” is to be performed the list would consist of the elements “lithium”, “battery”, and “medicine”. The file is then run and the search results are fetched. This step usually takes 10-15 seconds per page of 1000 results.

Module 2: DOI set operations Here search terms are entered into the categories OR, AND and NOT to apply set operations to them. At least one term must be entered under OR. In the case previously mentioned “lithium” would be entered under OR, “battery” under AND, and

“medicine” under NOT. The filename of the output is printed.

Module 3: Downloading

The filename output by module 2 is entered and the file is run. It will then try to download the articles corresponding to every DOI listed in the file and save them to the cache. This step is very slow and downloading more than a few thousand articles is an overnight or weekend job, as previously mentioned.

Module 4: Article collection

The same filename as above is entered as in module 3 the file is run. This copies every matching article from the cache to the working directory.

1.Search

2. Apply set operations

3. Download articles

4. Collect downloads

5. Correct filenames

6. Parse PDFs 7. Parse XMLs 8. Classify

articles 9. Classify sentences 10. Extract

data 11. Reattach

metadata

Input

Save as JSON External processing

Search terms

DOIs

Filtered DOIs

Cached downloads

Misnamed articles

Renamed articles Plaintext articles Relevant articles Tagged sentences

Extracted data Complete data Metadata

Exported data

Figure 8. Overall flow of the platform.

(17)

13 Module 5: Filename correction

This file is run and attempts to correct any PDFs misnamed as XML in the set directory. It should never be used on the cache folder as any renamed files will no longer be seen by the caching feature, causing them to have to be re-downloaded.

Module 6: PDF conversion

When run this module converts the contents of every PDF in the working directory to plaintext and saves them as .txt files. This typically takes 2-4 seconds per file.

Module 7: XML parsing

The file is run, extracting the plaintext content from each XML file in the working directory and saving it as .txt files.

Module 8: Article classification Plaintext-format training data is placed in the indicated folders. The file is then run, which generates the article classifier and applies it to all .txt files in the working directory. Those classified as belonging to the positive class are then copied to a secondary working directory.

The trained classifier is saved and can be loaded in future runs instead of building a new one.

Module 9: Sentence classification The file is run, loading an external classifier and applying it to the .txt files in the secondary working directory. These are split into sentences which are saved as plaintext-format .cls files, with sentences and associated tags on alternating rows.

Module 10: Data extraction

The output mode is set to 0, selecting JSON output, and the file is run. This attempts to extract values from .cls files in the secondary working directory and saves them on the format described under 5.2.

Output mode 1 or 2 also can be used to produce graphical output, as was shown in figure 5.

Module 11: Metadata

A search term corresponding to a set of search results containing all articles being processed is entered. In this example the terms “lithium” and “battery” would both work. The file is then run, fetching metadata from all corresponding search result files and attaching it to any matching entries in the file generated by data extraction. The result is then saved and can be further processed by a program capable of parsing JSON.

(18)

14

Results

The platform was applied to produce two sets of data: A smaller set extracted from 304 LiFePO4 battery articles and a larger set from 6692 general lithium-ion battery articles. Glow effects have been applied to the points in all figures presented in this section to better highlight stacked data points.

1. LFP data

The LFP data set consists of 3171 data points. Of these 920 points in the specific capacity category can be seen in Figure 9.

The main trend that can be observed is the fact that the densest set of points, as indicated by a strong glow effect, lies between about 90 and 150 mAhg-1. This is reasonable, as modern LFP batteries can have capacities of more than 140 mAhg-1 [22]. 77% of all extracted values fall below this value, and the median value is 139

mAhg-1. The remaining values may refer to anode materials, hypothetical values, measurement errors, misparsings or in- article comparisons to other cathode materials such as LCO. Most very low (< 10 mAhg-1) values are caused by misparsed text, not actual experimental data.

2. General Li-ion battery data This set of specific capacity data, shown in Figure 10, consists of 17278 data points.

95% of all values fall between 50 and 1000 mAhg-1, which is reasonable. Values are especially clustered around the ranges 50- 200 and 350-1000, indicating two different categories, most likely higher-capacity anodes and lower-capacity cathodes. A simple exponential regression indicates a specific capacity growth of 1.7% per year, but as the data comes from a mix of different types of batteries and electrodes this value is dubious.

1 10 100 1000 10000

2000 2004 2008 2012 2016 2020

Specific capacity / mAhg-1

Publication year

Figure 9. Specific capacity values in LFP battery articles.

1 10 100 1000 10000

2000 2004 2008 2012 2016 2020

Specific capacity / mAhg-1

Publication year

Figure 10. Specific capacity values in general Li-ion battery articles.

(19)

15 Some categories contain values obviously

chosen by humans – heat treatment temperatures, for example, are often divisible by neat numbers like 5, 10, 20 or 25 as indicated by the horizontal bands in Figure 11. The weight % active material category (not shown) exhibits very similar behavior, with strong preference for the values 10, 25 and 75%.

Cycle numbers, seen in Figure 12, appear to be a mix of two sets of values. One presumably comes from experiments where batteries were cycled until failure or some other condition while the other clearly consists of human-chosen values at nicely divisible numbers like 100 and 250. The overall effect is a random scatter of values overlaid by bands at cycle numbers well- liked by human researchers.

3. Reliability

Several major sources of error have been identified that contribute to the overall quality and quantity of the output.

Poor value extraction is the cause of most extreme values. This is caused both by insufficiently clever regular expressions and the fact that the PDF parser frequently fails to correctly parse text. Whitespaces are added or removed, characters are dropped or changed, tables are collapsed into strings of characters and ligatures replace regular characters. Numeric values may be joined into one large number if whitespace is removed, for example.

Figure 12. Cycle numbers in Li-ion battery articles.

1 10 100 1000 10000

2004 2008 2012 2016 2020

Number of cycles

Publication year

0 50 100 150 200 250

2000 2004 2008 2012 2016 2020

Heat treatment temperature / °C

Publication year

Figure 11. Heat treatment temperatures in Li-ion battery articles.

(20)

16 Some units may be not interpreted

correctly, for example some capacity values on specific area units (e.g. mAh·cm-2) are sometimes misinterpreted as specific mass (e.g. mAhg-1). This has been observed to be the cause of most very low specific capacity values. This specific error can be easily corrected but many similar undiscovered errors are expected to exist.

These issues can be, and have been, addressed by writing better regular expressions that tolerate the errors produced by the PDF parser, but this cannot cover every type of error. Automatic adjustment of the PDF parser’s margin settings can also help alleviate parsing problems, and tables can be handled by a purpose-built solution like tabula-py [23].

Incorrect classification of articles causes some articles of the wrong type to be included and analyzed. This is the source of many moderately high or low outliers. For example, the specific capacity of a lithium- silicon battery is much higher than that of an LFP battery [24], and these values sometimes show up in the LFP data as a high outliers.

The article classifier also sometimes fails to include the articles that it should, which is the cause of the low (304) number of LFP articles studied.

The main way to solve both of these issues is with better training data for the article classifier.

Incorrect sentence classification causes some sentences to not receive the appropriate category tags, which in turn causes them to not be processed by the data extraction module. This reduces the overall volume of data produced by the module and may also produce a biasing effect if the type of sentences that are missed tend to contain the same type of value.

Some sentences are also occasionally tagged as belonging to the wrong category which can lead to values being extracted in the wrong category. The overall effect of this is small as mistagged sentences usually don’t produce any valid matches when treated by the sentence classifier, but may be significant for the capacity retention and weight % active material categories as both are unitless percentage values.

Incomplete article access is caused by the fact that the platform only fetches articles from Crossref. Many potentially interesting articles served by other providers will not be downloaded and unavailable for study. This mainly decreases the amount of data available but may also bias the data if the research groups covered by different providers study do not study the same materials or use the same methods. The obvious solution is to implement downloading from more providers, assuming the appropriate access agreements are in place.

Discussion

1. Platform usefulness

The current version of the platform should be seen as a proof of concept. The data presented does not contain much new information, but should rather be seen as an indication that the platform is capable of extracting data fairly reliably.

The most immediately useful parts are the collection modules, as these provide a convenient solution for mass downloading.

The rest of the platform can be considered a framework for other more powerful modules. Any project using data from this version of the platform would need to take the significant sources of noise into account, and consider systematic biases. Of particular importance is the fact that it is hard to tell which specific material or experiment a given value comes from. It is often not clear whether capacity values refer

(21)

17 to the anode, cathode or the battery as a

whole, and the platform does not deal well with nonstandard phrasing or units.

However, if the issues brought up in this thesis were addressed this type of platform could become a useful research tool. It is unlikely that it would ever rival the accuracy of a trained human scientist, but the speed at which it can applied still means it has a niche in the mass analysis of articles too numerous for manual processing.

2. Machine-readable appendices One of the takeaways from this method is how difficult it is to cleanly and reliably extract data from articles. A possible solution to this is machine-readable appendices.

These structured data files attached to articles, specifically meant to be machine- read by anything from a search engine indexing the article to a mass analysis tool like this one. The data contained would include standard metadata as well as experimental measurements and conditions, instruments used and their settings, data displayed in charts and so forth.

The benefits of this would be numerous, as reliable mass analysis opens up easier field overview and identification of candidate materials, improves searchability of articles and, via platforms like this one, simplifies the identification of correlations between parameters. Improved reporting of experimental data and conditions would also help reproducibility.

The main downside would be the additional work required of authors to compile these files, increasing the time taken to publish papers, although the time saved in analysis and experiment reproduction may well justify the additional effort.

Conclusion

A platform has been created for the rapid but shallow analysis of large volumes of scientific articles. Methods such as SVM classification, regular expressions, mass downloading and PDF and XML parsing have been implemented to this end.

The platform has been applied to articles on the subject of LFP batteries as well as general lithium-ion battery articles. Data from both sets has been presented and its validity and interpretation discussed.

A number of issues affecting the reliability of the platform have been identified, with inaccurate PDF parsing and pattern matching contributing most to the noise and systematic errors seen in the output. Some potential improvements to these problems have been proposed.

The automated analysis of scientific articles is expected to play a significant role in future research, with natural language processing-based approaches using artificial neural networks showing the most promise.

Acknowledgements

Jonas Ångström, supervisor, for much support and feedback, and for initially setting up the classifiers used.

Pavlin Mitev, subject specialist, for feedback during the writing process and for providing access to the Teoroo computer cluster.

Peter Broqvist, examiner, for feedback during the writing process.

Andreas Röckert, for feedback during the writing process.

Daniel Brandell and Mikael Ardre, for helping enable this thesis position.

(22)

18

References

[1] X. Zhang, O. Toprakçı and L. Ji, "Electrospun Nanofiber-Based Anodes, Cathodes, and Separators for Advanced Lithium-Ion Batteries,"

Polymer Reviews, vol. 51, no. 3, pp. 239-264, 2011.

[2] A. Padhi, K. Nanjundaswamy and J. Goodenough,

"LifePO4: A Novel Cathode Material of Rechargeable batteries," Electrochemical Society Meeting Abstracts, vol. 96, no. 1, p. 73, 1996.

[3] E. Sarasketa-ZabalaI, G. L. M. Rodriguez- Martinez and I. Villarreal, "Calendar ageing analysis of a LiFePO4/graphite cell with dynamic model validations: Towards realistic lifetime predictions," Elsevier, vol. 272, pp. 45-57, 2020.

[4] S. Bodoardo, C. Gerbaldi, G. Meligrana, A. Tuel, S. Enzo and N. Penazzi, "Optimisation of some parameters for the preparation of nanostructured LifePO4/C cathode," Ionics, vol. 15, no. 19, 2009.

[5] Batteryspace.com, "12.8V (12V) LiFePO4 Battery Packs," [Online]. Available:

https://www.batteryspace.com/128vlifepo4battery packs.aspx. [Accessed 27 January 2020].

[6] M. S. Islam, D. J. Driscoll, C. A. J. Fisher and P.

R. Slater, "Atomic-scale investigation of defects, dopants, and lithium transport in the LiFePO4 olivine-type battery material.," Chemistry of Materials, vol. 17, no. 20, pp. 5085-5092, 2005.

[7] T. Satyavania, A. S. Kumara and P. S. Raob,

"Methods of synthesis and performance improvement of lithium iron phosphate for high rate Li-ion batteries: A review," Engineering Science and Technology, an International Journal, vol. 19, no. 1, pp. 178-188, 2016.

[8] A. Blidberg, 'Iron Based Materials for Positive Electrodes in Li-ion Batteries : Electrode Dynamics, Electronic Changes, Structural Transformations', PhD dissertation, Uppsala, 2017.

[9] D. Li, D. Danilov, Z. Zhang, H. Chen, Y. Yang and P. H. L. Nottena, "Modeling the SEI- Formation on Graphite Electrodes in LiFePO4 Batteries," Journal of the Electrochemical Society, vol. 162, no. 6, pp. A858-A869, 2015.

[10] A. M. Woodward, "The roles of reviews in information transfer," Journal of the American Society for Information Science, vol. 28, no. 3, 1977.

[11] P. Clark, "Project Aristo: Towards Machines that Capture and Reason with Science Knowledge," K- CAP '19 - Proceedings of the 10th International Conference on Knowledge Capture , pp. 1-2, 2019.

[12] W. Barlow, "I. Ueber die geometrischen Eigenschaften homogener starrer Structuren und ihre Anwendung auf Krystalle.," Zeitschrift für Kristallographie, vol. 23, pp. 1-63, 1894.

[13] Anaconda, Inc., "Anaconda | The World's Most Popular Computer Science Platform," 11 2016.

[Online]. Available: https://anaconda.com.

[Accessed 2 Dec 2019].

[14] Crossref, "GitHub - CrossRef/rest-api-doc:

Documentation for Crossref's REST API.,"

[Online]. Available:

https://github.com/CrossRef/rest-api-doc.

[Accessed 2 Dec 2019].

[15] S. Chamberlain, H. Zhu, N. Jahn and C. Boettiger,

"rcrossref: Client for Various 'Crossref' 'APIs',"

2019. [Online]. Available: https://CRAN.R- project.org/package=rcrossref. [Accessed 2 Dec 2019].

[16] S. Chamberlain, "fulltext: Full Text of Scholarly Articles Across Many Data Sources," 2019.

[Online]. Available: https://CRAN.R- project.org/package=fulltext. [Accessed 2 Dec 2019].

[17] L. Gautier, "rpy2 · PyPI," 2019. [Online].

Available: https://pypi.org/project/rpy2/.

[Accessed 2 Dec 2019].

[18] N. V. Houdnos, "GitHub -

nathanvan/parallelsugar: R package to provide maclapply() syntax for Windows machines,"

[Online]. Available:

https://github.com/nathanvan/parallelsugar.

[Accessed 2 Dec 2019].

[19] Y. Shinamaya and e. al., "GitHub -

pdfminer/pdfminer.six: Python PDF Parser -- fork with Python 2+3 support using six," [Online].

Available:

https://github.com/pdfminer/pdfminer.six.

[Accessed 2 Dec 2019].

[20] E. D. Demaine, M. L. Demaine and A. Lubiw,

"Folding and one straight cut suffice,"

Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '99), pp. 891-892, 1999.

[21] F. Pedregosa, G. Varoquaux, A. Gramfort, V.

Michel, B. Thirion, O. Grisel, M. Blondel, A.

References

Related documents

Previously the insulin used for treatment was isolated from pancreatic insulin producing cells from pigs, today the insulin used for treatment is human insulin and synthesised

When Stora Enso analyzed the success factors and what makes employees &#34;long-term healthy&#34; - in contrast to long-term sick - they found that it was all about having a

6.6.4 Share bio-based plastic on the market and in littering No statistics based on renewable or fossil raw material plastic trash in the environment is recorded in Sweden.

On the other hand, variation provides an opportunity to experience this distinction in language, as it becomes apparent to the child that not all string instruments

​ 2 ​ In this text I present my current ideas on how applying Intersectional Feminist methods to work in Socially Engaged Art is a radical opening towards new, cooperative ​ 3 ​

Taking basis in the fact that the studied town district is an already working and well-functioning organisation, and that the lack of financial resources should not be

Författaren får alltså, med sin person och/eller sin bok, representera &#34;de nya svenskarna&#34;, eller Invandraren om man så vill, och inte nog med det, han representera

In relation to model analysis, not doing model analy- sis can lead to a discriminatory and biased model. Wal- lach and Hardt portray a very clear example of when error analysis can