A mine of information – why TDM presents a challenge for publishers

Publishers and information specialists can’t have failed to notice that AI has been all over the news the past few weeks. From debate about ChatGPT being able to deliver a reasonably standard undergrad essay, to artists’ and writers’ concerns about copyright issues as other AI systems scrape their work to produce mashups for other users, AI is everywhere.

Publishers are the major owners, distributors, and hosts for all types of content, from literary works to academic articles books and photographic and artistic works. As such, publishers should be concerned about how to protect their intellectual property and that of their authors in this changing world. They also need to be more aware of the role that text and data mining and illegal mining or scraping are playing in the world of AI and machine learning (ML).

TDM at the core of AI

Artificial intelligence in its most commonly used form – machine learning – cannot really function without an underlying data set. The data is needed to train the algorithms so there is a bank of information that forms the basis of their ‘learning decisions.’ The advances over the past 10 years have outstripped the expectations of many AI and tech professors. While writers and technologists present AI or ML as miraculous, the core that they are built on is human knowledge, creativity and ingenuity and most of this valuable resource sits in banks of online published information.

Text and data mining (TDM) is a vital source for AI and should be a wake-up call for publishers. TDM is far from a new service but was previously regarded as a niche field where data scientists pulled large databanks to reanalyse existing data. Projects were often focused on extracting data for activities such as building large databases. Originally, this work was done manually by individuals using set rules to codify information. Lately, with the explosion of computing power, the exponential growth in published information, the increased sophistication in types of machine learning operations and the availability of publishers’ APIs, TDM can be done through coding and APIs. As such, TDM is experiencing a moment of significant growth and interest.

New models, new opportunities

Many publishers still have a black box model of how TDM is used by customers. They can see TDM as an area of growing interest but are unsure what their customers are doing with it, or they have a traditional model, thinking that the information is being used as a supplement or an additional research information stream and still consider it a pure research output.

The traditional model of data mining has expanded to where the analysis of mined research data is taking the place of parts of wet bench science. The explosion of topic information means that institutions and corporates are mining across whole specific topic areas to generate additional insights. Case studies now abound of new discoveries generated by deep mining. Many of these discoveries, from the influence of genetic variants to the collation of genomic or chemical information, have a substantial commercial value beyond the research discovery phase. Publishers need to understand how their data is being used beyond an input into a pure R&D process as their data could well be part of new business.

Protecting the value of your data

This is a very different outcome from using published information to inform research as TDM shifts the balance to surfacing new discoveries in existing information. It puts a different value on the data, which itself can become part of a new product. This productising stream, whether by informatics and tech businesses or through the move from wet bench to data science research, means the value of the information is changing alongside its use. The data sources are now a vital seedbed for many developments and publishers’ information with its highly rigorous structure is particularly suited to many of the organisations seeking new business opportunities in areas as diverse as medicine, materials science, or finance.

Given the rapid growth and Silicon Valley background of many of the businesses utilising data, it should be of no surprise that start-ups and others scrape published data without payment. In several consulting projects for publishers, Maverick has found a number of start-ups scraping publishers’ information to build out their own data products. Publishers need to consider not only how to protect their sites but also to see this information need as an opportunity for either sales or partnership.

In short, the explosion of publishers’ outputs, combined with the rapid growth in the applications of AI and ML as technologies, means publishing as an industry needs to engage with this technology. By doing so, publishers can ensure that their business strategy – including pricing, sales, licensing, and understanding the value of its holdings – is equipped to deal with, and take advantage of, these new developments.

To learn how Maverick’s TDM services can help your organisation protect and monetise its valuable research assets, contact your Maverick representative or reach out to Rebecca Moakes at Rebecca@maverick-os.com.

By Kate Wood, Senior Associate

Kate Wood has worked on machine learning products and run numerous research and analysis products on developments on text and data mining, machine learning and informatics and what this means for the publishing industry.

Further reading:

Maverick Case Study – Productizing TDM