LifeScience Hack

生物系創薬研究者がAI(誇大表示)を手に入れるまでの過程(Python、Deep Learning、ライフサイエンス)

Pythonで論文情報をまとめてゲットする⑥~EFetchを使ってアブストラクトを抽出~

かなり間隔があいてしまいましたが、
論文情報をまとめて取るシリーズの最新版です。
今回はEFetchを用いて、アブストラクトを取ってこようと思います。

これまでの記事の内容が理解できていることを前提としたいと思います。
これまでの記事はこちら

● Pythonで論文情報をまとめてゲットする① ~ PubMed APIについて ~

● Pythonで論文情報をまとめてゲットする② ~ 下準備 ~

● Pythonで論文情報をまとめてゲットする③ ~ ESearchを使ってPMIDを取得 ~

● Pythonで論文情報をまとめてゲットする④ ~ESummaryを使って論文タイトルを取得 ~

● Pythonで論文情報をまとめてゲットする⑤ ~ Openpyxlを使ってExcelファイルとして保存する ~

EFetch

EFetchの使い方は基本的にESummaryと同じです。
EFetchの基本URLはこちらになります。 https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
これにデータベースをpubmedに指定する?db=pubmed
pmidを指定する&id=を追加し、
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=とします。
また、注意点・変更点としてEFetchではjson形式での取得はできませんので、xmlでデータを取得していきます。

EFetchを用いたアブストラクト抽出コード

  
import requests
from lxml import etree
import time

URL = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=XML &id='
# EFetchの基本URL. これにpmidを追加してqueryとする

queries = [URL + pmid for pmid in pmids]
# pmids は第4回の記事で取得したpmidのリストです。

responses_abst = {} #このあと取得するabstractを格納する辞書を作成

# ESummaryの時はpmidごとのjsonすべてをresponseに保存していたが、重くなるためabstractを抽出してアブストラクトだけをresponses_abstに保存します

for query in queries:
      response = requests.get(query)
      root = etree.fromstring(response.content)
      pmid = root.find('.//PMID').text#pmidを抽出
      abst = root.findall('.//AbstractText')
      if abst is None:
          abst_text = ''
      else:
          abst_text = ''.join(root.xpath('//Abstract//*/text()'))
      responses_abst[pmid]=abst_text
      time.sleep(0.2)
print(responses_abst)

#{'32755859': 'Non-local self-similarity is well-known to be an effective prior for the image denoising problem. However, little work has been done to incorporate it in convolutional neural networks, which surpass non-local model-based methods despite only exploiting local information. In this paper, we propose a novel end-to-end trainable neural network architecture employing layers based on graph convolution operations, thereby creating neurons with non-local receptive fields. The graph convolution operation generalizes the classic convolution to arbitrary graphs. In this work, the graph is dynamically computed from similarities among the hidden features of the network, so that the powerful representation learning capabilities of the network are exploited to uncover self-similar patterns. We introduce a lightweight Edge-Conditioned Convolution which addresses vanishing gradient and over-parameterization issues of this particular graph convolution. Extensive experiments show state-of-the-art performance with improved qualitative and quantitative results on both synthetic Gaussian noise and real noise.',
#'32755861': 'Estimating optical flow from successive video frames is one of the fundamental problems in computer vision and image processing. In the era of deep learning, many methods have been proposed to use convolutional neural networks (CNNs) for optical flow estimation in an unsupervised manner. However, the performance of unsupervised optical flow approaches is still unsatisfactory and often lagging far behind their supervised counterparts, primarily due to over-smoothing across motion boundaries and occlusion. To address these issues, in this paper, we propose a novel method with a new post-processing term and an effective loss function to estimate optical flow in an unsupervised, end-to-end learning manner. Specifically, we first exploit a CNN-based non-local term to refine the estimated optical flow by removing noise and decreasing blur around motion boundaries. This is implemented via automatically learning weights of dependencies over a large spatial neighborhood. Because of its learning ability, the method is effective for various complicated image sequences. Secondly, to reduce the influence of occlusion, a symmetrical energy formulation is introduced to detect the occlusion map from refined bi-directional optical flows. Then the occlusion map is integrated to the loss function. Extensive experiments are conducted on challenging datasets, i.e. FlyingChairs, MPI-Sintel and KITTI to evaluate the performance of the proposed method. The state-of-the-art results demonstrate the effectiveness of our proposed method.',
#'32755863': '3D models are commonly used in computer vision and graphics. With the wider availability of mesh data, an efficient and intrinsic deep learning approach to processing 3D meshes is in great need. Unlike images, 3D meshes have irregular connectivity, requiring careful design to capture relations in the data. To utilize the topology information while staying robust under different triangulations, we propose to encode mesh connectivity using Laplacian spectral analysis, along with mesh feature aggregation blocks (MFABs) that can split the surface domain into local pooling patches and aggregate global information amongst them. We build a mesh hierarchy from fine to coarse using Laplacian spectral clustering, which is flexible under isometric transformations. Inside the MFABs there are pooling layers to collect local information and multi-layer perceptrons to compute vertex features of increasing complexity. To obtain the relationships among different clusters, we introduce a Correlation Net to compute a correlation matrix, which can aggregate the features globally by matrix multiplication with cluster features. Our network architecture is flexible enough to be used on meshes with different numbers of vertices. We conduct several experiments including shape segmentation and classification, and our method outperforms state-of-the-art algorithms for these tasks on the ShapeNet and COSEG datasets.',
# '32755867': 'Neural networks (NNs) are effective machine learning models that require significant hardware and energy consumption in their computing process. To implement NNs, stochastic computing (SC) has been proposed to achieve a tradeoff between hardware efficiency and computing performance. In an SC NN, hardware requirements and power consumption are significantly reduced by moderately sacrificing the inference accuracy and computation speed. With recent developments in SC techniques, however, the performance of SC NNs has substantially been improved, making it comparable with conventional binary designs yet by utilizing less hardware. In this article, we begin with the design of a basic SC neuron and then survey different types of SC NNs, including multilayer perceptrons, deep belief networks, convolutional NNs, and recurrent NNs. Recent progress in SC designs that further improve the hardware efficiency and performance of NNs is subsequently discussed. The generality and versatility of SC NNs are illustrated for both the training and inference processes. Finally, the advantages and challenges of SC NNs are discussed with respect to binary counterparts.',
# '32755873': 'People with Type 1 diabetes (T1D) require regular exogenous infusion of insulin to maintain their blood glucose concentration in a therapeutically adequate target range. Although the artificial pancreas and continuous glucose monitoring have been proven to be effective in achieving closed-loop control, significant challenges still remain due to the high complexity of glucose dynamics and limitations in the technology. In this work, we propose a novel deep reinforcement learning model for single-hormone (insulin) and dual-hormone (insulin and glucagon) delivery. In particular, the delivery strategies are developed by double Q-learning with dilated recurrent neural networks. For designing and testing purposes, the FDA-accepted UVA/Padova Type 1 simulator was employed. First, we performed long-term generalized training to obtain a population model. Then, this model was personalized with a small data-set of subject-specific data. In silico results show that the single and dual-hormone delivery strategies achieve good glucose control when compared to a standard basal-bolus therapy with low-glucose insulin suspension. Specifically, in the adult cohort (n=10), percentage time in target range [70, 180] mg/dL improved from 77.6% to 80.9% with single-hormone control, and to 85.6% with dual-hormone control. In the adolescent cohort (n=10), percentage time in target range improved from 55.5% to 65.9% with single-hormone control, and to 78.8% with dual-hormone control. In all scenarios, a significant decrease in hypoglycemia was observed. These results show that the use of deep reinforcement learning is a viable approach for closed-loop glucose control in T1D.',
# '32756018': "Identifying neural activity biomarkers of brain disease is essential to provide objective estimates of disease burden, obtain reliable feedback regarding therapeutic efficacy, and potentially to serve as a source of control for closed-loop neuromodulation. In Parkinson's Disease (PD), microelectrode recordings (MER) are routinely performed in the basal ganglia to guide electrode implantation for deep brain stimulation (DBS). While pathologically-excessive oscillatory activity has been observed and linked to PD motor dysfunction broadly, the extent to which these signals provide quantitative information about disease expression and fluctuations, particularly at short timescales, is unknown. Furthermore, the degree to which informative signal features are similar or different across patients has not been rigorously investigated. We sought to determine the extent to which motor error in PD across patients can be decoded on a rapid timescale using spectral features of neural activity.",
# '32756365': 'Railway inspection has always been a critical task to guarantee the safety of the railway transportation. The development of deep learning technologies brings new breakthroughs in the accuracy and speed of image-based railway inspection application. In this work, a series of one-stage deep learning approaches, which are fast and accurate at the same time, are proposed to inspect the key components of railway track including rail, bolt, and clip. The inspection results show that the enhanced model, the second version of you only look once (YOLOv2), presents the best component detection performance with 93% mean average precision (mAP) at 35 image per second (IPS), whereas the feature pyramid network (FPN) based model provides a smaller mAP and much longer inference time. Besides, the detection performances of more deep learning approaches are evaluated under varying input sizes, where larger input size usually improves the detection accuracy but results in a longer inference time. Overall, the YOLO series models could achieve faster speed under the same detection accuracy.',
# '32756582': 'Moving towards a horizontal and vertical integrated curriculum, Work-Station Learning Activities (WSLA) were designed and implemented as a new learning instrument. Here, we aim to evaluate whether and how this specific learning model affects academic performance. To better understand how it is received by medical students, a mixed methods research study was conducted.',
# '32757235': 'Deep learning for medical imaging analysis uses convolutional neural networks pretrained on ImageNet (Stanford Vision Lab, Stanford, CA). Little is known about how such color- and scene-rich standard training images compare quantitatively to medical images. We sought to quantitatively compare ImageNet images to point-of-care ultrasound (POCUS), computed tomographic (CT), magnetic resonance (MR), and chest x-ray (CXR) images.',
# '32757455': 'To develop a computer-aided diagnosis (CAD) system for distinguishing malignant from benign pulmonary nodules on computed tomography (CT) scans, and to assess whether the diagnostic performance of radiologists with different experiences can be improved with the assistant of CAD.'}

これで辞書型オブジェクトのresponses_abstにpmid毎のアブストラクトが格納されました。

これまでの全てをまとめる

これまでのeSearch→eSummary→eFetch全てを関数化したものをこちらにまとめました。
若干、内容を変えています。

関数化

import requests
import pandas as pd
import json
from lxml import etree
import time

def eSearch(term, retmax=10):
    URL = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json'
    option = '&retmax='+str(retmax)+'&term='+term
    query = URL + option
    response = requests.get(query)
    response_json = response.json()
    pmids = response_json['esearchresult']['idlist']
    return pmids

def eSummary(pmids):
    URL = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=json&id='
    queries = [URL + pmid for pmid in pmids]
    responses = {}
    for query in queries:
        response = requests.get(query)
        res_json = response.json()['result']
        responses.update(res_json)
        time.sleep(0.2)

    Summaries = [{'pmid':pmid, 
                  'Title':responses[pmid]['title'], 
                  'Author':responses[pmid]['sortfirstauthor'], 
                  'Journal' : responses[pmid]['source'],
                  'Pubdate':responses[pmid]['epubdate']} for pmid in pmids]
    summary_df = pd.DataFrame(Summaries)
    return summary_df

def eFetch(pmids):
    URL = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id='

    queries = [URL + pmid for pmid in pmids]

    responses_abst = {}

    for query in queries:
        response = requests.get(query)
        root = etree.fromstring(response.content)
        pmid = root.find('.//PMID').text#pmidを抽出
        abst = root.findall('.//AbstractText')
        if abst is None:
            abst_text = ''
        else:
            abst_text = ''.join(root.xpath('//Abstract//*/text()'))
        responses_abst[pmid]=abst_text
        time.sleep(0.2)
        abst_df = pd.DataFrame.from_dict(responses_abst, orient='index')
        abst_df.index.name = 'pmid'
        abst_df.columns = ['Abstract']
    
    return abst_df

実行してみる

関数化したものを実行し、最後に全てを一つのDataFrameにまとめます。

term = 'deep%20learning'
#pubmedで検索する単語をtermとします。「%20」はスペースです。

pmids = eSearch(term)
#まずはeSearchでpmidを取得します。デフォルトでは10個までにしましたが、変更する際はretmax=100などを入れます。

summary_df = eSummary(pmids)
#論文の基本情報を取得し、pandasのDataFrame型として返します。

abst_df = eFetch(pmids)
#更にアブストラクトをeFetchで取得し、pandas DataFrame型として返す。

df = pd.merge(summary_df, abst_df, on='pmid')
#summaryとabstractを統合し一つのDataFrameとします。

必要でしたら、これをexcelのデータとして保存します。
Excelへの変換は過去記事をご参照いただければと思います。