今回は文部科学省のサイトにある『日本食品標準成分表2010』のPDFファイルから,約1800種類に及ぶ食品毎の栄養組成を抽出し,txtファイルに変換します.
資源調査分科会報告「日本食品標準成分表2010」について
以下の流れで処理を行います.
- PDFファイルのダウンロード
- テキストをコピーする
- EXCELに貼付ける
- VBAによる処理
1. PDFファイルのダウンロード
上記リンク先にあるPDFファイルをダウンロードします.ファイル名は1299012_1.pdfから1299012_18.pdfまでです.
2. テキストをコピーする
ドキュメント内の全テキストを選択するには,表示/ページ表示/単一ページ以外にして下さい.その上で『全てを選択』してコピーします.単一ページ表示ですと,表示しているページだけのテキストが選択されます.
テキストの選択およびコピー
3. EXCELに貼付ける
貼り付けのオプション/テキストファイルウィザードを使用を選択します.テキストファイルウィザード1/3では元のデータ形式で『カンマやタブなどの区切り文字によってフィールドごとに区切られたデータ』を選択します.テキストファイルウィザード2/3では特に変更なく次へ進みます.テキストファイルウィザード3/3では最初のカラムの列のデータ形式のみ『文字列』に変更して完了をクリックします.
4. VBAによる処理
Alt+F11キーを押下してVBEを起動します.挿入メニューから標準モジュールを選択し,下記のコードを貼り付けます.
Option Explicit
Sub Procedure1()
Dim mySht As Worksheet
Dim myAr1 As Variant
Dim myAr2 As Variant
Dim i As Long
Dim j As Long
Dim k As Long
Dim myAr(1878, 53) As String
Dim RegEx1 As Object
Dim Match1 As Object
Dim Matches1 As Object
Dim strPtn1 As String
Dim tempStr As String
Dim RegEx2 As Object
Dim Match2 As Object
Dim Matches2 As Object
Dim strPtn2 As String
Set mySht = ActiveSheet
Set myAr1 = Application.Intersect(mySht.UsedRange, mySht.Range("A:B"))
Set myAr2 = mySht.UsedRange
Set RegEx1 = CreateObject("VBScript.RegExp")
Set RegEx2 = CreateObject("VBScript.RegExp")
strPtn2 = "^[0-9]+(\.[0-9]*)?"
i = 0
For k = 1 To myAr1.Rows.Count
If Len(myAr1.Cells(k, 1).Value) = 5 And _
myAr1.Cells(k, 1).Value >= "01001" And _
myAr1.Cells(k, 1).Value <= "18016" And _
myAr1.Cells(k, 2) <> "(欠番)" Then
myAr(i, 0) = myAr2.Cells(k, 1)
strPtn1 = "^[0-9]+$"
RegEx1.Pattern = strPtn1
RegEx1.IgnoreCase = True
RegEx1.Global = True
Set Matches1 = RegEx1.Execute(myAr2.Cells(k, 2))
If Matches1.Count > 0 Then
tempStr = myAr2.Cells(k, 2) & myAr2.Cells(k, 3)
strPtn1 = "[^0-9][0-9]+$"
RegEx1.Pattern = strPtn1
RegEx1.IgnoreCase = True
RegEx1.Global = True
Set Matches1 = RegEx1.Execute(tempStr)
If Matches1.Count > 0 Then
Set Match1 = Matches1.Item(Matches1.Count - 1)
myAr(i, 1) = Left(tempStr, Match1.firstindex + 1)
myAr(i, 2) = Mid(tempStr, Match1.firstindex + 2)
For j = 3 To 52
Select Case True
Case myAr2.Cells(k, j + 1).Value = "-"
myAr(i, j) = 0
Case myAr2.Cells(k, j + 1).Value = "Tr"
myAr(i, j) = 0
Case myAr2.Cells(k, j + 1).Value = "(Tr)"
myAr(i, j) = 0
Case myAr2.Cells(k, j + 1).Value = "(0)"
myAr(i, j) = 0
Case Else
myAr(i, j) = myAr2.Cells(k, j + 1).Value
End Select
Next j
RegEx2.Pattern = strPtn2
RegEx2.IgnoreCase = True
RegEx2.Global = True
Set Matches2 = RegEx2.Execute(myAr2.Cells(k, 54))
On Error Resume Next
Set Match2 = Matches2.Item(0)
On Error GoTo 0
myAr(i, 53) = Match2.Value
Else
For j = 1 To 52
Select Case True
Case myAr2.Cells(k, j + 1).Value = "-"
myAr(i, j) = 0
Case myAr2.Cells(k, j + 1).Value = "Tr"
myAr(i, j) = 0
Case myAr2.Cells(k, j + 1).Value = "(Tr)"
myAr(i, j) = 0
Case myAr2.Cells(k, j + 1).Value = "(0)"
myAr(i, j) = 0
Case Else
myAr(i, j) = myAr2.Cells(k, j + 1).Value
End Select
Next j
RegEx2.Pattern = strPtn2
RegEx2.IgnoreCase = True
RegEx2.Global = True
Set Matches2 = RegEx2.Execute(myAr2.Cells(k, 54))
On Error Resume Next
Set Match2 = Matches2.Item(0)
On Error GoTo 0
myAr(i, 53) = Match2.Value
End If
Else
strPtn1 = "[^0-9][0-9]+$"
RegEx1.Pattern = strPtn1
RegEx1.IgnoreCase = True
RegEx1.Global = True
Set Matches1 = RegEx1.Execute(myAr2.Cells(k, 2))
If Matches1.Count > 0 Then
Set Match1 = Matches1.Item(Matches1.Count - 1)
myAr(i, 1) = Left(myAr2.Cells(k, 2), Match1.firstindex + 1)
myAr(i, 2) = Mid(myAr2.Cells(k, 2), Match1.firstindex + 2)
For j = 3 To 52
Select Case True
Case myAr2.Cells(k, j).Value = "-"
myAr(i, j) = 0
Case myAr2.Cells(k, j).Value = "Tr"
myAr(i, j) = 0
Case myAr2.Cells(k, j).Value = "(Tr)"
myAr(i, j) = 0
Case myAr2.Cells(k, j).Value = "(0)"
myAr(i, j) = 0
Case Else
myAr(i, j) = myAr2.Cells(k, j).Value
End Select
Next j
RegEx2.Pattern = strPtn2
RegEx2.IgnoreCase = True
RegEx2.Global = True
Set Matches2 = RegEx2.Execute(myAr2.Cells(k, 53))
On Error Resume Next
Set Match2 = Matches2.Item(0)
On Error GoTo 0
myAr(i, 53) = Match2.Value
Else
For j = 1 To 52
Select Case True
Case myAr2.Cells(k, j + 1).Value = "-"
myAr(i, j) = 0
Case myAr2.Cells(k, j + 1).Value = "Tr"
myAr(i, j) = 0
Case myAr2.Cells(k, j + 1).Value = "(Tr)"
myAr(i, j) = 0
Case myAr2.Cells(k, j + 1).Value = "(0)"
myAr(i, j) = 0
Case Else
myAr(i, j) = myAr2.Cells(k, j + 1).Value
End Select
Next j
RegEx2.Pattern = strPtn2
RegEx2.IgnoreCase = True
RegEx2.Global = True
Set Matches2 = RegEx2.Execute(myAr2.Cells(k, 54))
On Error Resume Next
Set Match2 = Matches2.Item(0)
On Error GoTo 0
myAr(i, 53) = Match2.Value
End If
End If
Else
i = i - 1
End If
i = i + 1
Next k
Set mySht = Worksheets.Add
With mySht
.Name = myAr2.Cells(1, 1).Value & myAr2.Cells(1, 2).Value
.Range("A1:BB1878").Value = myAr
End With
Set mySht = Nothing
Set myAr1 = Nothing
Set myAr2 = Nothing
Set RegEx1 = Nothing
Set Match1 = Nothing
Set Matches1 = Nothing
Set RegEx2 = Nothing
Set Match2 = Nothing
Set Matches2 = Nothing
End Sub
以上の処理をPDFファイルの数だけ繰り返します.更に追加されたワークシート以外のワークシートを削除し,以下のコードを実行して.txtファイルにすべてのレコードを出力します.
Sub AllSheets_to_TextFile()
Dim myBook As Workbook
Dim mySht As Worksheet
Dim tmpSht As Worksheet
Dim myRng As Range
Dim myAr(1877, 53) As String
Dim tempAr As Variant
Dim i As Long
Dim j As Long
Dim k As Long
Dim GOF As Variant
Dim RegExp As Object
Dim Matches As Object
Dim Match As Object
Dim strPtn As String
Set RegExp = CreateObject("VBScript.RegExp")
strPtn = "\\"
k = 0
For Each tmpSht In Worksheets
Set myRng = tmpSht.Range("A1").CurrentRegion
tempAr = myRng
For i = LBound(tempAr) To UBound(tempAr)
For j = LBound(tempAr, 2) To UBound(tempAr, 2)
myAr(k, j - 1) = tempAr(i, j)
Next j
k = k + 1
Next i
Next tmpSht
GOF = Application.GetOpenFilename(FileFilter:="PDF file,*.pdf", _
Title:="Select PDF file", _
MultiSelect:=False)
If TypeName(GOF) = "Boolean" Then Exit Sub
GOF = Left(GOF, Len(GOF) - 4) & ".txt"
With RegExp
.Pattern = strPtn
.IgnoreCase = True
.Global = True
End With
Set Matches = RegExp.Execute(GOF)
GOF = Left(GOF, Matches.Item(Matches.Count - 1).firstindex) & "\M_FOODS.txt"
Set mySht = Worksheets.Add
With mySht
.Name = "M_FOODS"
.Range("A1:BB1878") = myAr
.Move
End With
ActiveWorkbook.SaveAs Filename:=GOF, _
FileFormat:=xlText, _
CreateBackup:=False
Set myBook = ActiveWorkbook
Application.DisplayAlerts = False
myBook.Close
Application.DisplayAlerts = True
Set Match = Nothing
Set Matches = Nothing
Set RegExp = Nothing
Set myRng = Nothing
Set mySht = Nothing
Set tmpSht = Nothing
Set myBook = Nothing
End Sub
なお,文部科学省は著作権を理由にデータの複製を行う際には連絡するよう連絡先を示しています.この記事を書いた2011年11月26日時点ではまだ連絡しておりませんので,やり方だけ公開します.