class PDF::Reader::Page

high level representation of a single PDF page. Ties together the various low level classes in PDF::Reader and provides access to the various components of the page (text, images, fonts, etc) in convenient formats.

If you require access to the raw PDF objects for this page, you can access the Page dictionary via the page_object accessor. You will need to use the objects accessor to help walk the page dictionary in any useful way.

Attributes

cache[R]

a Hash-like object for storing cached data. Generally this is scoped to the current document and is used to avoid repeating expensive operations

objects[R]

lowlevel hash-like access to all objects in the underlying PDF

page_object[R]

the raw PDF object that defines this page

Public Class Methods

new(objects, pagenum, options = {}) click to toggle source

creates a new page wrapper.

  • objects - an ObjectHash instance that wraps a PDF file

  • pagenum - an int specifying the page number to expose. 1 indexed.

# File lib/pdf/reader/page.rb, line 44
def initialize(objects, pagenum, options = {})
  @objects, @pagenum = objects, pagenum
  @page_object = objects.deref_hash(objects.page_references[pagenum - 1])
  @cache       = options[:cache] || {}

  unless @page_object.is_a?(::Hash)
    raise InvalidPageError, "Invalid page: #{pagenum}"
  end
end

Public Instance Methods

attributes() click to toggle source

Returns the attributes that accompany this page, including attributes inherited from parents.

# File lib/pdf/reader/page.rb, line 69
def attributes
  @attributes ||= {}.tap { |hash|
    page_with_ancestors.reverse.each do |obj|
      hash.merge!(@objects.deref_hash(obj) || {})
    end
  }
  # This shouldn't be necesary, but some non compliant PDFs leave MediaBox
  # out. Assuming 8.5" x 11" is what Acobat does, so we do it too.
  @attributes[:MediaBox] ||= [0,0,612,792]
  @attributes
end
boxes() click to toggle source

returns the “boxes” that define the page object. values are defaulted according to section 7.7.3.3 of the PDF Spec 1.7

DEPRECATED. Recommend using Page#rectangles instead

# File lib/pdf/reader/page.rb, line 191
def boxes
  # In ruby 2.4+ we could use Hash#transform_values
  Hash[rectangles.map{ |k,rect| [k,rect.to_a] } ]
end
height() click to toggle source
# File lib/pdf/reader/page.rb, line 81
def height
  rect = Rectangle.new(*attributes[:MediaBox])
  rect.apply_rotation(rotate) if rotate > 0
  rect.height
end
inspect() click to toggle source

return a friendly string representation of this page

# File lib/pdf/reader/page.rb, line 62
def inspect
  "<PDF::Reader::Page page: #{@pagenum}>"
end
number() click to toggle source

return the number of this page within the full document

# File lib/pdf/reader/page.rb, line 56
def number
  @pagenum
end
orientation() click to toggle source

Convenience method to identify the page’s orientation.

# File lib/pdf/reader/page.rb, line 102
def orientation
  if height > width
    "portrait"
  else
    "landscape"
  end
end
origin() click to toggle source
# File lib/pdf/reader/page.rb, line 93
def origin
  rect = Rectangle.new(*attributes[:MediaBox])
  rect.apply_rotation(rotate) if rotate > 0

  rect.bottom_left
end
raw_content() click to toggle source

returns the raw content stream for this page. This is plumbing, nothing to see here unless you’re a PDF nerd like me.

# File lib/pdf/reader/page.rb, line 165
def raw_content
  contents = objects.deref_stream_or_array(@page_object[:Contents])
  [contents].flatten.compact.map { |obj|
    objects.deref_stream(obj)
  }.compact.map { |obj|
    obj.unfiltered_data
  }.join(" ")
end
rectangles() click to toggle source

returns the “boxes” that define the page object. values are defaulted according to section 7.7.3.3 of the PDF Spec 1.7

# File lib/pdf/reader/page.rb, line 199
def rectangles
  # attributes[:MediaBox] can never be nil, but I have no easy way to tell sorbet that atm
  mediabox = objects.deref_array_of_numbers(attributes[:MediaBox]) || []
  cropbox = objects.deref_array_of_numbers(attributes[:CropBox]) || mediabox
  bleedbox = objects.deref_array_of_numbers(attributes[:BleedBox]) || cropbox
  trimbox = objects.deref_array_of_numbers(attributes[:TrimBox]) || cropbox
  artbox = objects.deref_array_of_numbers(attributes[:ArtBox]) || cropbox

  begin
    mediarect = Rectangle.from_array(mediabox)
    croprect = Rectangle.from_array(cropbox)
    bleedrect = Rectangle.from_array(bleedbox)
    trimrect = Rectangle.from_array(trimbox)
    artrect = Rectangle.from_array(artbox)
  rescue ArgumentError => e
    raise MalformedPDFError, e.message
  end

  if rotate > 0
    mediarect.apply_rotation(rotate)
    croprect.apply_rotation(rotate)
    bleedrect.apply_rotation(rotate)
    trimrect.apply_rotation(rotate)
    artrect.apply_rotation(rotate)
  end

  {
    MediaBox: mediarect,
    CropBox: croprect,
    BleedBox: bleedrect,
    TrimBox: trimrect,
    ArtBox: artrect,
  }
end
rotate() click to toggle source

returns the angle to rotate the page clockwise. Always 0, 90, 180 or 270

# File lib/pdf/reader/page.rb, line 176
def rotate
  value = attributes[:Rotate].to_i
  case value
  when 0, 90, 180, 270
    value
  else
    0
  end
end
runs(opts = {}) click to toggle source
# File lib/pdf/reader/page.rb, line 125
def runs(opts = {})
  receiver = PageTextReceiver.new
  walk(receiver)
  receiver.runs(opts)
end
text(opts = {}) click to toggle source

returns the plain text content of this page encoded as UTF-8. Any characters that can’t be translated will be returned as a ▯

# File lib/pdf/reader/page.rb, line 113
def text(opts = {})
  receiver = PageTextReceiver.new
  walk(receiver)
  runs = receiver.runs(opts)

  # rectangles[:MediaBox] can never be nil, but I have no easy way to tell sorbet that atm
  mediabox = rectangles[:MediaBox] || Rectangle.new(0, 0, 0, 0)

  PageLayout.new(runs, mediabox).to_s
end
Also aliased as: to_s
to_s(opts = {})
Alias for: text
walk(*receivers) click to toggle source

processes the raw content stream for this page in sequential order and passes callbacks to the receiver objects.

This is mostly low level and you can probably ignore it unless you need access to something like the raw encoded text. For an example of how this can be used as a basis for higher level functionality, see the text() method

If someone was motivated enough, this method is intended to provide all the data required to faithfully render the entire page. If you find some required data isn’t available it’s a bug - let me know.

Many operators that generate callbacks will reference resources stored in the page header - think images, fonts, etc. To facilitate these operators, the first available callback is page=. If your receiver accepts that callback it will be passed the current PDF::Reader::Page object. Use the Page#resources method to grab any required resources.

It may help to think of each page as a self contained program made up of a set of instructions and associated resources. Calling walk() executes the program in the correct order and calls out to your implementation.

# File lib/pdf/reader/page.rb, line 154
def walk(*receivers)
  receivers = receivers.map { |receiver|
    ValidatingReceiver.new(receiver)
  }
  callback(receivers, :page=, [self])
  content_stream(receivers, raw_content)
end
width() click to toggle source
# File lib/pdf/reader/page.rb, line 87
def width
  rect = Rectangle.new(*attributes[:MediaBox])
  rect.apply_rotation(rotate) if rotate > 0
  rect.width
end

Private Instance Methods

ancestors(origin = @page_object[:Parent]) click to toggle source
# File lib/pdf/reader/page.rb, line 276
def ancestors(origin = @page_object[:Parent])
  if origin.nil?
    []
  else
    obj = objects.deref_hash(origin)
    if obj.nil?
      raise MalformedPDFError, "parent mus not be nil"
    end
    [ select_inheritable(obj) ] + ancestors(obj[:Parent])
  end
end
callback(receivers, name, params=[]) click to toggle source

calls the name callback method on each receiver object with params as the arguments

# File lib/pdf/reader/page.rb, line 266
def callback(receivers, name, params=[])
  receivers.each do |receiver|
    receiver.send(name, *params) if receiver.respond_to?(name)
  end
end
content_stream(receivers, instructions) click to toggle source
# File lib/pdf/reader/page.rb, line 247
def content_stream(receivers, instructions)
  buffer       = Buffer.new(StringIO.new(instructions), :content_stream => true)
  parser       = Parser.new(buffer, @objects)
  params       = []

  while (token = parser.parse_token(PagesStrategy::OPERATORS))
    if token.kind_of?(Token) and PagesStrategy::OPERATORS.has_key?(token)
      callback(receivers, PagesStrategy::OPERATORS[token], params)
      params.clear
    else
      params << token
    end
  end
rescue EOFError
  raise MalformedPDFError, "End Of File while processing a content stream"
end
page_with_ancestors() click to toggle source
# File lib/pdf/reader/page.rb, line 272
def page_with_ancestors
  [ @page_object ] + ancestors
end
resources() click to toggle source

Returns the resources that accompany this page. Includes resources inherited from parents.

# File lib/pdf/reader/page.rb, line 243
def resources
  @resources ||= Resources.new(@objects, @objects.deref_hash(attributes[:Resources]) || {})
end
root() click to toggle source
# File lib/pdf/reader/page.rb, line 236
def root
  @root ||= objects.deref_hash(@objects.trailer[:Root]) || {}
end
select_inheritable(obj) click to toggle source

select the elements from a Pages dictionary that can be inherited by child Page dictionaries.

# File lib/pdf/reader/page.rb, line 291
def select_inheritable(obj)
  ::Hash[obj.select { |key, value|
    [:Resources, :MediaBox, :CropBox, :Rotate, :Parent].include?(key)
  }]
end