PyAuto Desktop Documentation

Welcome to the official documentation for PyAuto Desktop. This library offers high-performance, thread-safe desktop automation tools including computer vision and input control.


Starter Guide

1. The Inspector Tool

PyAuto Desktop includes a GUI tool called the Inspector. Use this to capture screenshots (“needles”), test matches against your screen, and automatically generate the Python code for your script.

# Opens the GUI for snipping, editing, testing, and code generation
pyauto_desktop.inspector()

2. Simple Example

Here is a basic example of locating an image and clicking it.

import pyauto_desktop

# Initialize a session for the primary monitor
session = pyauto_desktop.Session(
    screen=0,
    source_resolution=(2560, 1440),
    source_dpr=1.25,
    scaling_type="dpr"
)
 submit_btn = session.locateOnScreen("submit_btn.png", confidence=0.8)
# Find an image and click it
if submit_btn:
    print("Button found!")
    session.click("submit_btn.png")

Screen & Vision

class Session(screen=0, source_resolution=None, source_dpr=None, scaling_type=None, direct_input=False)

The core classs that manages screen capture and coordinate translation.

Parameters:

Parameter

Type

Default

Description

screen

int

0

The logical screen index (0 is primary, 1 is secondary, etc.).

source_resolution

tuple

None

The width and height of the monitor where your template images were originally captured.

source_dpr

float

None

The DPI scaling factor of the source monitor. If None, it is auto-detected.

scaling_type

str

None

The strategy for coordinate translation: 'dpr' or 'resolution'. See Scaling Strategies below for details.

direct_input

bool

False

WINDOWS ONLY, Uses 'pydirectinput' (hardware scancodes) instead of 'pynput' (virtual keys). Essential for applications, games, or Citrix/RDP sessions that ignore standard software-simulated input.

Scaling Strategies

The scaling_type determines how coordinates are translated when moving between screens with different properties. The correct choice depends on how the target application renders its UI.

  • ‘dpr’ (DPI Dependent)

    Best for standard Windows applications (e.g., Web Browsers, Office, Discord). These apps respect the Windows “Scale and layout” settings (100%, 125%, 150%). When the scale changes, these apps resize and reposition their UI elements accordingly.

  • ‘resolution’ (Resolution Dependent)

    Best for full-screen applications and games. These applications generally ignore Windows DPI scaling and render 1:1 with the screen’s pixel resolution.

Tip

How to choose the right type

To determine if your target application is DPR or Resolution dependent, perform this simple test:

  1. Open the application.

  2. Go to Windows Display Settings -> Scale and layout.

  3. Change the scaling percentage (e.g., from 100% to 125%).

If the application window or UI elements physically resize or move on your screen, use 'dpr'. If the application UI looks exactly the same (common in full-screen games), use 'resolution'.

locateOnScreen(image, region=None, grayscale=False, confidence=0.9, source_resolution=None, time_out=0, downscale=3, use_pyramid=True)

Locates the single best instance of an image on the screen.

Parameters:

Parameter

Type

Default

Description

image

str | object

Required

File path to the template image or a PIL image object.

region

tuple

None

A bounding box (x, y, w, h) to limit the search area.

grayscale

bool

False

If True, converts images to grayscale. Faster, but ignores color information.

confidence

float

0.9

Match strictness (0.0 to 1.0). Higher is stricter.

source_resolution

tuple

None

Overrides the session source resolution for this specific search.

source_dpr

tuple

None

Overrides the session source dpr for this specific search.

scaling_type

tuple

None

Overrides the session scaling type for this specific search.

time_out

float

0

Maximum seconds to keep retrying if the image is not found immediately.

downscale

int

3

Scale factor for the initial pyramid search pass (higher is faster).

use_pyramid

bool

True

Whether to use multi-scale pyramid optimization.

Example:

start_button = session.locateOnScreen(image='images/start_button.png', region=(100,200,400,500), grayscale=True, confidence=0.85, time_out=3)
if start_button:
 print("start_button found")
noindex:

locateAllOnScreen(image, region=None, grayscale=False, confidence=0.9, overlap_threshold=0.5, scaling_type=None, source_resolution=None, time_out=0, downscale=3, use_pyramid=True)

Locates all instances of an image on the screen.

Parameters:

Parameter

Type

Default

Description

image

str | object

Required

File path to the template image or a PIL image object.

region

tuple

None

A bounding box (x, y, w, h) to limit the search area.

grayscale

bool

False

If True, converts images to grayscale. Faster, but ignores color information.

confidence

float

0.9

Match strictness (0.0 to 1.0). Higher is stricter.

overlap_threshold

float

0.5

Maximum allowed overlap (0.0 to 1.0) between found matches before they are merged.

source_resolution

tuple

None

Overrides the session source resolution for this specific search.

source_dpr

tuple

None

Overrides the session source dpr for this specific search.

scaling_type

tuple

None

Overrides the session scaling type for this specific search.

time_out

float

0

Maximum seconds to keep retrying if the image is not found immediately.

downscale

int

3

Scale factor for the initial pyramid search pass (higher is faster).

use_pyramid

bool

True

Whether to use multi-scale pyramid optimization.

Returns: A list of tuples [(x, y, w, h), ...].

Example:

start_buttons = session.locateAllOnScreen(image='images/start_button.png', region=(100,200,400,500), grayscale=True, confidence=0.85, time_out=3)
for start_button in start_buttons:
 print(f'Found start_button at {start_button}')
locateAny(tasks, time_out=0)

Searches for a list of different images and returns the first one found. Useful for detecting application states (e.g., loading finished).

Parameters:

Parameter

Type

Default

Description

tasks

list[dict]

Required

A list of task dictionaries. Each dict must have an image key (str or object). Optional keys: label, confidence, grayscale, region, downscale.

time_out

float

0

Maximum seconds to keep retrying if the image is not found immediately.

Returns: A tuple (label, match_tuple) or None.

Example:

tasks = [{'label': 'connected', 'task': dict(image='images/connected.png', confidence=0.8, grayscale=False)},
{'label': 'disconnected', 'task': dict(image='images/disconnected.png', confidence=0.8, grayscale=False)}]
result = session.locateAny(tasks)
if result:
label, match = result
if label == 'connected':
    #start automation
elif label == 'disconnect':
    #reconnect
locateAll(tasks, time_out=0)

Searches for multiple different images simultaneously and returns all matches for every image found.

Parameters:

Parameter

Type

Default

Description

tasks

list[dict]

Required

List of task dictionaries defining what images to look for.

time_out

float

0

Maximum seconds to keep retrying if the image is not found immediately.

Returns: A dictionary {label: (match)}. Example:

tasks = [{'label': 'low_health', 'task': dict(image='images/low_health.png', confidence=0.8, grayscale=False)},
{'label': 'low_mana', 'task': dict(image='images/low_mana.png', confidence=0.8, grayscale=False)}]
result = session.locateAll(tasks)
label, match = result
if label == 'low_health':
    #use health potion
elif label == 'low_mana':
    #use mana potion
read_text(region=None, mode='clean', use_det=False)

Captures the screen region and returns found text lines using Windows Native OCR.

Parameters:

Parameter

Type

Default

Description

region

tuple

None

A bounding box (x, y, w, h) to limit the OCR area.

mode

str

'clean'

Post-processing mode for the text.

use_det

bool

False

Whether to use additional detection models.

Returns: A list of strings containing the text lines found.

get_pixel(x, y)

Returns the RGB color of the pixel at the specified coordinates relative to the current screen.

Parameters:

Parameter

Type

Default

Description

x

int

Required

The X-coordinate of the pixel.

y

int

Required

The Y-coordinate of the pixel.

Returns: A tuple (R, G, B) integers representing the color values. Returns None if capture fails.

screenshot(imageFilename=None, region=None)

Takes a screenshot of the screen or a specific region, optionally saves it to a file, and returns a PIL Image object.

Parameters:

Parameter

Type

Default

Description

imageFilename

str

None

Optional file path to save the captured image (e.g., 'screen.png').

region

tuple

None

A bounding box (left, top, width, height) to limit the capture area.

Example:

# Keep the image in memory as a PIL Image object
im = session.screenshot()

# Save the entire screen to a file and return the PIL Image
im_saved = session.screenshot('my_capture.png')

# Capture and save a specific region of the screen
im_region = session.screenshot('my_region.png', region=(0, 0, 300, 400))

Mouse & Keyboard

click(target=None, y=None, offset=(0, 0), button='left', clicks=1, interval=0.2, hold_time=0)

Performs a mouse click. It can target a coordinate, a match result, a list of matches, or just click where the mouse currently is.

Parameters:

Parameter

Type

Default

Description

target

tuple | list | int

None

A match tuple (x, y, w, h), a coordinate tuple (x, y), a list of matches tuple [(x, y, w, h), (x, y, w, h)], or None to click at current mouse position.

y

int

None

The Y-coordinate (only used if target is passed as an X integer).

offset

tuple

(0, 0)

Shifts the click position by (x, y) pixels relative to the target’s center.

button

str

'left'

Which mouse button to use: 'left', 'right', 'mouse4', 'mouse5', or 'middle'.

clicks

int

1

Number of times to click (e.g., 2 for double-click).

interval

float

0.2

Delay in seconds between clicks if clicking multiple times or a list of targets.

hold_time

float

0

Delay between mouseDown and mouseUp for direct input. Only if direct_input is True in Session

moveTo(x, y, duration=0.0)

Moves the mouse cursor to a specific location.

Parameters:

Parameter

Type

Default

Description

target

tuple | list | int

None

A match tuple (x, y, w, h), a coordinate tuple (x, y), a list of matches tuple [(x, y, w, h), (x, y, w, h)], or None to click at current mouse position.

y

int

None

The Y-coordinate (only used if target is passed as an X integer).

offset

tuple

(0, 0)

Shifts the click position by (x, y) pixels relative to the target’s center.

duration

float

0.0

Time in seconds to slide the mouse to the target. 0.0 moves instantly.

write(message, interval=0.0)

Types a string of text using the keyboard.

Parameters:

Parameter

Type

Default

Description

message

str

Required

The text string to type.

interval

float

0.0

Delay in seconds between each character typed.

press(key)

Presses a single keyboard key. Key list can be found here: https://pynput.readthedocs.io/en/latest/keyboard.html#pynput.keyboard.Key

Parameters:

Parameter

Type

Default

Description

key

str

Required

The name of the key (e.g., "esc", "f1", "space") or a single character.

scroll(clicks, duration=0.0)

Scrolls the mouse wheel vertically.

Parameters:

Parameter

Type

Default

Description

clicks

int

Required

The number of steps to scroll. Positive values scroll up, negative values scroll down.

duration

float

0.0

The total time (in seconds) to complete the scroll. If greater than 0, the scroll is performed incrementally over the specified duration.

keyDown(key)

Presses and holds down a single keyboard key. Key list can be found here: https://pynput.readthedocs.io/en/latest/keyboard.html#pynput.keyboard.Key

Parameters:

Parameter

Type

Default

Description

key

str

Required

The name of the key (e.g., "esc", "f1", "space") or a single character.

keyUp(key)

Releases keyboard key. Key list can be found here: https://pynput.readthedocs.io/en/latest/keyboard.html#pynput.keyboard.Key

Parameters:

Parameter

Type

Default

Description

key

str

Required

The name of the key (e.g., "esc", "f1", "space") or a single character.

mouseDown(button='left')

Presses and holds a mouse button. The button will remain pressed until mouseUp() is called.

Parameters:

Parameter

Type

Default

Description

button

str

"left"

The mouse button to press. Accepted values: "left", "right", "middle", "mouse4", "mouse5".

mouseUp(button='left')

Releases a previously held mouse button.

Parameters:

Parameter

Type

Default

Description

button

str

"left"

The mouse button to release. Accepted values: "left", "right", "middle", "mouse4", "mouse5".

Window Control

Helper functions to manage application windows. These functions can target windows by Title (string) or PID (integer).

find_window(target)

Finds a window by Title (string) or PID (int). Returns the first matching window object or None.

Parameters:

Parameter

Type

Default

Description

target

str | int

Required

The Window Title or Process ID (PID).

move_window(target, x, y)

Moves the top-left corner of the window to (x, y).

Parameters:

Parameter

Type

Default

Description

target

str | int

Required

The Window Title or PID.

x

int

Required

The target X-coordinate.

y

int

Required

The target Y-coordinate.

resize_window(target, width, height)

Resizes the window to the specified width and height.

Parameters:

Parameter

Type

Default

Description

target

str | int

Required

The Window Title or PID.

width

int

Required

The target width in pixels.

height

int

Required

The target height in pixels.

focus_window(target)

Brings the window to the foreground. Automatically un-minimizes (restores) it if hidden.

Parameters:

Parameter

Type

Default

Description

target

str | int

Required

The Window Title or PID.

maximize_window(target)

Maximizes the specified window.

Parameters:

Parameter

Type

Default

Description

target

str | int

Required

The Window Title or PID.

minimize_window(target)

Minimizes the specified window.

Parameters:

Parameter

Type

Default

Description

target

str | int

Required

The Window Title or PID.

get_window_info(target)

Returns a dictionary of window properties (x, y, width, height, title, pid).

Parameters:

Parameter

Type

Default

Description

target

str | int

Required

The Window Title or PID.

get_focused_window()

Returns a dictionary of active window properties (title, x, y, width, height, pid).

Example:

active_window = pyauto_desktop.get_focused_window()
title = active_window.get('title')
x = active_window.get('x')
y = active_window.get('y')
width = active_window.get('width')
height = active_window.get('height')
pid = active_window.get('pid') #Windows only